next up previous contents
Next: 3.11 Security In Condor Up: 3. Administrators' Manual Previous: 3.9 Managing your Condor

Subsections

  
3.10 Setting up Condor for Special Environments

The following sections describe how to setup Condor for use in a number of special environments or configurations. See section 3.3 on page [*] for installation instructions for the various ``contrib modules'' that you can optionally download and install in your pool.

  
3.10.1 Using Condor with AFS

If you are using AFS at your site, be sure to read section 3.4.5 on ``Shared Filesystem Config Files Entries'' for details on configuring your machines to interact with and use shared filesystems, AFS in particular.

Condor does not currently have a way to authenticate itself to AFS. This is true of the Condor daemons that would like to authenticate as AFS user Condor, and the condor_shadow, which would like to authenticate as the user who submitted the job it is serving. Since neither of these things can happen yet, there are a number of special things people who use AFS with Condor must do. Some of this must be done by the administrator(s) installing Condor. Some of this must be done by Condor users who submit jobs.

  
3.10.1.1 AFS and Condor for Administrators

The most important thing is that since the Condor daemons can't authenticate to AFS, the LOCAL_DIR (and it's subdirectories like ``log'' and ``spool'') for each machine must be either writable to unauthenticated users, or must not be on AFS. The first option is a VERY bad security hole so you should NOT have your local directory on AFS. If you've got NFS installed as well and want to have your LOCAL_DIR for each machine on a shared file system, use NFS. Otherwise, you should put the LOCAL_DIR on a local partition on each machine in your pool. This means that you should run condor_install to install your release directory and configure your pool, setting the LOCAL_DIR parameter to some local partition. When that's complete, log into each machine in your pool and run condor_init to set up the local Condor directory.

The RELEASE_DIR, which holds all the Condor binaries, libraries and scripts can and probably should be on AFS. None of the Condor daemons need to write to these files, they just need to read them. So, you just have to make your RELEASE_DIR world readable and Condor will work just fine. This makes it easier to upgrade your binaries at a later date, means that your users can find the Condor tools in a consistent location on all the machines in your pool, and that you can have the Condor config files in a centralized location. This is what we do at UW-Madison's CS department Condor pool and it works quite well.

Finally, you might want to setup some special AFS groups to help your users deal with Condor and AFS better (you'll want to read the section below anyway, since you're probably going to have to explain this stuff to your users). Basically, if you can, create an AFS group that contains all unauthenticated users but that is restricted to a given host or subnet. You're supposed to be able to make these host-based ACLs with AFS, but we've had some trouble getting that working here at UW-Madison. What we have instead is a special group for all machines in our department. So, the users here just have to make their output directories on AFS writable to any process running on any of our machines, instead of any process on any machine with AFS on the Internet.

  
3.10.1.2 AFS and Condor for Users

The condor_shadow process runs on the machine where you submitted your Condor jobs and performs all file system access for your jobs. Because this process isn't authenticated to AFS as the user who submitted the job, it will not normally be able to write any output. So, when you submit jobs, any directories where your job will be creating output files will need to be world writable (to non-authenticated AFS users). In addition, if your program writes to stdout or stderr, or you're using a user log for your jobs, those files will need to be in a directory that's world-writable.

Any input for your job, either the file you specify as input in your submit file, or any files your program opens explicitly, needs to be world-readable.

Some sites may have special AFS groups set up that can make this unauthenticated access to your files less scary. For example, there's supposed to be a way with AFS to grant access to any unauthenticated process on a given host. That way, you only have to grant write access to unauthenticated processes on your submit machine, instead of any unauthenticated process on the Internet. Similarly, unauthenticated read access could be granted only to processes running your submit machine. Ask your AFS administrators about the existence of such AFS groups and details of how to use them.

The other solution to this problem is to just not use AFS at all. If you have disk space on your submit machine in a partition that is not on AFS, you can submit your jobs from there. While the condor_shadow is not authenticated to AFS, it does run with the effective UID of the user who submitted the jobs. So, on a local (or NFS) file system, the condor_shadow will be able to access your files normally, and you won't have to grant any special permissions to anyone other than yourself. If the Condor daemons are not started as root however, the shadow will not be able to run with your effective UID, and you'll have a similar problem as you would with files on AFS. See the section on ``Running Condor as Non-Root'' for details.

  
3.10.2 Configuring Condor for Multiple Platforms

Beginning with Condor version 6.0.1, you can use a single, global config file for all platforms in your Condor pool, with only platform-specific settings placed in separate files. This greatly simplifies administration of a heterogeneous pool by allowing you to change platform-independent, global settings in one place, instead of separately for each platform. This is made possible by the LOCAL_CONFIG_FILE parameter being treated by Condor as a list of files, instead of a single file. Of course, this will only help you if you are using a shared filesystem for the machines in your pool, so that multiple machines can actually share a single set of configuration files.

If you have multiple platforms, you should put all platform-independent settings (the vast majority) into your regular condor_config file, which would be shared by all platforms. This global file would be the one that is found with the CONDOR_CONFIG environment variable, user condor's home directory, or /etc/condor/condor_config.

You would then set the LOCAL_CONFIG_FILE parameter from that global config file to specify both a platform-specific config file and optionally, a local, machine-specific config file (this parameter is described in section 3.4.2 on ``Condor-wide Config File Entries'').

The order in which you specify files in the LOCAL_CONFIG_FILE parameter is important, because settings in files at the beginning of the list are overridden if the same settings occur in files later in the list. So, if you specify the platform-specific file and then the machine-specific file, settings in the machine-specific file would override those in the platform-specific file (which is probably what you want).

  
3.10.2.1 Specifying a Platform-Specific Config File

To specify the platform-specific file, you could simply use the ARCH and OPSYS parameters which are defined automatically by Condor. For example, if you had Intel Linux machines, Sparc Solaris 2.6 machines, and SGIs running IRIX 6.x, you might have files named:

        condor_config.INTEL.LINUX
        condor_config.SUN4x.SOLARIS26
        condor_config.SGI.IRIX6

Then, assuming these three files were in the directory held in the ETC macro, and you were using machine-specific config files in the same directory, named by each machine's hostname, your LOCAL_CONFIG_FILE parameter would be set to:

  LOCAL_CONFIG_FILE = $(ETC)/condor_config.$(ARCH).$(OPSYS), \
                      $(ETC)/$(HOSTNAME).local

Alternatively, if you are using AFS, you can use an ``@sys link'' to specify the platform-specific config file and let AFS resolve this link differently on different systems. For example, perhaps you have a soft linked named ``condor_config.platform'' that points to ``condor_config.@sys''. In this case, your files might be named:

        condor_config.i386_linux2
        condor_config.sun4x_56
        condor_config.sgi_64
        condor_config.platform -> condor_config.@sys

and your LOCAL_CONFIG_FILE parameter would be set to:

  LOCAL_CONFIG_FILE = $(ETC)/condor_config.platform, \
                      $(ETC)/$(HOSTNAME).local

  
3.10.2.2 Platform-Specific Config File Settings

The only settings that are truly platform-specific are:

RELEASE_DIR
Full path to where you have installed your Condor binaries. While the config files may be shared among different platforms, the binaries certainly cannot. Therefore, you must still maintain separate release directories for each platform in your pool. See section 3.4.2 on ``Condor-wide Config File Entries'' for details.

MAIL
The full path to your mail program. See section 3.4.2 on ``Condor-wide Config File Entries'' for details.

CONSOLE_DEVICES
Which devices in /dev should be treated as ``console devices''. See section 3.4.8 on ``condor_startd Config File Entries'' for details.

DAEMON_LIST
Which daemons the condor_master should start up. The only reason this setting is platform-specific is because on Alphas running Digital Unix and SGIs running IRIX, you must use the condor_kbdd, which is not needed on other platforms. See section 3.4.7 on ``condor_master Config File Entries'' for details.

Reasonable defaults for all of these settings will be found in the default config files inside a given platform's binary distribution (except the RELEASE_DIR, since it is up to you where you want to install your Condor binaries and libraries). If you have multiple platforms, simply take one of the condor_config files you get from either running condor_install or from the <release_dir>/etc/examples/condor_config.generic file, take these settings out and save them into a platform-specific file, and install the resulting platform-independent file as your global config file. Then, find the same settings from the config files for any other platforms you are setting up and put them in their own platform specific files. Finally, set your LOCAL_CONFIG_FILE parameter to point to the appropriate platform-specific file, as described above.

Not even all of these settings are necessarily going to be different. For example, if you have installed a mail program that understands the ``-s'' option in /usr/local/bin/mail on all your platforms, you could just set MAIL to that in your global file and not define it anywhere else. If you've only got Digital Unix and IRIX machines, the DAEMON_LIST will be the same for each, so there's no reason not to put that in the global config file (or, if you have no IRIX or Digital Unix machines, DAEMON_LIST won't have to be platform-specific either).

  
3.10.2.3 Other Uses for Platform-Specific Config Files

It is certainly possible that you might want other settings to be platform-specific as well. Perhaps you want a different startd policy for one of your platforms. Maybe different people should get the email about problems with different platforms. There's nothing hard-coded about any of this. What you decide should be shared and what should not is entirely up to you and how you lay out your config files.

Since the LOCAL_CONFIG_FILE parameter can be an arbitrary list of files, you can even break up your global, platform-independent settings into separate files. In fact, your global config file might only contain a definition for LOCAL_CONFIG_FILE, and all other settings would be handled in separate files.

You might want to give different people permission to change different Condor settings. For example, if you wanted some user to be able to change certain settings, but nothing else, you could specify those settings in a file which was early in the LOCAL_CONFIG_FILE list, give that user write permission on that file, then include all the other files after that one. That way, if the user was trying to change settings she/he shouldn't, they would simply be overridden.

As you can see, this mechanism is quite flexible and powerful. If you have very specific configuration needs, they can probably be met by using file permissions, the LOCAL_CONFIG_FILE setting, and your imagination.

  
3.10.3 Full Installation of condor_compile

In order to take advantage of two major Condor features: checkpointing and remote system calls, users of the Condor system need to relink their binaries. Programs that are not relinked for Condor can run in Condor's ``vanilla'' universe just fine, however, they cannot checkpoint and migrate, or run on machines without a shared filesystem.

To relink your programs with Condor, we provide a special tool, condor_compile. As installed by default, condor_compile works with the following commands: gcc, g++, g77, cc, acc, c89, CC, f77, fort77, ld. On Solaris and Digital Unix, f90 is also supported. See the condor_compile(1) man page for details on using condor_compile.

However, you can make condor_compile work transparently with all commands on your system whatsoever, including make.

The basic idea here is to replace the system linker (ld) with the Condor linker. Then, when a program is to be linked, the condor linker figures out whether this binary will be for Condor, or for a normal binary. If it is to be a normal compile, the old ld is called. If this binary is to be linked for condor, the script performs the necessary operations in order to prepare a binary that can be used with condor. In order to differentiate between normal builds and condor builds, the user simply places condor_compile before their build command, which sets the appropriate environment variable that lets the condor linker script know it needs to do its magic.

In order to perform this full installation of condor_compile, the following steps need to be taken:

1.
Rename the system linker from ld to ld.real.
2.
Copy the condor linker to the location of the previous ld.
3.
Set the owner of the linker to root.
4.
Set the permissions on the new linker to 755.

The actual commands that you must execute depend upon the system that you are on. The location of the system linker (ld), is as follows:

	Operating System              Location of ld (ld-path)
	Linux                         /usr/bin
	Solaris 2.X                   /usr/ccs/bin
	OSF/1 (Digital Unix)          /usr/lib/cmplrs/cc

On these platforms, issue the following commands (as root), where ld-path is replaced by the path to your system's ld.

        mv /[ld-path]/ld /[ld-path]/ld.real
        cp /usr/local/condor/lib/ld /[ld-path]/ld
        chown root /[ld-path]/ld
        chmod 755 /[ld-path]/ld

On IRIX, things are more complicated in that there are multiple ld binaries that need to be moved, and symbolic links need to be made in order to convince the linker to work, since it looks at the name of it's own binary in order to figure out what to do.

        mv /usr/lib/ld /usr/lib/ld.real
        mv /usr/lib/uld /usr/lib/uld.real
        cp /usr/local/condor/lib/ld /usr/lib/ld
        ln /usr/lib/ld /usr/lib/uld
        chown root /usr/lib/ld /usr/lib/uld
        chmod 755 /usr/lib/ld /usr/lib/uld
        mkdir /usr/lib/condor
        chown root /usr/lib/condor
        chmod 755 /usr/lib/condor
        ln -s /usr/lib/uld.real /usr/lib/condor/uld
        ln -s /usr/lib/uld.real /usr/lib/condor/old_ld

If you remove Condor from your system latter on, linking will continue to work, since the condor linker will always default to compiling normal binaries and simply call the real ld. In the interest of simplicity, it is recommended that you reverse the above changes by moving your ld.real linker back to it's former position as ld, overwriting the condor linker. On IRIX, you need to do this for both linkers, and you will probably want to remove the symbolic links as well.

  
3.10.4 Installing the condor_kbdd

The condor keyboard daemon (condor_kbdd) monitors X events on machines where the operating system does not provide a way of monitoring the idle time of the keyboard or mouse. In particular, this is necessary on Digital Unix machines and IRIX machines.

NOTE: If you are running on Solaris, Linux, or HP/UX, you do not need to use the keyboard daemon.

Although great measures have been taken to make this daemon as robust as possible, the X window system was not designed to facilitate such a need, and thus is less then optimal on machines where many users log in and out on the console frequently.

In order to work with X authority, the system by which X authorizes processes to connect to X servers, the condor keyboard daemon needs to run with super user privileges. Currently, the daemon assumes that X uses the HOME environment variable in order to locate a file named .Xauthority, which contains keys necessary to connect to an X server. The keyboard daemon attempts to set this environment variable to various users home directories in order to gain a connection to the X server and monitor events. This may fail to work on your system, if you are using a non-standard approach. If the keyboard daemon is not allowed to attach to the X server, the state of a machine may be incorrectly set to idle when a user is, in fact, using the machine.

In some environments, the keyboard daemon will not be able to connect to the X server because the user currently logged into the system keeps their authentication token for using the X server in a place that no local user on the current machine can get to. This may be the case if you are running AFS and have the user's X authority file in an AFS home directory. There may also be cases where you cannot run the daemon with super user privileges because of political reasons, but you would still like to be able to monitor X activity. In these cases, you will need to change your XDM configuration in order to start up the keyboard daemon with the permissions of the currently logging in user. Although your situation may differ, if you are running X11R6.3, you will probably want to edit the files in /usr/X11R6/lib/X11/xdm. The Xsession file should have the keyboard daemon startup at the end, and the Xreset file should have the keyboard daemon shutdown. As of patch level 4 of Condor version 6.0, the keyboard daemon has some additional command line options to facilitate this. The -l option can be used to write the daemons log file to a place where the user running the daemon has permission to write a file. We recommend something akin to $HOME/.kbdd.log since this is a place where every user can write and won't get in the way. The -pidfile and -k options allow for easy shutdown of the daemon by storing the process id in a file. You will need to add lines to your XDM config that look something like this:

	condor_kbdd -l $HOME/.kbdd.log -pidfile $HOME/.kbdd.pid

This will start the keyboard daemon as the user who is currently logging in and write the log to a file in the directory $HOME/.kbdd.log/. Also, this will save the process id of the daemon to  /.kbdd.pid, so that when the user logs out, XDM can simply do a:

	condor_kbdd -k $HOME/.kbdd.pid

This will shutdown the process recorded in  /.kbdd.pid and exit.

To see how well the keyboard daemon is working on your system, review the log for the daemon and look for successful connections to the X server. If you see none, you may have a situation where the keyboard daemon is unable to connect to your machines X server. If this happens, please send mail to condor-admin@cs.wisc.edu and let us know about your situation.

  
3.10.5 Installing a Checkpoint Server

The Checkpoint Server is a repository for checkpoint files. Using checkpoint servers reduces the disk requirements of submitting machines in the pool since they no longer need to store checkpoint files locally. Checkpoint server machines should have a large amount of disk space available, and should have a fast connection to machines in the Condor pool.

If your spool directories are on a network file system, then checkpoint files will make two trips over the network, one between the submission machine and the execution machine and a second between the submit machine and the network file server. If you install a checkpoint server and configure it to use the server's local disk, the checkpoint will travel only once over the network, between the execution machine and the checkpoint server. You may also obtain checkpointing network performance benefits by using multiple checkpoint servers, as discussed below.

NOTE: It is a good idea to pick very stable machines for your checkpoint servers. If the checkpoint servers crash, the Condor system will continue to operate, though poorly. While the Condor system will recover from a checkpoint server crash as best it can, there are two problems that can (and will) occur:

1.
If the checkpoint server is not functioning, when jobs need to checkpoint, they cannot do so. The jobs will keep trying to contact the checkpoint server, backing off exponentially in the time they wait between attempts. Normally, jobs only have a limited time to checkpoint before they are kicked off the machine. So, if the server is down for a long period of time, chances are that you'll lose a lot of work by jobs being killed without writing a checkpoint.

2.
When the jobs wish to start, if their checkpoints cannot be retrieved from the checkpoint server, they will either have to be restarted from scratch, or the job will wait for the server to come back on-line. You can control this behavior with the MAX_DISCARDED_RUN_TIME parameter in your config file (see section 3.4.6 on page [*] for details). Basically, this represents the maximum amount of CPU time you're willing to discard by starting a job over from scratch if the checkpoint server isn't responding to requests.

  
3.10.5.1 Preparing to Install a Checkpoint Server

Because of the problems that exist if your pool is configured to use a checkpoint server and that server is down, it is advisable to shut your pool down before doing any maintenance on your checkpoint server. See section 3.9 on page [*] for details on how to do that.

When modifying the checkpoint server configuration of a submission machine, you must make sure there are no jobs currently in the queue on that machine. If you have jobs in your queues, with checkpoint files on the local spool directories of your submit machines, those jobs will never run if your submit machines are configured to use a checkpoint server and the checkpoint files cannot be found on the server. You can either remove jobs from your queues or let them complete before you configure those submission machines with non-empty job queues. However, you may proceed and install the checkpoint server, configuring only those submission machines with empty queues and postponing the configuration of submission machines with non-empty job queues until the queues are empty.

  
3.10.5.2 Installing the Checkpoint Server Module

To install a checkpoint server, download the appropriate binary contrib module for the platform(s) your server will run on. When you uncompress and untar the file, you'll have a directory that contains a README, ckpt_server.tar, and so on. The ckpt_server.tar acts much like the release.tar file from a main release. This archive contains these files:

        sbin/condor_ckpt_server
        sbin/condor_cleanckpts
        etc/examples/condor_config.local.ckpt.server
These are all new files, not found in the main release, so you can safely untar the archive directly into your existing release directory. condor_ckpt_server is the checkpoint server binary. condor_cleanckpts is a script that can be periodically run to remove stale checkpoint files from your server. Normally, the checkpoint server cleans all old files by itself. However, in certain error situations, stale files can be left that are no longer needed. So, you may want to put a cron job in place that calls condor_cleanckpts every week or so, just to be safe. The example config file is described below.

Once you have unpacked the contrib module, you have a few more steps you must complete. Each will be discussed in their own section:

1.
Configure the checkpoint server.
2.
Spawn the checkpoint server.
3.
Configure your pool to use the checkpoint server.

  
3.10.5.3 Configuring a Checkpoint Server

There are a few settings you must place in the local config file of your checkpoint server. The file etc/examples/condor_config.local.ckpt.server contains all such settings, and you can just insert it into the local configuration file of your checkpoint server machine.

There is one setting that you must customize, and that is CKPT_SERVER_DIR. The CKPT_SERVER_DIR defines where your checkpoint files should be located. It is better if this is on a very fast local file system (preferably a RAID). The speed of this file system will have a direct impact on the speed at which your checkpoint files can be retrieved from the remote machines.

The other optional settings are:

DAEMON_LIST
(Described in section 3.4.7). If you want the checkpoint server managed by the condor_master, the DAEMON_LIST entry must have MASTER and CKPT_SERVER. Add STARTD if you want to allow jobs to run on your checkpoint server. Similarly, add SCHEDD if you would like to submit jobs from your checkpoint server.

The rest of these settings are the checkpoint-server specific versions of the Condor logging entries, described in section 3.4.3 on page [*].

CKPT_SERVER_LOG
The CKPT_SERVER_LOG is where the checkpoint server log gets put.

MAX_CKPT_SERVER_LOG
Use this item to configure the size of the checkpoint server log before it is rotated.

CKPT_SERVER_DEBUG
The amount of information you would like printed in your logfile. Currently, the only debug level supported is D_ALWAYS.

  
3.10.5.4 Spawning a Checkpoint Server

To spawn a checkpoint server once it is configured to run on a given machine, all you have to do is restart Condor on that host to enable the condor_master to notice the new configuration. You can do this by sending a condor_restart command from any machine with ``administrator'' access to your pool. See section 3.8 on page [*] for full details about IP/host-based security in Condor.

  
3.10.5.5 Configuring your Pool to Use the Checkpoint Server

Once the checkpoint server is installed and running, you just have to change a few settings in your config files to let your pool know about your new server:

USE_CKPT_SERVER
This parameter should be set to ``True''.

CKPT_SERVER_HOST
This parameter should be set to the full hostname of the machine that is now running your checkpoint server.

It most convenient to set these parameters in your global config file so they are in effect for all submission machines. However, you may configure each submission machine separately (using local config files) if you do not want all of your submission machines to use a checkpoint server at this time. If USE_CKPT_SERVER is set to ``False'' or is undefined, the submission machine will not use a checkpoint server.

Once these settings are in place, you simply have to send a condor_reconfig to all machines in your pool so the changes take effect. This is described in section 3.9.2 on page [*].

  
3.10.5.6 Configuring your Pool to Use Multiple Checkpoint Servers

It is possible to configure a Condor pool to use multiple checkpoint servers. This enables the administrator to deploy checkpoint servers across the network to improve checkpointing performance. In this case, Condor machines are configured to checkpoint to the ``nearest'' checkpoint server. There are two main benefits to deploying multiple checkpoint servers:

Once you have multiple checkpoint servers running in your pool, the following configuration changes are required to make them active.

First, USE_CKPT_SERVER should be set to ``True'' on all submission machines whose jobs should use a checkpoint server. Additionally, STARTER_CHOOSES_CKPT_SERVER should be set to ``True'' on these submission machines. When true, this parameter specifies that the checkpoint server specified by the execution machine should be used instead of the checkpoint server specified by the submission machine. (See section 3.4.6 on page [*] for more details.) This allows the job to use the checkpoint server closest to the machine on which it is running, instead of the server closest to the submission machine. For convenience, we suggest that you set these parameters in the global config file.

Next, you must set CKPT_SERVER_HOST on each machine. As described above, this should be set to the full hostname of the checkpoint server machine. In the case of multiple checkpoint servers, you will want to set this to be the hostname of the nearest server for each machine in the local config file.

Finally, once these settings are in place, you simply have to send a condor_reconfig to all machines in your pool so the changes take effect. This is described in section 3.9.2 on page [*].

Now, the jobs in your pool will checkpoint to the nearest checkpoint server. On restart, the job will remember where its checkpoint was stored and read it from the appropriate server. After a job successfully writes a checkpoint to a new server, it will remove any previous checkpoints left on other servers.

NOTE: If the configured checkpoint server is unavailable, the job will keep trying to contact that server as described above. It will not use alternate checkpoint servers. This may change in future versions of Condor.

  
3.10.6 Flocking: Configuring a Schedd to Submit to Multiple Pools

The condor_schedd may be configured to submit jobs to more than one pool. In the default configuration, the condor_schedd contacts the Central Manager specified by the CONDOR_HOST macro (described in section 3.4.2 on page [*]) to locate execute machines available to run jobs in its queue. However, the FLOCK_HOSTS macro (described in section 3.4.9 on page [*]) may be used to specify additional Central Managers for the condor_schedd to contact. When the local pool does not satisfy all job requests, the condor_schedd will try the pools specified by FLOCK_HOSTS in turn until all jobs are satisfied.

  
3.10.7 Configuring The Startd for SMP Machines

This section describes how to configure the condor_startd for SMP (Symmetric Multi-Processor) machines. Beginning with Condor version 6.1, machines with more than one CPU can be configured to run more than one job at a time. As always, owners of the resources have great flexibility in defining the policy under which multiple jobs may run, suspend, vacate, etc.

  
3.10.7.1 How Shared Resources are Represented to Condor

The way SMP machines are represented to the Condor system is that the shared resources are broken up into individual virtual machines (``vm'') that can be matched with and claimed by users. Each virtual machine is represented by an individual ``ClassAd'' (see the ClassAd reference, section 4.1, for details). In this way, a single SMP machine will appear to the Condor system as a collection of separate virtual machines. So for example, if you had an SMP machine named ``vulture.cs.wisc.edu'', it would appear to Condor as multiple machines, named ``vm1@vulture.cs.wisc.edu'', ``vm2@vulture.cs.wisc.edu'', and so on.

For now, the condor_startd simply creates a separate virtual machine for every CPU. All shared system resources (like RAM, disk space, swap space, etc) are divided evenly among all the virtual machines. In future versions of Condor, you will be able to configure what percentage or total value of each resource will go to each vm.

  
3.10.7.2 Configuring Startd Policy for SMP Machines

NOTE: Be sure you have read and understand section 3.6 on ``Configuring The Startd Policy'' before you proceed with this section.

Each virtual machine from an SMP is treated as an independent machine, with its own view of its machine state. For now, a single set of policy expressions is in place for all virtual machines simultaneously. Eventually, you will be able to explicitly specify separate policies for each one. However, since you do have control over each virtual machine's view of its own state, you can effectively have separate policies for each resource.

For example, you can configure how many of the virtual machines ``notice'' console or tty activity on the SMP as a whole. Ones that aren't configured to notice any activity will report ConsoleIdle and KeyboardIdle times from when the startd was started, (plus a configurable number of seconds). So, you can setup a 4 CPU machine with all the default startd policy settings and with the keyboard and console ``connected'' to only one virtual machine. Assuming there isn't too much load average (see section 3.10.7 below on ``Load Average for SMP Machines''), only one virtual machine will suspend or vacate its job when the owner starts typing at their machine again. The rest of the virtual machines could be matched with jobs and leave them running, even while the user was interactively using the machine.

Or, if you wish, you can configure all virtual machines to notice all tty and console activity. In this case, if a machine owner came back to her machine, all the currently running jobs would suspend or preempt (depending on your policy expressions), all at the same time.

  
3.10.7.3 Load Average for SMP Machines

Most operating systems define the load average for an SMP machine as the total load on all CPUs. For example, if you have a 4 CPU machine with 3 CPU-bound processes running at the same time, the load would be 3.0 In Condor, we maintain this view of the total load average and publish it in all resource ClassAds as TotalLoadAvg.

However, we also define the ``per-CPU'' load average for SMP machines. In this way, the model that each node on an SMP is a virtual machine, totally separate from the other nodes, can be maintained. All of the default, single-CPU policy expressions can be used directly on SMP machines, without modification, since the LoadAvg and CondorLoadAvg attributes are the per-virtual machine versions, not the total, SMP-wide versions.

The per-CPU load average on SMP machines is a number we basically invented. There is no system call you can use to ask your operating system for this value. Here's how it works:

We already compute the load average generated by Condor on each virtual machine. We do this by close monitoring of all processes spawned by any of the Condor daemons, even ones that are orphaned and then inherited by init. This Condor load average per virtual machine is reported as CondorLoadAvg in all resource ClassAds, and the total Condor load average for the entire machine is reported as TotalCondorLoadAvg. We also have the total, system-wide load average for the entire machine (reported as TotalLoadAvg). Basically, we walk through all the virtual machines and assign out portions of the total load average to each one. First, we assign out the known Condor load average to each node that is generating any. If there's any load average left in the total system load, that's considered owner load. Any virtual machines we already think are in the Owner state (like ones that have keyboard activity, etc), are the first to get assigned this owner load. We hand out owner load in increments of at most 1.0, so generally speaking, no virtual machine has a load average above 1.0. If we run out of total load average before we run out of virtual machines, all the remaining machines think they have no load average at all. If, instead, we run out of virtual machines and we still have owner load left, we start assigning that load to Condor nodes, too, creating individual nodes with a load average higher than 1.0.

  
3.10.7.4 New Config File Parameters for the SMP Startd

This section lists the config file settings that you might want to customize on an SMP machine. These control each virtual machine's view of the system resources, as described in section 3.10.7 above. These settings are fully described above in section 3.4.8 on page [*] which lists all the config file settings for the condor_startd.

  
3.10.7.5 Debug logging in the SMP Startd

This section describes how the startd handles its debug messages for SMP machines. In general, a given log message will either be something that is machine-wide (like reporting the total system load average), or it will be specific to a given virtual machine. Any log entrees specific to a virtual machine will have an extra header printed out in the entry: vm#:. So, for example, here's the output about system resources that are being gathered (with D_FULLDEBUG and D_LOAD turned on) on a 2 CPU machine with no Condor activity, and the keyboard connected to both virtual machines:

11/25 18:15 Swap space: 131064
11/25 18:15 number of kbytes available for (/home/condor/execute): 1345063
11/25 18:15 Looking up RESERVED_DISK parameter
11/25 18:15 Reserving 5120 kbytes for file system
11/25 18:15 Disk space: 1339943
11/25 18:15 Load avg: 0.340000 0.800000 1.170000
11/25 18:15 Idle Time: user= 0 , console= 4 seconds
11/25 18:15 SystemLoad: 0.340   TotalCondorLoad: 0.000  TotalOwnerLoad: 0.340
11/25 18:15 vm1: Idle time: Keyboard: 0        Console: 4
11/25 18:15 vm1: SystemLoad: 0.340  CondorLoad: 0.000  OwnerLoad: 0.340
11/25 18:15 vm2: Idle time: Keyboard: 0        Console: 4
11/25 18:15 vm2: SystemLoad: 0.000  CondorLoad: 0.000  OwnerLoad: 0.000
11/25 18:15 vm1: State: Owner           Activity: Idle
11/25 18:15 vm2: State: Owner           Activity: Idle

If, on the other hand, this machine only had one virtual machine connected to the keyboard and console, and the other vm was running a job, it might look something like this:

11/25 18:19 Load avg: 1.250000 0.910000 1.090000
11/25 18:19 Idle Time: user= 0 , console= 0 seconds
11/25 18:19 SystemLoad: 1.250   TotalCondorLoad: 0.996  TotalOwnerLoad: 0.254
11/25 18:19 vm1: Idle time: Keyboard: 0        Console: 0
11/25 18:19 vm1: SystemLoad: 0.254  CondorLoad: 0.000  OwnerLoad: 0.254
11/25 18:19 vm2: Idle time: Keyboard: 1496     Console: 1496
11/25 18:19 vm2: SystemLoad: 0.996  CondorLoad: 0.996  OwnerLoad: 0.000
11/25 18:19 vm1: State: Owner           Activity: Idle
11/25 18:19 vm2: State: Claimed         Activity: Busy

As you can see, shared system resources are printed without the header (like total swap space), which vm-specific messages (like the load average or state of each vm,) get the special header appended.


next up previous contents
Next: 3.11 Security In Condor Up: 3. Administrators' Manual Previous: 3.9 Managing your Condor
condor-admin@cs.wisc.edu