Next: 3.4 Configuring Condor Up: 3. Administrators' Manual Previous: 3.2 Installation of Condor

Subsections

3.3 Installing Contrib Modules

This section describes how to install various contrib modules in the Condor system. Some of these modules are separate, optional pieces, not included in the main distribution of Condor. For example, the checkpoint server, or DagMan. Others are integral parts of Condor taken from the development series that have certain features users might want to install. For example, the new SMP-aware condor_startd, or the CondorView collector. Both of these things come automatically with Condor version 6.1 and greater. However, if you don't want to switch over to using only the development binaries, you can install these seperate modules and maintain most of the stable release at your site.

3.3.1 Installing CondorView Contrib Modules

To install CondorView for your pool, you really need two things:

1.: The CondorView server, which collects historical information.
2.: The CondorView client, a Java applett that views this data.

Since these are totally seperate modules, they will each be handled in their own sections.

3.3.2 Installing the CondorView Server Module

The CondorView server is just an enhanced version of the condor_collector which can log information to disk, providing a persistant, historical database of your pool state. This includes machine state, as well as the state of jobs submitted by users, and so on. This enhanced condor_collector is simply the version 6.1 development series, but it can be installed in a 6.0 pool. The historical information logging can be turned on or off, so you can install the CondorView collector without using up disk space for historical information if you don't want it.

To install the CondorView server, you must download the appropriate binary module for whatever platform you are going to run your CondorView server on. This does not have to be the same platform as your exisiting central manager (see below). Once you uncompress and untar the module, you will have a directory with a view_server.tar file, a README, and so on. The view_server.tar acts much like the release.tar file for a main release of Condor. It contains all the binaries and supporting files you would install in your release directory:

        sbin/condor_collector
        etc/examples/condor_config.local.view_server

You have two options to choose from when deciding how to install this enhanced condor_collector in your pool:

1.: Replace your exisiting condor_collector and use the new version for both historical information and the regular role the collector plays in your pool.
2.: Install the new condor_collector and run it on a seperate host from your main condor_collector and configure your machines to send updates to both collectors.

If you replace your existing collector with the enhanced version, because it is development code, there might be a bug or problem that would cause problems for your pool. On the other hand, if you install the enhanced version on a seperate host, if there are problems, only CondorView will be affected, not your entire pool. However, installing the CondorView collector on a seperate host generates more network traffic (from all the duplicate updates that are sent from each machine in your pool to both collectors). In addition, the installation procedure to have both collectors running is a more complicated process. You will just have to decide for yourself which solution you feel more comfortable with.

Before we discuss the details of one type of installation or the other, we explain the steps you must take in either case.

3.3.2.1 Setting up the CondorView Server Module

Before you install the CondorView collector (as described in the following sections), you have to add a few settings to the local config file of that machine to enable historical data collection. These settings are described in detail in the Condor Version 6.1 Administrator's Manual, in the section ``condor_collector Config File Entries''. However, a short explaination of the ones you must customize is provided below. These entries are also explained in the etc/examples/condor_config.local.view_server file, included in the contrib module. You should just insert that file into the local config file for your CondorView collector host and customize as appropriate at your site.

POOL_HISTORY_DIR

This is the directory where historical data will be stored. There is a configurable limit to the maximum space required for all the files created by the CondorView server (POOL_HISTORY_MAX_STORAGE). This directory must be writable by whatever user the CondorView collector is running as (usually "condor").

NOTE: This should be a seperate directory, not the same as either the Spool or Log directories you have already setup for Condor. There are a few problems putting these files into either of those directories.

KEEP_POOL_HISTORY

This is a boolean that determines if the CondorView collector should store the historical information. It is false by default, which is why you must specify it as true in your local config file.

Once these settings are in place in the local config file for your CondorView server host, you must to create the directory you specified in POOL_HISTORY_DIR and make it writable by whomever your CondorView collector is running as. This would be the same user that owns the CollectorLog file in your Log directory (usually, ``condor'').

Once those steps are completed, you are ready to install the new binaries and you will begin collecting historical information. Then, you should install the CondorView client contrib module which contains the tools used to query and display this information.

3.3.2.2 CondorView Collector as Your Only Collector

To install the new CondorView collector as your main collector, you simply have to replace your existing binary with the new one, found in the view_server.tar file. All you need to do is move your existing condor_collector binary out of the way with the ``mv'' command. For example:

        % cd /full/path/to/your/release/directory
        % cd sbin
        % mv condor_collector condor_collector.old

Then, from that same directory, you just have to untar the view_server.tar file, into your release directory, which will install a new condor_collector binary, and an example config file. Within 5 minutes, the condor_master will notice the new timestamp on your condor_collector binary, shutdown your existing collector, and spawn the new version. You will see messages about this in the log file for your condor_master (usually MasterLog in your log directory). Once the new collector is running, it is safe to remove your old binary, though you may want to keep it around in case you have problems with the new version and want to revert back.

Once this is completed, you just have to add a few config file entries to the local config file on your central manager to enable historical data collection. These are described below in the ``Configuring the CondorView Server Module'' section.

3.3.2.3 CondorView Collector in Addition to Your Main Collector

To install the CondorView collector in addition to your regular collector requires a little extra work. First, you should untar the view_server.tar file into some temporary location (not your main release directory). Copy the sbin/condor_collector file out of there, and into your main release directory's sbin with a new name (such as condor_collector.view_server).

Next, you must configure whatever host is going to run your seperate CondorView server to spawn this new collector in addition to whatever other daemons it's running. You do this by adding ``COLLECTOR'' to the DAEMON_LIST on this machine, and defining what ``COLLECTOR'' means. For example:

        DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR
        COLLECTOR = $(SBIN)/condor_collector.view_server

For this change to take effect, you must actually re-start the condor_master on this host (which you can do with the condor_restart command, if you run that command from a machine with ``ADMINISTRATOR'' access to your pool. (See section 3.8 on page

for full details of IP/host-based security in Condor).

Finally, you must tell all the machines in your pool to start sending updates to both collectors. You do this by specifying the following setting in your global config file:

        CONDOR_VIEW_HOST = full.hostname

where ``full.hostname'' is the full hostname of the machine where you are running your CondorView collector.

Once this setting is in place, you must send a condor_reconfig to your entire pool. The easiest way to do this is:

        % condor_reconfig `condor_status -master`

Again, this command must be run from a trusted ``administrator'' machine for it to work.

3.3.3 Installing the CondorView Client Contrib Module

$\fbox{This section has not yet been written}$

3.3.4 Installing a Checkpoint Server

The Checkpoint Server is a repository for checkpoint files. Using checkpoint servers reduces the disk requirements of submitting machines in the pool since they no longer need to store checkpoint files locally. Checkpoint server machines should have a large amount of disk space available, and should have a fast connection to machines in the Condor pool.

If your spool directories are on a network file system, then checkpoint files will make two trips over the network, one between the submission machine and the execution machine and a second between the submit machine and the network file server. If you install a checkpoint server and configure it to use the server's local disk, the checkpoint will travel only once over the network, between the execution machine and the checkpoint server. You may also obtain checkpointing network performance benefits by using multiple checkpoint servers, as discussed below.

NOTE: It is a good idea to pick very stable machines for your checkpoint servers. If the checkpoint servers crash, the Condor system will continue to operate, though poorly. While the Condor system will recover from a checkpoint server crash as best it can, there are two problems that can (and will) occur:

1.: If the checkpoint server is not functioning, when jobs need to checkpoint, they cannot do so. The jobs will keep trying to contact the checkpoint server, backing off exponentially in the time they wait between attempts. Normally, jobs only have a limited time to checkpoint before they are kicked off the machine. So, if the server is down for a long period of time, chances are that you'll lose a lot of work by jobs being killed without writing a checkpoint.
2.: When the jobs wish to start, if their checkpoints cannot be retrieved from the checkpoint server, they will either have to be restarted from scratch, or the job will wait for the server to come back on-line. You can control this behavior with the MAX_DISCARDED_RUN_TIME parameter in your config file (see section 3.4.6 on page for details). Basically, this represents the maximum amount of CPU time you're willing to discard by starting a job over from scratch if the checkpoint server isn't responding to requests.

3.3.4.1 Preparing to Install a Checkpoint Server

Because of the problems that exist if your pool is configured to use a checkpoint server and that server is down, it is advisable to shut your pool down before doing any maintenance on your checkpoint server. See section 3.9 on page for details on how to do that.

When modifying the checkpoint server configuration of a submission machine, you must make sure there are no jobs currently in the queue on that machine. If you have jobs in your queues, with checkpoint files on the local spool directories of your submit machines, those jobs will never run if your submit machines are configured to use a checkpoint server and the checkpoint files cannot be found on the server. You can either remove jobs from your queues or let them complete before you configure those submission machines with non-empty job queues. However, you may proceed and install the checkpoint server, configuring only those submission machines with empty queues and postponing the configuration of submission machines with non-empty job queues until the queues are empty.

3.3.4.2 Installing the Checkpoint Server Module

To install a checkpoint server, download the appropriate binary contrib module for the platform(s) your server will run on. When you uncompress and untar the file, you'll have a directory that contains a README, ckpt_server.tar, and so on. The ckpt_server.tar acts much like the release.tar file from a main release. This archive contains these files:

        sbin/condor_ckpt_server
        sbin/condor_cleanckpts
        etc/examples/condor_config.local.ckpt.server

These are all new files, not found in the main release, so you can safely untar the archive directly into your existing release directory. condor_ckpt_server is the checkpoint server binary. condor_cleanckpts is a script that can be periodically run to remove stale checkpoint files from your server. Normally, the checkpoint server cleans all old files by itself. However, in certain error situations, stale files can be left that are no longer needed. So, you may want to put a cron job in place that calls condor_cleanckpts every week or so, just to be safe. The example config file is described below.

Once you have unpacked the contrib module, you have a few more steps you must complete. Each will be discussed in their own section:

1.: Configure the checkpoint server.
2.: Spawn the checkpoint server.
3.: Configure your pool to use the checkpoint server.

3.3.4.3 Configuring a Checkpoint Server

There are a few settings you must place in the local config file of your checkpoint server. The file etc/examples/condor_config.local.ckpt.server contains all such settings, and you can just insert it into the local configuration file of your checkpoint server machine.

There is one setting that you must customize, and that is CKPT_SERVER_DIR. The CKPT_SERVER_DIR defines where your checkpoint files should be located. It is better if this is on a very fast local file system (preferably a RAID). The speed of this file system will have a direct impact on the speed at which your checkpoint files can be retrieved from the remote machines.

The other optional settings are:

DAEMON_LIST: (Described in section 3.4.7). If you want the checkpoint server managed by the condor_master, the DAEMON_LIST entry must have MASTER and CKPT_SERVER. Add STARTD if you want to allow jobs to run on your checkpoint server. Similarly, add SCHEDD if you would like to submit jobs from your checkpoint server.

The rest of these settings are the checkpoint-server specific versions of the Condor logging entries, described in section 3.4.3 on page .

CKPT_SERVER_LOG: The CKPT_SERVER_LOG is where the checkpoint server log gets put.
MAX_CKPT_SERVER_LOG: Use this item to configure the size of the checkpoint server log before it is rotated.
CKPT_SERVER_DEBUG: The amount of information you would like printed in your logfile. Currently, the only debug level supported is D_ALWAYS.

3.3.4.4 Spawning a Checkpoint Server

To spawn a checkpoint server once it is configured to run on a given machine, all you have to do is restart Condor on that host to enable the condor_master to notice the new configuration. You can do this by sending a condor_restart command from any machine with ``administrator'' access to your pool. See section 3.8 on page for full details about IP/host-based security in Condor.

3.3.4.5 Configuring your Pool to Use the Checkpoint Server

Once the checkpoint server is installed and running, you just have to change a few settings in your config files to let your pool know about your new server:

USE_CKPT_SERVER: This parameter should be set to ``True''.
CKPT_SERVER_HOST: This parameter should be set to the full hostname of the machine that is now running your checkpoint server.

It most convenient to set these parameters in your global config file so they are in effect for all submission machines. However, you may configure each submission machine separately (using local config files) if you do not want all of your submission machines to use a checkpoint server at this time. If USE_CKPT_SERVER is set to ``False'' or is undefined, the submission machine will not use a checkpoint server.

Once these settings are in place, you simply have to send a condor_reconfig to all machines in your pool so the changes take effect. This is described in section 3.9.2 on page .

3.3.4.6 Configuring your Pool to Use Multiple Checkpoint Servers

It is possible to configure a Condor pool to use multiple checkpoint servers. This enables the administrator to deploy checkpoint servers across the network to improve checkpointing performance. In this case, Condor machines are configured to checkpoint to the ``nearest'' checkpoint server. There are two main benefits to deploying multiple checkpoint servers:

Checkpoint-related network traffic may be localized by intelligent placement of checkpoint servers.
Faster checkpointing means that jobs spend less time checkpointing (and more time doing useful work), jobs have a better chance of checkpointing successfully when vacated, and workstation owners see Condor jobs vacate their machines more quickly.

Once you have multiple checkpoint servers running in your pool, the following configuration changes are required to make them active.

First, USE_CKPT_SERVER should be set to ``True'' on all submission machines whose jobs should use a checkpoint server. Additionally, STARTER_CHOOSES_CKPT_SERVER should be set to ``True'' on these submission machines. When true, this parameter specifies that the checkpoint server specified by the execution machine should be used instead of the checkpoint server specified by the submission machine. (See section 3.4.6 on page for more details.) This allows the job to use the checkpoint server closest to the machine on which it is running, instead of the server closest to the submission machine. For convenience, we suggest that you set these parameters in the global config file.

Next, you must set CKPT_SERVER_HOST on each machine. As described above, this should be set to the full hostname of the checkpoint server machine. In the case of multiple checkpoint servers, you will want to set this to be the hostname of the nearest server for each machine in the local config file.

Finally, once these settings are in place, you simply have to send a condor_reconfig to all machines in your pool so the changes take effect. This is described in section 3.9.2 on page .

Now, the jobs in your pool will checkpoint to the nearest checkpoint server. On restart, the job will remember where its checkpoint was stored and read it from the appropriate server. After a job successfully writes a checkpoint to a new server, it will remove any previous checkpoints left on other servers.

NOTE: If the configured checkpoint server is unavailable, the job will keep trying to contact that server as described above. It will not use alternate checkpoint servers. This may change in future versions of Condor.

3.3.5 Installing PVM Support in Condor

To install support for PVM in Condor, download the file archive from http://www.cs.wisc.edu/condor/condor-pvm and follow the directions found the INSTALL file contained in the archive.

Next: 3.4 Configuring Condor Up: 3. Administrators' Manual Previous: 3.2 Installation of Condor

condor-admin@cs.wisc.edu