This section describes how to install various contrib modules in the Condor system. Some of these modules are separate, optional pieces, not included in the main distribution of Condor. For example, the checkpoint server, or DagMan. Others are integral parts of Condor taken from the development series that have certain features users might want to install. For example, the new SMP-aware condor_startd, or the CondorView collector. Both of these things come automatically with Condor version 6.1 and greater. However, if you don't want to switch over to using only the development binaries, you can install these seperate modules and maintain most of the stable release at your site.
To install CondorView for your pool, you really need two things:
Since these are totally seperate modules, they will each be handled in their own sections.
The CondorView server is just an enhanced version of the condor_collector which can log information to disk, providing a persistant, historical database of your pool state. This includes machine state, as well as the state of jobs submitted by users, and so on. This enhanced condor_collector is simply the version 6.1 development series, but it can be installed in a 6.0 pool. The historical information logging can be turned on or off, so you can install the CondorView collector without using up disk space for historical information if you don't want it.
To install the CondorView server, you must download the appropriate binary module for whatever platform you are going to run your CondorView server on. This does not have to be the same platform as your exisiting central manager (see below). Once you uncompress and untar the module, you will have a directory with a view_server.tar file, a README, and so on. The view_server.tar acts much like the release.tar file for a main release of Condor. It contains all the binaries and supporting files you would install in your release directory:
sbin/condor_collector
etc/examples/condor_config.local.view_server
You have two options to choose from when deciding how to install this enhanced condor_collector in your pool:
If you replace your existing collector with the enhanced version, because it is development code, there might be a bug or problem that would cause problems for your pool. On the other hand, if you install the enhanced version on a seperate host, if there are problems, only CondorView will be affected, not your entire pool. However, installing the CondorView collector on a seperate host generates more network traffic (from all the duplicate updates that are sent from each machine in your pool to both collectors). In addition, the installation procedure to have both collectors running is a more complicated process. You will just have to decide for yourself which solution you feel more comfortable with.
Before we discuss the details of one type of installation or the other, we explain the steps you must take in either case.
Before you install the CondorView collector (as described in the following sections), you have to add a few settings to the local config file of that machine to enable historical data collection. These settings are described in detail in the Condor Version 6.1 Administrator's Manual, in the section ``condor_collector Config File Entries''. However, a short explaination of the ones you must customize is provided below. These entries are also explained in the etc/examples/condor_config.local.view_server file, included in the contrib module. You should just insert that file into the local config file for your CondorView collector host and customize as appropriate at your site.
NOTE: This should be a seperate directory, not the same as either the Spool or Log directories you have already setup for Condor. There are a few problems putting these files into either of those directories.
Once these settings are in place in the local config file for your CondorView server host, you must to create the directory you specified in POOL_HISTORY_DIR and make it writable by whomever your CondorView collector is running as. This would be the same user that owns the CollectorLog file in your Log directory (usually, ``condor'').
Once those steps are completed, you are ready to install the new binaries and you will begin collecting historical information. Then, you should install the CondorView client contrib module which contains the tools used to query and display this information.
To install the new CondorView collector as your main collector, you simply have to replace your existing binary with the new one, found in the view_server.tar file. All you need to do is move your existing condor_collector binary out of the way with the ``mv'' command. For example:
% cd /full/path/to/your/release/directory
% cd sbin
% mv condor_collector condor_collector.old
Then, from that same directory, you just have to untar the
view_server.tar file, into your release directory, which will
install a new condor_collector binary, and an example config
file.
Within 5 minutes, the condor_master will notice the new timestamp on
your condor_collector binary, shutdown your existing
collector, and spawn the new version.
You will see messages about this in the log file for your
condor_master (usually MasterLog in your log
directory).
Once the new collector is running, it is safe to remove your old
binary, though you may want to keep it around in case you have
problems with the new version and want to revert back.
Once this is completed, you just have to add a few config file entries to the local config file on your central manager to enable historical data collection. These are described below in the ``Configuring the CondorView Server Module'' section.
To install the CondorView collector in addition to your regular collector requires a little extra work. First, you should untar the view_server.tar file into some temporary location (not your main release directory). Copy the sbin/condor_collector file out of there, and into your main release directory's sbin with a new name (such as condor_collector.view_server).
Next, you must configure whatever host is going to run your seperate CondorView server to spawn this new collector in addition to whatever other daemons it's running. You do this by adding ``COLLECTOR'' to the DAEMON_LIST on this machine, and defining what ``COLLECTOR'' means. For example:
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR
COLLECTOR = $(SBIN)/condor_collector.view_server
For this change to take effect, you must actually re-start the
condor_master on this host (which you can do with the
condor_restart command, if you run that command from a machine with
``ADMINISTRATOR'' access to your pool.
(See section 3.8 on
page
for full details of IP/host-based
security in Condor).
Finally, you must tell all the machines in your pool to start sending updates to both collectors. You do this by specifying the following setting in your global config file:
CONDOR_VIEW_HOST = full.hostname
where ``full.hostname'' is the full hostname of the machine where you
are running your CondorView collector.
Once this setting is in place, you must send a condor_reconfig to your entire pool. The easiest way to do this is:
% condor_reconfig `condor_status -master`
Again, this command must be run from a trusted ``administrator''
machine for it to work.
If your spool directories are on a network file system, then checkpoint files will make two trips over the network, one between the submission machine and the execution machine and a second between the submit machine and the network file server. If you install a checkpoint server and configure it to use the server's local disk, the checkpoint will travel only once over the network, between the execution machine and the checkpoint server. You may also obtain checkpointing network performance benefits by using multiple checkpoint servers, as discussed below.
NOTE: It is a good idea to pick very stable machines for your checkpoint servers. If the checkpoint servers crash, the Condor system will continue to operate, though poorly. While the Condor system will recover from a checkpoint server crash as best it can, there are two problems that can (and will) occur:
for details).
Basically, this represents the maximum amount of CPU time you're
willing to discard by starting a job over from scratch if the
checkpoint server isn't responding to requests.
Because of the problems that exist if your pool is configured to use a
checkpoint server and that server is down, it is advisable to shut
your pool down before doing any maintenance on your checkpoint
server.
See section 3.9 on
page
for details on how to do that.
When modifying the checkpoint server configuration of a submission machine, you must make sure there are no jobs currently in the queue on that machine. If you have jobs in your queues, with checkpoint files on the local spool directories of your submit machines, those jobs will never run if your submit machines are configured to use a checkpoint server and the checkpoint files cannot be found on the server. You can either remove jobs from your queues or let them complete before you configure those submission machines with non-empty job queues. However, you may proceed and install the checkpoint server, configuring only those submission machines with empty queues and postponing the configuration of submission machines with non-empty job queues until the queues are empty.
To install a checkpoint server, download the appropriate binary contrib module for the platform(s) your server will run on. When you uncompress and untar the file, you'll have a directory that contains a README, ckpt_server.tar, and so on. The ckpt_server.tar acts much like the release.tar file from a main release. This archive contains these files:
sbin/condor_ckpt_server
sbin/condor_cleanckpts
etc/examples/condor_config.local.ckpt.server
These are all new files, not found in the main release, so you can
safely untar the archive directly into your existing release
directory.
condor_ckpt_server is the checkpoint server binary.
condor_cleanckpts is a script that can be periodically run to
remove stale checkpoint files from your server.
Normally, the checkpoint server cleans all old files by itself.
However, in certain error situations, stale files can be left that are
no longer needed.
So, you may want to put a cron job in place that calls
condor_cleanckpts every week or so, just to be safe.
The example config file is described below.
Once you have unpacked the contrib module, you have a few more steps you must complete. Each will be discussed in their own section:
There are a few settings you must place in the local config file of your checkpoint server. The file etc/examples/condor_config.local.ckpt.server contains all such settings, and you can just insert it into the local configuration file of your checkpoint server machine.
There is one setting that you must customize, and that is CKPT_SERVER_DIR. The CKPT_SERVER_DIR defines where your checkpoint files should be located. It is better if this is on a very fast local file system (preferably a RAID). The speed of this file system will have a direct impact on the speed at which your checkpoint files can be retrieved from the remote machines.
The other optional settings are:
The rest of these settings are the checkpoint-server specific versions
of the Condor logging entries, described in
section 3.4.3 on
page
.
To spawn a checkpoint server once it is configured to run on a given
machine, all you have to do is restart Condor on that host to enable
the condor_master to notice the new configuration.
You can do this by sending a condor_restart command from any machine
with ``administrator'' access to your pool.
See section 3.8 on
page
for full details about IP/host-based
security in Condor.
Once the checkpoint server is installed and running, you just have to change a few settings in your config files to let your pool know about your new server:
It most convenient to set these parameters in your global config file so they are in effect for all submission machines. However, you may configure each submission machine separately (using local config files) if you do not want all of your submission machines to use a checkpoint server at this time. If USE_CKPT_SERVER is set to ``False'' or is undefined, the submission machine will not use a checkpoint server.
Once these settings are in place, you simply have to send a
condor_reconfig to all machines in your pool so the changes take
effect.
This is described in section 3.9.2 on
page
.
It is possible to configure a Condor pool to use multiple checkpoint servers. This enables the administrator to deploy checkpoint servers across the network to improve checkpointing performance. In this case, Condor machines are configured to checkpoint to the ``nearest'' checkpoint server. There are two main benefits to deploying multiple checkpoint servers:
Once you have multiple checkpoint servers running in your pool, the following configuration changes are required to make them active.
First, USE_CKPT_SERVER should be set to ``True'' on all
submission machines whose jobs should use a checkpoint server.
Additionally, STARTER_CHOOSES_CKPT_SERVER should be set to
``True'' on these submission machines.
When true, this parameter specifies that the checkpoint server
specified by the execution machine should be used instead of the
checkpoint server specified by the submission machine.
(See section 3.4.6 on
page
for more
details.)
This allows the job to use the checkpoint server closest to the
machine on which it is running, instead of the server closest to the
submission machine.
For convenience, we suggest that you set these parameters in the
global config file.
Next, you must set CKPT_SERVER_HOST on each machine. As described above, this should be set to the full hostname of the checkpoint server machine. In the case of multiple checkpoint servers, you will want to set this to be the hostname of the nearest server for each machine in the local config file.
Finally, once these settings are in place, you simply have to send a
condor_reconfig to all machines in your pool so the changes take
effect.
This is described in section 3.9.2 on
page
.
Now, the jobs in your pool will checkpoint to the nearest checkpoint server. On restart, the job will remember where its checkpoint was stored and read it from the appropriate server. After a job successfully writes a checkpoint to a new server, it will remove any previous checkpoints left on other servers.
NOTE: If the configured checkpoint server is unavailable, the job will keep trying to contact that server as described above. It will not use alternate checkpoint servers. This may change in future versions of Condor.
To install support for PVM in Condor, download the file archive from http://www.cs.wisc.edu/condor/condor-pvm and follow the directions found the INSTALL file contained in the archive.