next up previous contents
Next: 2.6 Submitting a Job Up: 2. Users' Manual Previous: 2.4 Road-map for running

Subsections

  
2.5 Job Preparation

Before submitting your program to Condor, you must first make certain your program is batch ready. Next you'll need to decide upon a Condor Universe, or runtime environment, for your job.

2.5.1 Batch Ready

Condor runs your program unattended and in the background. Make certain that your program can do this before submitting it to Condor. Condor can redirect console output (stdout and stderr) and keyboard (stdin) input to/from files for you, so you may have to create file(s) that contain the proper keystrokes needed for your file.

It is also very easy to quickly submit multiple runs of your program to Condor. Perhaps you want to run the same program 500 times on 500 different input data sets. If so, you need to arrange your data files accordingly so that each run can read its own input, and so one run's output files do not clobber (overwrite) another run's files. For each individual run, Condor allows you to easily customize that run's initial working directory, stdin, stdout, stderr, command-line arguments, or shell environment. Therefore, if your program directly opens its own data files, hopefully it can read what filenames to use via either stdin or the command-line. If your program opens a static filename every time, you will likely need to make a separate subdirectory for each run to store its data files into.

2.5.2 Choosing a Condor Universe

A Universe in Condor defines an execution environment. You can state which Universe to use for each job in a submit-description file when the job is submitted. Condor Version 6.1.2 supports three different program Universes for user jobs:

If your program is a parallel application written for PVM, then you would ask Condor for the PVM universe at submit time. See section 2.9 for information on using Condor with PVM jobs.

Otherwise, you need to decide between Standard or Vanilla Universe. In general, Standard Universe provides more services to your job than Vanilla Universe and therefore Standard is usually preferable. But Standard Universe also imposes some restrictions on what your job can do. Vanilla Universe has very few restrictions, and can be used when either the Standard Universe's additional services are not desired or when the job cannot abide by the Standard Universe's restrictions.

  
2.5.2.1 Standard Universe

In the Standard Universe, which is the default, Condor will automatically make checkpoints (take a snapshot of its current state) of the job. So if a Standard Universe job is running on a machine and needs to leave (perhaps because the owner of the machine returned), Condor will checkpoint the job and then migrate it to some other idle machine. Because the job was checkpointed, Condor will restart the job from the checkpoint and therefore it can continue to run from where it left off.

Furthermore, Standard Universe jobs can use Condor's remote system calls mechanism, which enables the program to access data files from any machine in the Condor pool regardless of whether that machine is sharing a file-system via NFS (or AFS) or if the user has an account there. Even if your files are just sitting on your local hard-drive, or in /tmp, Condor jobs can access them. How it works is when your Condor job start up on some remote machine, a corresponding condor_shadow process also starts up on the machine where you submitted the job. As your job runs on the remote machine, Condor traps hundreds of operating system calls (such as calls to open, read, and write files) and ships them over the network via a remote procedure call to the condor_shadow process. The condor_shadow executes the system call on the submit machine and passes the result back over the network to your Condor job. The end result is everything appears to your job like it is simply running on the submit machine, even as it bounces around to different machines in the pool.

The transparent checkpoint/migration and remote system calls are highly desirable services. However, all Standard Universe jobs must be re-linked with the Condor libraries. Although this is a simple process, after doing so there are a few restrictions on what the program can do:

1.
On some platforms, specifically HPUX and Digital Unix (OSF/1), shared libraries are not supported; therefore on these platforms applications must be statically linked (Note: shared library checkpoint support is available on IRIX, Solaris, and LINUX).
2.
Only single process jobs are supported, i.e. the fork(2), exec(2), system(3) and similar calls are not implemented.
3.
Signals and signal handlers are supported, but Condor reserves the SIGUSR2 and SIGTSTP signals and does not permit their use by user code.
4.
Most interprocess communication (IPC) calls are not supported, i.e. the socket(2), send(2), recv(2), and similar calls are not implemented.
5.
All file operations must be idempotent -- read-only and write-only file accesses work correctly, but programs which both read and write to the same file may not.
6.
Each Condor job that has been checkpointed has an associated checkpoint file which is approximately the size of the address space of the process. Disk space must be available to store the checkpoint file on the submitting machine (or on a Condor Checkpoint Server if your site administrator has set one up).

Although relinking a program for use in Condor's Standard Universe is very easy to do and typically requires no changes to the program's source code, sometimes users who wish to utilize Condor do not have access to their program's source or object code. Without access to either the source or object code, relinking for the Standard Universe is impossible. This situation is typical with commercial applications, which usually only provide a binary executable and only rarely provide source or object code.

2.5.2.2 Vanilla Universe

The Vanilla Universe in Condor is for running any programs which cannot be successfully re-linked for submission into the Standard Universe. Shell scripts are another good reason to use the Vanilla Universe. However, here's the down side: Vanilla jobs cannot checkpoint or use remote system calls. So, for example, when a user returns to a workstation running a Vanilla job, Condor can either suspend the job or restart the job from the beginning someplace else. Furthermore, unlike Standard jobs, Vanilla jobs must rely on some external mechanism in place (such as NFS, AFS, etc.) for accessing data files from different machines because Remote System Calls are only available in the Standard Universe.

2.5.3 Relinking for the Standard Universe

Relinking a program with the Condor libraries (condor_rt0.o and condor_syscall_lib.a) is a simple one-step process with Condor Version 6.1.2. To re-link a program with the Condor libraries for submission into the Standard Universe, simply run condor_compile. See the command reference page for condor_compile on page [*].

Note that even once your job is re-linked, you can still run your program outside of Condor directly from the shell prompt as usual. When you do this, the following message is printed to remind you that this binary is linked with the Condor libraries:

  WARNING: This binary has been linked for Condor.
  WARNING: Setting up to run outside of Condor...


next up previous contents
Next: 2.6 Submitting a Job Up: 2. Users' Manual Previous: 2.4 Road-map for running
condor-admin@cs.wisc.edu