next up previous contents
Next: 3.7 DaemonCore Up: 3. Administrators' Manual Previous: 3.5 User Priorities in

Subsections

  
3.6 Configuring The Startd Policy

This section describes how to configure the condor_startd to implement the policy you choose for when remote jobs should start, be suspended, (possibly) resumed, vacated (with a checkpoint) or killed (no checkpoint). This policy is the heart of Condor's balancing act between the needs and wishes of resource owners (machine owners) and resource users (people submitting their jobs to Condor). Please read this section carefully if you plan to change any of the settings described below, as getting it wrong can have a severe impact on either the owners of machines in your pool (in which case they might ask to be removed from the pool entirely) or the users of your pool (in which case they might stop using Condor).

Before we get into the details, there are a few things to note:

To define your policy, you basically set a number of expressions in the config file (see section 3.4 on ``Configuring Condor'' for an introduction to Condor's config files). These expressions are evaluated in the context of the machine's ClassAd and the ClassAd of a potential resource request (a job that has been submitted to Condor). The expressions can therefore reference attributes from either ClassAd. First, we'll list all the attributes that are included in the Machine's ClassAd. Then, we'll list all the attributes that are included in a job ClassAd. Next, we'll explain the the START expression, which describes to Condor what conditions must be met for the machine to start a job. Then, we'll describe the RANK expression, which allows you to specify which kinds of jobs a given machine prefers to run. Then, we'll discuss in some detail how the condor_startd works, in particular, the machine states and activities, to give you an idea of what is possible for your policy decisions. Finally, we offer two example policy settings.

  
3.6.1 Startd ClassAd Attributes

The condor_startd represents the machine on which it is running to the Condor pool. It publishes a number of characteristics about the machine in its ClassAd to help in match-making with resource requests. The values of all these attributes can be found by using condor_status -l hostname. On an SMP machine, the startd will break the machine up and advertise it as separate virtual machines, each with its own name and ClassAd. The attributes themselves and what they represent are described below:

Activity
: String which describes Condor job activity on the machine. Can have one of the following values:
``Idle''
: There is no job activity
``Busy''
: A job is busy running
``Suspended''
: A job is currently suspended
``Vacating''
: A job is currently checkpointing
``Killing''
: A job is currently being killed
``Benchmarking''
: The startd is running benchmarks
AFSCell
: If the machine is running AFS, this is a string containing the AFS cell name.
Arch
: String with the architecture of the machine. Typically one of the following:
``INTEL''
: Intel CPU (Pentium, Pentium II, etc).
``ALPHA''
: Digital Alpha CPU
``SGI''
: Silicon Graphics MIPS CPU
``SUN4u''
: Sun UltraSparc CPU
``SUN4x''
: A Sun Sparc CPU other than an UltraSparc, i.e. sun4m or sun4c CPU found in older Sparc workstations such as the Sparc 10, Sparc 20, IPC, IPX, etc.
``HPPA1''
: Hewlett Packard PA-RISC 1.x CPU (i.e. PA-RISC 7000 series CPU) based workstation
``HPPA2''
: Hewlett Packard PA-RISC 2.x CPU (i.e. PA-RISC 8000 series CPU) based workstation
ClockDay
: The day of the week, where 0 = Sunday, 1 = Monday, ... , 6 = Saturday.
ClockMin
: The number of minutes passed since midnight.
CondorLoadAvg
: The load average generated by Condor (either from remote jobs or running benchmarks).
ConsoleIdle
: The number of seconds since activity on the system console keyboard or console mouse has last been detected.
Cpus
: Number of CPUs in this machine, i.e. 1 = single CPU machine, 2 = dual CPUs, etc.
CurrentRank
: A float which represents this machine owner's affinity for running the Condor job which it is currently hosting. If not currently hosting a Condor job, CurrentRank is -1.0.
Disk
: The amount of disk space on this machine available for the job in kbytes ( e.g. 23000 = 23 megabytes ). Specifically, this is the amount of disk space available in the directory specified in the Condor configuration files by the EXECUTE macro, minus any space reserved with the RESERVED_DISK macro.
EnteredCurrentActivity
: Time at which the machine entered the current Activity (see Activity entry above). Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
FileSystemDomain
: a domain name configured by the Condor administrator which describes a cluster of machines which all access the same networked filesystems usually via NFS or AFS.
KeyboardIdle
: The number of seconds since activity on any keyboard or mouse associated with this machine has last been detected. Unlike ConsoleIdle, KeyboardIdle also takes activity on pseudo-terminals into account (i.e. virtual ``keyboard'' activity from telnet and rlogin sessions as well). Note that KeyboardIdle will always be equal to or less than ConsoleIdle.
KFlops
: Relative floating point performance as determined via a linpack benchmark.
LastHeardForm
: Time when the Condor Central Manager last received a status update from this machine. Expressed as seconds since the epoch (integer value). Note: This attribute is only inserted by the Central Manager once it receives the ClassAd. It is not present in the startd's copy of the ClassAd. Therefore, you couldn't use this attribute in defining startd expressions (which you wouldn't want to, anyway).
LoadAvg
: A floating point number with the machine's current load average.
Machine
: A string with the machine's fully qualified hostname.
Memory
: The amount of RAM in megabytes.
Mips
: Relative integer performance as determined via a dhrystone benchmark.
MyType
: The ClassAd type; always set to the literal string ``Machine''.
Name
: The name of this resource; typically the same value as the Machine attribute, but could be customized by the site administrator. On SMP machines, the startd will divide the CPUs up into seperate virtual machines, each with with a unique name. These names will be of the form ``vm#@full.hostname'', for example, ``vm1@vulture.cs.wisc.edu'', which signifies virtual machine 1 from vulture.cs.wisc.edu.
OpSys
: String describing the operating system running on this machine. For Condor Version 6.1.2 typically one of the following:
``HPUX10'' (for HPUX 10.20)
``IRIX6'' (for IRIX 6.2, 6.3, or 6.4)
``LINUX'' (for LINUX 2.0.x kernel systems)
``LINUX-GLIBC'' (for LINUX systems, using GNU's libc)
``OSF1'' (for Digital Unix 4.x)
``SOLARIS251''
``SOLARIS26''
Requirements
: A boolean which, when evaluated within the context of the Machine ClassAd and a Job ClassAd, must evaluate to TRUE before Condor will allow the job to use this machine.
StartdIpAddr
: String with the IP and port address of the condor_startd daemon which is publishing this Machine ClassAd.
State
: String which publishes the machine's Condor state, which can be:
``Owner''
: The machine owner is using the machine, and it is unavailable to Condor.
``Unclaimed''
: The machine is available to run Condor jobs, but a good match (i.e. job to run here) is either not available or not yet found.
``Matched''
: The Condor Central Manager has found a good match for this resource, but a Condor scheduler has not yet claimed it.
``Claimed''
: The machine is claimed by a remote condor_schedd and is probably running a job.
``Preempting''
: A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.
TargetType
: Describes what type of ClassAd to match with. Always set to the string literal ``Job'', because Machine ClassAds always want to be matched with Jobs, and vice-versa.
UidDomain
: a domain name configured by the Condor administrator which describes a cluster of machines which all have the same "passwd" file entries, and therefore all have the same logins.
VirtualMemory
: The amount of currently available virtual memory (swap space) expressed in kbytes.

  
3.6.2 Job ClassAd Attributes

\fbox{This section has not yet been written}

  
3.6.3 The START expression

The most important expression in the startd (and possibly in all of Condor) is the START expression. This expression describes what conditions must be met for a given machine to service a resource request (in other words, start someone's job). This expression (like any other expression) can reference attributes in the machine's ClassAd (such as KeyboardIdle, LoadAvg, etc), or attributes in a potential requester's ClassAd (such as Owner, Imagesize, even Cmd, the name of the executable the requester wants to run). What the START expression evaluates to plays a crucial role in determining what state and activity the machine is in.

It is technically the Requirements expression that is used for matching with other jobs. The startd just always defines the Requirements expression as the START expression. However, in situations where the machine wants to make itself unavailable for further matches, it sets its Requirements expression to False, not its START expression. When the START expression locally evaluates to true, the machine advertises the Requirements expression as ``True'' and doesn't even publish the START expression.

Normally, the expressions in the machine ClassAd are evaluated against certain request ClassAds in the condor_negotiator to see if there is a match, or against whatever request ClassAd currently has claimed the machine. However, by locally evaluating an expression, the machine only evaluates the expression against its own ClassAd. If an expression cannot be locally evaluated (because it references other expressions that are only found in a request ad, such as Owner or Imagesize), the expression is (usually) undefined. See the ClassAd appendix, section 4.1, for specifics of how undefined terms are handled in ClassAd expression evaluation.

NOTE: If you have machines with lots of real memory and swap space so the only scarce resource is CPU time, you could use the JOB_RENICE_INCREMENT (see section 3.4.12 on ``condor_starter Config File Entries'' for details) so that Condor starts jobs on your machine with low priority. Then, you could set up your machines with:

        START : True
        SUSPEND : False
        PREEMPT : False
        KILL : False
This way, Condor jobs would always run and would never be kicked off. However, because they would run with ``nice priority'', interactive response on your machines would not suffer. You probably wouldn't even notice Condor was running the jobs, assuming you had enough free memory for the Condor jobs so that you weren't swapping all the time.

  
3.6.4 The RANK expression

A machine can be configured to prefer running certain jobs over other jobs. This is done via the RANK expression. This is an expression, just like any other in the machine's ClassAd. It can reference any attribute found in either the machine ClassAd or a request ad (normally, in fact, it references things in the request ad). Probably the most common use of this expression is to configure a machine to prefer to run jobs from the owner of that machine, or by extension, a group of machines to prefer jobs from the owners of those machines.

For example, imagine you have a small research group with 4 machines: ``tenorsax'', ``piano'', ``bass'' and ``drums''. These machines are owned by 4 users: ``coltrane'', ``tyner'', ``garrison'' and ``jones'', respectively.

Say there's a large Condor pool in your department, but you spent a lot of money on really fast machines for your group. You want to make sure that if anyone in your group has Condor jobs, they have priority on your machines. To achieve this, all you have to do is set the Rank expression on your machines to refer to the Owner attribute and prefer requests where that attribute matches one of the people in your group:

        RANK : Owner == "coltrane" || Owner == "tyner" \
               || Owner == "garrison" || Owner == "jones"

The RANK expression is evaluated as a floating point number. However, just like in C, boolean expressions evaluate to either 1 or 0 depending on if they're true or false. So, if this expression evaluated to 1 (because the remote job was owned by one of the blessed folks), that would be higher than anyone else (for whom the expression would evaluate to 0).

If you wanted to get really fancy, you could still have the same basic setup, where anyone from your group has priority on your machines, but the actual machine owner has even more priority on their own machine. For example, you'd put the following entry in Jimmy Garrison's local config file bass.local:

        RANK : Owner == "coltrane" + Owner == "tyner" \
               + (Owner == "garrison") * 10 + Owner == "jones"
Notice, we're using ``+'' instead of ``| | '', since we want to be able to distinguish which terms matched and which ones didn't. Now, if anyone who wasn't in the John Coltrane quartet was running a job on ``bass'', the RANK would evaluate numerically to 0, since none of those boolean terms would evaluate to 1, and 0+0+0+0 is still 0. Now, suppose Elvin Jones submits a job. His job would match this machine (assuming the START was true for him at that time) and the RANK would numerically evaluate to 1 (since one of the boolean terms would evaluate to 1), so Elvin would preempt whoever else was using the machine at the time. After a while, say Jimmy decides to submit a job (maybe even from another machine, it doesn't matter, all that matters is that it's Jimmy's job). Now, the RANK would evaluate to 10, since the boolean that matches him gets multiplied by 10. So, Jimmy would preempt even Elvin, and his job would run on his machine.

The RANK expression doesn't just have to refer to the Owner of the jobs. Suppose you have a machine with a ton of memory, and others with not much at all. You could configure your big-memory machine to prefer to run jobs with bigger memory requirements:

        RANK : ImageSize

That's all there is to it. The bigger the job, the more this machine wants to run it. That's pretty altruistic of you, always servicing bigger and bigger jobs, even if they're not yours. So, perhaps you still want to be a nice guy, all else being equal, but if you have jobs, you want to run them, regardless of everyone else's Imagesize:

        RANK : (Owner == "coltrane" * 1000000000000) + Imagesize
This scheme would break down if someone submitted a job with an image size of more 10^12 kbytes. However, if they did, this Rank expression preferring their job over yours wouldn't be the only problem Condor had. :-)

  
3.6.5 Machine States

A given machine could be in a number of different states, depending on whether or not the machine is available to run Condor jobs, and if so, what stage in the Condor protocol has been reached. The possible states are:

Owner
The machine is being used by the machine owner, or at least is not available to run Condor jobs. When the machine first starts up, it begins in this state.
Unclaimed
The machine is available to run Condor jobs, but is not currently doing so in any way.
Matched
The machine is available to run jobs, and has been matched by the negotiator with a given schedd. That schedd just hasn't claimed this machine yet. In this state, the machine is unavailable for further matches.

Claimed
The machine has been claimed by a schedd.
Preempting
The machine was claimed by a schedd, but is now preempting that claim because either the owner of the machine came back, the negotiator decided to preempt this match because another user with higher priority has jobs waiting to run, or the negotiator decided to preempt this match because it found another request that this resource would rather serve (see the RANK expression below).

See figure 3.2 on page [*] for the various states and the possible transitions between them.


  
Figure 3.2: Machine States
\includegraphics{admin-man/machine-states.eps}

  
3.6.6 Machine Activities

Within some of these states, there could be a number of different activities the machine is in. The idea is that all the things that are true about a given state are true regardless of what activity you are in. However, there are certain important differences between each activity, which is why they are separated out from each other within a given state. In general, you must specify both a state and an activity to describe what ``state'' the machine is in. This will be denoted in this manual as ``state/activity'' pairs. For example, ``Claimed/Busy''. The following list describes all the possible state/activity pairs:

Figure 3.3 on page [*] gives the overall view of all machine states and activities, and shows all the possible transitions from one to another within the Condor system. Each transition is labeled with a number on the diagram, and transition numbers referred to in this manual will be bold. This may seem pretty daunting, but it's actually easier to handle than it looks.


  
Figure 3.3: Machine States and Activities
\includegraphics{admin-man/machine-activities.eps}

Various expressions are used to determine when and if many of these state and activity transitions occur. Other transitions are initiated by parts of the Condor protocol (such as when the condor_negotiator matches a machine with a schedd). The following section describes the conditions that lead to the various state and activity transitions.

  
3.6.7 State and Activity Transitions

This section will trace through all possible state and activity transitions within the machine and describe the conditions under which each one occurs. Whenever a transition occurs, the machine records when it entered its new activity and/or new state. These times are often used to write the expressions that determine when further transitions occurred (for example, you might only enter the Killing activity if you've been in the Vacating activity longer than a given amount of time).

  
3.6.7.1 Owner State

When the startd is first spawned, the machine it represents enters the Owner state. The machine will remain in this state as long as the START expression locally evaluates to false. If the START locally evaluates to true or can't be locally evaluated (it evaluates to undefined), transition 1 will occur and the machine will enter the Unclaimed state.

So long as the START expression locally evaluates to false, there is no possible request in the Condor system that could match it, so the machine in unavailable to Condor and stays in the Owner state. For example, if the START expression was:

START : KeyboardIdle > 15 * $(MINUTE) && Owner == "coltrane"
and if KeyboardIdle was only 34 seconds, then the machine would still be in the Owner state, even though it references Owner, which is undefined. False && anything is False, even False && undefined

If, however, the START expression was:

        START : KeyboardIdle > 15 * $(MINUTE) || Owner == "coltrane"
and KeyboardIdle was still only 34 seconds, then the machine would leave the Owner state and go to Unclaimed. This is because ``False || undefined'' is undefined. So, while this machine isn't available to just any body, if user ``coltrane'' has jobs submitted, the machine is willing to run them. Anyone else would have to wait until KeyboardIdle exceeds 15 minutes. However, since ``coltrane'' might claim this resource, but hasn't yet, the machine goes to the Unclaimed state.

While in the Owner state the startd only polls the status of the machine every UPDATE_INTERVAL to see if anything has changed that would lead it to a different state. The idea is that you don't want to put much load on the machine while the Owner is using it (frequently waking up, computing load averages, checking the access times on files, computing free swap space, etc), and there's nothing time critical that the startd needs to be sure to notice as soon as it happens. If the START expression evaluates to True and it's 5 minutes before we notice it, that's a drop in the bucket of High Throughput Computing.

The machine can only go to the Unclaimed state from the Owner state, and only does so when the START expression no longer locally evaluates to False. Generally speaking, if the START expression locally evaluates to false at any time, the machine will either transition directly to the Owner state, or to the Preempting state on its way to the Owner state, if there's a job running that needs preempting.

  
3.6.7.2 Unclaimed State

While in the Unclaimed state, if the START expression locally evaluates to false, the machine will return to the Owner state via transition 2.

When it's in the Unclaimed state, another expression comes into effect, RunBenchmarks  . Whenever the RunBenchmarks evaluates to True while the machine is in the Unclaimed state, the machine will transition from the Idle activity to the Benchmarking activity (transition 3) and perform benchmarks to determine MIPS and KFLOPS. When the benchmarks complete, the machine returns to the Idle activity (transition 4).

The startd automatically inserts an attribute, LastBenchmark, whenever it runs benchmarks, so commonly LastBenchmark is defined in terms of this attribute, for example:

        BenchmarkTimer = (CurrentTime - LastBenchmark)
        RunBenchmarks : $(BenchmarkTimer) >= (4 * $(HOUR))
Here, a macro, BenchmarkTimer is defined to help write the expression. The idea is that this macro holds the time since the last benchmark, so when this time exceeds 4 hours, we run the benchmarks again. The startd keeps a weighted average of these benchmarking results to try to get the most accurate numbers possible. That's why you would want the startd to run them more than once in its lifetime.

NOTE: LastBenchmark is initialized to 0 before the benchmarks have ever been run. So, if you want the startd to run benchmarks as soon as the machine is unclaimed (if it hasn't done so already), just include a term for LastBenchmark as in the example above.

NOTE: If RunBenchmarks is defined, and set to something other than ``False'', the startd will automatically run one set of benchmarks when it first starts up. So, if you want to totally disable benchmarks, both at startup, and at any time thereafter, just set RunBenchmarks to ``False'' or comment it out from your config file.

From the Unclaimed state, the machine can go to two other possible states: Matched or Claimed/Idle. Once the condor_negotiator matches an Unclaimed machine with a requester at a given schedd, the negotiator sends a command to both parties, notifying them of the match. If the schedd gets that notification and initiates the claiming procedure with the machine before the negotiator's message gets to the machine, the Match state is skipped entirely, and the machine goes directly to the Claimed/Idle state (transition 5). However, normally the machine will enter the Matched state (transition 6), even if it's only for a brief period of time.

  
3.6.7.3 Matched State

The Matched state is not very interesting to Condor. The only noteworthy things are that the machine lies about its START expression while in this state and says that Requirements are false to prevent being matched again before it has been claimed, and that the startd starts a timer to make sure it doesn't stay in the Matched state too long. This timer is set with the MATCH_TIMEOUT   config file parameter. It is specified in seconds and defaults to 300 (5 minutes). If the schedd that was matched with this machine doesn't claim it within this period of time, the machine gives up on it, goes back into the Owner state via transition 7 (which it will probably leave right away to get to the Unclaimed state again, and wait for another match).

At any time while the machine is in the Matched state, if the START expression locally evaluates to false, the machine enters the Owner state directly (transition 7).

If the schedd that was matched with the machine claims it before the MATCH_TIMEOUT expires, the machine goes into the Claimed/Idle state (transition 8).

  
3.6.7.4 Claimed State

The Claimed state is certainly the most complicated state. It has the most possible activities, and the most expressions that determine what it will do next. In addition the condor_checkpoint and condor_vacate commands only have any effect on the machine when its in the Claimed state. In general, there are two sets of expressions that might take effect, depending on if the universe of the request that claimed the machine is Standard or Vanilla. The Standard Universe expressions are the ``normal'' expressions, for example:

        WANT_SUSPEND            : True
        WANT_VACATE             : $(ActivationTimer) > 10 * $(MINUTE)
        SUSPEND                 : $(KeyboardBusy) || $(CPUBusy)
        ...

The Vanilla expressions have ``_VANILLA'' appended to the end, for example:

        WANT_SUSPEND_VANILLA    : True
        WANT_VACATE_VANILLA     : True
        SUSPEND_VANILLA         : $(KeyboardBusy) || $(CPUBusy)
        ...

If you don't specify separate vanilla versions, the normal versions will be used for all jobs, including vanilla jobs. For the purposes of this manual, we'll always refer to the regular expressions. Keep in mind that if the request was a Vanilla Universe, the Vanilla expressions (if they were defined) would be in effect, instead. The reason for this is that the resource owner might want the machine to behave differently for Vanilla jobs, since they can't checkpoint. For example, they might want to let Vanilla jobs remain suspended for much longer than standard jobs.

While Claimed, the POLLING_INTERVAL takes effect, and the startd starts polling the machine much more frequently to evaluate its state.

If the owner starts typing on the console again, we want to notice as soon as possible and start doing whatever that owner wants at that point. For SMP machines, if any virtual machine is in the Claimed state, the startd will poll the machine more frequently. If we're already polling for one virtual machine, it doesn't really cost us any more to evaluate the state of all the virtual machines at the same time.

In general, when the startd is going to kick a job off a machine (usually because of activity on the machine that signifies that the owner is using the machine again) the startd will go through successive levels of getting the job out of the way. The first and least costly to the job is suspending it. This even works for Vanilla jobs. If suspending the job for a little while doesn't satisfy the machine owner, (the owner is still using the machine after a certain period of time, for example), the startd moves on to vacating the job, which involves performing a checkpoint so that the work it had completed up until this point is not lost. If even that does not satisfy the machine owner (usually because it's taking too long and the owner wants their machine back now), the final, most drastic stage is reached: killing. Killing is just quick death to the job, without a checkpoint. For Vanilla jobs, vacating and killing are basically equivalent, though a vanilla job can request to have a certain softkill signal sent to it at vacate time so that it can perform application-specific checkpointing, for example.

The WANT_SUSPEND expression determines if the machine will even evaluate the SUSPEND expression to consider entering the Suspended activity. The WANT_VACATE expression determines what happens when the machine enters the preempting state, whether it will go to the vacating activity, or go directly to killing. If one or both of these expressions evaluates to false, the machine will skip that stage of getting rid of the job and proceed directly to the more drastic stages.

When the machine first enters the Claimed state, it goes to the Idle activity. From there, it has two options. It can enter the Preempting state via transition 9 (if a condor_vacate comes in, or if the START expression locally evaluates to false). Or, it can enter the busy activity (transition 10) if the schedd that has claimed the machine decides to activate the claim and start a job.

From Claimed/Busy, the machine can go to many different state/activity combinations. The startd evaluates the WANT_SUSPEND expression to decide which other expressions to evaluate. If WANT_SUSPEND is true, the startd will evaluate the SUSPEND expression, and if it is false, the startd will evaluate the PREEMPT expression and skip the Suspended activity entirely. Here are all the possible state/activity destinations that the machine can get to from Claimed/Busy:

Claimed/Idle
If the starter that is serving a given job exits (because the jobs completes, for example), the machine will go back to Claimed/Idle (transition 11).
Preempting
If WANT_SUSPEND is false and the PREEMPT expression is true, the machine will enter the Preempting state (transition 12).
Claimed/Suspended
If both the WANT_SUSPEND and SUSPEND expressions evaluate to true, the machine will suspend the job (transition 13). The other reason the machine would go from Claimed/Busy to Preempting is if the condor_negotiator matched the machine with a ``better'' match. This better match could either be from the machine's perspective (see section 3.6.4 on the RANK Expression above) or from the negotiator's perspective (because a user with a better user priority has jobs that should be running on this machine). In this case, WANT_VACATE is assumed to be true, and the machine will always go to Preempting/Vacating.
Claimed/Busy
While it's not really a state change, there is another thing that could happen to the machine while it's in Claimed/Busy, which is that either a condor_checkpoint command could arrive, or the PeriodicCheckpoint expression could evaluate to true. When either of these things occur, the startd requests that the job begin a periodic checkpoint. Since the startd has no way to know when this process completes, there's no way periodic checkpointing could be its own state. However, for the purposes of all the expressions, periodic checkpointing is Claimed/Busy, just like a job was running.

You already know what happens in Claimed/Idle, so now we'll discuss what happens in Claimed/Suspended. Again, there are multiple state/activity combinations that you can reach from Claimed/Suspended:

Claimed/Busy
If the CONTINUE expression evaluates to true, the machine will resume the computation and will go back to the Claimed/Busy state (transition 14).

Preempting
If the PREEMPT expression is true, the machine will enter the Preempting state (transition 15).

From the Claimed state, you can only enter other activities in the Claimed state (all of which we've already discussed), or the Preempting state, which is described next.

  
3.6.7.5 Preempting State

The Preempting state is much less complicated than the Claimed state. Basically, there are two possible activities, and two possible destinations. Depending on WANT_VACATE you either enter the Vacating activity (if it's true) or the Killing activity (if it's false).

While in the Preempting state (regardless of activity) the machine advertises its Requirements expression as False to signify that it is not available for further matches, either because it is about to go to the owner state anyway, or because it has already been matched with one preempting match, and further preempting matches are disallowed until the machine has been claimed by the new match.

The main function of the Preempting state is to get rid of the starter associated with this resource. If the condor_starter associated with a given claim exits while the machine is still in the Vacating activity, it means the job successfully completed its checkpoint.

If the machine is in the Vacating activity, it keeps evaluating the KILL expression. As soon as this expression evaluates to true, the machine enters the Killing activity (transition 16).

When the starter exits, or if there was no starter running when the machine enters the Preempting state (because it came from Claimed/Idle), the other job of the preempting state is completed: notifying the schedd that had claimed this machine that the claim is broken.

At this point, the machine will either enter the Owner state via transition 17 (if the job was preempted because the machine owner came back) or the Claimed/Idle state via transition 18 (if the job was preempted because a better match was found).

Then the machine enters the Killing activity, it begins a timer, the length of which is defined by the KILLING_TIMEOUT   macro. This macro is defined in seconds and defaults to 30. If this timer expires and the machine is still in the Killing activity, something has gone seriously wrong with the condor_starter and the startd tries to vacate the job immediately by sending SIGKILL to all of the condor_starter's children, and then to the condor_starter itself.

Again, once the starter is gone and the schedd that had claimed the machine is notified that the claim is broken, the machine will either enter the Owner state via transition 19 (if the job was preempted because the machine owner came back) or the Claimed/Idle state via transition 20 (if the job was preempted because a better match was found).

  
3.6.8 State/Activity Transition Expression Summary

The following section is meant to summarize the information from the previous sections to serve as a quick reference. If anything is unclear here, please refer to the previous sections for clarification.

START
When this is true, the machine is willing to spawn a remote Condor job.
RunBenchmarks
While in the Unclaimed state, the machine will run benchmarks whenever this is true.
MATCH_TIMEOUT
If the machine has been in the Matched state longer than this, it will go back to the Owner state.
WANT_SUSPEND
If this is true, the machine will evaluate the SUSPEND expression to see if it should transition to the Suspended activity. If this is false, the machine will look at the PREEMPT expression.
SUSPEND
If WANT_SUSPEND is true, and the machine is in the Claimed/Busy state, it will enter the Suspended activity if SUSPEND is true.
CONTINUE
If the machine is in the Claimed/Suspended state, it will enter the Busy activity if CONTINUE is true.
PREEMPT
If the machine is either in the Claimed/Suspended activity, or is in the Claimed/Busy activity and the WANT_SUSPEND is false, the machine will enter the Preempting state whenever PREEMPT is true.
WANT_VACATE
This is only checked when the PREEMPT expression is true and the machine enters the Preempting state. If WANT_VACATE is true, the machine will enter the Vacating activity. If it is false, the machine will proceed directly to the Killing activity.
KILL
If the machine is the Preempting/Vacating state, it will enter Preempting/Killing whenever KILL is true.
KILLING_TIMEOUT
If the machine is in the Preempting/Killing state for longer than KILLING_TIMEOUT seconds, the startd will just send a SIGKILL to the condor_starter and all its children to try to kill the job as quickly as possible.
PERIODIC_CHECKPOINT
If the machine is in the Claimed/Busy state and PERIODIC_CHECKPOINT is true, the user's job will begin a periodic checkpoint.
RANK
If this expression evaluates to a higher number for a pending resource request than it does for the current request, the machine will preempt the current request (enter the Preempting/Vacating state). When the preemption is complete, the machine will enter the Claimed/Idle state with the new resource request claiming it.

  
3.6.9 Example Policy Settings

The following section provides two examples of how you might configure the policy at your pool. Each one is described in English, then the actual macros and expressions used are listed and explained with comments. Finally the entire set of macros and expressions are listed in one block so you can see them in one place for easy reference.

  
3.6.9.1 Default Policy Settings

These settings are the default as shipped with Condor. They have been used for many years with no problems. The Vanilla expressions are identical to the regular ones. (They aren't even listed here. If you don't define them, the regular expressions are used for Vanilla jobs as well).

First, we define a bunch of macros which help us write the expressions more clearly. In particular, we use:

StateTimer
How long we've been in the current state.

ActivityTimer
How long we've been in the current activity.

ActivationTimer
How long the has job been running on this machine.

LastCkpt
How long it's been since we last performed a periodic checkpoint.

NonCondorLoadAvg
The difference of the system load and the Condor load (i.e the load generated by everything but Condor).

BackgroundLoad
How much background load we're willing to have on our machine and still start a Condor job.

BackgroundLoad
How much background load we're willing to have on our machine and still start a Condor job.

HighLoad
If the $(NonCondorLoadAvg) goes over this, the CPU is ``busy'' and we want to start evicting the Condor job.

StartIdleTime
How long the keyboard has to be idle before we'll start a job.

ContinueIdleTime
How long the keyboard has to be idle before we'll resume a suspended job.

MaxSuspendTime
How long we're willing to let the job be suspended before we move on to more drastic measures.

MaxVacateTime
How long we're willing to let the job be checkpointing before we give up on it and have to kill it outright.

KeyboardBusy
A boolean string that evaluates to true when the keyboard is being used.

CPU_Idle
A boolean string that evaluates to true when the CPU is idle is being used.

CPU_Busy
A boolean string that evaluates to true when the CPU is busy.

MachineBusy
The CPU or the Keyboard is busy.

##  These macros are here to help write legible expressions:
MINUTE          = 60
HOUR            = (60 * $(MINUTE))
StateTimer      = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
ActivationTimer = (CurrentTime - JobStart)

NonCondorLoadAvg        = (LoadAvg - CondorLoadAvg)
BackgroundLoad          = 0.3
HighLoad                = 0.5
StartIdleTime           = 15 * $(MINUTE)
ContinueIdleTime        = 5 * $(MINUTE)
MaxSuspendTime          = 10 * $(MINUTE)
MaxVacateTime           = 5 * $(MINUTE)

KeyboardBusy            = KeyboardIdle < $(MINUTE)
CPU_Idle                = $(NonCondorLoadAvg) <= $(BackgroundLoad)
CPU_Busy                = $(NonCondorLoadAvg) >= $(HighLoad)
MachineBusy             = ($(CPU_Busy) || $(KeyboardBusy))

Now, we define that we always want to suspend jobs. If that's not enough, we'll always try to gracefully vacate them, unless they've only been running for less than 10 minutes anyway, in which case we'll just kill them, instead of trying to checkpoint those 10 minutes of work.

WANT_SUSPEND            : True
WANT_VACATE             : $(ActivationTimer) > 10 * $(MINUTE)

Finally, we define the actual expressions. Start any job if the CPU is idle (as defined by our macro), and the keyboard has been idle long enough.

START           : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime)

Suspend a job if the machine is busy.

SUSPEND         : $(MachineBusy)

Continue a suspended job if the CPU is idle and the Keyboard has been idle for long enough.

CONTINUE        : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime)

There are two conditions that we want to preempt under. First, if we have suspended the job, but it's been suspended too long. Second, if we don't even want to suspend the job, and the machine is busy.

PREEMPT	        : ( ($(ActivityTimer) > $(MaxSuspendTime)) && \
                   (Activity == "Suspended") ) || \
                  ( $(MachineBusy) && (WANT_SUSPEND == False) )

Kill a job if we've been vacating for too long.

KILL            : $(ActivityTimer) > $(MaxVacateTime)

Finally, specify we want periodic checkpointing. For jobs smaller than 60 megs, we periodic checkpoint every 6 hours. For larger jobs, we only checkpoint every 12 hours.

PERIODIC_CHECKPOINT     : ( (ImageSize < 60000) && \
                            ($(LastCkpt) > (6 * $(HOUR))) ) || \ 
                          ( $(LastCkpt) > (12 * $(HOUR)) )

For clarity and reference, the entire set policy settings are included once more without comments:

##  These macros are here to help write legible expressions:
MINUTE          = 60
HOUR            = (60 * $(MINUTE))
StateTimer      = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
ActivationTimer = (CurrentTime - JobStart)
LastCkpt	= (CurrentTime - LastPeriodicCheckpoint)

NonCondorLoadAvg        = (LoadAvg - CondorLoadAvg)
BackgroundLoad          = 0.3
HighLoad                = 0.5
StartIdleTime           = 15 * $(MINUTE)
ContinueIdleTime        = 5 * $(MINUTE)
MaxSuspendTime          = 10 * $(MINUTE)
MaxVacateTime           = 5 * $(MINUTE)

KeyboardBusy            = KeyboardIdle < $(MINUTE)
CPU_Idle                = $(NonCondorLoadAvg) <= $(BackgroundLoad)
CPU_Busy                = $(NonCondorLoadAvg) >= $(HighLoad)
MachineBusy             = ($(CPU_Busy) || $(KeyboardBusy))

WANT_SUSPEND            : True
WANT_VACATE             : $(ActivationTimer) > 10 * $(MINUTE)

START           : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime)
SUSPEND         : $(MachineBusy)
CONTINUE        : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime)
PREEMPT	        : ( ($(ActivityTimer) > $(MaxSuspendTime)) && \
                   (Activity == "Suspended") ) || \
                  ( $(MachineBusy) && (WANT_SUSPEND == False) )
KILL            : $(ActivityTimer) > $(MaxVacateTime)

PERIODIC_CHECKPOINT     : ( (ImageSize < 60000) && \
                            ($(LastCkpt) > (6 * $(HOUR))) ) || \ 
                          ( $(LastCkpt) > (12 * $(HOUR)) )

  
3.6.9.2 UW-Madison CS Condor Pool Policy Settings

Due to a recent increase in the number of Condor users and the size of their jobs (many users here are submitting jobs with an Imagesize of over 100 megs!), we have had to customize our policy to try to handle this range of Imagesize better.

Basically, whether or not we suspend or vacate jobs is now a function of the Imagesize of the job that's currently running (which is defined in terms of kilobytes). We have divided the Imagesize into three possible categories, which we define with macros.

BigJob          = (ImageSize > (30 * 1024))
MediumJob       = (ImageSize <= (30 * 1024) && ImageSize >= (10 * 1024))
SmallJob        = (ImageSize < (10 * 1024))

Our policy can be summed up with the following few sentences: If the job is ``small'', it goes through the normal progression of suspend to vacate to kill based on the tried and true times. If the job is ``medium'', when the user comes back, we start vacating the job right away. The idea is that if we checkpoint immediately, all our pages are still in memory, checkpointing will be fast, and we'll free up memory pages as soon as we checkpoint. If we suspend, our pages will start getting swapped out and when we finally want to checkpoint (10 minutes later), we'll have to start swapping out the user's pages again, they'll see reduced performance, and checkpointing will take much longer. If the job is ``big'', don't even bother checkpointing, since we won't finish before the owner gets too upset and we might as well not even bother putting the wasted load on the network and checkpoint server.

All the logic for our pool's special policy is tuned with the WANT_ expressions. All of the other expressions and macros just use the defaults. We only want to suspend jobs if they are ``small'', and we only want to vacate jobs that are ``small'' or ``medium''. We still want to always suspend Vanilla jobs, regardless of their size.

WANT_SUSPEND            : $(SmallJob)
WANT_VACATE             : $(MediumJob) || $(SmallJob)
WANT_SUSPEND_VANILLA    : True
WANT_VACATE_VANILLA     : True

Now, we define the actual expressions, (which we just use the defaults for). We really do this with macros and simply define the expressions with the macros later on. This may seem really strange, but we do it because it makes it easier to do special customized settings (for example, for testing purposes) and still reference the defaults. There will be a brief example of this at the end of this section.

CS_START        = $(CPU_Idle) && KeyboardIdle > $(StartIdleTime)
CS_SUSPEND      = $(MachineBusy)
CS_CONTINUE     = (KeyboardIdle > $(ContinueIdleTime)) && $(CPU_Idle)
CS_PREEMPT      = ( ($(ActivityTimer) > $(MaxSuspendTime)) && \
                   (Activity == "Suspended") ) || \
                  ( $(MachineBusy) && (WANT_SUSPEND == False) )
CS_KILL         = ($(ActivityTimer) > $(MaxVacateTime))

Here's where we actually define the expressions in terms of our special macros:

START       : $(CS_START)
SUSPEND     : $(CS_SUSPEND)
CONTINUE    : $(CS_CONTINUE)
PREEMPT     : $(CS_PREEMPT)
KILL        : $(CS_KILL)

We still don't want to define separate Vanilla versions of any of these, since we already have a different WANT_SUSPEND for vanilla jobs and all of the policy expressions are just written in terms of that.

Periodic checkpointing also takes image size into account. Since we kill large jobs right away at eviction time, we want to periodically checkpoint them more frequently (every 3 hours), since that's the only way they make forward progress. However, with all those large periodic checkpoints going on on so frequently, we don't want to bog down our network or our checkpoint servers. So, we only periodic checkpoint small or medium jobs every 12 hours, since they get the privilege of checkpointing at eviction time.

PERIODIC_CHECKPOINT  : (($(LastCkpt) > (3 * $(HOUR))) \
      && $(BigJob)) || (($(LastCkpt) > (12 * $(HOUR))) && \
      ($(SmallJob) || $(MediumJob)))

For clarity and reference, the entire set of policy settings are included once more, without comments:

ActivationTimer = (CurrentTime - JobStart)
StateTimer      = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
LastCkpt        = (CurrentTime - LastPeriodicCheckpoint)

NonCondorLoadAvg   = (LoadAvg - CondorLoadAvg)
BackgroundLoad     = 0.3
HighLoad           = 0.5
StartIdleTime      = 15 * $(MINUTE)
ContinueIdleTime   = 5 * $(MINUTE)
MaxSuspendTime     = 10 * $(MINUTE)
MaxVacateTime      = 5 * $(MINUTE)

KeyboardBusy       = KeyboardIdle < $(MINUTE)
CPU_Idle           = $(NonCondorLoadAvg) <= $(BackgroundLoad)
CPU_Busy           = $(NonCondorLoadAvg) >= $(HighLoad)
MachineBusy        = ($(CPU_Busy) || $(KeyboardBusy))

BigJob       = (ImageSize > (30 * 1024))
MediumJob    = (ImageSize <= (30 * 1024) && ImageSize >= (10 * 1024))
SmallJob     = (ImageSize < (10 * 1024))

WANT_SUSPEND            : $(SmallJob)
WANT_VACATE             : $(MediumJob) || $(SmallJob)
WANT_SUSPEND_VANILLA    : True
WANT_VACATE_VANILLA     : True

CS_START    = $(CPU_Idle) && KeyboardIdle > $(StartIdleTime)
CS_SUSPEND  = $(CPU_Busy) || $(KeyboardBusy)
CS_CONTINUE = (KeyboardIdle > $(ContinueIdleTime)) && $(CPU_Idle)
CS_PREEMPT  = ( ($(ActivityTimer) > $(MaxSuspendTime)) && \
               (Activity == "Suspended") ) || \
              ( $(MachineBusy) && (WANT_SUSPEND == False) )
CS_KILL     = ($(ActivityTimer) > $(MaxVacateTime))

START       : $(CS_START)
SUSPEND     : $(CS_SUSPEND)
CONTINUE    : $(CS_CONTINUE)
PREEMPT     : $(CS_PREEMPT)
KILL        : $(CS_KILL)

PERIODIC_CHECKPOINT  : (($(LastCkpt) > (3 * $(HOUR))) \
      && $(BigJob)) || (($(LastCkpt) > (12 * $(HOUR))) && \
      ($(SmallJob) || $(MediumJob)))

As a final example, we show how our default macros can be used to setup a given machine for testing. Suppose we want the machine to behave just like normal, but if user ``coltrane'' submits a job, we want that job to start regardless of what's happening on the machine, and we don't want the job suspended, vacated or killed. For example, we might know ``coltrane'' is just going to be submitting very short running programs to test something and he wants to see them execute right away. Anyway, we could configure any machine (or our whole pool, for that matter) with the following 5 expressions:

        START      : ($(CS_START)) || Owner == "coltrane"
        SUSPEND    : ($(CS_SUSPEND)) && Owner != "coltrane"
        CONTINUE   : $(CS_CONTINUE)
        PREEMPT    : ($(CS_PREEMPT)) && Owner != "coltrane"
        KILL       : $(CS_KILL)
Notice that you don't have to do anything special with either the CONTINUE or KILL expressions. If Coltrane's jobs never suspend, they'll never even look at CONTINE. Similarly, if they never preempt, they'll never look at KILL.

  
3.6.10 Differences from the Version 6.0 Policy Settings

This section describes how the policy expressions just layed out differ from the policy expressions in previous versions of Condor. If you've never used Condor version 6.0 or earlier, or never looked closely at the policy settings, you can probably skip this section.

In summary, there is no longer a VACATE expression, and the KILL expression is not evaluated while a machine is claimed. There is only a PREEMPT expression which describes the conditions when a machine will move from the Claimed state to the Preempting state. Once a machine has decided to go into the Preempting state, the WANT_VACATE expression controls whether or not the job should be vacated with a checkpoint or directly killed. The KILL expression only determines when the machine goes from Preempting/Vacating to Preempting/Killing.

Before, the KILL expression had to handle three distinct cases (the transitions from Claimed/Busy, Claimed/Suspended and Preepting/Vacating) and the VACATE expression had to handle two (the transitions from Claimed/Busy and Claimed/Suspended). Now, PREEMPT has to handle the same two cases as the previous VACATE expression, but the KILL expression only handles one. In fact, very complicated policies can now be specified using all of the default expressions, and only tuning the WANT_VACATE and WANT_SUSPEND expressions. In previous versions, if you made heavy use of the WANT_* expressions, the KILL expression could become incredibly complicated.


next up previous contents
Next: 3.7 DaemonCore Up: 3. Administrators' Manual Previous: 3.5 User Priorities in
condor-admin@cs.wisc.edu