Introduction to HTC and Condor for (CE) Administrators


INFN online training - March 15th, 2022.

Francesco Prelz, francesco.prelz@mi.infn.it

⇔ Program

  1. High Throughput vs High Performance - why ?
  2. Role, structure and interplay of the (HT)Condor daemons.
    Or (1): where are policies configured and enforced?.
    Or (2): what are the differences among the three VMs we just set up?
  3. The job life-cycle. "Where" the heck "is" my (or her/his) job ?
  4. Log files, debug levels and other information sources.
  5. Brief hands-on session.
There's many important things we do not cover here, e.g:

⇔ High Performance or High Throughput?

HTC picture (1/3)

high-throughput / high-performance

HTC picture (2/3)

high-throughput / high-performance

HTC picture (3/3)

high-throughput / high-performance
FLOPY FLOPS

⇔ Condor "philosophy"

Condor philosophy in one sentence (Greg Thain's):

To reliably run as many jobs as possible
on as many machines as possible, subject to all constraints.

(in order of precedence - reliability is first! - "Jobs are like money in the bank"®)

  • All daemons take a crisply defined responsibility in realising this first principle.
Let's first agree on some generic batch system terminology:
  • Job execution requests are described according to some convention and collected at submit nodes and queued somewhere.
  • Jobs are executed on worker nodes (sometimes called workers, or execution nodes), where all needed dependencies and environment are hopefully satisfied.
  • Batch systems typically organise management functions and processes on head nodes - mostly distinct from workers.
  • Our three VMs wish to represent these three node categories.
  • We cannot and should not forget the human factor - we cannot call them names. This is where all the constraints come from.

Reliably run...

...as many jobs...

...on as many machines as possible.

The policy of worker (or execution) nodes is represented by the condor_startd:
  • The startd doesn't start jobs - it rather manages the machine and creates “slots”, places for jobs to run, then starts an actual condor_starter. ›
  • In case of conflict with the job policy the machine wins. The resource owner is the boss and the job is a guest.
  • The startd is near-sighted: it just sees the machine and the running and candidate jobs - knows nothing of the rest of the system.
  • The startd can select, pre-empt, limit jobs, and choose how to describe the local machine to the world.
We can now name and locate the various daemons more precisely.
  • One condor_starter per slot/per job manages the running jobs:
    • Creates environment.
    • Monitors and reports job usage.
    • Properly (and "philosophically") cleans up after use (and the startd cleans up after the starter).
    • Handles file transfer (preferred over unmanaged, undeclared shared FS), communicating directly with its partner on the submit side, the condor_shadow.
    • Many of these functions can fail.
  • Two more daemons left to describe: the condor_collector and condor_negotiator.

⇔ The "Central Manager" (1)

I.e.: Condor Services typically running on the head nodes.

⇔ The "Central Manager" (2)

⇔ The spinning pie

  1. Get all slots in the pool (via condor_status, possibly selecting them with NEGOTIATOR_SLOT_CONSTRAINT).
  2. Get all submitters with pending requests (condor_status -submitters).
  3. Compute the number of slots submitters should get
    • Based on historical usage (condor_userprio -all).
    • With corrections: effective_priority = real_priority * (configurable) priority_factor.
    • Priority smoothed by PRIORITY_HALFLIFE, defaulting to 86400 (seconds - 24 hours).
  4. Hand out slots to submitters in ascending effective priority order (a lower priority value means higher priority in Condor).
    • When more matching slots than needed are found, they are ordered by RANK.
  5. As the slot allocation is performed before checking for matches, there will be leftover slots to assign. Repeat as needed.
  6. The matching ('claimed') startd and schedd will begin handling the request - starting the starter and shadow. Many things can still fail at this level...

⇔ Daemon configuration

Life cycle of HTCondor job: (again, courtesy Greg Thain)

Where it all starts: a submit file. Submit files are not ClassAds.

universe = vanilla executable = /path/to/my/computation    request_memory = 70M arguments = $(ProcID) should_transfer_input = yes output = out.$(ProcID) error = error.$(ProcId) log = /path/to/user.log +IsVerySpecialJob = true Queue
JobUniverse = 5 Cmd = computation Args = “0” RequestMemory = 70000000 Requirements = Opsys == “Linux...    DiskUsage = 0 Output = “out.0” Error = “error.0” UserLog = “/path/to/user.log” IsVerySpecialJob = true

In both condor_q and condor_status, the -af ('autoformat') option is very practical to list specific attributes (-af:l shows the attribute names, -af:th formats a nice table):

$ condor_status -af Machine DetectedMemory Disk baldassarre.fisica.unimi.it 128724 171330201 baldassarre.fisica.unimi.it 128724 343348 bell.heisenberg.pcteor1.mi.infn.it 32132 838312772 bethe.heisenberg.pcteor1.mi.infn.it 24147 437415148 bloch.heisenberg.pcteor1.mi.infn.it 64513 884268984 (...)

But the '-constraint' flag is also interesting, as it can select ClassAds based on their contents:

$ condor_status -af Machine DetectedMemory Disk \ -constraint 'Clustername=="magi-pool" && Disk > 1000000' baldassarre.fisica.unimi.it 128724 171330138 melchiorre.fisica.unimi.it 128724 166939416

The same syntax can be used to build a 'Requirements' expression to select resources to match jobs.
Once again, the full reference for the ClassAd language is found in the HTCondor manual, here.

"Old" condor_q format (admin-friendly: can be made default by setting CONDOR_Q_ONLY_MY_JOBS and CONDOR_Q_DASH_BATCH_IS_DEFAULT in the config file):

$ condor_q [-nobatch -allusers] -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 11:50:49 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6376.0 prelz 1/30 10:52 0+00:00:00 I 0 0.0 echo Hello, world! Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...

"New" format:

$ condor_q -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 10:52:33 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS prelz CMD: /bin/echo 1/30 10:52 _ _ 1 1 6376.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...

Why is my (or very-important-user X's) job not starting ?

Let's start with the user-level tools: Check the UserLog. Check the Requirements expression

$ condor_q -better-analyze[:reverse] 6385 -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... The Requirements expression for job 6385.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain)) Job 6385.000 defines the following attributes: (... snip ...) The Requirements expression for job 6385.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 74 TARGET.Arch == "X86_64" [7] 101 TARGET.HasFileTransfer 6385.000: Job has not yet been considered by the matchmaker. 6385.000: Run analysis summary ignoring user priority. Of 101 machines, 27 are rejected by your job's requirements 8 reject your job because of their own requirements 0 match and are already running your jobs 9 match but are serving other users 57 are available to run your job

Why does a job end up in a 'Held' state ?

$ condor_q -- Schedd: gaspare.fisica.unimi.it : <159.149.47.93:9618?... @ 01/30/19 12:22:25 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 115318.0 mino.cancelli 11/20 15:35 6+18:37:59 H 0 1.0 add_user_mpiexe (... etc. etc. ...)

Just find the reason in the job ClassAd (-l and -af work just as for machine ClassAds):

$ condor_q 115318 -af HoldReason Failed to initialize user log to /home/mino.cancelli/run_condor/run_whatever.log

Jobs are put on hold when a condition that could be temporary or recoverable but cannot be fixed by Condor itself prevents the job from making progress.
When the problem is fixed, jobs can be released:
$ condor_release 115318

When a user-level approach fails...

How to tweak the log verbosity.

⇔ Reference material

Available (HT)Condor reference:

⇔ Let's try some of this ourselves (1).

⇔ Let's try some of this ourselves (2)...

⇔ Thank you!

Your contact e-mail resource for issues
htcondor-support@lists.infn.it
< Goto Page: >