INFN online training - March 15^th, 2022.

Francesco Prelz, francesco.prelz@mi.infn.it

High Throughput vs High Performance - why ?
Role, structure and interplay of the (HT)Condor daemons.
Or (1): where are policies configured and enforced?.
Or (2): what are the differences among the three VMs we just set up?
The job life-cycle. "Where" the heck "is" my (or her/his) job ?
Log files, debug levels and other information sources.
Brief hands-on session.

There's many important things we do not cover here, e.g:

The installation from scratch of a Condor pool - because we just did it with the assistance of http://get.htcondor.org.
The (complex - and now based mostly on JSON Web Tokens!) authentication/authorisation model (detailed in the Condor manual here, and covered in another talk by David Rebatto).

As usual, the best computing equipment is the one tailored to the application at hand.
If it weren't for the price tag, anyone would ask for as many execution cores as possible, all available in parallel, now.
Indeed, institutions that try to elbow their way through the Top 500 list cannot be too concerned about the price tag.
When "now" and "as many as possible" aren't a real application requirement (as in the many "pleasantly parallel" cases we know well), any available computing equipment, even if slow, heterogeneous and/or poorly connected can contribute some amount of progress to the overall effort. As long as there are tools to harness the workload in that way...
A few pictures are better than a thousand words!

high-throughput /	high-performance

high-throughput /	high-performance

high-throughput /	high-performance
FLOPY	FLOPS

Condor philosophy in one sentence (Greg Thain's):

To reliably run as many jobs as possible
on as many machines as possible, subject to all constraints.

(in order of precedence - reliability is first! - "Jobs are like money in the bank"^®)

All daemons take a crisply defined responsibility in realising this first principle.

Let's first agree on some generic batch system terminology:

Job execution requests are described according to some convention and collected at submit nodes and queued somewhere.
Jobs are executed on worker nodes (sometimes called workers, or execution nodes), where all needed dependencies and environment are hopefully satisfied.
Batch systems typically organise management functions and processes on head nodes - mostly distinct from workers.
Our three VMs wish to represent these three node categories.
We cannot and should not forget the human factor - we cannot call them names. This is where all the constraints come from.

Every responsibility in the system is handled by a daemon, typically a UNIX process (unless you are on Windows).
Each process knows how to exit nicely, handle and report failures.
No clutter: everything is cleaned up - in case of success or failure.
Resource usage is measured, reported and limited.
This requires some management by a parent (or guardian, or baby-sitter) outside the process. The condor_master ("systemctl start condor" == run condor_master):
- A small process - runs on every machine where Condor does anything.
- Fork-execs (and restarts, and stops) all needed (configured) processes as children.
- Checks that the kids stay healthy: makes sure they are killed when hung.
- Tunes the Linux kernel if needed.
- Exits when the disk is full.
- Handles admin commands such as condor_on, condor_off, condor_reconfig.
- Distributes SIGHUP, terminates children when killed.

Requires some scheduling (that can occur locally on a submit node).
Jobs are sent to the condor_schedd:
- A reliable, slow job database.
- Makes sure that jobs are restarted after a crash and no jobs can fall through a bottomless crack.
- The schedd doesn't really schedule: has jobs, can request machines - but only uses the machines given to it.
Scalability to handle many jobs is achieved by multiplying and distributing the schedds.
The schedd can be near the user - can even be disconnected while the jobs keep running - “submit locally, run globally”.

Machines will necessarily be heterogeneous.
Could be on foreign Condor pools.
Could be on the same pool but with different config.
Could be in places without shared filesystem or other dependencies or parts of the environment.
The configuration of the submit side (represented by the schedd) and of the execution side has to be different, as the responsibilities and the best interest of both sides are different.
- All policies are expressed in the same language - Classified ads. We don't have time to enter into too many details here, but ClassAds are fundamentally collections of named expressions, that can reference each other via a small set of 'total' operators and predefined functions. All details are found in the manual here.

The policy of worker (or execution) nodes is represented by the condor_startd:

The startd doesn't start jobs - it rather manages the machine and creates “slots”, places for jobs to run, then starts an actual condor_starter. ›
In case of conflict with the job policy the machine wins. The resource owner is the boss and the job is a guest.
The startd is near-sighted: it just sees the machine and the running and candidate jobs - knows nothing of the rest of the system.
The startd can select, pre-empt, limit jobs, and choose how to describe the local machine to the world.

We can now name and locate the various daemons more precisely.

One condor_starter per slot/per job manages the running jobs:
- Creates environment.
- Monitors and reports job usage.
- Properly (and "philosophically") cleans up after use (and the startd cleans up after the starter).
- Handles file transfer (preferred over unmanaged, undeclared shared FS), communicating directly with its partner on the submit side, the condor_shadow.
- Many of these functions can fail.
Two more daemons left to describe: the condor_collector and condor_negotiator.

I.e.: Condor Services typically running on the head nodes.

The condor_collector is Condor's central database - normally there's just one - see
condor_config_val COLLECTOR_HOST.
A lightweight service, all in memory, with no permanent storage.
Every service reports (frequently) to the collector in classad format.
If the daemon is missing, some activity stops, but jobs don't fail.
All contents of the collector can be queried with the condor_status command, just as all contents of the schedd can be queried with condor_q.

The condor_negotiator is where most of the scheduling really happens.
A slow process, constantly cycling through user requests and allocating machines to jobs.
If the daemon is missing, some activity stops, but jobs don't fail.
Rich, complex semantics - could talk for hours about it.
Some relevant policy options are realised here (pre-emption vs. no pre-emption i.e. fairness vs. throughput, accounting groups and quotas, concurrency limits):
"very powerful, often ignored".
Enabling debugging output from negotiator gives a cornucopia of information on what's going on.

The matchmaker in the negotiator tries to achieve long-term fairness in the allocation of resource shares to competing users.
Available compute resources are “The Pie”
Users, with their relative priorities, are each trying to get their “Pie Slice” (as long as job and machine requirements/"preferences" are satisfied)
First, the Matchmaker takes some jobs from each user and finds resources for them.
After all users have got their initial “Pie Slice”, if there are still more jobs and resources, the matchmaker continues “spinning the pie” and handing out resources until everything is matched.
A user who didn't submit many jobs recently will get larger pie slices for some time.
HTCondor tracks usage and has a formula for determining priority based on both current demand and prior usage - prior usage exponentially "decays” over time, with configurable decay factors.

Get all slots in the pool (via condor_status, possibly selecting them with NEGOTIATOR_SLOT_CONSTRAINT).
Get all submitters with pending requests (condor_status -submitters).
Compute the number of slots submitters should get
- Based on historical usage (condor_userprio -all).
- With corrections: effective_priority = real_priority * (configurable) priority_factor.
- Priority smoothed by PRIORITY_HALFLIFE, defaulting to 86400 (seconds - 24 hours).
Hand out slots to submitters in ascending effective priority order (a lower priority value means higher priority in Condor).
- When more matching slots than needed are found, they are ordered by RANK.
As the slot allocation is performed before checking for matches, there will be leftover slots to assign. Repeat as needed.
The matching ('claimed') startd and schedd will begin handling the request - starting the starter and shadow. Many things can still fail at this level...

We cannot really substitute the Condor manual when it comes to the details of a 2000+ knob configuration...
Your best friend is condor_config_val. It tells you the location of config files:
$ condor_config_val -config Configuration source: /etc/condor/condor_config Local configuration source: /etc/condor/condor_config.local (...)
It tells you where individual attributes are defined:
$ condor_config_val -v CONDOR_HOST CONDOR_HOST = cmcondor.bo.infn.it # at: /etc/condor/condor_config.local, line 4 # raw: CONDOR_HOST = cmcondor.bo.infn.it

condor_config_val can dump the entire config space:
$ condor_config_val -dump |head # Configuration from machine: orsone.mi.infn.it ABORT_ON_EXCEPTION = false ACCOUNTANT_HOST = (... etc. etc. etc. ...)
Each individual daemon holds an authoritative copy of the configuration, (this can be synchronised with condor_reconfig), but condor_config_val can also query each daemon at runtime (and even change the values on the go, if allowed) when one of these switches is used:
-master Query the master -schedd Query the schedd -startd Query the startd -collector Query the collector -negotiator Query the negotiator
The current recommendation for local config modifications is to store them in small snippets in the /etc/condor/config.d directory. Files are read in lexicographical order (the last definition of an attribute takes precedence), and the recommended filename has two leading digits, e.g. /etc/condor/config.d/05-some_example.

condor_config_val can also dump the so-called "metaknobs". Logical collections of consistent settings that are used to set up certain host or security profiles (e.g. 'Submit' node, 'Execute' node, 'CentralManager', 'Personal' Condor, Host- vs. User-based security, etc.) as well as other policies and features - see manual here.):
$ condor_config_val use ROLE:Submit use ROLE:Submit is DAEMON_LIST=$(DAEMON_LIST) SCHEDD
Caution: there are config overrides via the environment. Setting env variables named _condor_ATTRIBUTE_NAME=value (note leading underscore) takes precedence over all other configuration sources.
condor_reconfig (or sending SIGHUP to the master) causes configuration to be re-read and to become effective - with the exception of changes to a few attributes (notably DAEMON_LIST, NETWORK_INTERFACE, a few security settings, full listing here, that require a condor_restart.

universe = vanilla executable = /path/to/my/computation request_memory = 70M arguments = $(ProcID) should_transfer_input = yes output = out.$(ProcID) error = error.$(ProcId) log = /path/to/user.log +IsVerySpecialJob = true Queue

→

JobUniverse = 5 Cmd = computation Args = “0” RequestMemory = 70000000 Requirements = Opsys == “Linux... DiskUsage = 0 Output = “out.0” Error = “error.0” UserLog = “/path/to/user.log” IsVerySpecialJob = true

The file (left) fed into condor_submit becomes a ClassAd (right) stored in the condor_schedd (both in memory and on a backup "bootstrap" file), where it can be accessed via the (expensive!) condor_q command.
There's no fixed schema. Attributes prepended with '+' in the submit files go straight into the ClassAd.
ClassAd can be modified via SUBMIT_ATTRS, condor_qedit, job transforms.

In both condor_q and condor_status, the -af ('autoformat') option is very practical to list specific attributes (-af:l shows the attribute names, -af:th formats a nice table):

$ condor_status -af Machine DetectedMemory Disk baldassarre.fisica.unimi.it 128724 171330201 baldassarre.fisica.unimi.it 128724 343348 bell.heisenberg.pcteor1.mi.infn.it 32132 838312772 bethe.heisenberg.pcteor1.mi.infn.it 24147 437415148 bloch.heisenberg.pcteor1.mi.infn.it 64513 884268984 (...)

But the '-constraint' flag is also interesting, as it can select ClassAds based on their contents:

$ condor_status -af Machine DetectedMemory Disk \ -constraint 'Clustername=="magi-pool" && Disk > 1000000' baldassarre.fisica.unimi.it 128724 171330138 melchiorre.fisica.unimi.it 128724 166939416

The same syntax can be used to build a 'Requirements' expression to select resources to match jobs.
Once again, the full reference for the ClassAd language is found in the HTCondor manual, here.

"Old" condor_q format (admin-friendly: can be made default by setting CONDOR_Q_ONLY_MY_JOBS and CONDOR_Q_DASH_BATCH_IS_DEFAULT in the config file):

$ condor_q [-nobatch -allusers] -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 11:50:49 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6376.0 prelz 1/30 10:52 0+00:00:00 I 0 0.0 echo Hello, world! Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...

"New" format:

$ condor_q -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 10:52:33 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS prelz CMD: /bin/echo 1/30 10:52 _ _ 1 1 6376.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...

Why is my (or very-important-user X's) job not starting ?

Let's start with the user-level tools: Check the UserLog. Check the Requirements expression

$ condor_q -better-analyze[:reverse] 6385 -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... The Requirements expression for job 6385.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain)) Job 6385.000 defines the following attributes: (... snip ...) The Requirements expression for job 6385.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 74 TARGET.Arch == "X86_64" [7] 101 TARGET.HasFileTransfer 6385.000: Job has not yet been considered by the matchmaker. 6385.000: Run analysis summary ignoring user priority. Of 101 machines, 27 are rejected by your job's requirements 8 reject your job because of their own requirements 0 match and are already running your jobs 9 match but are serving other users 57 are available to run your job

Why does a job end up in a 'Held' state ?

$ condor_q -- Schedd: gaspare.fisica.unimi.it : <159.149.47.93:9618?... @ 01/30/19 12:22:25 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 115318.0 mino.cancelli 11/20 15:35 6+18:37:59 H 0 1.0 add_user_mpiexe (... etc. etc. ...)

Just find the reason in the job ClassAd (-l and -af work just as for machine ClassAds):

$ condor_q 115318 -af HoldReason Failed to initialize user log to /home/mino.cancelli/run_condor/run_whatever.log

Jobs are put on hold when a condition that could be temporary or recoverable but cannot be fixed by Condor itself prevents the job from making progress.

When the problem is fixed, jobs can be released:

$ condor_release 115318

... superpowers come to the rescue. Check the logs. Usual location:
/var/log/condor/CollectorLog /var/log/condor/SchedLog /var/log/condor/GangliadLog /var/log/condor/ShadowLog /var/log/condor/MasterLog /var/log/condor/SharedPortLog /var/log/condor/MatchLog /var/log/condor/StarterLog /var/log/condor/NegotiatorLog /var/log/condor/StartLog /var/log/condor/ProcLog /var/log/condor/XferStatsLog
Condor can fetch logs for you with condor_fetchlog:
$ condor_fetchlog `condor_config_val CONDOR_HOST` NEGOTIATOR 11/25/20 02:43:50 Now in new log file /var/log/condor/NegotiatorLog 11/25/20 02:43:50 MaxPrioValue = 46659.054688 11/25/20 02:43:50 NumSubmitterAds = 1 11/25/20 02:43:50 Negotiating with puddu@gaspare.fisica.unimi.it skipped... (... etc. etc. etc. ...)
There are many levels of log verbosity.

Relevant config attributes: ALL_DEBUG (affects all daemons) and <SUBSYS>_DEBUG (with <SUBSYS> == STARTD, SCHEDD, COLLECTOR, NEGOTIATOR, etc...)
Many log 'subsystems' can be enabled. They are detailed in the manual, here. Most frequently used:
- D_FULLDEBUG: This level provides verbose output of a general nature into the log files. Frequent log messages for very specific debugging purposes would be excluded. In those cases, the messages would be viewed by having that another flag and D_FULLDEBUG both listed in the configuration file.
- D_ALL: This flag turns on all debugging output by enabling all of the debug levels at once. There is no need to list any other debug levels in addition to D_ALL; doing so would be redundant. Be warned: this will generate about a HUGE amount of output (and cause log files to be rapidly rotated!).
- D_NETWORK: All daemons (except in part the shadow) log a message on every TCP accept, connect, and close, and on every UDP send and receive.
- D_SECURITY: Details about the setup of secure network communication, including the negotiation of a socket authentication mechanism, the management of a session key cache and the authentication process itself are logged.
Logs can be redirected by setting <SUBSYS>_<LOGLEVEL>_LOG (e.g. COLLECTOR_SECURITY_LOG).

Available (HT)Condor reference:

Videos on Youtube, like pretty much for every other current project.
HTCondor admin tutorial, given at each and every HTCondor week.
HTCondor user tutorial, given at each and every HTCondor week (at least the ones held in presence) or at other events.
HTCondor manual.

HTCondor Wiki.
HTCondor mailing lists, including a specific list for INFN users (htcondor-support@lists.infn.it).
HTCondor on Github.

We already started from the install recipe available from the Condor Download Page.
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s -- --no-dry-run
A few words on Condor versions. Note: the schema changed significantly with major version 9:
- There is a stable, "Long Term Support" (LTS) version series, or "channel" (currently v9.0.x, with v10.0.0 due to appear shortly) getting only bugfixes.
- New features get frequently added in the 'feature' channels, labeled as v9.y.x, with y ≥ 1 (the last one is v9.5.0) along with any bug fix that would go into stable.
- Periodically, a new LTS series is started from where the 'feature' series reached in the meanwhile.
- Version with the same 'major' number can be expected to interoperate.
- Older major/minor versions will interoperate with the newer versions "wherever they can" - line protocols have been stable for a long time - but odd side-effects are possible.

An initial, dead-simple job submission, monitoring and control exercise can be found here
Minimum-dependency job executable: pi_pl
And now let's continue on our installations...

Your contact e-mail resource for issues
htcondor-support@lists.infn.it

Introduction to HTC and Condor for (CE) Administrators

INFN online training - March 15^th, 2022.

Francesco Prelz, francesco.prelz@mi.infn.it

⇔ Program

⇔ High Performance or High Throughput?

HTC picture (1/3)

HTC picture (2/3)

HTC picture (3/3)

⇔ Condor "philosophy"

⇔ Reliably run...

⇔ ...as many jobs...

⇔ ...on as many machines as possible.

⇔ The "Central Manager" (1)

⇔ The "Central Manager" (2)

⇔ The spinning pie

⇔ Daemon configuration

Life cycle of HTCondor job: (again, courtesy Greg Thain)

Where it all starts: a submit file. Submit files are not ClassAds.

Why is my (or very-important-user X's) job not starting ?

When a user-level approach fails...

How to tweak the log verbosity.

⇔ Reference material

⇔ Let's try some of this ourselves (1).

⇔ Let's try some of this ourselves (2)...

⇔ Thank you!

INFN online training - March 15th, 2022.

Francesco Prelz, francesco.prelz@mi.infn.it

Why is my (or very-important-user X's) job not starting ?

INFN online training - March 15^th, 2022.