Count (or compute) on 'Amico'

Original picture bitmap source:www.publicdomainpictures.net. License: CC0 Public domain
⇔ Program
→ Conceptual introduction:
- Distributed computing and storage: features and expectation management
- Opportunistic computing: survival techniques
→ Practical introduction - available tools:
→ Job examples - in increasing order of complexity:
- Start with a rare specimen: hello, world.
- Assembly of the submit file
- Job submission and monitoring
- What to do when things go wrong
- Add file transfer via "sandbox".
- Multiple/parametric job submission and control.
- File access via Object Storage
- Script submission - Object Storage file staging
- Interactive jobs
→ More complex cases (tomorrow...):
⇔ Distributed computing - main ingredients
![]() |
|
![]() |
|
![]() |
|
⇔ Distributed storage - oh, my, oh, why?
![]() |
|
|
![]() |
|
![]() |
⇔ Opportunistic resource access - on top of it all!
![]() |
|
Three options to survive job pre-emption:
- Let jobs be restarted from scratch:
- Make sure an interrupted job doesn't leave any state behind (in files or databases).
- Partition the job in small-enough units of execution.
- Logical checkpointing:
- Periodically save enough state (in a local file or external database) to safely resume the execution from a known rescue point.
- This will help in reducing cycle waste.
- Physical checkpointing:
- Save (and restore) the entire job virtual memory state. Either via the (soon to be phased out) HTCondor standard universe, or via Docker+CRIU.
- Knowing that there are a few things that cannot be physically
checkpointed:
- Data in transit on the network.
- Data cached by the OS on open files.
- OS Inter-Process Communications (IPC) structures.
- Kernel-level threads.
- File locks.
- Alarms, timers.
Mental picture of the 'amico' cluster(1/3)
high-throughput / ![]() |
high-performance |
![]() |
![]() |
Mental picture of the 'amico' cluster(2/3)
high-throughput / ![]() |
high-performance |
![]() |
![]() |
Mental picture of the 'amico' cluster(3/3)
high-throughput / ![]() |
high-performance |
FLOPY | FLOPS |
![]() |
![]() |
⇔ Storage organisation/management:

![]() |
|
Time to earn that first coffee break:
$ which s3
/usr/bin/s3
Missing the s3
command ? Please grab a complimentary static x86_64
executable here. The tool is configurable via appropriate
environment variables:
$ env S3_ACCESS_KEY_ID=XXX S3_SECRET_ACCESS_KEY=YYY \
S3_HOSTNAME=rgw.fisica.unimi.it s3 (... command ...)
or:
$ export S3_ACCESS_KEY_ID=XXX
$ export S3_SECRET_ACCESS_KEY=YYY
$ export S3_HOSTNAME=rgw.fisica.unimi.it
$ s3 list
Here's a few examples - let's try them:
$ s3 create tut
Bucket successfully created.
$ s3 test tut
Bucket Status
---------------------------------------------- --------------------
tut USA
$ s3 put tut/my/first/file filename=/etc/motd
$ s3 list tut
Key Last Modified Size
---------------------------------- -------------------- -----
my/first/file 2019-01-28T13:01:33Z 286
$ s3 get tut/my/first/file
The programs included with the Debian GNU/Linux system are free...
(... or whatever the contents of /etc/motd...).
$ s3 get tut/my/first/file filename=/tmp/junk
$ head -2 /tmp/junk
The programs included with the Debian GNU/Linux system are free...
$ s3 put tut/my/second/file < /etc/motd
$ s3 list tut
Key Last Modified Size
----------------------------------- -------------------- -----
my/first/file 2019-01-28T13:01:33Z 286
my/second/file 2019-01-28T13:06:32Z 286
$ s3 list tut prefix=my/second/
Key Last Modified Size
----------------------------------- -------------------- -----
my/second/file 2019-01-28T13:06:32Z 286
Yet another object:
$ s3 copy tut/my/second/file tut/my/third/file
$ s3 list tut
Key Last Modified Size
----------------------------------- -------------------- -----
my/first/file 2019-01-28T13:01:33Z 286
my/second/file 2019-01-28T13:06:32Z 286
my/third/file 2019-01-28T13:08:40Z 286
$ s3 delete tut
ERROR: ErrorBucketNotEmpty
Extra Details:
BucketName: tut
RequestId: tx0000000000000001ea231-005c4f0d53-1ce77c-default
HostId: 1ce77c-default-default
Hmmmm: not much help from the s3 command in removing many objects. So it's either:
$ s3 delete tut/my/first/file
$ s3 delete tut/my/second/file
$ s3 delete tut/my/third/file
$ s3 delete tut
or, depending on your taste, on a slippery and a bit hokey path:
$ s3 list tut prefix=my/ |tail -n +3|awk '{system ("s3 delete tut/" $1);}'
ACLs on S3
The s3 command is a bit picky about the ACL file format:
cat > my_acl << EOACL
OwnerID myself ()
UserID myself () FULL_CONTROL
UserID someone_else () READ
Group All Users READ
EOACL
$ s3 setacl tut filename=my_acl
$ s3 setacl tut/inferno filename=my_acl
$ s3 getacl tut/inferno
OwnerID prelz Francesco Prelz
Type User Identifier Permission
------ ------------------------- ------------
Group All Users READ
UserID another (Another User) READ
UserID prelz (Francesco Prelz) FULL_CONTROL
Objects readable by all users can be downloaded via plain HTTP:
wget 'http://tut.rgw.fisica.unimi.it/inferno'
⇔ Quiz! (over the break)
Why is there no s3 rename/mv command?
Other options for Object Storage access
Tools that have been found useful:
|
![]() |
⇔ Computing organisation/management:
(HT)Condor

- HTCondor user tutorial, given at each and every HTCondor week.
- HTCondor manual.
- HTCondor Wiki.
- HTCondor mailing lists, including a specific list for INFN users (condor@lists.infn.it).
- HTCondor on Github.
![]() |
Before entering into the details of how jobs are submitted and
controlled, let's focus on some terminology to describe the
computing resources available in the friendly 'Amico' clusters:
|
![]() |
|
![]() |
|
Spinning pie
- HTCondor is based around a concept called “Fair Share”:
- Assumes users are competing for resources
- Aims for long-term fairness
- Available compute resources are “The Pie”
- Users, with their relative priorities, are each trying to get their “Pie Slice” (as long as job and machine requirements/"preferences" are satisfied)
- First, the Matchmaker takes some jobs from each user and finds resources for them.
- After all users have got their initial “Pie Slice”, if there are still more jobs and resources, the matchmaker continues “spinning the pie” and handing out resources until everything is matched.
- A user who didn't submit many jobs recently will get larger pie slices for some time.
- HTCondor tracks usage and has a formula for determining priority based on both current demand and prior usage - prior usage exponentially "decays” over time.
Federated 'Amico' clusters:
ClusterName |
Submit/Head Node | Group |
magi-pool | gaspare.mi.infn.it | General Purpose |
teor-pool | heisenberg.pcteor1.mi.infn.it | Theory |
proof-pool | proof[-XX].mi.infn.it (any node) | HEP - ATLAS |
lagrange | halley.fisica.unimi.it | Condensed matter |
etsfmi | etsfmi.fisica.unimi.it | Condensed matter |
erebor-pool | erebor.fisica.unimi.it | Cosmology |
doraemon | doraemon.fisica.unimi.it | Cosmology |
Hostname | Group |
stargate.fisica.unimi.it | General Purpose |
virgo.fisica.unimi.it | Theor. Astrophysics |
List available 'Amico' resources: condor_status -pool superpool-cm.
$ condor_status -pool superpool-cm
Name OpSys Arch State Activity
LoadAv Mem ActvtyTime
slot1@baldassarre.fisica.unimi.it LINUX X86_64 Unclaimed Idle
0.000 128212 4+15:11:17
slot1_1@baldassarre.fisica.unimi.it LINUX X86_64 Claimed Busy
(... etc. etc. ...) 10.690 512 0+01:48:08
Machines Owner Claimed Unclaimed Matched Preempting Drain
X86_64/LINUX 86 0 2 84 0 0 0
Total 86 0 2 84 0 0 0
A '-l
' option returns all available machine attributes, all in
'classified ad' (or ClassAd) format:
$ condor_status -l -pool superpool-cm
Activity = "Idle"
AddressV1 = "{[ p=\"primary\"; a=\"159.149.47.95\"; port=9618;
n=\"Internet\"; spid=\"12543_087f_3\"; noUDP=true; ] ... }"
Arch = "X86_64"
(... etc. etc. etc. etc. etc. etc. ...)
Familiarising with the available attributes (benchmark results, OS and
processor types, kernel version, software dependencies, etc.)
is worth some time.
The -af ('autoformat') option is useful to list specific attributes (-af:l shows the attribute names, -af:th formats a nice table):
$ condor_status -pool superpool-cm -af Machine DetectedMemory Disk
baldassarre.fisica.unimi.it 128724 171330201
baldassarre.fisica.unimi.it 128724 343348
bell.heisenberg.pcteor1.mi.infn.it 32132 838312772
bethe.heisenberg.pcteor1.mi.infn.it 24147 437415148
bloch.heisenberg.pcteor1.mi.infn.it 64513 884268984 (...)
But the '-constraint' flag is waaay more interesting:
$ condor_status -pool superpool-cm -af Machine DetectedMemory Disk \
-constraint 'Clustername=="magi-pool" && Disk > 1000000'
baldassarre.fisica.unimi.it 128724 171330138
melchiorre.fisica.unimi.it 128724 166939416
The same syntax can be used to build a 'Requirements
'
expression to select resources to match and execute jobs.The reference for the ClassAd language is found in the HTCondor manual, here
Ready for Hello, world! - a real wildebeest.
- Have your executable ready. Runnable from the command line
and with no need for interactive input. What about
/bin/echo
? - Prepare a
- Note: we explore a few examples of submit files here, but there's a rich reference section on submit files in the HTCondor manual
- Submit the job:
$ condor_submit hello_world_submit Submitting job(s). 1 job(s) submitted to cluster 6376.
- (optional) If looking at the user log sounds too tedious (but we can use this time better):
condor_wait tutorial_jobs.log [job ID]
Life cycle of HTCondor job: (courtesy Greg Thain)

"Old" condor_q
format:
$ condor_q [-nobatch -allusers]
-- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 11:50:49
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
6376.0 prelz 1/30 10:52 0+00:00:00 I 0 0.0 echo Hello, world!
Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp
Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp
Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...
"New" format:
$ condor_q
-- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 10:52:33
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
prelz CMD: /bin/echo 1/30 10:52 _ _ 1 1 6376.0
Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp
Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp
Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...
When the job completes, don't forget to check the
exit code in the user log.
Why is my job not starting ? - The Requirements expression
$ condor_q -better-analyze[:reverse] 6385
-- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?...
The Requirements expression for job 6385.000 is
(TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") &&
(TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
((TARGET.HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain))
Job 6385.000 defines the following attributes: (... snip ...)
The Requirements expression for job 6385.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 74 TARGET.Arch == "X86_64"
[7] 101 TARGET.HasFileTransfer
6385.000: Job has not yet been considered by the matchmaker.
6385.000: Run analysis summary ignoring user priority. Of 101 machines,
27 are rejected by your job's requirements
8 reject your job because of their own requirements
0 match and are already running your jobs
9 match but are serving other users
57 are available to run your job
Why did my job end up in a 'Held' state ?
$ condor_q
-- Schedd: gaspare.fisica.unimi.it : <159.149.47.93:9618?... @ 01/30/19 12:22:25
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
115318.0 mino.cancelli 11/20 15:35 6+18:37:59 H 0 1.0 add_user_mpiexe
(... etc. etc. ...)
Just find the reason in the job ClassAd (-l
and -af
work just as for machine ClassAds):
$ condor_q 115318 -af HoldReason
Failed to initialize user log to /home/mino.cancelli/run_condor/run_whatever.log
Jobs are put on hold when a condition that
could be temporary or recoverable but cannot be fixed by Condor itself
prevents the job from making progress.
When the problem is fixed, jobs can be released:
$ condor_release 115318
When the job reports errors on stderr
, don't forget to fetch
them back:Error = /path/to/stderr_file
Recap: which options for your I/O needs?
For read-only POSIX access another option would be
a web-based file system (e.g. CVMFS,
WebDAV, etc.). The limits shown in the diagram are back-of-the-envelope suggestions - many more-specific cases may require custom attention. |
![]() |
Basic I/O: the 'job sandbox'

- Let's try another example
requiring an input file.
- Note: all new files found in the job execution area (but not in
any subdirectory) will be transferred back by default if
transfer_output_files
is not specified, as demonstrated in the next(Executable, Source, Input 1, Input 2)
- When output files are written for the entire duration of
the jobs (e.g. to save a logical checkpoint for later recovery), we need
to set also:
When_To_Transfer_Output = On_Exit_Or_Evict
Specify what your job requires (1)
- Implies knowing the needs of our jobs:
- requesting too little: causes problems for your and other jobs + jobs might end up being held by Condor
- requesting too much: jobs will match to fewer “slots”
- Condor measures and logs resource usage:
$ condor_history -l 6394 -af:l DiskUsage ResidentSetSize\ ExecutableSize RemoteUserCpu BytesRecvd ResidentSetSize = 0 ExecutableSize = 1750 RemoteUserCpu = 0.0 BytesRecvd = 2196498.0 DiskUsage = 2250
- Certain requests have to be specified in the Job Submit file:
request_cpus = 1 request_memory = 20MB request_disk = 1MB
Specify your job Requirements
(2)
- Requests can be also made in the Requirements and Rank submit file expressions.
- They can be made against any attribute in the machine ad -
but CPU, Memory and Disk requirements will fail in certain cases when specified here:
# Select machines by name Requirements = Regexp("node",Machine) # Select machines by attribute(s) Requirements = ClusterName == "magi-pool" && HasJava # And prefer faster machines Rank = Mips
How to handle many jobs.
- Remember: smaller execution units get better goodput on
distributed systems,
so do prefer running many independent jobs:
- to analyze multiple data files
- to sweep over various parameter or input combinations
- ClusterId and ProcId can be accessed inside the submit file using the $(ClusterId) and $(ProcId) macros.
- Sweep over input files:
(we can at last add another input file).
- Submit many small jobs:
(with a trick to change the input file list).
Time to use that object storage of ours!

- Let's first stash our files onto the object storage (even if they
aren't that large):
$ s3 put tut/inferno filename=dc_inferno.txt 16270 bytes remaining (92% complete) ... $ s3 put tut/purgatorio filename=dc_purgatorio.txt $ s3 put tut/paradiso filename=dc_paradiso.txt
- Then our option of choice would be to adapt the code to use object
storage natively:
- This isn't terribly complex (see the LibS3 usage reference in the CEPH docs).
- There is some environmental pressure to move to an asynchronous I/O model...
- Old-style source vs. Source using libs3 - diff -u - Jsdiff
- So that sending the resulting static executable
(or otherwise self-contained executable)
onto a distributed system becomes a piece of cake®:
Other options for S3 access

- ROOT also provides native S3 read-only access (honoring, please note,
the
S3_ACCESS_ID
andS3_ACCESS_KEY
environment variables):export S3_ACCESS_ID=$S3_ACCESS_KEY_ID export S3_ACCESS_KEY=$S3_SECRET_ACCESS_KEY TFile* f = TFile::Open("as3://rgw.fisica.unimi.it/tut/calib_file.root"); f->ls(); // etc. etc. - Another option would be the CERN
Davix
(on Github)
library. It is also natively accessible from recent versions of ROOT,
so it may be interesting for ROOT users who can resist the
confusion of tongues:
davix-get --s3accesskey $S3_ACCESS_KEY_ID \ --s3secretkey $S3_SECRET_ACCESS_KEY \ s3://tut.rgw.fisica.unimi.it/inferno
For anything else a simple exe
is not enough

- Suppose, just suppose you don't like, don't prefer or
just cannot modify your code as in the previous example...
- We can then put together a nice shell script to stage the input files in, run the executable and stage the output files out.
- And here's the corresponding
- Most of the issues with executing scripts boil down to what we can
reasonably assume to find installed on the remote execution
machine. Pure POSIX
/bin/sh
,/bin/rm
,/bin/basename
,/dev/null
should be safe enough, but we cannot blindly assume to find everything we need.
Beware the snakes
- Another popular example: in order to submit
a Python script
we'd better use another nice shell script
to make sure we aren't missing anything or getting stuck in
the Python version peat bog.
Direct submission:
Submission via wrapper script:
Note that the 'indirect' script uses Condor's ability to edit the attributes in the job queue to exclude machines that were found not to have the needed Python version - a technique that may be used to work around other (inevitable) black holes.
Any help with this dependencies spaghetti ?
- Start simple, if possible: 'Amico' includes execution nodes that mount
CVMFS
for ATLAS (add
HasCVMS
to the job requirements). - Try packing all dependencies with the job:
CDE
may be a recommended option that allows to carry all needed dependencies
with the job. Note that
cde_exec
itself is not a static executable and may hit incompatible glibc versions, and that there are a few catches both in a possible script to unwind CDE tarballs (1, 2) and in the corrisponding two-
There are simpler ways to make an ELF executable self-contained - they are harder to make truly general, but may be applicable and useful in certain cases, if one knows what (s)he's doing. One such example will be shown tomorrow. - Hopefully this makes the option to assemble and run a Docker container
(require
HasDocker
) look much simpler that it seemed.
More details on this tomorrow as well - so please stay tuned!
Interactive [access to] jobs, anyone ?
condor_ssh_to_job
is a tool to get a shell with current working directory where a job is running (configuration allowing).condor_submit -interactive
results in a shell prompt issued on the execute machine where the job runs.- A submit file can still be used: many options are ignored (Universe, Executable, Arguments...), but (notably) Requirements, Rank and sandbox Transfers are not.
- Do we now feel fluent enough to try them both out ?
$ condor_submit -interactive Submitting job(s). 1 job(s) submitted to cluster 115676. Waiting for job to start...
⇔ See you Tomorrow!
Your contact e-mail resource for issues
with the 'Amico' infrastructure: amico-troubles@mi.infn.it
with the 'Amico' infrastructure: amico-troubles@mi.infn.it