Count (or compute) on 'Amico'

Original picture bitmap source:www.publicdomainpictures.net. License: CC0 Public domain

→ Conceptual introduction:

Distributed computing and storage: features and expectation management
Opportunistic computing: survival techniques

→ Practical introduction - available tools:

'Amico' (maybe) distributed storage: CEPH
'Amico' (hopefully) distributed computing: (HT)Condor

→ Job examples - in increasing order of complexity:

Start with a rare specimen: hello, world.
- Assembly of the submit file
- Job submission and monitoring
- What to do when things go wrong
Add file transfer via "sandbox".
Multiple/parametric job submission and control.
File access via Object Storage
Script submission - Object Storage file staging
Interactive jobs

→ More complex cases (tomorrow...):

Here's a concentrated system
Interconnection latencies:
negligible
Expansion/Scale capabilites:
almost none
Complete control and access to any system component
- (including hardware maintenance)

But wait: there are other execution resources
- and they may even be available
Distributing the execution may be a good idea:
- Better scalability
- Faster turnaround
- Maintenance falls on someone else's shoulders
As long as a few issues are kept under control:
- dependencies may not be exactly the same everywhere
- numeric results may require a systematic study/check
- latency and bandwidth of access to resources is limited and has to be shared
- any component in a distributed system may (and will) fail

This is such a beautiful idea that other people may like to join and jump on the bandwagon!
⇒ Access to the resources needs to be arbitrated according to some notion of fairness.
It just cannot be arbitrated manually...

The fact that no one just processes a handful of small files was not forgotten.
The handling of exabyte-scale data stores is an issue that has been successfully tackled even in our field.

We'll now focus on the stage and actors of data access/transfer.
Our ordinary jobs typically use one, or more, or all of these access mechanisms.
Some access strategies are inherently distributed, others are not: coaxing the latter to support concurrent access has produced countless pain.

A file access model that calls for writes (updates, appends) to existing files with simultaneous write and read access (such as the POSIX model that serves so well for local file access but also supports user-level and partial file locking) brings a number of hard problems.
- Pretty much all of the distributed file system failures, inconsistencies and data losses are a consequence of this.
A write-once, read-many access model removes this issue entirely
This is where the wonderful scalability of all web giant production file stores comes from. This is available today from an object store nearest to you!

The 'Amico' cluster federates resources that were procured by individual research groups, for specific purposes.
Cluster owners retain execution priority (⇔ power of pre-emption - suspending and/or killing running jobs) on their cluster(s).
Job re-submission can of course be handled automatically, but maximising the goodput in this environment requires a few points of attention on the job side. These just cannot be ignored.
Pre-emption is particularly bad (and difficult to handle) for jobs that require real parallelism.

Let jobs be restarted from scratch:
- Make sure an interrupted job doesn't leave any state behind (in files or databases).
- Partition the job in small-enough units of execution.
Logical checkpointing:
- Periodically save enough state (in a local file or external database) to safely resume the execution from a known rescue point.
- This will help in reducing cycle waste.

Physical checkpointing:

Save (and restore) the entire job virtual memory state. Either via the (soon to be phased out) HTCondor standard universe, or via Docker+CRIU.

Knowing that there are a few things that cannot be physically checkpointed:

Data in transit on the network.
Data cached by the OS on open files.
OS Inter-Process Communications (IPC) structures.

Kernel-level threads.
File locks.
Alarms, timers.

high-throughput /	high-performance

high-throughput /	high-performance

high-throughput /	high-performance
FLOPY	FLOPS

A bird's eye view on how the 'Amico' data object store works.
'Placement group' location is determined only by hashing the object name - no centralised 'directory' or database needed.
A hierarchical naming scheme is possible, but no 'subdirectory' listing is possible. The store is just able to list all objects in a given storage bucket.
Terminology:
- Pool: collection of objects handled by the low-level store, and mapped on a given set of placement groups. The number of pools is kept limited.
- Bucket: logical collection of objects handled by the higher-level 'object gateway'. The bucket name contributes to the name hash. Buckets for different users are stored in the same pool with the appropriate access rights. The number of buckets can be large.
Objects at all levels include metadata (OMAP/XATTRS at the pool level, 'headers' at the bucket level).
All of this is much clearer in practice: let's try it out.

Let's start with some interactive tool to access our object storage via the Amazon S3 protocol:

$ which s3 /usr/bin/s3

Missing the s3 command ? Please grab a complimentary static x86_64 executable here. The tool is configurable via appropriate environment variables:

$ env S3_ACCESS_KEY_ID=XXX S3_SECRET_ACCESS_KEY=YYY \ S3_HOSTNAME=rgw.fisica.unimi.it s3 (... command ...)

or:

$ export S3_ACCESS_KEY_ID=XXX $ export S3_SECRET_ACCESS_KEY=YYY $ export S3_HOSTNAME=rgw.fisica.unimi.it $ s3 list

Main S3 commands: create, delete, get, put, copy, list.
Here's a few examples - let's try them:

$ s3 create tut Bucket successfully created. $ s3 test tut Bucket Status ---------------------------------------------- -------------------- tut USA $ s3 put tut/my/first/file filename=/etc/motd $ s3 list tut Key Last Modified Size ---------------------------------- -------------------- ----- my/first/file 2019-01-28T13:01:33Z 286 $ s3 get tut/my/first/file The programs included with the Debian GNU/Linux system are free... (... or whatever the contents of /etc/motd...). $ s3 get tut/my/first/file filename=/tmp/junk $ head -2 /tmp/junk The programs included with the Debian GNU/Linux system are free...

Let's try with another object/file:

$ s3 put tut/my/second/file < /etc/motd $ s3 list tut Key Last Modified Size ----------------------------------- -------------------- ----- my/first/file 2019-01-28T13:01:33Z 286 my/second/file 2019-01-28T13:06:32Z 286 $ s3 list tut prefix=my/second/ Key Last Modified Size ----------------------------------- -------------------- ----- my/second/file 2019-01-28T13:06:32Z 286

Yet another object:

$ s3 copy tut/my/second/file tut/my/third/file $ s3 list tut Key Last Modified Size ----------------------------------- -------------------- ----- my/first/file 2019-01-28T13:01:33Z 286 my/second/file 2019-01-28T13:06:32Z 286 my/third/file 2019-01-28T13:08:40Z 286

And now let's try to get rid of them all:

$ s3 delete tut ERROR: ErrorBucketNotEmpty Extra Details: BucketName: tut RequestId: tx0000000000000001ea231-005c4f0d53-1ce77c-default HostId: 1ce77c-default-default

Hmmmm: not much help from the s3 command in removing many objects. So it's either:

$ s3 delete tut/my/first/file $ s3 delete tut/my/second/file $ s3 delete tut/my/third/file $ s3 delete tut

or, depending on your taste, on a slippery and a bit hokey path:

$ s3 list tut prefix=my/ |tail -n +3|awk '{system ("s3 delete tut/" $1);}'

Access can be granted to other users/groups via an appropriate Access Control List.
The s3 command is a bit picky about the ACL file format:

cat > my_acl << EOACL OwnerID myself () UserID myself () FULL_CONTROL UserID someone_else () READ Group All Users READ EOACL $ s3 setacl tut filename=my_acl $ s3 setacl tut/inferno filename=my_acl $ s3 getacl tut/inferno OwnerID prelz Francesco Prelz Type User Identifier Permission ------ ------------------------- ------------ Group All Users READ UserID another (Another User) READ UserID prelz (Francesco Prelz) FULL_CONTROL

Objects readable by all users can be downloaded via plain HTTP:

wget 'http://tut.rgw.fisica.unimi.it/inferno'

Why is there no s3 rename/mv command?

Tools that have been found useful:

Amazon's recommended tool: s3cmd (written in Python).
CrossFTP.
Cyberduck.io for MacOS or Windows.

Available reference:

HTCondor user tutorial, given at each and every HTCondor week.
HTCondor manual.

HTCondor Wiki.
HTCondor mailing lists, including a specific list for INFN users (condor@lists.infn.it).
HTCondor on Github.

Before entering into the details of how jobs are submitted and controlled, let's focus on some terminology to describe the computing resources available in the friendly 'Amico' clusters:

The computing resources are organised in a number of privately owned and operated computing clusters.
Inside each cluster, one Head Node is usually charged with co-ordinating the cluster, and sometimes also acts as a single network point of entry.
Other nodes in the cluster execute jobs (Execution Nodes). In the 'Amico' infrastructure, all executing nodes in any cluster can communicate directly over the local area network.
We'll shortly go over the list of available clusters.
Nodes where jobs are submitted and queued are called Submit Nodes.

Typically, users who need to submit jobs share some interests with cluster owners, so they have priority access to some cluster.
Interactive execution and (possibly) various batch systems are used to organise the workload in each cluster.
Typically with less than 100% resource occupancy.
'Amico' wants to be friendly to local cluster owners, and will suspend, then migrate jobs when local workload appears. Current default policy:
- Suspend after 2 minutes of local activity.
- Vacate and migrate if the job cannot be restarted within 10 minutes.

An upper-tier service (or "Central Manager", codename: superpool-cm) matching available computing resources with pending job requests can compensate load peaks across clusters and increase goodput.
The semantics of this resource sharing service is opportunistic: HTCondor is a specialised solution for this scenario.
If HTCondor is also used as a local cluster 'batch system', then local and 'Amico' jobs can be handled in a uniform way.
This scenario cannot be serviced with any number of FIFO (first-in-first-out) queues.

HTCondor is based around a concept called “Fair Share”:
- Assumes users are competing for resources
- Aims for long-term fairness
Available compute resources are “The Pie”
Users, with their relative priorities, are each trying to get their “Pie Slice” (as long as job and machine requirements/"preferences" are satisfied)
First, the Matchmaker takes some jobs from each user and finds resources for them.
After all users have got their initial “Pie Slice”, if there are still more jobs and resources, the matchmaker continues “spinning the pie” and handing out resources until everything is matched.
A user who didn't submit many jobs recently will get larger pie slices for some time.
HTCondor tracks usage and has a formula for determining priority based on both current demand and prior usage - prior usage exponentially "decays” over time.

`ClusterName`	Submit/Head Node	Group
`magi-pool`	`gaspare.mi.infn.it`	General Purpose
`teor-pool`	`heisenberg.pcteor1.mi.infn.it`	Theory
`proof-pool`	`proof[-XX].mi.infn.it` (any node)	HEP - ATLAS
`lagrange`	`halley.fisica.unimi.it`	Condensed matter
`etsfmi`	`etsfmi.fisica.unimi.it`	Condensed matter
`erebor-pool`	`erebor.fisica.unimi.it`	Cosmology
`doraemon`	`doraemon.fisica.unimi.it`	Cosmology

Other (standalone) Submit nodes (there could be many more):

Hostname	Group
`stargate.fisica.unimi.it`	General Purpose
`virgo.fisica.unimi.it`	Theor. Astrophysics

List available 'Amico' resources: condor_status -pool superpool-cm.

$ condor_status -pool superpool-cm Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@baldassarre.fisica.unimi.it LINUX X86_64 Unclaimed Idle 0.000 128212 4+15:11:17 slot1_1@baldassarre.fisica.unimi.it LINUX X86_64 Claimed Busy (... etc. etc. ...) 10.690 512 0+01:48:08 Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 86 0 2 84 0 0 0 Total 86 0 2 84 0 0 0

A '-l' option returns all available machine attributes, all in 'classified ad' (or ClassAd) format:

$ condor_status -l -pool superpool-cm Activity = "Idle" AddressV1 = "{[ p=\"primary\"; a=\"159.149.47.95\"; port=9618; n=\"Internet\"; spid=\"12543_087f_3\"; noUDP=true; ] ... }" Arch = "X86_64" (... etc. etc. etc. etc. etc. etc. ...)

Familiarising with the available attributes (benchmark results, OS and processor types, kernel version, software dependencies, etc.) is worth some time.

The -af ('autoformat') option is useful to list specific attributes (-af:l shows the attribute names, -af:th formats a nice table):

$ condor_status -pool superpool-cm -af Machine DetectedMemory Disk baldassarre.fisica.unimi.it 128724 171330201 baldassarre.fisica.unimi.it 128724 343348 bell.heisenberg.pcteor1.mi.infn.it 32132 838312772 bethe.heisenberg.pcteor1.mi.infn.it 24147 437415148 bloch.heisenberg.pcteor1.mi.infn.it 64513 884268984 (...)

But the '-constraint' flag is waaay more interesting:

$ condor_status -pool superpool-cm -af Machine DetectedMemory Disk \ -constraint 'Clustername=="magi-pool" && Disk > 1000000' baldassarre.fisica.unimi.it 128724 171330138 melchiorre.fisica.unimi.it 128724 166939416

The same syntax can be used to build a 'Requirements' expression to select resources to match and execute jobs.
The reference for the ClassAd language is found in the HTCondor manual, here

Have your executable ready. Runnable from the command line and with no need for interactive input. What about /bin/echo ?
Prepare a
- Note: we explore a few examples of submit files here, but there's a rich reference section on submit files in the HTCondor manual
Submit the job:
$ condor_submit hello_world_submit Submitting job(s). 1 job(s) submitted to cluster 6376.
(optional) If looking at the user log sounds too tedious (but we can use this time better):
condor_wait tutorial_jobs.log [job ID]

"Old" condor_q format:

$ condor_q [-nobatch -allusers] -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 11:50:49 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6376.0 prelz 1/30 10:52 0+00:00:00 I 0 0.0 echo Hello, world! Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...

"New" format:

$ condor_q -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... @ 01/30/19 10:52:33 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS prelz CMD: /bin/echo 1/30 10:52 _ _ 1 1 6376.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...

When the job completes, don't forget to check the exit code in the user log.

Why is my job not starting ? - The Requirements expression

$ condor_q -better-analyze[:reverse] 6385 -- Schedd: orsone.mi.infn.it : <192.84.138.153:9618?... The Requirements expression for job 6385.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain)) Job 6385.000 defines the following attributes: (... snip ...) The Requirements expression for job 6385.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 74 TARGET.Arch == "X86_64" [7] 101 TARGET.HasFileTransfer 6385.000: Job has not yet been considered by the matchmaker. 6385.000: Run analysis summary ignoring user priority. Of 101 machines, 27 are rejected by your job's requirements 8 reject your job because of their own requirements 0 match and are already running your jobs 9 match but are serving other users 57 are available to run your job

Why did my job end up in a 'Held' state ?

$ condor_q -- Schedd: gaspare.fisica.unimi.it : <159.149.47.93:9618?... @ 01/30/19 12:22:25 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 115318.0 mino.cancelli 11/20 15:35 6+18:37:59 H 0 1.0 add_user_mpiexe (... etc. etc. ...)

Just find the reason in the job ClassAd (-l and -af work just as for machine ClassAds):

$ condor_q 115318 -af HoldReason Failed to initialize user log to /home/mino.cancelli/run_condor/run_whatever.log

Jobs are put on hold when a condition that could be temporary or recoverable but cannot be fixed by Condor itself prevents the job from making progress.

When the problem is fixed, jobs can be released:

$ condor_release 115318

When the job reports errors on stderr, don't forget to fetch them back:
Error = /path/to/stderr_file

For read-only POSIX access another option would be a web-based file system (e.g. CVMFS, WebDAV, etc.).

The limits shown in the diagram are back-of-the-envelope suggestions - many more-specific cases may require custom attention.

Let's try another example requiring an input file.
Note: all new files found in the job execution area (but not in any subdirectory) will be transferred back by default if transfer_output_files is not specified, as demonstrated in the next (Executable, Source, Input 1, Input 2)
When output files are written for the entire duration of the jobs (e.g. to save a logical checkpoint for later recovery), we need to set also:
When_To_Transfer_Output = On_Exit_Or_Evict

Implies knowing the needs of our jobs:
- requesting too little: causes problems for your and other jobs + jobs might end up being held by Condor
- requesting too much: jobs will match to fewer “slots”
Condor measures and logs resource usage:
$ condor_history -l 6394 -af:l DiskUsage ResidentSetSize\ ExecutableSize RemoteUserCpu BytesRecvd ResidentSetSize = 0 ExecutableSize = 1750 RemoteUserCpu = 0.0 BytesRecvd = 2196498.0 DiskUsage = 2250
Certain requests have to be specified in the Job Submit file:
request_cpus = 1 request_memory = 20MB request_disk = 1MB

Requests can be also made in the Requirements and Rank submit file expressions.
They can be made against any attribute in the machine ad - but CPU, Memory and Disk requirements will fail in certain cases when specified here:
# Select machines by name Requirements = Regexp("node",Machine) # Select machines by attribute(s) Requirements = ClusterName == "magi-pool" && HasJava # And prefer faster machines Rank = Mips

Remember: smaller execution units get better goodput on distributed systems, so do prefer running many independent jobs:
- to analyze multiple data files
- to sweep over various parameter or input combinations
ClusterId and ProcId can be accessed inside the submit file using the $(ClusterId) and $(ProcId) macros.
Sweep over input files: (we can at last add another input file).
Submit many small jobs: (with a trick to change the input file list).

Let's first stash our files onto the object storage (even if they aren't that large):
$ s3 put tut/inferno filename=dc_inferno.txt 16270 bytes remaining (92% complete) ... $ s3 put tut/purgatorio filename=dc_purgatorio.txt $ s3 put tut/paradiso filename=dc_paradiso.txt
Then our option of choice would be to adapt the code to use object storage natively:
- This isn't terribly complex (see the LibS3 usage reference in the CEPH docs).
- There is some environmental pressure to move to an asynchronous I/O model...
- Old-style source vs. Source using libs3 - diff -u - Jsdiff
So that sending the resulting static executable (or otherwise self-contained executable) onto a distributed system becomes a piece of cake^®:

ROOT also provides native S3 read-only access (honoring, please note, the S3_ACCESS_ID and S3_ACCESS_KEY environment variables):
export S3_ACCESS_ID=$S3_ACCESS_KEY_ID export S3_ACCESS_KEY=$S3_SECRET_ACCESS_KEY TFile* f = TFile::Open("as3://rgw.fisica.unimi.it/tut/calib_file.root"); f->ls(); // etc. etc.
Another option would be the CERN Davix (on Github) library. It is also natively accessible from recent versions of ROOT, so it may be interesting for ROOT users who can resist the confusion of tongues:
davix-get --s3accesskey $S3_ACCESS_KEY_ID \ --s3secretkey $S3_SECRET_ACCESS_KEY \ s3://tut.rgw.fisica.unimi.it/inferno

Suppose, just suppose you don't like, don't prefer or just cannot modify your code as in the previous example...
- We can then put together a nice shell script to stage the input files in, run the executable and stage the output files out.
- And here's the corresponding
Most of the issues with executing scripts boil down to what we can reasonably assume to find installed on the remote execution machine. Pure POSIX /bin/sh, /bin/rm, /bin/basename, /dev/null should be safe enough, but we cannot blindly assume to find everything we need.

Another popular example: in order to submit a Python script we'd better use another nice shell script to make sure we aren't missing anything or getting stuck in the Python version peat bog.

Direct submission:

Submission via wrapper script:

Note that the 'indirect' script uses Condor's ability to edit the attributes in the job queue to exclude machines that were found not to have the needed Python version - a technique that may be used to work around other (inevitable) black holes.

Start simple, if possible: 'Amico' includes execution nodes that mount CVMFS for ATLAS (add HasCVMS to the job requirements).
Try packing all dependencies with the job: CDE may be a recommended option that allows to carry all needed dependencies with the job. Note that cde_exec itself is not a static executable and may hit incompatible glibc versions, and that there are a few catches both in a possible script to unwind CDE tarballs (1, 2) and in the corrisponding two -
There are simpler ways to make an ELF executable self-contained - they are harder to make truly general, but may be applicable and useful in certain cases, if one knows what (s)he's doing. One such example will be shown tomorrow.
Hopefully this makes the option to assemble and run a Docker container (require HasDocker) look much simpler that it seemed.
More details on this tomorrow as well - so please stay tuned!

condor_ssh_to_job is a tool to get a shell with current working directory where a job is running (configuration allowing).
condor_submit -interactive results in a shell prompt issued on the execute machine where the job runs.
- A submit file can still be used: many options are ignored (Universe, Executable, Arguments...), but (notably) Requirements, Rank and sandbox Transfers are not.
Do we now feel fluent enough to try them both out ?
$ condor_submit -interactive Submitting job(s). 1 job(s) submitted to cluster 115676. Waiting for job to start...

Your contact e-mail resource for issues
with the 'Amico' infrastructure:
amico-troubles@mi.infn.it

Count (or compute) on 'Amico'

⇔ Program

⇔ Distributed computing - main ingredients

⇔ Distributed storage - oh, my, oh, why?

⇔ Opportunistic resource access - on top of it all!

Three options to survive job pre-emption:

Mental picture of the 'amico' cluster(1/3)

Mental picture of the 'amico' cluster(2/3)

Mental picture of the 'amico' cluster(3/3)

⇔ Storage organisation/management:

Time to earn that first coffee break:

ACLs on S3

⇔ Quiz! (over the break)

Other options for Object Storage access

⇔ Computing organisation/management: (HT)Condor

Spinning pie

Federated 'Amico' clusters:

Ready for Hello, world! - a real wildebeest.

Life cycle of HTCondor job: (courtesy Greg Thain)

Recap: which options for your I/O needs?

Basic I/O: the 'job sandbox'

Specify what your job requires (1)

Specify your job `Requirements` (2)

How to handle many jobs.

Time to use that object storage of ours!

Other options for S3 access

For anything else a simple `exe` is not enough

Beware the snakes

Any help with this dependencies spaghetti ?

Interactive [access to] jobs, anyone ?

⇔ See you Tomorrow!