Count (or compute) on 'Amico'

Original picture bitmap License: CC0 Public domain

⇔ Program

→ Conceptual introduction:

  1. Distributed computing and storage: features and expectation management
  2. Opportunistic computing: survival techniques

→ Practical introduction - available tools:

  1. 'Amico' (maybe) distributed storage: CEPH
  2. 'Amico' (hopefully) distributed computing: (HT)Condor

→ Job examples - in increasing order of complexity:

  1. Start with a rare specimen: hello, world.
    • Assembly of the submit file
    • Job submission and monitoring
    • What to do when things go wrong
  2. Add file transfer via "sandbox".
  3. Multiple/parametric job submission and control.
  4. File access via Object Storage
  5. Script submission - Object Storage file staging
  6. Interactive jobs

→ More complex cases (tomorrow...):

  1. Common dependencies and how to require them
  2. Docker, docker universe
  3. Challenges and opportunities for parallel execution in opportunistic environment (HP-HTC).
  4. MPI job cases

⇔ Distributed computing - main ingredients

  • Here's a concentrated system
  • Interconnection latencies:
  • Expansion/Scale capabilites:
    almost none
  • Complete control and access to any system component
    • (including hardware maintenance)
  • But wait: there are other execution resources
    • and they may even be available
  • Distributing the execution may be a good idea:
    • Better scalability
    • Faster turnaround
    • Maintenance falls on someone else's shoulders
  • As long as a few issues are kept under control:
    • dependencies may not be exactly the same everywhere
    • numeric results may require a systematic study/check
    • latency and bandwidth of access to resources is limited and has to be shared
    • any component in a distributed system may (and will) fail
  • This is such a beautiful idea that other people may like to join and jump on the bandwagon!
  • ⇒ Access to the resources needs to be arbitrated according to some notion of fairness.
  • It just cannot be arbitrated manually...

⇔ Distributed storage - oh, my, oh, why?

  • The fact that no one just processes a handful of small files was not forgotten.
  • The handling of exabyte-scale data stores is an issue that has been successfully tackled even in our field.
  • We'll now focus on the stage and actors of data access/transfer.
  • Our ordinary jobs typically use one, or more, or all of these access mechanisms.
  • Some access strategies are inherently distributed, others are not: coaxing the latter to support concurrent access has produced countless pain.
  • A file access model that calls for writes (updates, appends) to existing files with simultaneous write and read access (such as the POSIX model that serves so well for local file access but also supports user-level and partial file locking) brings a number of hard problems.
    • Pretty much all of the distributed file system failures, inconsistencies and data losses are a consequence of this.
  • A write-once, read-many access model removes this issue entirely
  • This is where the wonderful scalability of all web giant production file stores comes from. This is available today from an object store nearest to you!

⇔ Opportunistic resource access - on top of it all!

  • The 'Amico' cluster federates resources that were procured by individual research groups, for specific purposes.
  • Cluster owners retain execution priority (⇔ power of pre-emption - suspending and/or killing running jobs) on their cluster(s).
  • Job re-submission can of course be handled automatically, but maximising the goodput in this environment requires a few points of attention on the job side. These just cannot be ignored.
  • Pre-emption is particularly bad (and difficult to handle) for jobs that require real parallelism.

Three options to survive job pre-emption:

  1. Let jobs be restarted from scratch:
    • Make sure an interrupted job doesn't leave any state behind (in files or databases).
    • Partition the job in small-enough units of execution.
  2. Logical checkpointing:
    • Periodically save enough state (in a local file or external database) to safely resume the execution from a known rescue point.
    • This will help in reducing cycle waste.
  3. Physical checkpointing:
    • Save (and restore) the entire job virtual memory state. Either via the (soon to be phased out) HTCondor standard universe, or via Docker+CRIU.
    • Knowing that there are a few things that cannot be physically checkpointed:
      • Data in transit on the network.
      • Data cached by the OS on open files.
      • OS Inter-Process Communications (IPC) structures.
      • Kernel-level threads.
      • File locks.
      • Alarms, timers.

Mental picture of the 'amico' cluster(1/3)

high-throughput / high-performance

Mental picture of the 'amico' cluster(2/3)

high-throughput / high-performance

Mental picture of the 'amico' cluster(3/3)

high-throughput / high-performance

⇔ Storage organisation/management:

  • A bird's eye view on how the 'Amico' data object store works.
  • 'Placement group' location is determined only by hashing the object name - no centralised 'directory' or database needed.
  • A hierarchical naming scheme is possible, but no 'subdirectory' listing is possible. The store is just able to list all objects in a given storage bucket.
  • Terminology:
    • Pool: collection of objects handled by the low-level store, and mapped on a given set of placement groups. The number of pools is kept limited.
    • Bucket: logical collection of objects handled by the higher-level 'object gateway'. The bucket name contributes to the name hash. Buckets for different users are stored in the same pool with the appropriate access rights. The number of buckets can be large.
  • Objects at all levels include metadata (OMAP/XATTRS at the pool level, 'headers' at the bucket level).
  • All of this is much clearer in practice: let's try it out.

Time to earn that first coffee break:

Let's start with some interactive tool to access our object storage via the Amazon S3 protocol:
$ which s3 /usr/bin/s3
Missing the s3 command ? Please grab a complimentary static x86_64 executable here. The tool is configurable via appropriate environment variables:
$ env S3_ACCESS_KEY_ID=XXX S3_SECRET_ACCESS_KEY=YYY \ s3 (... command ...)
$ export S3_ACCESS_KEY_ID=XXX $ export S3_SECRET_ACCESS_KEY=YYY $ export $ s3 list
Main S3 commands: create, delete, get, put, copy, list.
Here's a few examples - let's try them:
$ s3 create tut Bucket successfully created. $ s3 test tut Bucket Status ---------------------------------------------- -------------------- tut USA $ s3 put tut/my/first/file filename=/etc/motd $ s3 list tut Key Last Modified Size ---------------------------------- -------------------- ----- my/first/file 2019-01-28T13:01:33Z 286 $ s3 get tut/my/first/file The programs included with the Debian GNU/Linux system are free... (... or whatever the contents of /etc/motd...). $ s3 get tut/my/first/file filename=/tmp/junk $ head -2 /tmp/junk The programs included with the Debian GNU/Linux system are free...
Let's try with another object/file:
$ s3 put tut/my/second/file < /etc/motd $ s3 list tut Key Last Modified Size ----------------------------------- -------------------- ----- my/first/file 2019-01-28T13:01:33Z 286 my/second/file 2019-01-28T13:06:32Z 286 $ s3 list tut prefix=my/second/ Key Last Modified Size ----------------------------------- -------------------- ----- my/second/file 2019-01-28T13:06:32Z 286
Yet another object:
$ s3 copy tut/my/second/file tut/my/third/file $ s3 list tut Key Last Modified Size ----------------------------------- -------------------- ----- my/first/file 2019-01-28T13:01:33Z 286 my/second/file 2019-01-28T13:06:32Z 286 my/third/file 2019-01-28T13:08:40Z 286
And now let's try to get rid of them all:
$ s3 delete tut ERROR: ErrorBucketNotEmpty Extra Details: BucketName: tut RequestId: tx0000000000000001ea231-005c4f0d53-1ce77c-default HostId: 1ce77c-default-default
Hmmmm: not much help from the s3 command in removing many objects. So it's either:
$ s3 delete tut/my/first/file $ s3 delete tut/my/second/file $ s3 delete tut/my/third/file $ s3 delete tut
or, depending on your taste, on a slippery and a bit hokey path:
$ s3 list tut prefix=my/ |tail -n +3|awk '{system ("s3 delete tut/" $1);}'

ACLs on S3

Access can be granted to other users/groups via an appropriate Access Control List.
The s3 command is a bit picky about the ACL file format:
cat > my_acl << EOACL OwnerID myself () UserID myself () FULL_CONTROL UserID someone_else () READ Group All Users READ EOACL $ s3 setacl tut filename=my_acl $ s3 setacl tut/inferno filename=my_acl $ s3 getacl tut/inferno OwnerID prelz Francesco Prelz Type User Identifier Permission ------ ------------------------- ------------ Group All Users READ UserID another (Another User) READ UserID prelz (Francesco Prelz) FULL_CONTROL
Objects readable by all users can be downloaded via plain HTTP:
wget ''

⇔ Quiz! (over the break)

Why is there no s3 rename/mv command?

Other options for Object Storage access

Tools that have been found useful:

⇔ Computing organisation/management: (HT)Condor

Available reference:
Before entering into the details of how jobs are submitted and controlled, let's focus on some terminology to describe the computing resources available in the friendly 'Amico' clusters:
  • The computing resources are organised in a number of privately owned and operated computing clusters.
  • Inside each cluster, one Head Node is usually charged with co-ordinating the cluster, and sometimes also acts as a single network point of entry.
  • Other nodes in the cluster execute jobs (Execution Nodes). In the 'Amico' infrastructure, all executing nodes in any cluster can communicate directly over the local area network.
  • We'll shortly go over the list of available clusters.
  • Nodes where jobs are submitted and queued are called Submit Nodes.
  • Typically, users who need to submit jobs share some interests with cluster owners, so they have priority access to some cluster.
  • Interactive execution and (possibly) various batch systems are used to organise the workload in each cluster.
  • Typically with less than 100% resource occupancy.
  • 'Amico' wants to be friendly to local cluster owners, and will suspend, then migrate jobs when local workload appears. Current default policy:
    • Suspend after 2 minutes of local activity.
    • Vacate and migrate if the job cannot be restarted within 10 minutes.
  • An upper-tier service (or "Central Manager", codename: superpool-cm) matching available computing resources with pending job requests can compensate load peaks across clusters and increase goodput.
  • The semantics of this resource sharing service is opportunistic: HTCondor is a specialised solution for this scenario.
  • If HTCondor is also used as a local cluster 'batch system', then local and 'Amico' jobs can be handled in a uniform way.
  • This scenario cannot be serviced with any number of FIFO (first-in-first-out) queues.

Spinning pie

Federated 'Amico' clusters:

ClusterName Submit/Head Node Group
magi-pool General Purpose
teor-pool Theory
proof-pool proof[-XX] (any node) HEP - ATLAS
lagrange Condensed matter
etsfmi Condensed matter
erebor-pool Cosmology
doraemon Cosmology
Other (standalone) Submit nodes (there could be many more):
Hostname Group General Purpose Theor. Astrophysics

List available 'Amico' resources: condor_status -pool superpool-cm.

$ condor_status -pool superpool-cm Name OpSys Arch State Activity LoadAv Mem ActvtyTime LINUX X86_64 Unclaimed Idle 0.000 128212 4+15:11:17 LINUX X86_64 Claimed Busy (... etc. etc. ...) 10.690 512 0+01:48:08 Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 86 0 2 84 0 0 0 Total 86 0 2 84 0 0 0
A '-l' option returns all available machine attributes, all in 'classified ad' (or ClassAd) format:
$ condor_status -l -pool superpool-cm Activity = "Idle" AddressV1 = "{[ p=\"primary\"; a=\"\"; port=9618; n=\"Internet\"; spid=\"12543_087f_3\"; noUDP=true; ] ... }" Arch = "X86_64" (... etc. etc. etc. etc. etc. etc. ...)
Familiarising with the available attributes (benchmark results, OS and processor types, kernel version, software dependencies, etc.) is worth some time.

The -af ('autoformat') option is useful to list specific attributes (-af:l shows the attribute names, -af:th formats a nice table):

$ condor_status -pool superpool-cm -af Machine DetectedMemory Disk 128724 171330201 128724 343348 32132 838312772 24147 437415148 64513 884268984 (...)
But the '-constraint' flag is waaay more interesting:
$ condor_status -pool superpool-cm -af Machine DetectedMemory Disk \ -constraint 'Clustername=="magi-pool" && Disk > 1000000' 128724 171330138 128724 166939416
The same syntax can be used to build a 'Requirements' expression to select resources to match and execute jobs.
The reference for the ClassAd language is found in the HTCondor manual, here

Ready for Hello, world! - a real wildebeest.

  1. Have your executable ready. Runnable from the command line and with no need for interactive input. What about /bin/echo ?
  2. Prepare a
    • Note: we explore a few examples of submit files here, but there's a rich reference section on submit files in the HTCondor manual
  3. Submit the job:
    $ condor_submit hello_world_submit Submitting job(s). 1 job(s) submitted to cluster 6376.
  4. (optional) If looking at the user log sounds too tedious (but we can use this time better):
    condor_wait tutorial_jobs.log [job ID]

Life cycle of HTCondor job: (courtesy Greg Thain)

"Old" condor_q format:

$ condor_q [-nobatch -allusers] -- Schedd: : < @ 01/30/19 11:50:49 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6376.0 prelz 1/30 10:52 0+00:00:00 I 0 0.0 echo Hello, world! Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...

"New" format:

$ condor_q -- Schedd: : < @ 01/30/19 10:52:33 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS prelz CMD: /bin/echo 1/30 10:52 _ _ 1 1 6376.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for prelz: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 susp Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, ...
When the job completes, don't forget to check the exit code in the user log.

Why is my job not starting ? - The Requirements expression

$ condor_q -better-analyze[:reverse] 6385 -- Schedd: : < The Requirements expression for job 6385.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain)) Job 6385.000 defines the following attributes: (... snip ...) The Requirements expression for job 6385.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 74 TARGET.Arch == "X86_64" [7] 101 TARGET.HasFileTransfer 6385.000: Job has not yet been considered by the matchmaker. 6385.000: Run analysis summary ignoring user priority. Of 101 machines, 27 are rejected by your job's requirements 8 reject your job because of their own requirements 0 match and are already running your jobs 9 match but are serving other users 57 are available to run your job

Why did my job end up in a 'Held' state ?

$ condor_q -- Schedd: : < @ 01/30/19 12:22:25 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 115318.0 mino.cancelli 11/20 15:35 6+18:37:59 H 0 1.0 add_user_mpiexe (... etc. etc. ...)

Just find the reason in the job ClassAd (-l and -af work just as for machine ClassAds):

$ condor_q 115318 -af HoldReason Failed to initialize user log to /home/mino.cancelli/run_condor/run_whatever.log

Jobs are put on hold when a condition that could be temporary or recoverable but cannot be fixed by Condor itself prevents the job from making progress.
When the problem is fixed, jobs can be released:
$ condor_release 115318

When the job reports errors on stderr, don't forget to fetch them back:
Error = /path/to/stderr_file

Recap: which options for your I/O needs?

For read-only POSIX access another option would be a web-based file system (e.g. CVMFS, WebDAV, etc.).

The limits shown in the diagram are back-of-the-envelope suggestions - many more-specific cases may require custom attention.

Basic I/O: the 'job sandbox'  

Specify what your job requires (1)

Specify your job Requirements (2)

How to handle many jobs.

Time to use that object storage of ours!  

  1. Let's first stash our files onto the object storage (even if they aren't that large):
    $ s3 put tut/inferno filename=dc_inferno.txt 16270 bytes remaining (92% complete) ... $ s3 put tut/purgatorio filename=dc_purgatorio.txt $ s3 put tut/paradiso filename=dc_paradiso.txt
  2. Then our option of choice would be to adapt the code to use object storage natively:
  3. So that sending the resulting static executable (or otherwise self-contained executable) onto a distributed system becomes a piece of cake®:

Other options for S3 access  

For anything else a simple exe is not enough  

Beware the snakes

Any help with this dependencies spaghetti ?

Interactive [access to] jobs, anyone ?

⇔ See you Tomorrow!

Your contact e-mail resource for issues
with the 'Amico' infrastructure:
< Goto Page: >