'Amico' - special cases
Original picture bitmap source:www.publicdomainpictures.net. License: CC0 Public domain
⇔ Program
So far presented ideas and recipes that may work with any type of workload. Today we'll focus on a few special cases, that may not be of interest for everyone, so we present them in order of decreasing generality:
⇔ How was this past night ? Sweet dreams ?
- Any questions, doubts, comments, dreams ?
Let's now consider the case of the Canonical Unportable Untouchable Old Physics Fortran Code. Or any other code that maybe doesn't use the full POSIX file semantics, but just cannot be moved away from LIBC stat()
,open()
,read()
,write()
, etc. - We can try mounting
a S3 bucket via FUSE using
S3FS.
Just create a
~/.passwd-s3fs
file, containing the access ID and key separated by a colon, and just readable by its owner:$ echo "$S3_ACCESS_KEY_ID:$S3_SECRET_ACCESS_KEY" > ~/.passwd-s3fs $ chmod 0600 ~/.passwd-s3fs
- But... There's something wrong:
$ mkdir s3bucket $ s3fs -o "url=https://rgw.fisica.unimi.it" tut s3bucket $ ls -l s3bucket ---------- 1 root root 206815 Feb 12 12:14 inferno ---------- 1 root root 212878 Feb 12 12:16 paradiso ---------- 1 root root 207307 Feb 12 12:16 purgatorio $ fusermount -u s3bucket
-
...we first need to at least add a set of POSIX permissions
(octal 0100644) via an attribute in the S3 'header':
$ s3 copy tut/inferno tut/inferno x-amz-meta-mode=33188 $ s3 copy tut/purgatorio tut/purgatorio x-amz-meta-mode=33188 $ s3 copy tut/paradiso tut/paradiso x-amz-meta-mode=33188
- Let's not bother about
x-amz-meta-uid
andx-amz-meta-gid
for now:$ s3fs -o "url=https://rgw.fisica.unimi.it" tut s3bucket $ ls -l s3bucket -rw-r--r-- 1 root root 206815 Feb 12 12:14 inferno -rw-r--r-- 1 root root 212878 Feb 12 12:16 paradiso -rw-r--r-- 1 root root 207307 Feb 12 12:16 purgatorio $ /path/to/rhymes_and_reason s3bucket/inferno mi disse: <<Quel folletto e` Gianni Schicchi, ad ogne conoscenza or li fa bruni. <<Oh!>>, diss'io lui, <<se l'altro non ti ficchi Ed elli a me: <<Vano pensiero aduni: quando Fetonte abbandono` li freni, dovre' io ben riconoscere alcuni (...) OK! $ fusermount -u s3bucket
Testing S3FS
- Then, by creatively assembling a few recipes we saw yesterday,
we can try locating machines that have FUSE installed (there's
more than one thing in FUSE that can worry a sysadmin - e.g. stray,
leftover mounts). If
the executing node doesn't provide S3FS we can
bring our own (it's a single executable, luckily).
Actually, since a-static
compile ofs3fs
hits a nasty GLIBC bug whose still-'SUSPENDED' state shows the current attitude towards static linking, our portables3fs.tpk
was put together using another technique, demostrated in thetrace_and_package.sh
script, that we'll now briefly comment on.
Here's the test
- Remember: depending on what it does precisely,
your code may fail with S3FS. Tracing
the system calls with
strace
may help in understanding why, but this is something you should take care of figuring out.
Too rapidly going out of fashion: the Standard Universe.
|
- For the case of monolythic executables this is so powerful and simple
that's still worth demonstrating, while the sun sets:
$ condor_compile cc -o pi pi.c LINKING FOR CONDOR : /usr/bin/ld -L/usr/lib64/condor -Bstatic --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -o pi /usr/lib64/condor/condor_rt0.o /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../ (etc. etc.) $ ./pi 10 Condor: Notice: Will checkpoint to ./pi.ckpt Condor: Notice: Remote system calls disabled. pi = 3.141592653Compare the Vanilla and the Standard Universe
Remember: the 'Standard' job will redirect I/O to the submit machine and resume from where it left executing automatically. But... as the Standard jobs are handled by separate code inside Condor some features may not be available or break. Again, this is a tool - it may not be the right one for everyone: experiment and, if possible, enjoy.
Your first (?) Docker® container
- Disclaimer: we can in no way cover all the material in the Docker "get started" tutorial. This is more of a crash-landing. And... if you are already familiar with Docker please relax for a short while.
- The following examples can be run on a Linux machine where the
Docker daemons are running and the user is included in
the
docker
UNIX group. This should be doable if you brought your own device. - Suppose that, in a world dominated by obsolete installations of
CERN Scientific Linux, you do need some fresh tool from Debian, or
(wow), U-buntu...!
$ docker search debian NAME DESCRIPTION STARS OFFICIAL ubuntu Ubuntu is a Debian-based Linux operating sys… 9179 [OK] debian Debian is a Linux distribution that's compos… 2978 [OK] (... and much more...)The above is a list of the Docker images available in the central, world-wide "Docker Hub".
Listing Docker® Hub image versions
- Suppose that we now would like a specific version of an image
and we'd like to
list all available versions.
This suddenly becomes suuuuch a low-level question to ask:
$ curl -s https://registry.hub.docker.com/v1/repositories/ubuntu/tags \ | python -m json.tool [ { "layer": "", "name": "latest" }, { "layer": "", "name": "10.04" }, (... etc. etc. etc. ...)
Creating custom Docker® images
- Let's then put together a recipe for customising a default image
with our requirements:
$ mkdir test_docker $ cd test_docker $ cat << EOD > Dockerfile FROM debian:buster RUN apt-get update && apt-get install -y libs3-dev ENTRYPOINT [] EOD $ docker build -t XXXmyname/buster-libs3:0.1 . Sending build context to Docker daemon 2.048kB Step 1/3 : FROM debian:buster buster: Pulling from library/debian (...) Step 2/3 : RUN apt-get update && apt-get install -y libs3-dev Step 3/3 : ENTRYPOINT [] Successfully built 74ea437e7243 Successfully tagged XXXmyname/buster-libs3:0.1 $ docker tag XXXmyname/buster-libs3:0.1 XXXmyname/buster-libs3:latest
Structure of Docker® images
- A docker image is a collection of files (it can be 'saved' to
and 'loaded' from a tarball) over which a series of layered
differential changes are applied (same model as GIT).
$ docker image ls XXXmyname/buster-libs3 REPOSITORY TAG IMAGE ID CREATED SIZE XXXmyname/buster-libs3 0.1 0ab5fe086a1a 8 minutes ago 179MB XXXmyname/buster-libs3 latest 0ab5fe086a1a 8 minutes ago 179MB $ docker image history XXXmyname/buster-libs3 IMAGE CREATED CREATED BY SIZE 0ab5fe086a1a 10 mins ago /bin/sh -c #(nop) ENTRYPOINT [] 0B d4b9e69c8653 10 mins ago /bin/sh -c apt-get update && apt-get install 65.3MB 613327ff5c12 7 days ago /bin/sh -c #(nop) CMD ["bash"] 0B
7 days ago /bin/sh -c #(nop) ADD file:22a69a330913adf55 114MB
The local Docker® registry - dr.mi.infn.it
- Satisfactory images can be pushed back to the global Docker Hub (with an
appropriate account) or to a local registry (anyone can push
from within our LANs):
$ docker tag XXXmyname/buster-libs3:0.1 \ dr.mi.infn.it/XXXmyname/buster-libs3:0.1 $ docker tag dr.mi.infn.it/XXXmyname/buster-libs3:0.1 \ dr.mi.infn.it/XXXmyname/buster-libs3:latest $ docker push dr.mi.infn.it/XXXmyname/buster-libs3 The push refers to repository [dr.mi.infn.it/XXXmyname/buster-libs3] d7f08b4ff0d5: Pushed (...)
- The image is now available on the local registry for anyone to
grab using the name
dr.mi.infn.it/XXXmyname/buster-libs3
. This includes any node in the 'Amico' infrastructure running Docker, with no need to transfer anything explicitely (you guessed right: the local registry back-end is on the common object storage):$ curl -s https://dr.mi.infn.it/v2/_catalog | python -m json.tool { "repositories": [ "centos-run-openmpi", "condordockergeant4example", "XXXmyname/buster-libs3", "run-adda-openmpi", "run-crystal-openmpi", "run-crystal17-openmpi", "run-gift-spigs" ] } $ curl -s https://dr.mi.infn.it/v2/XXXmyname/buster-libs3/tags/list {"name":"XXXmyname/buster-libs3","tags":["0.1","latest"]}
Docker® containers
- Docker is not a virtual machine system. It runs processes on the one Operating System of the hosts. It uses Cgroups (just like Condor) to properly isolate its processes.
- The filesystem of each Docker process is attached to a
container, which uses the same modification layer system
used for images to store any change. The modifications should
be thought as volatile unless
docker commit
ted. - As usual this is clearer in practice:
docker run -it -v `pwd`:/mnt \ dr.mi.infn.it/XXXmyname/buster-libs3 /bin/bash root@e289a4940975:/# ls /mnt (...) root@e289a4940975:/# which s3 /usr/bin/s3 (... Setup the usual S3 env variables ...) root@e289a4940975:/# ldd /mnt/rhymes_and_objects not a dynamic executable root@e289a4940975:/# /mnt/rhymes_and_objects s3:tut/inferno (...)
- Note that:
- Outbound network connectivity worked out of the box - via IPv4 NAT
- We were once again saved by our old-fashioned static build!
root@e289a4940975:/# which c++ (... hmm - nothing ...) root@e289a4940975:/# apt-get install -y build-essential Reading package lists... Done Building dependency tree (... etc. etc. etc. ...) root@e289a4940975:/# mkdir /home/rhymes root@e289a4940975:/# cd /home/rhymes/ (... Argh - c++98 styled std::make_pair<> ...) root@e289a4940975:/home/rhymes# c++ --std=c++98 -o rhymes_and_reason \ /mnt/rhymes_and_reason.cpp root@e289a4940975:/home/rhymes# c++ --std=c++98 -o rhymes_and_objects \ /mnt/rhymes_and_objects.cpp -ls3 root@e289a4940975:/home/rhymes# ./rhymes_and_reason /mnt/dc_inferno.txt (...) root@e289a4940975:/home/rhymes# ./rhymes_and_objects s3:tut/inferno (...)
- If we now
commit
this container as a new version of our image, it will contain what we need for running our code independent of the environment (but it will be harder to reproduce as the recipe is not in aDockerfile
). And much more...:root@e289a4940975:/# exit $ docker ps -l CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES e289a4940975 dr.mi.infn.it/XXXmyname/buster-libs3 "/bin/bash" About an hour ago Exited (0) About a minute ago frosty_aryabhata $ docker commit e289a4940975 XXXmyname/buster-libs3:0.2 sha256:9d6ae1a2745850768b3a73faaa788cb34bef214de0dceeb1e0a8dcfc12d6b130 $ docker tag XXXmyname/buster-libs3:0.2 \ dr.mi.infn.it/XXXmyname/buster-libs3:0.2 $ docker push dr.mi.infn.it/XXXmyname/buster-libs3:0.2 The push refers to repository [dr.mi.infn.it/XXXmyname/buster-libs3] e59c2ff9667d: Pushed $ docker image ls XXXmyname/buster-libs3 REPOSITORY TAG IMAGE ID CREATED SIZE XXXmyname/buster-libs3 0.2 9d6ae1a27458 19 minutes ago 449MB XXXmyname/buster-libs3 0.1 0ab5fe086a1a 2 hours ago 179MB XXXmyname/buster-libs3 latest 0ab5fe086a1a 2 hours ago 179MB
- Now anyone can run our code, anywhere Docker® is available:
$ docker run \ --rm=true -e S3_HOSTNAME=rgw.fisica.unimi.it \ -e S3_ACCESS_KEY_ID="XXX" -e S3_SECRET_ACCESS_KEY="YYY" \ dr.mi.infn.it/XXXmyname/buster-libs3:0.2 \ /home/rhymes/rhymes_and_objects s3:tut/inferno (...)
- Anywhere includes Condor execute nodes where Docker
is running (and the trusty
condor
user belongs to thedocker
UNIX group...!) - Please note that without
--rm=true
Docker keeps all exited containers (and dangling, untagged images as well) by default, so one must remember to clean them up every now and then:docker rm `docker ps -q -f='status=exited'` docker rmi `docker images -q -f='dangling=true'`
Available 'Amico' dependencies
Dependency | Requirement | Available (partitionable) Machines (*) |
CVMFS mounted (from ATLAS experiment servers) | HasCVMFS |
44 |
Local installation of Java® (check JavaVendor ,
JavaVersion , JavaSpecificationVersion ) |
HasJava |
87 |
Local installation of Docker® with the ability to start jobs by the 'condor' user. | HasDocker |
11 |
Ability to handle Parallel Universe jobs (regardless of any local installation of MPI!) | HasMPI |
92 |
condor_status -pool superpool-cm -constraint HasXXXX
as of February 14th, 2019.
Submitting a 'Docker Universe' job
- We are now ready to submit our first 'Docker' Universe job.
The 'Docker' universe includes very little more than the Vanilla Universe, namely:- Automatic inclusion of
TARGET.HasDocker
in the jobRequirements
. - Launch and cleanup of the container on the execute node
via the command configured in the
DOCKER
config variable:$ condor_config_val DOCKER /usr/bin/docker
- Automatic inclusion of
- Here's the
- And here's the (dismal) output, from the node we happened to land onto:
$ cat docker_job.stderr FATAL: kernel too old
- Why? We encountered the ultimate dependency of our job: the set of system calls provided by the kernel. This changes slowly and sparingly but Docker (as well as any similar system that doesn't run another kernel) is unable to work around this.
- Let's aim for an older distribution, and try to build a smaller image directly with another Dockerfile.
- We should now be able to run the container in the Docker universe:
- And enjoy a coffee break... Perhaps say good-bye to our non-MPI users.
- Remember: the Docker hub already contains a number of images ready to go and tackle many tasks. Special thanks to Ruggero Turra for the nifty example he committed to Github (available as dr.mi.infn.it/condordockergeant4example as well) that shows how to run self-contained Geant4 simulations.
Amor, ch'a nullo amato amar perdona, pur un linguaggio nel mondo non s'usa. prese costui de la bella persona che 'l tien legato, o anima confusa, quando la brina in su la terra assempra Poi disse a me: <<Elli stessi s'accusa; |
che 'l sole i crin sotto l'Aquario tempra mostro` gia` mai con tutta l'Etiopia ma poco dura a la sua penna tempra, Tra questa cruda e tristissima copia la via e` lunga e 'l cammino e` malvagio, sanza sperar pertugio o elitropia: |
⇔ Parallel "Universe" - main ingredients
|
The "Dedicated" Scheduler for a given resource must be one
|
Try running an MPI job
- MPI jobs are, at the very least, just another case of dealing with job dependencies. But these don't scare us anymore.
- The Condor manual recommends
packaging MPI applications with
CDE,
so let's try a simple example using
wave_mpi.c
:$ mpicc -o wave_mpi wave_mpi.c -lm $ cde mpirun -n 4 wave_mpi 10000 $ tar -cf wave_mpi.tar cde-package/ $ gzip -9 wave_mpi.tar - Time to reuse our
run_cde_exe.sh
and prepare a nice
The real Parallel Universe
- Here's how the same submission would look like if we used the real
Parallel Universe (and the
openmpiscript
provided in the Condor distribution):
Note and remember that:- The executable is specific to a given architecture and set of library versions (recompile as appropriate).
- The same version of MPI (be it OpenMPI or MPICH, but the script for Mpich is different) has to be installed throughout the cluster.
- A configuration to enable all hosts to contact each other
via SSH is created on the fly (via
condor_ssh
andsshd.sh
) - network connectivity must allow to connect into worker nodes. - Debugging requires some creativity. Plus, the script structure
was revised in recent Condor versions
(using
orted_launcher.sh
, which removes the need for a real SSHD server, as SSHD was just used to launchorted
).
'Vanilla' MPI jobs can (of course) run on Docker
- If we can run
mpirun
on CDE, we can run it on Docker® too: Dockerfile.
Assembling this image takes a while, but we can find it ready asdr.mi.infn.it/prelz/centos-openmpi:latest
- As we have seen, the real problem with Docker is configuring network inbound connectivity properly. But with the same-host 'Vanilla' jobs this is not needed.
- So let's just try executing quick&dirty script to compile and run our MPI job inside a docker container:
- Right ...?
- Uh oh:
-------------------------------------------------------------------------- Open MPI was unable to obtain the username in order to create a path for its required temporary directories. This type of error is usually caused by a transient failure of network-based authentication services (e.g., LDAP or NIS failure due to network congestion), but can also be an indication of system misconfiguration. Please consult your system administrator about these issues and try again. --------------------------------------------------------------------------
- So, let's "consult my system administrator" (?#@%!!)
(Wearing sysop hat): this requires a small workaround in the form of a setuid-root executable inside the container (it's there already...):# cc -o /usr/local/sbin/add_user_mpiexec /usr/local/src/add_user_mpiexec.c # chmod u+s /usr/local/sbin/add_user_mpiexec # ls -l /usr/local/sbin/add_user_mpiexec -rwsr-xr-x 1 root root 9048 Feb 19 17:41 /usr/local/sbin/add_user_mpiexecand an updated quick&dirty script.
Bingo. Eat road-runner!
19 February 2019 05:49:21 PM MPI_WAVE: C version. Estimate a solution of the wave equation using MPI. Using 4 processes. Using a total of 1001 points. Using 20000 time steps of size 0.0004. Computing final solution at time 8 I X F(X) Exact 0 0.000 0.000 0.000 1 0.001 0.006 0.006 2 0.002 0.013 0.013 (... etc. etc. etc. ...) |
⇔ Thank you for staying with us!
- Many MPI job cases can be accommodated within a custom container (a few local customers were served with Intel compilers running on the fly for the executing processor architecture, MPICH, odd dependencied). Freedom of portability of the code is worth as any type of freedom!
- Yes, HTCondor can also use, via a few obscure features, the parallel/"dedicated" scheduler to schedule parallel docker containers - the tricky part is establishing a set of network ports that allow MPI to communicate. We are trying to set up a general enough scheme (codename: 'paddock') with the sole initial purpose of characterising the penalty of running on 'HTHPC' instead of HPC. Stay tuned for updates - whenever we are able to make some progress.
- Thank you again for your patience!