Sharing group-owned clusters at the UNIMI Physics Department.
via a 'super'-collector/negotiator (or "startd flocking", © ToddT)
Francesco Prelz, David Rebatto
INFN, sezione di Milano
Summary
- Problem description and expectations.
- Enters: Dan Bradley.
- The idea.
- Journey journal, with the occasional wild beast, or
$$(NegotiatorMatchExprNegotiatorName) encounter.
- A few config toggles, of course.
- What is keeping job submitters and cluster owners happy.
The composite department
- The Physics Department at the state university of Milan is a fairly
large, 80-some faculty, 1200 student department. Its research activity is structured in a dozen 'groups',
active in various fields (high-energy and nuclear Physics, solid state
and condensed matter physics, astrophysics, theory, electronics,
environmental and medical physics, etc...).
- Many of these groups proceeded in sparse order with the purchase of
computing resources. The thought of rationalising this process
came along quite a posteriori, even though this should be a
quite familiar scenario for people close enough to the Condor conspiracy ...
- The typical purchase (with a couple notable exceptions - LHC Tier-2 centre, Theory group),
would be a turn-key configuration of one rackful of worker nodes with an Infiniband
interconnect, for the execution of MPI jobs.
- LHC Tier-2 cores: ~2500. Cores in the other 8 'group' clusters that we were
able to access to, as of today: 1968.
- The Tier-2 centre has been running Condor 'as a batch system' since the
early days of WLCG (2002 or so): not investing into other technology
sounded wise enough...
Enters: Dan Bradley (1)
|
Picture taken at the EU Condor week,
Milan - 2006
|
|
Enters: Dan Bradley (2)
- Actually, we did look up the Condor Wiki, and were
inspired by this entry:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToHaveExecuteMachines
- The description of what this solution is for is
strikingly similar to the department scenario we just described...
- → Just throw into the picture partitionable slots - and possibly the submission of parallel jobs
via the Dedicated Scheduler.
- The entry is not signed, but the Wiki update history hints to the mind behind it.
The idea (1)
The idea (2)
The idea (3)
The idea (4)
The idea (5)
The idea (6)
"Super"-pool config (1)
NegotiatorName = "whatever-pool"
NEGOTIATOR_MATCH_EXPRS = NegotiatorName
SUPER_COLLECTOR = superpool-cm.fisica.unimi.it
LOCAL_COLLECTOR = $(CONDOR_HOST)
# the local negotiator should only ever report to the local collector
NEGOTIATOR.COLLECTOR_HOST = $(LOCAL_COLLECTOR)
# startds should report to both collectors
STARTD.COLLECTOR_HOST = $(LOCAL_COLLECTOR),$(SUPER_COLLECTOR)
# trust both negotiators
#ALLOW_NEGOTIATOR=$(COLLECTOR_HOST)
ALLOW_NEGOTIATOR = $(LOCAL_COLLECTOR),$(SUPER_COLLECTOR)
# Flocking to super-pool
FLOCK_TO = $(SUPER_COLLECTOR)
"Super"-pool config (2)
# Advertise in the machine ad the name of the pool
ClusterName = $(NegotiatorName)
STARTD_ATTRS = $(STARTD_ATTRS) ClusterName
# Advertise in the machine ad the name of the negotiator that made the match
# for the job that is currently running. We need this in SUPER_START.
CurJobPool = "$$(NegotiatorMatchExprNegotiatorName)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) CurJobPool
STARTD_JOB_EXPRS = $(STARTD_JOB_EXPRS) CurJobPool
# Turn PREEMPT on only for jobs coming from an external pool
PREEMPT = ($(PREEMPT)) && (MY.CurJobPool =!= $(NegotiatorName))
# We do not want the super-negotiator to preempt local-negotiator matches.
# Therefore, only match jobs if:
# 1. the new match is from the local pool
# OR 2. the existing match is not from the local pool
SUPER_START = NegotiatorMatchExprNegotiatorName =?= $(NegotiatorName) || \
MY.CurJobPool =!= $(NegotiatorName)
START = ($(START)) && ($(SUPER_START))
The winding road
|
- So, we turned the configuration crank and...
jobs started going on hold with this HoldReason:
Cannot expand $$ expression
(NegotiatorMatchExprNegotiatorName).
- After a good deal of debugging, patches were
proposed and fed back for review for the proper propagation of 'match time'
attributes both for partitionable jobs and for the dedicated scheduler.
- Another issue we found was with the
recycling of dynamic slots in the Dedicated Scheduler: they were used once, then forgotten until
CLAIM_WORKLIFE expired.
- All patches were eventually integrated or included starting
with Condor version 8.7.5
|
Keeping job submitters happy
- While the standard universe still serves
nicely a number of customers who just run a single executable (usually
compiled from Fortran source...) and who avail themselves of
static linking, remote I/O and checkpointing, more complex workloads
are served by:
- Running docker containers (we install docker or have docker
installed wherever possible on the various pools, and use
HasDocker).
- Mounting CERN's CVMFS for read-only access to common software distributions,
(we publish a
HasCVMFS attribute).
- Providing users with CEPH-based object storage
space, and encouraging them to redirect all job I/O there.
- Waiting for CRIU to
provide workable checkpoint&restore capabilities for docker
containers.
- As mentioned, many users run various
flavours of MPI jobs. While we do configure local pools with
one Dedicated Scheduler each, we are also trying to make these
MPI set-ups portable via docker, so that we can hope to schedule them
via the DedSched on other pools. But this would be another talk...
Keeping local pool owners happy
Conclusions
- The set-up seems to be working. Sometimes it
takes some persuasion to un-cling people from their own resources.
- With enough persistence in interacting with
the Condor devel team, order can be brought to cases where the configuration
semantics doesn't produce the expected effects...
- As mentioned, we are trying to build (via
Docker) enough portability into our local MPI applications so that we
can eventually have Condor launch them. This is still work in progress,
and we'll hopefully have progress to report here in 2019.
|
|