Sharing group-owned clusters at the UNIMI Physics Department.

via a 'super'-collector/negotiator (or "startd flocking", © ToddT)

Francesco Prelz, David Rebatto

INFN, sezione di Milano

Summary

The composite department

Enters: Dan Bradley (1)

  Picture taken at the EU Condor week,
  Milan - 2006

Enters: Dan Bradley (2)

The idea (1)

The idea (2)

The idea (3)

The idea (4)

The idea (5)

The idea (6)

"Super"-pool config (1)

NegotiatorName = "whatever-pool"
NEGOTIATOR_MATCH_EXPRS = NegotiatorName
SUPER_COLLECTOR = superpool-cm.fisica.unimi.it
LOCAL_COLLECTOR = $(CONDOR_HOST)

# the local negotiator should only ever report to the local collector
NEGOTIATOR.COLLECTOR_HOST = $(LOCAL_COLLECTOR)
# startds should report to both collectors
STARTD.COLLECTOR_HOST = $(LOCAL_COLLECTOR),$(SUPER_COLLECTOR)

# trust both negotiators
#ALLOW_NEGOTIATOR=$(COLLECTOR_HOST)
ALLOW_NEGOTIATOR = $(LOCAL_COLLECTOR),$(SUPER_COLLECTOR)

# Flocking to super-pool
FLOCK_TO = $(SUPER_COLLECTOR)

"Super"-pool config (2)

# Advertise in the machine ad the name of the pool
ClusterName = $(NegotiatorName)
STARTD_ATTRS = $(STARTD_ATTRS) ClusterName

# Advertise in the machine ad the name of the negotiator that made the match
# for the job that is currently running.  We need this in SUPER_START.
CurJobPool = "$$(NegotiatorMatchExprNegotiatorName)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) CurJobPool
STARTD_JOB_EXPRS = $(STARTD_JOB_EXPRS) CurJobPool

# Turn PREEMPT on only for jobs coming from an external pool
PREEMPT = ($(PREEMPT)) && (MY.CurJobPool =!= $(NegotiatorName))

# We do not want the super-negotiator to preempt local-negotiator matches.
# Therefore, only match jobs if:
#      1. the new match is from the local pool
#   OR 2. the existing match is not from the local pool
SUPER_START = NegotiatorMatchExprNegotiatorName =?= $(NegotiatorName) || \
              MY.CurJobPool =!= $(NegotiatorName)

START = ($(START)) && ($(SUPER_START))

The winding road

  • So, we turned the configuration crank and... jobs started going on hold with this HoldReason:
    Cannot expand $$ expression (NegotiatorMatchExprNegotiatorName).
  • After a good deal of debugging, patches were proposed and fed back for review for the proper propagation of 'match time' attributes both for partitionable jobs and for the dedicated scheduler.
  • Another issue we found was with the recycling of dynamic slots in the Dedicated Scheduler: they were used once, then forgotten until CLAIM_WORKLIFE expired.
  • All patches were eventually integrated or included starting with Condor version 8.7.5

Keeping job submitters happy

Keeping local pool owners happy

Conclusions

  • The set-up seems to be working. Sometimes it takes some persuasion to un-cling people from their own resources.
  • With enough persistence in interacting with the Condor devel team, order can be brought to cases where the configuration semantics doesn't produce the expected effects...
  • As mentioned, we are trying to build (via Docker) enough portability into our local MPI applications so that we can eventually have Condor launch them. This is still work in progress, and we'll hopefully have progress to report here in 2019.

Questions

  • Thank you for your time.

Questions

  • Thank you for your time.