Next: 1.8 Contact Information
Up: 1. Overview
Previous: 1.6 Availability
Subsections
1.7 Version History
This section provides descriptions of what features have been added or
bugs fixed for each version of Condor.
Each release series is covered in its own section.
1.7.1 Development Release Series 6.1
This is the first development release.
You should only install it if you really know what you are doing.
Development releases will come out quickly, with lots of new features
being added, many bugs fixed, etc.
It should not be used for a production pool.
1.7.1.1 Version 6.1.3
- Owners can now specify how the SMP-Startd partitions the system
resources into the different types and numbers of virtual machines,
specifying the number of CPUs, megs of RAM, megs of swap space, etc.,
in each.
The number of virtual machines reporting of any given type can also be
changed dynamically through the condor_smp_admin tool.
- Fixed a bug in the reporting of virtual memory and disk space on
SMP machines where each virtual machine represented was advertising
the total in the system for itself, instead of its own share.
Now, both the totals, and the virtual machine-specific values are
advertised.
- Fixed bug in ClassAd printing when you tried to display an
integer or float attribute that didn't exist in the given ClassAd.
This could show up in condor_status, condor_q, condor_history, etc.
- Various commands sent to the Condor daemons now have seperate
debug levels associated with them.
For example, commands such as ``keep-alives'', and the command sent
from the condor_kbdd to the condor_startd are only seen in the
various log files if D_FULLDEBUG is turned on, instead of
D_COMMAND, which the default and now enabled for all daemons on
all platforms by default.
Administrators retaining their old configuration when upgrading to
this version are encouraged to enable D_COMMAND in the
SCHEDD_DEBUG setting.
In addition, for IRIX and Digital Unix machines, it should be enabled
in the STARTD_DEBUG setting as well.
See section 3.4.3 on
page
for details on
debug levels in Condor.
- New debug levels added to Condor:
- D_NETWORK, used by various daemons in Condor to report
various network statistics about the Condor daemons.
- D_KEYBOARD, used by the condor_startd to print out
statistics about remote tty and console idle times in the
condor_startd.
This information used to be logged at D_FULLDEBUG, along with
everything else, so now, you can see just the idle times, and/or have
the information stored to a seperate file.
- Fixed up bug in the SMP-Startd's load average computation that
could cause certain rare exceptions to be treated as fatal, when in
fact, the Startd could recover from them.
- Added a -run option to condor_q, which diplays
information for running jobs, including the remote host where each job
is running.
- Macros can now be incrementally defined. See
section 3.4.1 on
page
for more details.
- condor_config_val can now be used to set configuration
variables. See the man page on page
for more details.
1.7.1.2 Version 6.1.2
- Fixed some bugs in the condor_install script.
Also, enhanced condor_install to customize the path to perl in
various perl scripts used by Condor.
- Fixed a problem with our build environment that left some files
out of the release.tar files in the binary releases on some
platforms.
- condor_dagman, ``DAGMan'' (see section 2.10 on
page
for details) is now included in the
development release by default.
- Fixed a bug in the computation of the total physical memory in
HPUX machines that was resulting in an overflow with machines with
lots of ram (over 1 gigabyte).
Also, if you define ``MEMORY'' in your config file, that value will
override whatever value Condor computes for your machine.
- Fixed a bug in condor_starter.pvm, the PVM version of the
Condor starter (available as an optional ``Contrib module''), when you
disabled STARTER_LOCAL_LOGGING.
Now, having this set to ``False'' will properly place debug messages
from condor_starter.pvm into the ShadowLog file of the
machine that submitted the job (as opposed to the StarterLog
file on the machine executing the job).
1.7.1.3 Version 6.1.1
- Fixed a bug in the condor_startd where we compute the load
average caused by Condor that was causing us to get the wrong values.
This could cause a cycle of continuous job suspends and job resumes.
- Beginning with this version, any jobs linked with the Condor
checkpoint libraries will use the zlib compression code (used by gzip
and others) to compress periodic checkpoints before they are written
to the network.
These compressed checkpoints are uncompressed at startup time.
This saves network bandwidth, disk space, as well as time (if the
network is the bottleneck to checkpointing, which it usually is).
In future versions of Condor, all checkpoints will probably be
compressed, but at this time, it is only used for periodic
checkpoints.
Note, you have to relink your jobs with the condor_compile command
to have this feature enabled.
Old jobs (not relinked) will continue to run just fine, they just
won't be compressed.
- condor_status now has better support for displaying checkpoint
server ClassAds.
- More contrib modules from the developement series are now
available, such as the checkpoint server, PVM support, and the
CondorView server.
- Fixed some minor bugs in the UserLog code that were causing
problems for DAGMan in exceptional error cases.
- Fixed an obscure bug in the logging code when D_PRIV was
enabled that could result in incorrect file permissions on log files.
1.7.1.4 Version 6.1.0
- Support has been added to the condor_startd to run multiple
jobs on SMP machines.
See section 3.10.7 on
page
for details about setting up and
configuring SMP support.
- The expressions that control the condor_startd policy for
vacating, jobs has been simplified.
See section 3.6 on
page
for complete details on the new
policy expressions, and section 3.6.10 on
page
for an explaination of what's
different from the version 6.0 expressions.
- We now perform better tracking of processes spawned by Condor.
If children die and are inherited by init, we still know they belong
to Condor.
This allows us to better ensure we don't leave processes lying around
when we need to get off a machine, and enables us to have a much more
accurate computation of the load average generated by Condor (the
CondorLoadAvg as reported by the condor_startd).
- The condor_collector now can store historical information
about your pool state.
This information can be queried with the condor_stats program (see
the man page on page
), which is used by the
condor_view Java GUI, which is available as a seperate contrib
module.
- Condor jobs can now be put in a ``hold'' state with the
condor_hold command.
Such jobs remain in the job queue (and can be viewed with condor_q),
but there will not be any negotiation to find machines for them.
If a job is having a temporary problem (like the permissions are
wrong on files it needs to access), the job can be put on hold until
the problem can be solved.
Jobs put on hold can be released with the condor_release command.
- condor_userprio now has the notion of user factors as a
way to create different groups of users in different priority levels.
See section 3.5 on page
for
details.
This includes the ability to specify a local priority domain, and all
users from other domains get a much worse priority.
- Usage statistics by user is now available from
condor_userprio.
See the man page on page
for details.
- The condor_schedd has been enhanced to enable ``flocking'',
where it seeks matches with machines in multiple pools if its requests
cannot be serviced in the local pool.
See section 3.10.6 on page
for more
details.
- The condor_schedd has been enhanced to enable condor_q and
other interactive tools better response time.
- The condor_schedd has also been enhanced to allow it to check
the permissions of the files you specify for input, output, error and
so on.
If the schedd doesn't have the required access rights to the files,
the jobs will not be submitted, and condor_submit will print an
error message.
- When you perform a condor_rm command, and the job you removed
was using a ``user log'', the remove event is now recorded into the
log.
- Two new attributes have been added to the job classad when it
begins executing: RemoteHost and LastRemoteHost.
These attributes list the IP address and port of the startd that is
either currently running the job, or the last startd to run the job
(if it's run on more than one machine).
This information helps users track their job's execution more closely,
and allows administrators to troubleshoot problems more effectively.
- The performance of checkpointing was increased by using larger
buffers for the network I/O used to get the checkpoint file on and off
the remote executing host (this helps for all pools, with or without
checkpoint servers).
1.7.2 Stable Release Series 6.0
6.0 is the first version of Condor with ClassAds.
It contains many other fundamental enhancements over version 5.
It is also the first official stable release series, with a
development series (6.1) simultaneously available.
1.7.2.1 Version 6.0.3
- Fixed a bug that was causing the hostname of the submit machine
that claimed a given execute machine to be incorrectly reported by the
condor_startd at sites using NIS.
- Fixed a bug in the condor_startd's benchmarking code that
could cause a floating point exception (SIGFPE, signal 8) on very,
very fast machines, such as newer Alphas.
- Fixed an obscure bug in condor_submit that could happen when
you set a requirements expression that references the ``Memory''
attribute.
The bug only showed up with certain formations of the requirement
expression.
1.7.2.2 Version 6.0.2
- Fixed a bug in the fcntl() call for Solaris 2.6 that was
causing problems with file I/O inside Fortran jobs.
- Fixed a bug in the way the DEFAULT_DOMAIN_NAME
parameter was handled so that this feature now works properly.
- Fixed a bug in how the SOFT_UID_DOMAIN config file
parameter was used in the condor_starter.
This feature is also documented in the manual now (see
section 3.4.5 on
page
).
- You can now set the RunBenchmarks expression to ``False'' and
the condor_startd will never run benchmarks, not even at startup
time.
- Fixed a bug in getwd() and getcwd() for sites
that use the NFS automounter.
his bug was only present if user programs tried to call
chdir() themselves.
Now, this is supported.
- Fixed a bug in the way we were computing the available virtual
memory on HPUX 10.20 machines.
- Fixed a bug in condor_q -analyze so it will correctly identify
more situations where a job won't run.
- Fixed a bug in condor_status -format so that if the requested
attribute isn't available for a given machine, the format string
(including spaces, tabs, newlines, etc) is still printed, just the
value for the requested attribute will be an empty string.
- Fixed a bug in the condor_schedd that was causing
condor_history to not print out the first ClassAd attribute of all
jobs that have completed
- Fixed a bug in condor_q that would cause a segmentation fault
if the argument list was too long.
1.7.2.3 Version 6.0.1
1.7.2.4 Version 6.0 pl4
NOTE: Back in the bad old days, we used this evil ``patch level''
version number scheme, with versions like ``6.0pl4''.
This has all gone away in the current versions of Condor.
- Fixed a bug that could cause a segmentation violation in the
condor_schedd under rare conditions when a condor_shadow exited.
- Fixed a bug that was preventing any core files that user jobs
submitted to Condor might create from being transfered back to the
submit machine for inspection by the user who submitted them.
- Fixed a bug that would cause some Condor daemons to go into an
infinite loop if the "ps" command output duplicate entries.
This only happens on certain platforms, and even then, only under rare
conditions.
However, the bug has been fixed and Condor now handles this case
properly.
- Fixed a bug in the condor_shadow that would cause a
segmentation violation if there was a problem writing to the user log
file specified by "log = filename" in the submit file used with
condor_submit.
- Added new command line arguments for the Condor daemons to support
saving the PID (process id) of the given daemon to a file, sending a
signal to the PID specified in a given file, and overriding what
directory is used for logging for a given daemon.
These are primarily for use with the condor_kbdd when it needs to be
started by XDM for the user logged onto the console, instead of
running as root.
See section 3.10.4 on ``Installing the condor_kbdd'' on
page
for details.
- Added support for the CREATE_CORE_FILES config file
parameter.
If this setting is defined, Condor will override whatever limits you
have set and in the case of a fatal error, will either create core
files or not depending on the value you specify ("true" or "false").
- Most Condor tools (condor_on, condor_off,
condor_master_off, condor_restart, condor_vacate,
condor_checkpoint, condor_reconfig, condor_reconfig_schedd,
condor_reschedule) can now take the IP address and port you want to
send the command to directly on the command line, instead of only
accepting hostnames.
This IP/port must be passed in a special format used in Condor (which
you will see in the daemon's log files, etc).
It is of the form:
<ip.address:port>.
For example:
<123.456.789.123:4567>.
1.7.2.5 Version 6.0 pl3
- Fixed a bug that would cause a segmentation violation if a
machine was not configured with a full hostname as either the official
hostname or as any of the hostname aliases.
- If your host information does not include a fully qualified
hostname anywhere, you can specify a domain in the
DEFAULT_DOMAIN_NAME parameter in your global config file
which will be appended to your hostname whenever Condor needs to use a
fully qualified name.
- All Condor daemons and most tools now support a "-version"
option that displays the version information and exits.
- The condor_install script now prompts for a short description
of your pool, which it stores in your central manager's local config
file as COLLECTOR_NAME.
This description is used to display the name of your pool when sending
information to the Condor developers.
- When the condor_shadow process starts up, if it is configured
to use a checkpoint server and it cannot connect to the server, the
shadow will check the MAX_DISCARDED_RUN_TIME parameter.
If the job in question has accumulated more CPU minutes than this
parameter, the condor_shadow will keep trying to connect to the
checkpoint server until it is successful.
Otherwise, the condor_shadow will just start the job over from
scratch immediately.
- If Condor is configured to use a checkpoint server, it will only
use the checkpoint server.
Previously, if there was a problem connecting to the checkpoint
server, Condor would fall back to using the submit machine to store
checkpoints.
However, this caused problems with local disks filling up on machines
without much disk space.
- Fixed a rare race condition that could cause a segmentation
violation if a Condor daemon or tool opened a socket to a daemon and
then closed it right away.
- All TCP sockets in Condor now have the "keep alive" socket option
enabled.
This allows Condor daemons to notice if their peer goes away in a hard
crash.
- Fixed a bug that could cause the condor_schedd to kill jobs
without a checkpoint during its graceful shutdown method under certain
conditions.
- The condor_schedd now supports the
MAX_SHADOW_EXCEPTIONS parameter.
If the condor_shadow processes for a given match die due to a fatal
error (an exception) more than this number of times, the
condor_schedd will now relinquish that match and stop trying to
spawn condor_shadow processes for it.
- The "-master" option to condor_status now displays the Name
attribute of all condor_master daemons in your pool, as opposed
to the Machine attribute.
This helps for pools that have submit-only machines joining them, for
example.
1.7.2.6 Version 6.0 pl2
- In patch level 1, code was added to more accurately find the
full hostname of the local machine.
Part of this code relied on the resolver, which on many platforms is a
dynamic library.
On Solaris, this library has needed many security patches and the
installation of Solaris on our development machines produced binaries
that are incompatible with sites that haven't applied all the security
patches.
So, the code in Condor that relies on this library was simply removed
for Solaris.
- Version information is now built into Condor.
You can see the CondorVersion attribute in every daemon's
ClassAd.
You can also run the UNIX command "ident" on any Condor binary to see
the version.
- Fixed a bug in the "remote submit" mode of condor_submit.
The remote submit wasn't connecting to the specified schedd, but was
instead trying to connect to the local schedd.
- Fixed a bug in the condor_schedd that could cause it to exit
with an error due to its log file being locked improperly under
certain rare circumstances.
1.7.2.7 Version 6.0 pl1
- condor_kbdd bug patched: On Silicon Graphics and DEC Alpha
ports, if your X11 server is using Xauthority user authentication, and
the condor_kbdd was unable to read the user's .Xauthority
file for some reason, the condor_kbdd would fall into an infinite
loop.
- When using a Condor Checkpoint Server, the protocol between the
Checkpoint Server and the condor_schedd has been made more robust
for a faulty network connection. Specifically, this improves
reliability when submitting jobs across the Internet and using a
remote Checkpoint Server.
- Fixed a bug concerning MAX_JOBS_RUNNING: The parameter
MAX_JOBS_RUNNING in the config file controls the maximum
number of simultaneous condor_shadow processes allowed on your
submission machine.
The bug was the number of shadow processes could, under certain
conditions, exceed the number specified by
MAX_JOBS_RUNNING.
- Added new parameter JOB_RENICE_INCREMENT that can be
specified in the config file.
This parameter specifies the UNIX nice level that the condor_starter
will start the user job.
It works just like the renice(1) command in UNIX.
Can be any integer between 1 and 19; a value of 19 is the lowest
possible priority.
- Improved response time for condor_userprio.
- Fixed a bug that caused periodic checkpoints to happen more
often than specified.
- Fixed some bugs in the installation procedure for certain
environments that weren't handled properly, and made the documentation
for the installation procedure more clear.
- Fixed a bug on IRIX that could allow vanillia jobs to be started
as root under certain conditions.
This was caused by the non-standard uid that user "nobody" has on
IRIX.
Thanks to Chris Lindsey at NCSA for help discovering this bug.
- On machines where the /etc/hosts file is misconfigured to
list just the hostname first, then the full hostname as an alias,
Condor now correctly finds the full hostname anyway.
- The local config file and local root config file are now only
found by the files listed in the LOCAL_CONFIG_FILE and
LOCAL_ROOT_CONFIG_FILE parameters in the global config
files.
Previously, /etc/condor and user condor's home directory
(condor) were searched as well.
This could cause problems with submit-only installations of Condor at
a site that already had Condor installed.
1.7.2.8 Version 6.0 pl0
- Initial Version 6.0 release.
Next: 1.8 Contact Information
Up: 1. Overview
Previous: 1.6 Availability
condor-admin@cs.wisc.edu