Info
Where "cpus" are mentioned, read "cores". I'll try to make this clearer at some point.
When Gridengine (e.g., 8.1.9) calculates cpu usage information, it is done based on slots used and either cpu time or wallclock time. This works if each slot corresponds to a single cpu. But, if more than one cpu is allocated to a slot, this calculation does not give the desired usage information.
For the purposes of this article:
execd_params
is configured with ACCT_RESERVED_USAGE=true
and SHARETREE_RESERVED_USAGE=true
so that wallclock time rather than cpu time is used.cpu = wallclock * nslots * ncpus_per_slot
The complex used to track the number of cpus used per slot is specified in the execd_params
with SLOT_MULTIPLIER_NAME
. E.g.
execd_params SLOT_MULTIPLIER_NAME=ncpus
The complex would be defined as:
ncpus ncpus INT <= FORCED YES 0 1000
A queue would be configured with:
complex_values ncpus=4
So, a job request for 4 slots X 16 cpus would be:
#$ -pe dev 4
#$ -l ncpus=16
and a run of 30s would amount to:
cpu = 30 * 4 *16 = 1920
versus what is currently returned:
cpu = 30 *4 = 120
source/libs/sgeobj/sge_conf.h
:mconf_get_slot_multiplier_name()
declarationsource/libs/sgeobj/sge_conf.c
:slot_multiplier_name
static variable to hold value set in execd_params SLOT_MULTIPLIER_NAME
mconf_get_slot_multiplier_name()
source/daemons/execd/load_avg.h
:build_reserved_usage()
declaration to accept reference to jobsource/daemons/execd/load_avg.c
:build_reserved_usage()
to accept reference to jobbuild_reserved_usage()
to get slot multiplier and use it when calculating cpu usageget_slot_multiplier()
static function get the multiplier value if defined, or 1.0 otherwisecalculate_reserved_usage()
to get job reference and provide it when calling build_reserved_usage()
source/daemons/execd/reaper_execd.c
:build_derived_final_usage()
to provide job reference when calling build_reserved_usage()
Patch files (based off of https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/:
Changes to the code are minimal. The highlights are:
build_reserved_usage()
function to take a job reference so that the consumable information can be obtainedbuild_reserved_usage()
to use the slot multiplier valueThese changes make it possible to account for the number of cpus per slot which is critical to calculating cpu usage by wallclock.