Where "cpus" are mentioned, read "cores". I'll try to make this clearer at some point.

Introduction

When Gridengine (e.g., 8.1.9) calculates cpu usage information, it is done based on slots used and either cpu time or wallclock time. This works if each slot corresponds to a single cpu. But, if more than one cpu is allocated to a slot, this calculation does not give the desired usage information.

For the purposes of this article:

with the goal of calculating cpu usage as:

cpu = wallclock * nslots * ncpus_per_slot

Usage

The complex used to track the number of cpus used per slot is specified in the execd_params with SLOT_MULTIPLIER_NAME. E.g.

execd_params	SLOT_MULTIPLIER_NAME=ncpus

The complex would be defined as:

ncpus               ncpus       INT       <=    FORCED      YES        0        1000

A queue would be configured with:

complex_values        ncpus=4

So, a job request for 4 slots X 16 cpus would be:

#$ -pe dev 4
#$ -l ncpus=16

and a run of 30s would amount to:

cpu = 30 * 4 *16 = 1920

versus what is currently returned:

cpu = 30 *4 = 120

Implementation

Changes

source/libs/sgeobj/sge_conf.h:

source/libs/sgeobj/sge_conf.c:

source/daemons/execd/load_avg.h:

source/daemons/execd/load_avg.c:

source/daemons/execd/reaper_execd.c:

Patch files (based off of https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/):

Summary

Changes to the code are minimal. The highlights are:

Conclusion

These changes make it possible to account for the number of cpus per slot which is critical to calculating cpu usage by wallclock.