Hide last authors
John 1.1 1
2
3 {{toc-local/}}
4
John 5.1 5 {{info}}
6 Where "cpus" are mentioned, read "cores". I'll try to make this clearer at some point.
7 {{/info}}
8
John 1.1 9 = Introduction =
10
John 4.1 11 When Gridengine (e.g., 8.1.9) calculates cpu usage information, it is done based on slots used and either cpu time or wallclock time. This works if each slot corresponds to a single cpu. But, if more than one cpu is allocated to a slot, this calculation does not give the desired usage information.
John 1.1 12
John 4.1 13 For the purposes of this article:
John 1.1 14
15 * (((
16 execd_params is configured with ACCT_RESERVED_USAGE=true and SHARETREE_RESERVED_USAGE=true so that wallclock time rather than cpu time is used
17 )))
18 * (((
John 4.1 19 cpus are a dedicated resource managed by a consumable
John 1.1 20 )))
21 * (((
22 jobs request the number of cpus for each slot
23 )))
24
John 4.1 25 with the goal of calculating cpu usage as:
John 1.1 26
27 {{noformat}}
28 cpu = wallclock * nslots * ncpus_per_slot
29 {{/noformat}}
30
31 = Usage =
32
33 The complex used to track the number of cpus used per slot is specified in the execd_params with SLOT_MULTIPLIER_NAME. E.g.
34
35 {{noformat}}
36 execd_params SLOT_MULTIPLIER_NAME=ncpus
37 {{/noformat}}
38
39 The complex would be defined as:
40
41 {{noformat}}
42 ncpus ncpus INT <= FORCED YES 0 1000
43 {{/noformat}}
44
John 4.1 45 A queue would be configured with:
John 1.1 46
47 {{noformat}}
48 complex_values ncpus=4
49 {{/noformat}}
50
51 * 4 cpus allocatable from this queue
52
53 So, a job request for 4 slots X 16 cpus would be:
54
55 {{noformat}}
56 #$ -pe dev 4
57 #$ -l ncpus=16
58 {{/noformat}}
59
60 and a run of 30s would amount to:
61
62 (% style="margin-left: 30.0px;" %)
63 cpu = 30 * 4 *16 = 1920
64
65 versus what is currently returned:
66
67 (% style="margin-left: 30.0px;" %)
68 cpu = 30 *4 = 120
69
70 = Implementation =
71
72 == Changes ==
73
74 source/libs/sgeobj/sge_conf.h:
75
76 * add mconf_get_slot_multiplier_name() declaration
77
78 source/libs/sgeobj/sge_conf.c:
79
80 * set slot_multiplier_name static variable to hold value set in execd_params SLOT_MULTIPLIER_NAME
81 * implement mconf_get_slot_multiplier_name()
82
83 source/daemons/execd/load_avg.h:
84
85 * augment build_reserved_usage() declaration to accept reference to job
86
87 source/daemons/execd/load_avg.c:
88
89 * augment build_reserved_usage() to accept reference to job
90 * enhance build_reserved_usage() to get slot multiplier and use it when calculating cpu usage
91 * add get_slot_multiplier() static function get the multiplier value if defined, or 1.0 otherwise
92 * update calculate_reserved_usage() to get job reference and provide it when calling build_reserved_usage()
93
94 source/daemons/execd/reaper_execd.c:
95
96 * update build_derived_final_usage() to provide job reference when calling build_reserved_usage()
97
98 Patch files (based off of [[https:~~/~~/arc.liv.ac.uk/downloads/SGE/releases/8.1.9/>>url:https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/||shape="rect"]]):
99
100 * [[attach:0001-Support-for-slot-multiplier.patch]]
101
102 == Summary ==
103
104 Changes to the code are minimal. The highlights are:
105
106 * change the signature of the build_reserved_usage() function to take a job reference so that the consumable information can be obtained
107 * update build_reserved_usage() to use the slot multiplier value
108
109 = Conclusion =
110
John 4.1 111 These changes make it possible to account for the number of cpus per slot which is critical to calculating cpu usage by wallclock.

Contact