Show last authors
1
2
3 {{toc-local/}}
4
5 {{info}}
6 Where "cpus" are mentioned, read "cores". I'll try to make this clearer at some point.
7 {{/info}}
8
9 = Introduction =
10
11 When Gridengine (e.g., 8.1.9) calculates cpu usage information, it is done based on slots used and either cpu time or wallclock time. This works if each slot corresponds to a single cpu. But, if more than one cpu is allocated to a slot, this calculation does not give the desired usage information.
12
13 For the purposes of this article:
14
15 * (((
16 execd_params is configured with ACCT_RESERVED_USAGE=true and SHARETREE_RESERVED_USAGE=true so that wallclock time rather than cpu time is used
17 )))
18 * (((
19 cpus are a dedicated resource managed by a consumable
20 )))
21 * (((
22 jobs request the number of cpus for each slot
23 )))
24
25 with the goal of calculating cpu usage as:
26
27 {{noformat}}
28 cpu = wallclock * nslots * ncpus_per_slot
29 {{/noformat}}
30
31 = Usage =
32
33 The complex used to track the number of cpus used per slot is specified in the execd_params with SLOT_MULTIPLIER_NAME. E.g.
34
35 {{noformat}}
36 execd_params SLOT_MULTIPLIER_NAME=ncpus
37 {{/noformat}}
38
39 The complex would be defined as:
40
41 {{noformat}}
42 ncpus ncpus INT <= FORCED YES 0 1000
43 {{/noformat}}
44
45 A queue would be configured with:
46
47 {{noformat}}
48 complex_values ncpus=4
49 {{/noformat}}
50
51 * 4 cpus allocatable from this queue
52
53 So, a job request for 4 slots X 16 cpus would be:
54
55 {{noformat}}
56 #$ -pe dev 4
57 #$ -l ncpus=16
58 {{/noformat}}
59
60 and a run of 30s would amount to:
61
62 (% style="margin-left: 30.0px;" %)
63 cpu = 30 * 4 *16 = 1920
64
65 versus what is currently returned:
66
67 (% style="margin-left: 30.0px;" %)
68 cpu = 30 *4 = 120
69
70 = Implementation =
71
72 == Changes ==
73
74 source/libs/sgeobj/sge_conf.h:
75
76 * add mconf_get_slot_multiplier_name() declaration
77
78 source/libs/sgeobj/sge_conf.c:
79
80 * set slot_multiplier_name static variable to hold value set in execd_params SLOT_MULTIPLIER_NAME
81 * implement mconf_get_slot_multiplier_name()
82
83 source/daemons/execd/load_avg.h:
84
85 * augment build_reserved_usage() declaration to accept reference to job
86
87 source/daemons/execd/load_avg.c:
88
89 * augment build_reserved_usage() to accept reference to job
90 * enhance build_reserved_usage() to get slot multiplier and use it when calculating cpu usage
91 * add get_slot_multiplier() static function get the multiplier value if defined, or 1.0 otherwise
92 * update calculate_reserved_usage() to get job reference and provide it when calling build_reserved_usage()
93
94 source/daemons/execd/reaper_execd.c:
95
96 * update build_derived_final_usage() to provide job reference when calling build_reserved_usage()
97
98 Patch files (based off of [[https:~~/~~/arc.liv.ac.uk/downloads/SGE/releases/8.1.9/>>url:https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/||shape="rect"]]):
99
100 * [[attach:0001-Support-for-slot-multiplier.patch]]
101
102 == Summary ==
103
104 Changes to the code are minimal. The highlights are:
105
106 * change the signature of the build_reserved_usage() function to take a job reference so that the consumable information can be obtained
107 * update build_reserved_usage() to use the slot multiplier value
108
109 = Conclusion =
110
111 These changes make it possible to account for the number of cpus per slot which is critical to calculating cpu usage by wallclock.

Contact