Wiki source code of Slot Multiplier for Calculating CPU Usage in Gridengine
Last modified by Admin on 2021/10/31 17:45
Hide last authors
author | version | line-number | content |
---|---|---|---|
![]() |
1.1 | 1 | |
2 | |||
3 | {{toc-local/}} | ||
4 | |||
![]() |
5.1 | 5 | {{info}} |
6 | Where "cpus" are mentioned, read "cores". I'll try to make this clearer at some point. | ||
7 | {{/info}} | ||
8 | |||
![]() |
1.1 | 9 | = Introduction = |
10 | |||
![]() |
4.1 | 11 | When Gridengine (e.g., 8.1.9) calculates cpu usage information, it is done based on slots used and either cpu time or wallclock time. This works if each slot corresponds to a single cpu. But, if more than one cpu is allocated to a slot, this calculation does not give the desired usage information. |
![]() |
1.1 | 12 | |
![]() |
4.1 | 13 | For the purposes of this article: |
![]() |
1.1 | 14 | |
15 | * ((( | ||
16 | execd_params is configured with ACCT_RESERVED_USAGE=true and SHARETREE_RESERVED_USAGE=true so that wallclock time rather than cpu time is used | ||
17 | ))) | ||
18 | * ((( | ||
![]() |
4.1 | 19 | cpus are a dedicated resource managed by a consumable |
![]() |
1.1 | 20 | ))) |
21 | * ((( | ||
22 | jobs request the number of cpus for each slot | ||
23 | ))) | ||
24 | |||
![]() |
4.1 | 25 | with the goal of calculating cpu usage as: |
![]() |
1.1 | 26 | |
27 | {{noformat}} | ||
28 | cpu = wallclock * nslots * ncpus_per_slot | ||
29 | {{/noformat}} | ||
30 | |||
31 | = Usage = | ||
32 | |||
33 | The complex used to track the number of cpus used per slot is specified in the execd_params with SLOT_MULTIPLIER_NAME. E.g. | ||
34 | |||
35 | {{noformat}} | ||
36 | execd_params SLOT_MULTIPLIER_NAME=ncpus | ||
37 | {{/noformat}} | ||
38 | |||
39 | The complex would be defined as: | ||
40 | |||
41 | {{noformat}} | ||
42 | ncpus ncpus INT <= FORCED YES 0 1000 | ||
43 | {{/noformat}} | ||
44 | |||
![]() |
4.1 | 45 | A queue would be configured with: |
![]() |
1.1 | 46 | |
47 | {{noformat}} | ||
48 | complex_values ncpus=4 | ||
49 | {{/noformat}} | ||
50 | |||
51 | * 4 cpus allocatable from this queue | ||
52 | |||
53 | So, a job request for 4 slots X 16 cpus would be: | ||
54 | |||
55 | {{noformat}} | ||
56 | #$ -pe dev 4 | ||
57 | #$ -l ncpus=16 | ||
58 | {{/noformat}} | ||
59 | |||
60 | and a run of 30s would amount to: | ||
61 | |||
62 | (% style="margin-left: 30.0px;" %) | ||
63 | cpu = 30 * 4 *16 = 1920 | ||
64 | |||
65 | versus what is currently returned: | ||
66 | |||
67 | (% style="margin-left: 30.0px;" %) | ||
68 | cpu = 30 *4 = 120 | ||
69 | |||
70 | = Implementation = | ||
71 | |||
72 | == Changes == | ||
73 | |||
74 | source/libs/sgeobj/sge_conf.h: | ||
75 | |||
76 | * add mconf_get_slot_multiplier_name() declaration | ||
77 | |||
78 | source/libs/sgeobj/sge_conf.c: | ||
79 | |||
80 | * set slot_multiplier_name static variable to hold value set in execd_params SLOT_MULTIPLIER_NAME | ||
81 | * implement mconf_get_slot_multiplier_name() | ||
82 | |||
83 | source/daemons/execd/load_avg.h: | ||
84 | |||
85 | * augment build_reserved_usage() declaration to accept reference to job | ||
86 | |||
87 | source/daemons/execd/load_avg.c: | ||
88 | |||
89 | * augment build_reserved_usage() to accept reference to job | ||
90 | * enhance build_reserved_usage() to get slot multiplier and use it when calculating cpu usage | ||
91 | * add get_slot_multiplier() static function get the multiplier value if defined, or 1.0 otherwise | ||
92 | * update calculate_reserved_usage() to get job reference and provide it when calling build_reserved_usage() | ||
93 | |||
94 | source/daemons/execd/reaper_execd.c: | ||
95 | |||
96 | * update build_derived_final_usage() to provide job reference when calling build_reserved_usage() | ||
97 | |||
98 | Patch files (based off of [[https:~~/~~/arc.liv.ac.uk/downloads/SGE/releases/8.1.9/>>url:https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/||shape="rect"]]): | ||
99 | |||
100 | * [[attach:0001-Support-for-slot-multiplier.patch]] | ||
101 | |||
102 | == Summary == | ||
103 | |||
104 | Changes to the code are minimal. The highlights are: | ||
105 | |||
106 | * change the signature of the build_reserved_usage() function to take a job reference so that the consumable information can be obtained | ||
107 | * update build_reserved_usage() to use the slot multiplier value | ||
108 | |||
109 | = Conclusion = | ||
110 | |||
![]() |
4.1 | 111 | These changes make it possible to account for the number of cpus per slot which is critical to calculating cpu usage by wallclock. |