- Info
- Alle Inhalte des Nutzerportal sind nur auf Englisch verfügbar.
Accounting and Priorities
Concept of job priority
The individual job priority is computed as a weighted sum of three different factors - see below for details:
- the time that the job is waiting in the queue
- the share of the project's compute time that has already been used
- a special priority granted as "quality of service" to specific projects or kind of usage
Thus, a job will get an especially high priority if it
- has already been long time in the queue (age_factor)
- runs under an account that has not yet used its share of compute time (FairShare_factor)
- is associated with a high priority for other reasons (e.g. express QOS for small tests)
SLURM job priority calculation
On mistral we are using the Multi-factor Job Priority plugin of SLURM in order to influence job priority. The jobs priority at any given time is a weighted sum of the following factors:
- age_factor ∈ [0,1] with 1 when age is more than PriorityMaxAge (30 day, 0 hours)
- FairShare_factor ∈ [0,1] as explained below
- QOS_factor ∈ [0,1] normalized according to 'sacctmgr show qos' (e.g. normal = 0, express = 0.1, bench = 1)
with the weights:
- PriorityWeightFairshare=1000
- PriorityWeightQOS=1000
- PriorityWeightAge=1000
The final priority is then calculated as
Job_priority = PriorityWeightAge * age_factor + PriorityWeightFairshare * FairShare_factor + PriorityWeightQOS * QOS_factor
and can be checked with the sprio command:
PRIORITY = AGE + FAIRSHARE + QOS ∈ [0,3000] AGE = Weighted age priority ∈ [0,1000] FAIRSHARE = Weighted fair-share priority ∈ [0,1000] QOS = Weighted quality of service priority ∈ [0,1000]
While squeue has format options (%p and %Q) that display a job's composite priority, sprio can be used to display a breakdown of the priority components for each job, e.g.
$ sprio JOBID PRIORITY AGE FAIRSHARE QOS 1421556 1175 100 975 100 2015831 274 20 204 50 2017372 258 0 258 0 ...
SLURM accounting storage
For each SLURM job the accounting database stores the computing cycles delivered by a machine in the units of allocated_cpus*wall_clock_seconds.
Hence, one Haswell node used for one hour in the compute partition is accounted internally as
1 NodeHour HSW = 3600*48 CPUsec = 172800 CPUsec
while a Broadwell node in the compute2 partition used for one hour in exclusive state is accounted internally as
1 NodeHour BDW = 3600*72 CPUsec = 259200 CPUsec
HLRE-projects are accounted by means of nodehours (as shown at https://luv.dkrz.de/projects/). We expect that since the CPU frequency of Broadwell nodes is less than for Haswell nodes, while the core count is higher, both effects should compensate resulting in a balanced nodehour metric. This might be modified in the future when different accounting weights are introduced for different node types.
FairShare Factor
While all other factors to calculate the accounts priority are fairly easy to understand, the FairShare needs to be explained in detail.
The FairShare factor does not involve a fixed allotment (like the granted NodeHours per project), whereby a user's access to a machine is cut off once that allotment is reached.
Instead, the FairShare factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.
SLURM FairShare factor is therefore mainly based on the ratio of the amount of computing resources the user's jobs has already consumed to the shares of a computing resource that a user has been granted. The higher the value, the less shares were used compared to what was granted, and the higher is the placement in the queue.
DKRZ uses a two level share hierarchy. On the top level (parent level), project accounts are grouped according to their shareholder grant - e.g. project ba1234 is a bmbf project, while mh1234 is a mpg project. On the bottom/account level each project is represented by the granted NodeHours for the current period. This forms the Normalized Shares per project:
S_project = (S_parent / S_parent-siblings) * (S_account / S_account-siblings)
As an example take the following settings (given by sshare)
Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 1.000000 256867246717 0.000000 1.000000 root root 1 0.000969 bmbf 205 0.198643 1000000 0.000004 ba1234 238938 0.032619 1000000 0.000004 0.999915 rest_bmbf 1216131 0.166024 mpg 205 0.198643 71017642553 0.276476 mh1234 238938 0.032619 16528791499 0.099181 0.121532 mh4321 238938 0.032619 1000000 0.045404 0.381050 rest_mpg ... 977193 0.133405 rest 621 0.601744 Project mh1234 as mpg child has the following shares: sum of Raw Shares (parent level): S_parent-siblings = 1032 Raw Shares of mpg: S_parent = 205 Norm Shares of mpg: S_1 = 205/1032 = 0.198643 (i.e. ~20 % of the compute resources are for mpg) sum of Raw Shares (account level): S_account-siblings = 1455069 Raw Shares of mh1234: S_account = 238938 Norm Shares of mh1234: S_2 = 238938/1455069 = 0.164211 (i.e. ~16.4 % of mpg share is for mh1234) => S_mh1234 = S_1 * S_2 = 0.032619 (i.e. ~3.3 % of the compute resources are for mh1234)
User can query the actual shares for all projects they belong to via
$ sshare
This shows for each SLURM account the value of S_project in column 'Norm Shares'. Furthermore, column 'Raw Usage' shows the already used CPUseconds for this account (U_account).
SLURM implements another factor of fairness such that not only the Raw Usage of an account is used to calculate the FairShare, but also the Raw Usage of its parent (i.e. the sum of all siblings). This allows that accounts of different parents with the same charged usage and the same shares, get different FairShare due to their sibling accounts. This so called 'Effective Usage' is ultimately used to calculate the Fair-share factor.
U_project-eff = U_account/U_total + ( (U_parent - U_account)/U_total * S_account / S_account-siblings)
For the example above this reads
U_mh1234-eff = U_mh1234/U_total + ( (U_mpg - U_mh1234)/U_total * S_mh1234 / S_mh1234-siblings) = 16528791499/256867246717 + ( (71017642553 - 16528791499)/256867246717 * 238938 / 1455069 ) = 0.064348 + (0.212128 * 0.164211) = 0.099181 The sibling project mh4321 with much less Raw Usage but same shares gets: U_mh4321-eff = U_mh4321/U_total + ( (U_mpg - U_mh4321)/U_total * S_mh4321 / S_mh4321-siblings) = 1000000/256867246717 + ( (71017642553 - 1000000)/256867246717 * 238938 / 1455069 ) = 0.000004 + (0.276472 * 0.164211) = 0.045404 The project ba1234 with same prerequisites as the project mh4321 but different parent gets: U_ba1234-eff = U_ba1234/U_total + ( (U_bmbf - U_ba1234)/U_total * S_ba1234 / S_ba1234-siblings) = 1000000/256867246717 + ( (1000000 - 1000000)/256867246717 * 238938 / 1455069 ) = 0.000004 + (0.0 * 0.164211) = 0.000004
Finally, the Fair-share factor of a project is
FS_project = 2(-U_project-eff/S_project)
Again, for the example above this reads
FS_mh1234 = 2**(-U_mh1234-eff/S_mh1234) = 2**(-0.099181/0.032619) = 0.121532 The sibling project mh4321 gets a better FairShare since it used less resources so far: FS_mh4321 = 2**(-U_mh4321-eff/S_mh4321) = 2**(-0.045404/0.032619) = 0.381050 while the project ba1234 is prioritized over mh4321 (although both already used the same amount of resources): FS_ba1234 = 2**(-U_ba1234-eff/S_ba1234) = 2**(-0.000004/0.032619) = 0.999915
Some general examples:
- project X does not use any CPUtime so far (Raw Usage = 0) and also no sibling used any CPUtime: U_project-eff=0 => FS_project = 1
- compared to all others (sum of raw usage U_total), project X used exactly its granted share: U_project-eff = S_project => FS_project = 0.5
Attention: "half/all/double CPUtime used so far" cases are not directly mapped by SLURM FairShare! Only the ratio U_project-eff / S_project is of interest for FairShare!
In general: A FairShare factor of above 0.5 indicates that the project has consumed less than its allocated share while a FairShare factor below 0.5 indicates that the project has consumed more than its allocated share of the computing resources.
Artikelaktionen