You are here: Home / Systems / Mistral / Running Jobs / Accounting and Priorities

Accounting and Priorities

How SLURM uses the accounting data to calculate a jobs priority.

SLURM job priority calculation

On mistral we are using the Multi-factor Job Priority plugin of SLURM in order to influence job priority. The jobs priority at any given time is a weighted sum of the following factors:

  • age_factor ∈ [0,1] with 1 when age is more than PriorityMaxAge (30 day, 0 hours)
  • FairShare_factor ∈ [0,1] as explained below
  • QOS_factor ∈ [0,1] normalized according to 'sacctmgr show qos' (e.g. normal = 0, express = 0.1, bench = 1)

with the weights:

  • PriorityWeightFairshare=1000
  • PriorityWeightQOS=1000
  • PriorityWeightAge=1000

The final priority is then calculated as

Job_priority = PriorityWeightAge * age_factor + 
               PriorityWeightFairshare * FairShare_factor +
               PriorityWeightQOS * QOS_factor

and can be checked with the sprio command:

PRIORITY = AGE + FAIRSHARE + QOS ∈ [0,3000]
AGE = Weighted age priority ∈ [0,1000]
FAIRSHARE = Weighted fair-share priority ∈ [0,1000]
QOS = Weighted quality of service priority ∈ [0,1000]

While squeue has format options (%p and %Q) that display a job's composite priority, sprio can be used to display a breakdown of the priority components for each job, e.g.

$ sprio
          JOBID   PRIORITY        AGE  FAIRSHARE  QOS
        1421556       1175        100        975  100
        2015831        274         20        204   50
        2017372        258          0        258    0
        ...

SLURM accounting storage

For each SLURM job the accounting database stores the computing cycles delivered by a machine in the units of allocated_cpus*wall_clock_seconds.

Hence, one Haswell node used for one hour in the compute partition is accounted internally as

1 NodeHour HSW = 3600*48 CPUsec = 172800 CPUsec

while a Broadwell node in the compute2 partition used for one hour in exclusive state is accounted internally as

1 NodeHour BDW = 3600*72 CPUsec = 259200 CPUsec

HLRE-projects are accounted by means of nodehours (as shown at https://luv.dkrz.de/projects/). We expect that since the CPU frequency of Broadwell nodes is less than for Haswell nodes, while the core count is higher, both effects should compensate resulting in a balanced nodehour metric. This might be modified in the future when different accounting weights are introduced for different node types.

FairShare Factor

While all other factors to calculate the accounts priority are fairly easy to understand, the FairShare needs to be explained in detail.

The FairShare factor does not involve a fixed allotment (like the granted NodeHours per project), whereby a user's access to a machine is cut off once that allotment is reached.

Instead, the FairShare factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.

SLURM FairShare factor is therefore mainly based on the ratio of the amount of computing resources the user's jobs has already consumed to the shares of a computing resource that a user has been granted. The higher the value, the less shares were used compared to what was granted, and the higher is the placement in the queue.

DKRZ uses a two level share hierarchy. On the top level (parent level), project accounts are grouped according to their shareholder grant - e.g. project ba1234 is a bmbf project, while mh1234 is a mpg project. On the bottom/account level each project is represented by the granted NodeHours for the current period. This forms the Normalized Shares per project:

S_project = (S_parent / S_parent-siblings) * (S_account / S_account-siblings)

As an example take the following settings (given by sshare)

             Account       User Raw Shares Norm Shares    Raw Usage Effectv Usage  FairShare 
-------------------- ---------- ---------- -----------  ----------- ------------- ---------- 
root                                          1.000000 256867246717      0.000000   1.000000 
 root                      root          1    0.000969
 bmbf                                  205    0.198643      1000000      0.000004
  ba1234			    238938    0.032619      1000000	 0.000004   0.999915
  rest_bmbf			   1216131    0.166024
 mpg                                   205    0.198643  71017642553      0.276476
  mh1234                            238938    0.032619  16528791499      0.099181   0.121532
  mh4321			    238938    0.032619      1000000	 0.045404   0.381050
  rest_mpg ...			    977193    0.133405
 rest				       621    0.601744


Project mh1234 as mpg child has the following shares:

sum of Raw Shares (parent level): S_parent-siblings = 1032
Raw Shares of mpg: S_parent = 205
Norm Shares of mpg: S_1 = 205/1032 = 0.198643 (i.e. ~20 % of the compute resources are for mpg)

sum of Raw Shares (account level): S_account-siblings = 1455069
Raw Shares of mh1234: S_account = 238938
Norm Shares of mh1234: S_2 = 238938/1455069 = 0.164211 (i.e. ~16.4 % of mpg share is for mh1234)

=> S_mh1234 = S_1 * S_2 = 0.032619 (i.e. ~3.3 % of the compute resources are for mh1234)

User can query the actual shares for all projects they belong to via

$ sshare

This shows for each SLURM account the value of S_project in column 'Norm Shares'. Furthermore, column 'Raw Usage' shows the already used CPUseconds for this account (U_account).

SLURM implements another factor of fairness such that not only the Raw Usage of an account is used to calculate the FairShare, but also the Raw Usage of its parent (i.e. the sum of all siblings). This allows that accounts of different parents with the same charged usage and the same shares, get different FairShare due to their sibling accounts. This so called 'Effective Usage' is ultimately used to calculate the Fair-share factor.

U_project-eff = U_account/U_total + ( (U_parent - U_account)/U_total * S_account / S_account-siblings)

For the example above this reads

U_mh1234-eff = U_mh1234/U_total + ( (U_mpg - U_mh1234)/U_total * S_mh1234 / S_mh1234-siblings)
	     = 16528791499/256867246717 + ( (71017642553 - 16528791499)/256867246717 * 238938 / 1455069 )
	     = 0.064348 + (0.212128 * 0.164211) = 0.099181

The sibling project mh4321 with much less Raw Usage but same shares gets:

U_mh4321-eff = U_mh4321/U_total + ( (U_mpg - U_mh4321)/U_total * S_mh4321 / S_mh4321-siblings)
	     = 1000000/256867246717 + ( (71017642553 - 1000000)/256867246717 * 238938 / 1455069 )
	     = 0.000004 + (0.276472 * 0.164211) = 0.045404

The project ba1234 with same prerequisites as the project mh4321 but different parent gets:

U_ba1234-eff = U_ba1234/U_total + ( (U_bmbf - U_ba1234)/U_total * S_ba1234 / S_ba1234-siblings)
	     = 1000000/256867246717 + ( (1000000 - 1000000)/256867246717 * 238938 / 1455069 )
	     = 0.000004 + (0.0 * 0.164211) = 0.000004

Finally, the Fair-share factor of a project is

FS_project = 2(-U_project-eff/S_project)

Again, for the example above this reads

FS_mh1234 = 2**(-U_mh1234-eff/S_mh1234) = 2**(-0.099181/0.032619) = 0.121532

The sibling project mh4321 gets a better FairShare since it used less resources so far:

FS_mh4321 = 2**(-U_mh4321-eff/S_mh4321) = 2**(-0.045404/0.032619) = 0.381050

while the project ba1234 is prioritized over mh4321 (although both already used the same amount of resources):

FS_ba1234 = 2**(-U_ba1234-eff/S_ba1234) = 2**(-0.000004/0.032619) = 0.999915

Some general examples:

  1. project X does not use any CPUtime so far (Raw Usage = 0) and also no sibling used any CPUtime: U_project-eff=0 => FS_project = 1
  2. compared to all others (sum of raw usage U_total), project X used exactly its granted share: U_project-eff = S_project => FS_project = 0.5

Attention: "half/all/double CPUtime used so far" cases are not directly mapped by SLURM FairShare! Only the ratio U_project-eff / S_project is of interest for FairShare!

In general: A FairShare factor of above 0.5 indicates that the project has consumed less than its allocated share while a FairShare factor below 0.5 indicates that the project has consumed more than its allocated share of the computing resources.

Document Actions