Computational resources allocation and accounting¶
About¶
This document explains how we allocate and keep track of the projects’ computational resources. For details about the storage accounting, refer to Calculation of the storage space utilization.
In the text below, the therms, option names, and employed tools are related to the Slurm workload manager - the software we use at Discoverer for job control and computational resources’ management. It is possible for every authorized user to read the Slurm’s configuration on the login or compute nodes. All configuration files are located within the folder /etc/slurm
.
Organization of the Slurm accounts¶
One Slurm account per project¶
Warning
All resource allocations on Discoverer are based on projects (group of users). No resource allocation will be provided on a per-user basis.
One Slurm account is created for each accepted project. The Slurm account defines the quantity of computational resources allocated to the project and regulates their utilization. Those computational resources are exclusively accessible to the users who participate in the project. The principal investigator (PI) determines the list of users who will be able to use the resources managed by the Slurm account assigned to the project. It is not permissible for any external users to gain access to the Slurm account.
Each Slurm account has a unique name. That name contains the ID of the project. The system engineers at Discoverer create the Slurm account and allocate the computational resources that have been approved for the project. Those users who are eligible to utilize and share the allocated computational resources are added to the Slurm account.
Important
The system engineers at Discoverer will provide the PI and eligible users with the name of the Slurm account assigned to their project.
Trackable resources (TRES)¶
Important
The only trackable resource we have under Slurm’s control at this time is the CPU time.
The number of trackable resources assigned to each Slurm account is determined by the core or node hours requested by the principal investigator in the approved project proposal. In order to assign the Slurm account with the specified number of core or node hours, it is necessary to convert that number into CPU minutes. Below, we provide a comprehensive explanation about that type of conversion. But before we proceed with the conversion formulas, let’s introduce some basic definitions.
Note
1 core hour is the utilization or reservation of one CPU core for 1 hour
Our compute nodes are equipped with 2 x AMD EPYC 7H12 64-Core Processors (see Resource Overview). Since each compute node has 128 CPU cores, if those 128 CPU cores are being used for 1 hour of computation, that resource utilization corresponds to 1 node hour:
Note
1 node hour = 128 core hours
The CPU utilization is counted by Slurm in CPU minutes. To make those CPU minutes a meaningful measure and related them to the core and node hours, we need to consider the hyperthreading enabled on our processors. Each processor core runs two processor threads when hyperthreading is enabled. The Linux operating system, on the other hand, regards each thread as a single central processing unit (CPU). We therefore obtain the following relations between core/node hours and CPU minutes:
1 core hour = 2 × 60 CPU minutes
1 node hour = 128 × 2 × 60 CPU minutes
Important
Example: The PI requested 18 000 node hours on Discoverer for his project. In that case, the Slurm account for the project will be loaded with 18000 × 128 × 2 × 60 = 276480000 CPU minutes.
Maximum CPUs per Slrum account¶
The maximum number of CPUs per Slurm account is a number that determines the maximum number of processor threads that can be used simultaneously by all the jobs associated with that account, which are currently running. The PI should initially specify that number in the project documentation, and it’s possible to request its change later, provided that modification is in line with the requirements for increasing the job productivity.
Important
Example: The PI requested 18 000 node hours on Discoverer for his project, which are to be spent by utilizing max. 512 CPU cores. The maximum CPUs used by all simultaneously running jobs associated with the project Slurm account will be limited to 1024. If the threshold of 1024 concurrently utilized processor threads is reached, the newly submitted jobs will be queued and executed only if there are enough CPU threads freed by the previously running jobs.
CPUs to node aggregation¶
In order to utilize better the compute nodes and prevent the inefficient spreading of parallel tasks into large number of nodes, we may apply maximum node utilization policy to each Slurm account. The goal of that policy is to limit the scattering of the processes on CPU cores across high number of compute nodes and keep more or less the node exclusivity to the jobs.
Important
Example: According to the project documentation, the PI requested the maximum utilization of 1024 CPU cores on Discoverer. Unless the PI has a valid reason to the contrary, we limit the maximum number of nodes in use in the Slurm account for the project to 8, which is computed as 1024 / 128 = 8. So, all the jobs running at the same time on that Slurm account cannot take up more than 8 nodes at once.
In case the parallel execution of jobs is suppressed by that policy, we may change the policy application to certain Slurm accounts and adjust the number of nodes assigned.
Maximum wall time¶
Wall time, also called real-world time, clock time, wall-clock time, or elapsed real time, is the expected time for completing a job. That time has to be specified in the Slurm batch script. No unlimited wall time is allowed on Discoverer. Job submissions with unlimited wall time are not allowed.
The maximum wall time per job associated with the Slurm account of a project is provided by the PI in the accepted project application.
Maximum number of running jobs¶
Each Slurm project account has a limit on the number of jobs that can be executed simultaneously within the account. The number might be given by the PI in the accepted project application, or it might be set by our system engineers.
Maximum number of submitted jobs¶
That number is the maximum number of jobs that can be successfully submitted to the queue . If the PI does not give that number in the accepted project application, the system engineers in charge of the Slurm accounts support assign it to the account. If the project needs to run job arrays, that number can be changed on request.
Users¶
Upon a proper request from the PI, the system engineers can assign users to a Slurm account. It is noteworthy that a user may be enrolled in multiple Slurm accounts if that user contributes to more than one project.
QoS¶
Every Slurm account for project is assigned a default Quality of Service (QoS) object. Its role is to ensure that the limits are enforced, re-defined or that they are extended when it is necessary. If the project requires additional quality of service objects, they may also be added to the account.
Displaying the resources loaded into the Slurm account and QoS¶
The limits loaded into the account, as well as the list of users, and the QoS assigned to the account can be displayed on the login node. One can accomplish this by executing a command line that displays a list of associations for the account:
sacctmgr show association where account=ehpc-reg-XXXXXX-YYY
The output represents a table. It consists of the following column names:
Cluster, Account, User, Partition, Share, Priority, GrpJobs, GrpTRES, GrpSubmit, GrpWall, GrpTRESMins, MaxJobs, MaxTRES, MaxTRESPerNode, MaxSubmit, MaxWall, MaxTRESMins, QOS, Def QOS, GrpTRESRunMin
To widen a column, add a format option to the command line and specify there the name of the column and its width (in number of symbols). For example, to set the width of the column GrpTRES
to 20 symbols, execute:
sacctmgr show association where account=ehpc-reg-XXXXXX-YYY format=GrpTRES%20
Showing how many resources are left in the Slurm account¶
Currently, the utilization of the computational resources loaded into the account can be estimated by executing the following command line:
sshare -A ehpc-reg-XXXXXX-YYY -u " " -o account,user,GrpTRESRaw%80,GrpTRESMins,RawUsage
The CPU minutes spent so far by all jobs associated to the Slurm account are displayed in the column GrpTRESRaw
(see the numbers assigned to cpu
and billing
there), whereas the number in GrpTRESMins
shows the CPU minutes loaded into the account (the limit). To get the number of core and node hours from the CPU minutes, you need to use formulas (see the explanations for the formulas and units above).
The number assigned to cpu
in GrpTRESRaw
shows how many CPU minutes have been spent so far by all jobs associated with the Slurm account, whereas the value of cpu=
in GrpTRESMins
is equal to the CPU minutes that have been loaded into the account. To get the number of core and node hours from the CPU minutes, you need to implement the formulas (see the explanations for the formulas and units above):
core hours = CPU minutes / 60 / 2
node hours = CPU minutes / 60 / 2 / 128
The number in RawUsage
shows the total CPU seconds spent so far by all jobs linked to the Slurm account. That number should be used as a reference or to grossly check the accuracy of the computed CPU minutes. The RawUsage
number is usually not something the users need to deal with, but they should include it in any report they send to the Discoverer helpdesk.
Important
Although it is true that the fair-share component is activated in slurm.conf
(PriorityType=priority/multifactor
), we do not decay the historic usage (PriorityDecayHalfLife=0
) nor do we clean it (PriorityUsageResetPeriod=NONE
). Based on that particular configuration of the fair-share component in slurm.conf
, the output of the sshare
command should provide a precise estimate of the number of computational resources spent or left in the Slurm account.
Warning
The output of the sshare
command line displays numbers stored in the Slurm’s database once every 5 minutes. Therefore, if you repeatedly invoke the sshare
command line between two updates of the database values, you will receive the same numbers each time, even if there are ongoing jobs associated with the Slurm account at that time.
Displaying the computational resources used by a single job¶
Details¶
In its SQL database, Slurm stores the job duration as an integer number in a column called CPUTimeRAW
. When hyperthreading is enabled, CPUTimeRAW
measures the number of thread seconds spent on executing the job in scope:
CPUTimeRAW = thread seconds = job duration in seconds x number of threads requested
The threadhours are computed as:
thread hours = thread seconds / 3600
Since each CPU core runs two threads (hyperthreading is enabled on all compute nodes), the corresponding corehours are taken as:
core hours = threa dhours / 2
Since the compute capacity of a single compute node is 256 threads:
node hours = thread hours / 256 = corehours / 128
Example¶
Let us collect the information about the CPU resources spent by a job with job ID 1234 and compute the corresponding corehours and nodehours. First, we need to dump the value of CPUTimeRAW corresponding to the job with ID 1234:
$ sacct -j 1234 -X --format=CPUTimeRAW
The result:
CPUTimeRAW
----------
100102023
shows the number of thread seconds spent by the job. Afterwards, we apply the formulas from before and get the sought core hours and node hours:
core hours = 100102023 / 3600 / 2 = 13903.06
node hours = 100102023 / 3600 / 256 = 108.62
One may add to the format
declaration an instruction for displaying the state of the job to check its state:
$ sacct -j 1234 -X --format=CPUTimeRAW,state%22
Displaying information about all jobs run by a user and the resources spent on them¶
Here, we provide some basic methods for displaying information about all jobs run by a single user and for estimating the computational resources spent on those jobs.
The example below is based on the tool sacct
. It demonstrates how to display the most important parameters for estimating the actual job sizes and their status. When properly executed, this command line will display statistics about all completed jobs run by the user with username username
between January 1, 2023 and October 31, 2023. The result will consist of a table with the following columns: job ID (jobid
), number of nodes occupied (nnodes
), number of CPUs utilized (ncpus
), the CPU raw time spent on the job (cputimeraw
), the elapsed time (elapsed
) comparable to the wall time, and the state (state
) deliberately elongated to 22 symbols:
sacct -S 2023-01-01 -E 2023-11-01 -u username -X --format=jobid,nnodes,ncpus,cputimeraw,elapsed,state%22
One must comply with the cluster metrics here. In this situation, the number of CPUs used is the same as the number of threads allocated because our processors can handle hyperthreading. Due to the aforementioned metrics, the CPU raw time is a quantification of the quantity of thread seconds spent on the execution of the job. For further information about how to convert the CPU raw time into core hours and node hours, refer to the section titled “Displaying the computational resources used by a job” aforementioned.
The state identifier is typically one of the following: RUNNING
, COMPLETED
, TIMEOUT
, FAILED
, CANCELLED
, or CANCELLED by uid
. All jobs with status RUNNING
are currently being executed. Note that TIMEOUT
means that the job did not finish within the wall time set by the user or Slurm account wall time limits. CANCELLED
signifies that the job has been terminated by Slurm on an administrative level. If the job is cancelled by a user, the corresponding user ID (as a number) will be displayed in the status. For example, CANCELLED by 2001
means the job was cancelled by the user with ID 2001. It is possible to translate that UID number to the actual username by typing on the login node:
id -nu 2001
When a job is terminated with the status FAILED
, it simply means that the exit status of execution was not 0. Slurm itself is unable to provide further information about the reason for the job failure. That reason might be found in the logs files created during the job execution in the submission folder. Sometimes, the log files cannot provide any clues about the job failure. If that’s the case, the application needs to be run again, this time in debug mode, in order to gather any additional data that could help to find those instructions in the code, whose execution results in a positive exit status.
Getting help¶
See Getting help