Resource Overview

Aggregated computational resources

Discoverer (CPU cluster)

144384 (1128 × 128) hardware CPU cores

288768 (1128 × 256) logical CPUs

295.5 (0.25 × 1110 + 1 × 18 ) TiB RAM

Discoverer+ (GPU+CPU cluster)

32 (4 × 8) NVIDIA H200 GPU accelerators

448 (112 × 4) hardware CPU cores

896 (224 × 4) logical CPUs

7.84 (1.96 × 4) TiB RAM

Compute nodes

Discoverer (CPU cluster)

1128 compute nodes

1110 “regular” nodes18 “fat” nodes

Each compute node is equipped with:

Detailed hardware information:

See also Partitions (of nodes)

Discoverer+ (GPU+CPU cluster)

4 compute nodes (4 × DGX H200)

For detailed hardware configuration:

https://www.nvidia.com/en-us/data-center/dgx-h200/

Detailed hardware information:

Comparing NVIDIA H200 to NVIDIA H100

Memory capacity and bandwidth comparison:

  • Memory Capacity:

    • H100: 80 GB of HBM3 memory
    • H200: 141 GB of HBM3e memory (nearly double the H100’s capacity)
  • Memory Bandwidth:

    • H100: 3.35 TB/s
    • H200: 4.8 TB/s (43% higher bandwidth)

Performance and Productivity:

  • AI Inference Performance:

Based on MLPerf Inference v4.0 benchmarks using the Llama 2 70B model:

  • H100: 22,290 tokens/second (offline), 21,504 tokens/second (server)
  • H200: 31,712 tokens/second (offline), 29,526 tokens/second (server)

This represents approximately a 37-42% throughput increase!

Overall Performance:

  • The H200 delivers up to 45% more performance in specific generative AI and HPC benchmarks compared to the H100
  • For HPC workloads, the H200 provides 1.7x the performance of the H100

Energy Efficiency:

NVIDIA estimates the H200 uses up to 50% less energy for key LLM inference workloads compared to the H100, resulting in a 50% lower total cost of ownership over the device lifetime. Both GPUs share the same Hopper architecture and consume approximately 700W, but the H200’s enhanced memory and bandwidth allow it to complete tasks faster, reducing overall energy consumption per workload.

Connectivity

Data center (per node)

Internet (overall)

10 Gbps aggregared maximum Internet connection speed (via BREN, GÉANT, Evolink)

Storage (per node)

Locally installed

  • Discoverer (CPU cluster):

No local storage devices are installed upon the compute nodes on Discoverer (CPU cluster). They are diskless nodes.

  • Discoverer+ (CPU+GPU cluster):

There are local storage devices installed upon the compute nodes on Discoverer+ (CPU+GPU cluster).

nvme4n1     259:0    0   1.7T  0 disk
├─nvme4n1p1 259:2    0   512M  0 part  /boot/efi
└─nvme4n1p2 259:3    0   1.7T  0 part
  └─md0       9:0    0   1.7T  0 raid1 /
nvme5n1     259:1    0   1.7T  0 disk
├─nvme5n1p1 259:4    0   512M  0 part  /boot/efi
└─nvme5n1p2 259:5    0   1.7T  0 part
  └─md0       9:0    0   1.7T  0 raid1 /
nvme8n1     259:7    0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid
nvme2n1     259:9    0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid
nvme0n1     259:11   0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid
nvme1n1     259:13   0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid
nvme6n1     259:17   0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid
nvme9n1     259:18   0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid
nvme7n1     259:19   0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid
nvme3n1     259:21   0   3.5T  0 disk
└─md1         9:1    0  24.5T  0 raid5 /raid

Those are the RAID volumes maintained on the compute nodes on Discoverer+ (CPU+GPU cluster).

Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid5 nvme2n1[0] nvme0n1[3] nvme9n1[6] nvme8n1[8] nvme6n1[4] nvme3n1[2] nvme1n1[1] nvme7n1[5]
      26254240768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
      bitmap: 0/28 pages [0KB], 65536KB chunk

md0 : active raid1 nvme5n1p2[1] nvme4n1p2[0]
      1874715648 blocks super 1.2 [2/2] [UU]
      bitmap: 6/14 pages [24KB], 65536KB chunk

The operating system is installed on the RAID1 array /dev/md0. It consists of two NVMe drives Micron 7450 NVMe SSD (1.7 TB each).

The RAID5 array /dev/md1 consists of eight NVMe drives amsung PM1743 MZWLO3T8HCLS-00A07 (3.5 TB each). That array, with 24.5 TB of total capacity, is used as a local scratch space on each of the compute nodes on Discoverer+ (CPU+GPU cluster).

Network-attached (shared)

  • Available on Discoverer (CPU cluster):

    • /home is NFS (over Ethernet) storage for home folders (size: 4.4 TB) [by NetApp]
    • /valhalla is Lustre (over InfiniBand) parallel scratch bulk storage (size: 5.1 PB on NVMe) [by HPE on Cray ClusterStor E1000]
  • Available on Discoverer+ (CPU+GPU cluster):

    • /valhalla is Lustre (over InfiniBand) parallel scratch bulk storage (size: 5.1 PB on NVMe) [by HPE on Cray ClusterStor E1000]
    • /weka is WEKA (over InfiniBand) very fast parallel scratch bulk storage (size: 273 TB on NVMe) [by WEKA on WEKA cluster 4.4]

Partitions (of nodes)

With respect to the control and aggregation of compute resources (managed by Slurm), and their hosting location, the compute nodes are organized into partitions:

Summary

Slurm cluster name Partition name Number of nodes Participating nodes (list of host names)
discoverer ALL 1128 cn[0001-1110], fn[01-18]
discoverer cn 1110 cn[0001-1110]
discoverer fn 18 fn[01-18]
disco-plus common 2 dgx[1-2]

Based on rack location (Discoverer CPU cluster)

Name Num of nodes Participating nodes (list of host names)*
rack1 96 cn[0001-0096]
rack2 96 cn[0097-0192]
rack3 96 cn[0193-0288]
rack4 96 cn[0289-0384]
rack5 96 cn[0385-0480]
rack6 96 cn[0481-0576]
rack7 96 cn[0577-0672]
rack8 96 cn[0673-0768]
rack9 96 cn[0769-0864]
rack10 96 cn[0865-0960]
rack11 96 cn[0961-1056]
rack12 72 cn[1057-1110], fn[01-18]

Based on IB connectivity (per switch, Discoverer CPU cluster)

Name Num of nodes Participating nodes (list of host names)*
pm1-isw0 24 cn[0001-0012,0025-0030,0039-0040,0043-0044,0047-0048]
pm1-isw1 24 cn[0013-0024,0031-0038,0041-0042,0045-0046]
pm1-isw2 24 cn[0061-0073,0076,0079,0082,0085-0088,0091-0094]
pm1-isw3 24 cn[0049-0060,0074-0075,0077-0078,0080-0081,0083-0084,0089-0090,0095-0096]
pm2-isw0 24 cn[0097-0108,0121-0126,0135-0136,0139-0140,0143-0144]
pm2-isw1 24 cn[0109-0120,0127-0134,0137-0138,0141-0142]
pm2-isw2 24 cn[0157-0169,0172,0175,0178,0181-0184,0187-0190]
pm2-isw3 24 cn[0145-0156,0170-0171,0173-0174,0176-0177,0179-0180,0185-0186,0191-0192]
pm3-isw0 24 cn[0193-0204,0217-0222,0231-0232,0235-0236,0239-0240]
pm3-isw1 24 cn[0205-0216,0223-0230,0233-0234,0237-0238]
pm3-isw2 24 cn[0253-0265,0268,0271,0274,0277-0280,0283-0286]
pm3-isw3 24 cn[0241-0252,0266-0267,0269-0270,0272-0273,0275-0276,0281-0282,0287-0288]
pm4-isw0 24 cn[0289-0300,0313-0318,0327-0328,0331-0332,0335-0336]
pm4-isw1 24 cn[0301-0312,0319-0326,0329-0330,0333-0334]
pm4-isw2 24 cn[0349-0361,0364,0367,0370,0373-0376,0379-0382]
pm4-isw3 24 cn[0337-0348,0362-0363,0365-0366,0368-0369,0371-0372,0377-0378,0383-0384]
pm5-isw0 24 cn[0385-0396,0409-0414,0423-0424,0427-0428,0431-0432]
pm5-isw1 24 cn[0397-0408,0415-0422,0425-0426,0429-0430]
pm5-isw2 24 cn[0445-0457,0460,0463,0466,0469-0472,0475-0478]
pm5-isw3 24 cn[0433-0444,0458-0459,0461-0462,0464-0465,0467-0468,0473-0474,0479-0480]
pm6-isw0 24 cn[0481-0492,0505-0510,0519-0520,0523-0524,0527-0528]
pm6-isw1 24 cn[0493-0504,0511-0518,0521-0522,0525-0526]
pm6-isw2 24 cn[0541-0553,0556,0559,0562,0565-0568,0571-0574]
pm6-isw3 24 cn[0529-0540,0554-0555,0557-0558,0560-0561,0563-0564,0569-0570,0575-0576]
pm7-isw0 24 cn[0577-0588,0601-0606,0615-0616,0619-0620,0623-0624]
pm7-isw1 24 cn[0589-0600,0607-0614,0617-0618,0621-0622]
pm7-isw2 24 cn[0637-0649,0652,0655,0658,0661-0664,0667-0670]
pm7-isw3 24 cn[0625-0636,0650-0651,0653-0654,0656-0657,0659-0660,0665-0666,0671-0672]
pm8-isw0 24 cn[0673-0684,0697-0702,0711-0712,0715-0716,0719-0720]
pm8-isw1 24 cn[0685-0696,0703-0710,0713-0714,0717-0718]
pm8-isw2 24 cn[0733-0745,0748,0751,0754,0757-0760,0763-0766]
pm8-isw3 24 cn[0721-0732,0746-0747,0749-0750,0752-0753,0755-0756,0761-0762,0767-0768]
pm9-isw0 24 cn[0769-0780,0793-0798,0807-0808,0811-0812,0815-0816]
pm9-isw1 24 cn[0781-0792,0799-0806,0809-0810,0813-0814]
pm9-isw2 24 cn[0829-0841,0844,0847,0850,0853-0856,0859-0862]
pm9-isw3 24 cn[0817-0828,0842-0843,0845-0846,0848-0849,0851-0852,0857-0858,0863-0864]
pm10-isw0 24 cn[0865-0876,0889-0894,0903-0904,0907-0908,0911-0912]
pm10-isw1 24 cn[0877-0888,0895-0902,0905-0906,0909-0910]
pm10-isw2 24 cn[0925-0937,0940,0943,0946,0949-0952,0955-0958]
pm10-isw3 24 cn[0913-0924,0938-0939,0941-0942,0944-0945,0947-0948,0953-0954,0959-0960]
pm11-isw0 24 cn[0961-0972,0985-0990,0999-1000,1003-1004,1007-1008]
pm11-isw1 24 cn[0973-0984,0991-0998,1001-1002,1005-1006]
pm11-isw2 24 cn[1021-1033,1036,1039,1042,1045-1048,1051-1054]
pm11-isw3 24 cn[1009-1020,1034-1035,1037-1038,1040-1041,1043-1044,1049-1050,1055-1056]
pm12-isw0 24 cn[1057-1068,1081-1086,1095-1096,1099-1100,1103-1104]
pm12-isw1 24 cn[1069-1080,1087-1094,1097-1098,1101-1102]
pm12-isw2 12 fn[07-18]
pm12-isw3 12 cn[1105-1110],fn[01-06]
  • “cn” stands for “regular” node, “fn” for “fat” node, and “dgx” points to “DGX H200” node

Display partition and compute node information

Important

We use Slurm to manage the compute resources and to submit the jobs to the compute nodes. The Slurm configuration file is located at /etc/slurm/slurm.conf. This file is accessible to everyone, but can be modified by the system administrator. All information presented below is related to the Slurm resource management of the resources across our clusters.

Show the partitions of compute nodes

The partitions of compute nodes are structures defined in the Slurm configuration file. They are used to group the nodes into logical units and to control the distribution of resources available to the jobs running on these nodes. Each partition is identified by a name and can contain one or more nodes. It has its own set of parameters that control the execution of jobs on the partition.

sinfo

The result apears as a table with the following columns:

  • PARTITION - The name of the partition
  • AVAIL - The availability of the partition
  • TIMELIMIT - The time limit for the partition
  • NODES - The number of nodes in the partition
  • STATE - The state of the partition
  • NODELIST - The list of nodes in the partition

The NODELIST column contains the list of nodes in the partition. Each node is represented by its name, and the nodes are separated by commas, unless their names contain consecutive numbers. In the latter case they are represented by a range of numbers in square brackets (aggregated range).

The AVAIL column indicates the availability of the nodes (state) in the given partition. The possible values are:

  • idle - Available for accepting and running new jobs
  • alloc - Currently fully allocated to running jobs
  • mix - Partially allocated (some CPUs/cores in use)
  • drain - Being administratively drained (unavailable to new jobs, existing jobs are still running)
  • drain*- Failed node (no new jobs, no existing jobs, likely due to hardware failure)

Note

One node can be configured as a member of multiple partitions. For example, the node cn0001 participates in the partitions cn, rack12, pm12-isw3, and ALL. When a node is in “drain” state (no * after drain) are administratively drained by the system administrator and they cannot be used by the job submitted to the partition. That does not mean those nodes are broken or unusable. They are just marked as drained to prevent new jobs from being submitted to them. But when the node is in “drain*” state it means it is completely drained and no jobs can be run on it and most probably some error occured on the node.

You may preview the drain nodes in each partition using the following command:

sinfo -d

In case you want to check the drain status of a specific node, you can use the following command:

srun -R
  • Show the nodes in a specific partition only

The example below shows the nodes participating in the partition cn and their current state:

sinfo -p cn

Example output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cn           up   infinite     21 drain* cn[0004-0005,0061,0070-0075,0082-0084,0283-0285,0406-0408,0535,0544,0952]
cn           up   infinite     19  drain cn[0007,0048,0062-0063,0094-0096,0289,0334-0336,0536-0537,0541,0545-0546,0558,0560-0561]
cn           up   infinite     90    mix cn[0064-0069,0076,0079,0085-0088,0145-0156,0170-0171,0173-0174,0176-0177,0179-0180,0185-0186,0191-0192,0253-0254,0261-0265,0268,0271,0274,0277-0278,0327-0331,0333,0397-0405,0415-0422,0425-0426,0429,0542,0548,0793-0798,0807-0808,0811-0812,1022,1106-1108]
cn           up   infinite    241  alloc cn[0001-0003,0006,0008-0047,0049-0060,0077-0078,0080-0081,0089-0090,0093,0097-0144,0157-0169,0172,0175,0178,0181-0184,0187-0190,0290-0294,0301-0312,0319-0326,0337-0348,0362-0363,0365-0366,0368-0369,0371-0372,0377-0378,0383-0384,0430,0543,0547,0549-0553,0556,0559,0562,0564-0568,0571-0572,0576,0769-0780,0925-0937,0940,0943,0946,0949-0951,0955-0958,1105,1109-1110]
cn           up   infinite    739   idle cn[0091-0092,0193-0252,0255-0260,0266-0267,0269-0270,0272-0273,0275-0276,0279-0282,0286-0288,0295-0300,0313-0318,0332,0349-0361,0364,0367,0370,0373-0376,0379-0382,0385-0396,0409-0414,0423-0424,0427-0428,0431-0534,0538-0540,0554-0555,0557,0563,0569-0570,0573-0575,0577-0768,0781-0792,0799-0806,0809-0810,0813-0924,0938-0939,0941-0942,0944-0945,0947-0948,0953-0954,0959-1021,1023-1104]

Show memory, CPU, and GPU information for each node in the partition:

Note

The GPUs are presented in the output as generic resources (GRES) with the format gpu:<number_of_gpus> and the latter is the number of GPUs available on the displayed node.

sinfo -N -p common -o "%N %T %C %G %m %c"

Example output:

NODELIST STATE CPUS(A/I/O/T) GRES MEMORY CPUS
dgx1 mixed 100/124/0/224 gpu:8 2063425 224
dgx2 mixed 16/208/0/224 gpu:7 2063425 224

Important

Note that if no GPUs are available on the node, the GRES column is empty (null):

NODELIST STATE CPUS(A/I/O/T) GRES MEMORY CPUS
cn0001 allocated 256/0/0/256 (null) 257700 256
cn0002 allocated 256/0/0/256 (null) 257700 256
cn0003 allocated 256/0/0/256 (null) 257700 256

The state is reported in the STATE column. The possible values are:

  • idle - Available for accepting and running new jobs
  • allocated - Currently all trackable resources are fully allocated to running jobs
  • mixed - Partially allocated (some trackable resources are in use)
  • drained - Being administratively drained (unavailable to new jobs, existing jobs are still running)
  • drained*- Failed node (no new jobs, no existing jobs, likely due to hardware failure)

The CPU allocation format A/I/O/T consists of:

  • A - Allocated CPUs (in use by jobs)
  • I - Idle CPUs (available for new jobs)
  • O - Other CPUs (reserved, offline, etc.)
  • T - Total CPUs on the node

Parameters used to control the resources available on the partition

This command shows the parameters set to control the execution of jobs on each partition in terms of Slurm configuration options format:

scontrol show partition

You may preview the parameters for a specific partition as well:

scontrol show partition cn

On both cases, the output will contain this type of structured information:

PartitionName=cn
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=cn[0001-1110]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=284160 TotalNodes=1110 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

To “decipher” this output you need to know the meaning of the parameters. The most important parameters are:

  • MaxTime - The maximum time limit for jobs in the partition
  • MaxNodes - The maximum number of nodes allowed to be used in the partition
  • MaxCPUs - The maximum number of CPUs allowed to be used in the partition
  • MaxMem - The maximum memory allowed to be used in the partition

The MaxTime parameter is the maximum time limit for jobs in the partition. The MaxNodes parameter is the maximum number of nodes allowed to be used in the partition. The MaxCPUs parameter is the maximum number of CPUs allowed to be used in the partition. The MaxMem parameter is the maximum memory allowed to be used in the partition.

The MaxTime parameter is the maximum time limit for jobs in the partition. The MaxNodes parameter is the maximum number of nodes allowed to be used in the partition. The MaxCPUs parameter is the maximum number of CPUs allowed to be used in the partition. The MaxMem parameter is the maximum memory allowed to be used in the partition.

The other parameters can be found in the Slurm documentation: https://slurm.schedmd.com/scontrol.html

Information about the nodes

Note

The information about the nodes does not depend on the partition they belong to. It is a global information about the nodes in the cluster. But the nodes can be members of multiple partitions and the information about the nodes is displayed for each partition they belong to.

Brief node information can be displayed with the following command:

sinfo -N -p cn

Example output:

NODELIST   NODES PARTITION STATE
cn0001         1        cn alloc
cn0002         1        cn alloc
cn0003         1        cn alloc
cn0004         1        cn drain*
cn0005         1        cn drain*
cn0006         1        cn alloc
cn0007         1        cn drain
cn0008         1        cn alloc
cn0009         1        cn alloc
cn0010         1        cn alloc
cn0011         1        cn alloc

State of resources on each node can be printed out using the following command:

sinfo -N -p cn -o "%N %T %C %m %c"

Example output:

NODELIST STATE CPUS(A/I/O/T) MEMORY CPUS
cn0001 allocated 256/0/0/256 257700 256
cn0002 allocated 256/0/0/256 257700 256
cn0003 allocated 256/0/0/256 257700 256
cn0004 drained* 0/0/256/256 257700 256
cn0005 drained* 0/0/256/256 257700 256
cn0006 allocated 256/0/0/256 257700 256
cn0007 drained 0/0/256/256 257700 256
cn0008 allocated 256/0/0/256 257700 256
cn0064 mixed 128/128/0/256 257700 256
cn0065 mixed 128/128/0/256 257700 256

CPU allocation format A/I/O/T:

  • A - Allocated CPUs (in use by jobs)
  • I - Idle CPUs (available for new jobs)
  • O - Other CPUs (reserved, offline, etc.)
  • T - Total CPUs on the node

Note

The number 256 here refers to the total number of CPU theads availale per node. The Linux OS running on the nodes discovers the CPU threads as logical CPUs and reports them as processors to the Linux kernel. So as the kernel reports them to Slurm and that is the reason we see 256 logical CPUs per node. In fact, on each node we have 128 CPU cores per node, but due to the hyperthreading technology, we see 256 logical CPUs per node (see above).

Examples:

  • 256/0/0/256 - All 256 CPUs allocated (fully busy)
  • 128/128/0/256 - 128 CPUs allocated, 128 idle (mixed state)
  • 0/0/256/256 - All 256 CPUs offline (drained nodes)

Show information for each node in the cluster

Most of the output produced by the command below is in terms of Slurm configuration file format.

scontrol show nodes

Example output:

NodeName=cn1108 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=12 CPUTot=256 CPULoad=6.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn1108-interconnect-1 NodeHostName=cn1108 Version=20.02.6-Bull.1.1
   OS=Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021
   RealMemory=257700 AllocMem=12288 FreeMem=242512 Sockets=8 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cn,rack12,pm12-isw3,ALL
   BootTime=2025-09-02T13:49:53 SlurmdStartTime=2025-09-02T13:51:12
   CfgTRES=cpu=256,mem=257700M,billing=256
   AllocTRES=cpu=12,mem=12G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=fn18 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=256 CPUTot=256 CPULoad=48.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=fn18-interconnect-1 NodeHostName=fn18 Version=20.02.6-Bull.1.1
   OS=Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021
   RealMemory=1031800 AllocMem=262144 FreeMem=729398 Sockets=8 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=fn,rack12,pm12-isw2,ALL
   BootTime=2025-09-02T14:43:50 SlurmdStartTime=2025-09-02T14:45:13
   CfgTRES=cpu=256,mem=1031800M,billing=256
   AllocTRES=cpu=256,mem=256G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Important fields in the output are explained below:

  • NodeName - Unique identifier for the compute node
  • State - Current node state (ALLOCATED, MIXED, IDLE, DRAINED, etc.)
  • CPUAlloc/CPUTot - Allocated CPUs by jobs vs Total CPUs (e.g., 12/256 means 12 CPUs (12 CPU threads) in use out of 256 total)
  • CPULoad - Current CPU load percentage
  • RealMemory - Total physical memory in MB
  • AllocMem/FreeMem - Allocated vs Free memory in MB
  • Sockets - Number of NUMA domains/NUMA nodes on the node (not physical CPU sockets)
  • ThreadsPerCore - Hyperthreading enabled (2 = 2 threads per physical core)
  • Partitions - Which Slurm partitions this node belongs to
  • BootTime/SlurmdStartTime - When the node and Slurm daemon started
  • CfgTRES - Configured trackable resources (total available)
  • AllocTRES - Currently allocated trackable resources

The other parameters can be found in the Slurm documentation: https://slurm.schedmd.com/scontrol.html