Resource Overview¶
Table of Contents
Aggregated computational resources¶
Discoverer (CPU cluster)¶
144384 (1128 × 128) hardware CPU cores
288768 (1128 × 256) logical CPUs
295.5 (0.25 × 1110 + 1 × 18 ) TiB RAM
Discoverer+ (GPU+CPU cluster)¶
32 (4 × 8) NVIDIA H200 GPU accelerators
448 (112 × 4) hardware CPU cores
896 (224 × 4) logical CPUs
7.84 (1.96 × 4) TiB RAM
Compute nodes¶
Discoverer (CPU cluster)¶
1128 compute nodes
1110 “regular” nodes18 “fat” nodes
Each compute node is equipped with:
- CPU: 2 × AMD EPYC 7H12 64-Core Processor (Sockets 8, CoresPerSocket 16, ThreadsPerCore 2)
- RAM:
- per “regular” node : 256 GB (Micron Technology Inc. DDR4, 3200 MT/s, 16 × 16 GB)
- per “fat” node: 1 TB (Micron Technology Inc. DDR4, 3200 MT/s, 16 × 64 GB)
- Ethernet: 2 × 1000BASE-T, 100BASE-TX, 10BASE-T Intel® Ethernet Controller I350
- Infiniband: Mellanox Technologies MT28908 Family [ConnectX-6], 1 × 200Gb/s
Detailed hardware information:
See also Partitions (of nodes)
Discoverer+ (GPU+CPU cluster)¶
4 compute nodes (4 × DGX H200)
For detailed hardware configuration:
https://www.nvidia.com/en-us/data-center/dgx-h200/
Detailed hardware information:
Comparing NVIDIA H200 to NVIDIA H100¶
Memory capacity and bandwidth comparison:
Memory Capacity:
- H100: 80 GB of HBM3 memory
- H200: 141 GB of HBM3e memory (nearly double the H100’s capacity)
Memory Bandwidth:
- H100: 3.35 TB/s
- H200: 4.8 TB/s (43% higher bandwidth)
Performance and Productivity:
- AI Inference Performance:
Based on MLPerf Inference v4.0 benchmarks using the Llama 2 70B model:
- H100: 22,290 tokens/second (offline), 21,504 tokens/second (server)
- H200: 31,712 tokens/second (offline), 29,526 tokens/second (server)
This represents approximately a 37-42% throughput increase!
Overall Performance:
- The H200 delivers up to 45% more performance in specific generative AI and HPC benchmarks compared to the H100
- For HPC workloads, the H200 provides 1.7x the performance of the H100
Energy Efficiency:
NVIDIA estimates the H200 uses up to 50% less energy for key LLM inference workloads compared to the H100, resulting in a 50% lower total cost of ownership over the device lifetime. Both GPUs share the same Hopper architecture and consume approximately 700W, but the H200’s enhanced memory and bandwidth allow it to complete tasks faster, reducing overall energy consumption per workload.
Connectivity¶
Data center (per node)¶
- Discoverer (CPU cluster):
- 1 × InfiniBand controller: 200 Gbps [Mellanox Technologies MT28908 Family [ConnectX-6]]
- 1 × Ethernet controller: 10 Gbps [Intel® Ethernet Controller I350]
- Discoverer+ (CPU+GPU cluster):
- 10 × Infiniband controller: 400 Gbps [Mellanox Technologies MT2910 Family [ConnectX-7]]
- 2 × Ethernet controller: 400 Gbps [Mellanox Technologies MT2910 Family [ConnectX-7]]
- 2 × Ethernet Controller: 100 Gbps [Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)]
- 1 × Ethernet controller: 10 Gbps [Intel Corporation Ethernet Controller X550 (rev 01)]
Storage (per node)¶
Locally installed¶
- Discoverer (CPU cluster):
No local storage devices are installed upon the compute nodes on Discoverer (CPU cluster). They are diskless nodes.
- Discoverer+ (CPU+GPU cluster):
There are local storage devices installed upon the compute nodes on Discoverer+ (CPU+GPU cluster).
nvme4n1 259:0 0 1.7T 0 disk
├─nvme4n1p1 259:2 0 512M 0 part /boot/efi
└─nvme4n1p2 259:3 0 1.7T 0 part
└─md0 9:0 0 1.7T 0 raid1 /
nvme5n1 259:1 0 1.7T 0 disk
├─nvme5n1p1 259:4 0 512M 0 part /boot/efi
└─nvme5n1p2 259:5 0 1.7T 0 part
└─md0 9:0 0 1.7T 0 raid1 /
nvme8n1 259:7 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
nvme2n1 259:9 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
nvme0n1 259:11 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
nvme1n1 259:13 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
nvme6n1 259:17 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
nvme9n1 259:18 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
nvme7n1 259:19 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
nvme3n1 259:21 0 3.5T 0 disk
└─md1 9:1 0 24.5T 0 raid5 /raid
Those are the RAID volumes maintained on the compute nodes on Discoverer+ (CPU+GPU cluster).
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid5 nvme2n1[0] nvme0n1[3] nvme9n1[6] nvme8n1[8] nvme6n1[4] nvme3n1[2] nvme1n1[1] nvme7n1[5]
26254240768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
bitmap: 0/28 pages [0KB], 65536KB chunk
md0 : active raid1 nvme5n1p2[1] nvme4n1p2[0]
1874715648 blocks super 1.2 [2/2] [UU]
bitmap: 6/14 pages [24KB], 65536KB chunk
The operating system is installed on the RAID1 array /dev/md0. It consists of two NVMe drives Micron 7450 NVMe SSD (1.7 TB each).
The RAID5 array /dev/md1 consists of eight NVMe drives amsung PM1743 MZWLO3T8HCLS-00A07 (3.5 TB each). That array, with 24.5 TB of total capacity, is used as a local scratch space on each of the compute nodes on Discoverer+ (CPU+GPU cluster).
Partitions (of nodes)¶
With respect to the control and aggregation of compute resources (managed by Slurm), and their hosting location, the compute nodes are organized into partitions:
Summary¶
Slurm cluster name Partition name Number of nodes Participating nodes (list of host names) discoverer ALL 1128 cn[0001-1110], fn[01-18] discoverer cn 1110 cn[0001-1110] discoverer fn 18 fn[01-18] disco-plus common 2 dgx[1-2]
Based on rack location (Discoverer CPU cluster)¶
Name Num of nodes Participating nodes (list of host names)* rack1 96 cn[0001-0096] rack2 96 cn[0097-0192] rack3 96 cn[0193-0288] rack4 96 cn[0289-0384] rack5 96 cn[0385-0480] rack6 96 cn[0481-0576] rack7 96 cn[0577-0672] rack8 96 cn[0673-0768] rack9 96 cn[0769-0864] rack10 96 cn[0865-0960] rack11 96 cn[0961-1056] rack12 72 cn[1057-1110], fn[01-18]
Based on IB connectivity (per switch, Discoverer CPU cluster)¶
Name Num of nodes Participating nodes (list of host names)* pm1-isw0 24 cn[0001-0012,0025-0030,0039-0040,0043-0044,0047-0048] pm1-isw1 24 cn[0013-0024,0031-0038,0041-0042,0045-0046] pm1-isw2 24 cn[0061-0073,0076,0079,0082,0085-0088,0091-0094] pm1-isw3 24 cn[0049-0060,0074-0075,0077-0078,0080-0081,0083-0084,0089-0090,0095-0096] pm2-isw0 24 cn[0097-0108,0121-0126,0135-0136,0139-0140,0143-0144] pm2-isw1 24 cn[0109-0120,0127-0134,0137-0138,0141-0142] pm2-isw2 24 cn[0157-0169,0172,0175,0178,0181-0184,0187-0190] pm2-isw3 24 cn[0145-0156,0170-0171,0173-0174,0176-0177,0179-0180,0185-0186,0191-0192] pm3-isw0 24 cn[0193-0204,0217-0222,0231-0232,0235-0236,0239-0240] pm3-isw1 24 cn[0205-0216,0223-0230,0233-0234,0237-0238] pm3-isw2 24 cn[0253-0265,0268,0271,0274,0277-0280,0283-0286] pm3-isw3 24 cn[0241-0252,0266-0267,0269-0270,0272-0273,0275-0276,0281-0282,0287-0288] pm4-isw0 24 cn[0289-0300,0313-0318,0327-0328,0331-0332,0335-0336] pm4-isw1 24 cn[0301-0312,0319-0326,0329-0330,0333-0334] pm4-isw2 24 cn[0349-0361,0364,0367,0370,0373-0376,0379-0382] pm4-isw3 24 cn[0337-0348,0362-0363,0365-0366,0368-0369,0371-0372,0377-0378,0383-0384] pm5-isw0 24 cn[0385-0396,0409-0414,0423-0424,0427-0428,0431-0432] pm5-isw1 24 cn[0397-0408,0415-0422,0425-0426,0429-0430] pm5-isw2 24 cn[0445-0457,0460,0463,0466,0469-0472,0475-0478] pm5-isw3 24 cn[0433-0444,0458-0459,0461-0462,0464-0465,0467-0468,0473-0474,0479-0480] pm6-isw0 24 cn[0481-0492,0505-0510,0519-0520,0523-0524,0527-0528] pm6-isw1 24 cn[0493-0504,0511-0518,0521-0522,0525-0526] pm6-isw2 24 cn[0541-0553,0556,0559,0562,0565-0568,0571-0574] pm6-isw3 24 cn[0529-0540,0554-0555,0557-0558,0560-0561,0563-0564,0569-0570,0575-0576] pm7-isw0 24 cn[0577-0588,0601-0606,0615-0616,0619-0620,0623-0624] pm7-isw1 24 cn[0589-0600,0607-0614,0617-0618,0621-0622] pm7-isw2 24 cn[0637-0649,0652,0655,0658,0661-0664,0667-0670] pm7-isw3 24 cn[0625-0636,0650-0651,0653-0654,0656-0657,0659-0660,0665-0666,0671-0672] pm8-isw0 24 cn[0673-0684,0697-0702,0711-0712,0715-0716,0719-0720] pm8-isw1 24 cn[0685-0696,0703-0710,0713-0714,0717-0718] pm8-isw2 24 cn[0733-0745,0748,0751,0754,0757-0760,0763-0766] pm8-isw3 24 cn[0721-0732,0746-0747,0749-0750,0752-0753,0755-0756,0761-0762,0767-0768] pm9-isw0 24 cn[0769-0780,0793-0798,0807-0808,0811-0812,0815-0816] pm9-isw1 24 cn[0781-0792,0799-0806,0809-0810,0813-0814] pm9-isw2 24 cn[0829-0841,0844,0847,0850,0853-0856,0859-0862] pm9-isw3 24 cn[0817-0828,0842-0843,0845-0846,0848-0849,0851-0852,0857-0858,0863-0864] pm10-isw0 24 cn[0865-0876,0889-0894,0903-0904,0907-0908,0911-0912] pm10-isw1 24 cn[0877-0888,0895-0902,0905-0906,0909-0910] pm10-isw2 24 cn[0925-0937,0940,0943,0946,0949-0952,0955-0958] pm10-isw3 24 cn[0913-0924,0938-0939,0941-0942,0944-0945,0947-0948,0953-0954,0959-0960] pm11-isw0 24 cn[0961-0972,0985-0990,0999-1000,1003-1004,1007-1008] pm11-isw1 24 cn[0973-0984,0991-0998,1001-1002,1005-1006] pm11-isw2 24 cn[1021-1033,1036,1039,1042,1045-1048,1051-1054] pm11-isw3 24 cn[1009-1020,1034-1035,1037-1038,1040-1041,1043-1044,1049-1050,1055-1056] pm12-isw0 24 cn[1057-1068,1081-1086,1095-1096,1099-1100,1103-1104] pm12-isw1 24 cn[1069-1080,1087-1094,1097-1098,1101-1102] pm12-isw2 12 fn[07-18] pm12-isw3 12 cn[1105-1110],fn[01-06]
- “cn” stands for “regular” node, “fn” for “fat” node, and “dgx” points to “DGX H200” node
Display partition and compute node information¶
Important
We use Slurm to manage the compute resources and to submit the jobs to the compute nodes. The Slurm configuration file is located at /etc/slurm/slurm.conf. This file is accessible to everyone, but can be modified by the system administrator. All information presented below is related to the Slurm resource management of the resources across our clusters.
Show the partitions of compute nodes¶
The partitions of compute nodes are structures defined in the Slurm configuration file. They are used to group the nodes into logical units and to control the distribution of resources available to the jobs running on these nodes. Each partition is identified by a name and can contain one or more nodes. It has its own set of parameters that control the execution of jobs on the partition.
sinfo
The result apears as a table with the following columns:
PARTITION- The name of the partitionAVAIL- The availability of the partitionTIMELIMIT- The time limit for the partitionNODES- The number of nodes in the partitionSTATE- The state of the partitionNODELIST- The list of nodes in the partition
The NODELIST column contains the list of nodes in the partition. Each node is represented by its name, and the nodes are separated by commas, unless their names contain consecutive numbers. In the latter case they are represented by a range of numbers in square brackets (aggregated range).
The AVAIL column indicates the availability of the nodes (state) in the given partition. The possible values are:
idle- Available for accepting and running new jobsalloc- Currently fully allocated to running jobsmix- Partially allocated (some CPUs/cores in use)drain- Being administratively drained (unavailable to new jobs, existing jobs are still running)drain*- Failed node (no new jobs, no existing jobs, likely due to hardware failure)
Note
One node can be configured as a member of multiple partitions. For example, the node cn0001 participates in the partitions cn, rack12, pm12-isw3, and ALL.
When a node is in “drain” state (no * after drain) are administratively drained by the system administrator and they cannot be used by the job submitted to the partition. That does not mean those nodes are broken or unusable. They are just marked as drained to prevent new jobs from being submitted to them. But when the node is in “drain*” state it means it is completely drained and no jobs can be run on it and most probably some error occured on the node.
You may preview the drain nodes in each partition using the following command:
sinfo -d
In case you want to check the drain status of a specific node, you can use the following command:
srun -R
- Show the nodes in a specific partition only
The example below shows the nodes participating in the partition cn and their current state:
sinfo -p cn
Example output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cn up infinite 21 drain* cn[0004-0005,0061,0070-0075,0082-0084,0283-0285,0406-0408,0535,0544,0952]
cn up infinite 19 drain cn[0007,0048,0062-0063,0094-0096,0289,0334-0336,0536-0537,0541,0545-0546,0558,0560-0561]
cn up infinite 90 mix cn[0064-0069,0076,0079,0085-0088,0145-0156,0170-0171,0173-0174,0176-0177,0179-0180,0185-0186,0191-0192,0253-0254,0261-0265,0268,0271,0274,0277-0278,0327-0331,0333,0397-0405,0415-0422,0425-0426,0429,0542,0548,0793-0798,0807-0808,0811-0812,1022,1106-1108]
cn up infinite 241 alloc cn[0001-0003,0006,0008-0047,0049-0060,0077-0078,0080-0081,0089-0090,0093,0097-0144,0157-0169,0172,0175,0178,0181-0184,0187-0190,0290-0294,0301-0312,0319-0326,0337-0348,0362-0363,0365-0366,0368-0369,0371-0372,0377-0378,0383-0384,0430,0543,0547,0549-0553,0556,0559,0562,0564-0568,0571-0572,0576,0769-0780,0925-0937,0940,0943,0946,0949-0951,0955-0958,1105,1109-1110]
cn up infinite 739 idle cn[0091-0092,0193-0252,0255-0260,0266-0267,0269-0270,0272-0273,0275-0276,0279-0282,0286-0288,0295-0300,0313-0318,0332,0349-0361,0364,0367,0370,0373-0376,0379-0382,0385-0396,0409-0414,0423-0424,0427-0428,0431-0534,0538-0540,0554-0555,0557,0563,0569-0570,0573-0575,0577-0768,0781-0792,0799-0806,0809-0810,0813-0924,0938-0939,0941-0942,0944-0945,0947-0948,0953-0954,0959-1021,1023-1104]
Show memory, CPU, and GPU information for each node in the partition:
Note
The GPUs are presented in the output as generic resources (GRES) with the format gpu:<number_of_gpus> and the latter is the number of GPUs available on the displayed node.
sinfo -N -p common -o "%N %T %C %G %m %c"
Example output:
NODELIST STATE CPUS(A/I/O/T) GRES MEMORY CPUS
dgx1 mixed 100/124/0/224 gpu:8 2063425 224
dgx2 mixed 16/208/0/224 gpu:7 2063425 224
Important
Note that if no GPUs are available on the node, the GRES column is empty (null):
NODELIST STATE CPUS(A/I/O/T) GRES MEMORY CPUS
cn0001 allocated 256/0/0/256 (null) 257700 256
cn0002 allocated 256/0/0/256 (null) 257700 256
cn0003 allocated 256/0/0/256 (null) 257700 256
The state is reported in the STATE column. The possible values are:
idle- Available for accepting and running new jobsallocated- Currently all trackable resources are fully allocated to running jobsmixed- Partially allocated (some trackable resources are in use)drained- Being administratively drained (unavailable to new jobs, existing jobs are still running)drained*- Failed node (no new jobs, no existing jobs, likely due to hardware failure)
The CPU allocation format A/I/O/T consists of:
A- Allocated CPUs (in use by jobs)I- Idle CPUs (available for new jobs)O- Other CPUs (reserved, offline, etc.)T- Total CPUs on the node
Parameters used to control the resources available on the partition¶
This command shows the parameters set to control the execution of jobs on each partition in terms of Slurm configuration options format:
scontrol show partition
You may preview the parameters for a specific partition as well:
scontrol show partition cn
On both cases, the output will contain this type of structured information:
PartitionName=cn
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=cn[0001-1110]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=284160 TotalNodes=1110 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
To “decipher” this output you need to know the meaning of the parameters. The most important parameters are:
MaxTime- The maximum time limit for jobs in the partitionMaxNodes- The maximum number of nodes allowed to be used in the partitionMaxCPUs- The maximum number of CPUs allowed to be used in the partitionMaxMem- The maximum memory allowed to be used in the partition
The MaxTime parameter is the maximum time limit for jobs in the partition. The MaxNodes parameter is the maximum number of nodes allowed to be used in the partition. The MaxCPUs parameter is the maximum number of CPUs allowed to be used in the partition. The MaxMem parameter is the maximum memory allowed to be used in the partition.
The MaxTime parameter is the maximum time limit for jobs in the partition. The MaxNodes parameter is the maximum number of nodes allowed to be used in the partition. The MaxCPUs parameter is the maximum number of CPUs allowed to be used in the partition. The MaxMem parameter is the maximum memory allowed to be used in the partition.
The other parameters can be found in the Slurm documentation: https://slurm.schedmd.com/scontrol.html
Information about the nodes¶
Note
The information about the nodes does not depend on the partition they belong to. It is a global information about the nodes in the cluster. But the nodes can be members of multiple partitions and the information about the nodes is displayed for each partition they belong to.
Brief node information can be displayed with the following command:
sinfo -N -p cn
Example output:
NODELIST NODES PARTITION STATE
cn0001 1 cn alloc
cn0002 1 cn alloc
cn0003 1 cn alloc
cn0004 1 cn drain*
cn0005 1 cn drain*
cn0006 1 cn alloc
cn0007 1 cn drain
cn0008 1 cn alloc
cn0009 1 cn alloc
cn0010 1 cn alloc
cn0011 1 cn alloc
State of resources on each node can be printed out using the following command:
sinfo -N -p cn -o "%N %T %C %m %c"
Example output:
NODELIST STATE CPUS(A/I/O/T) MEMORY CPUS
cn0001 allocated 256/0/0/256 257700 256
cn0002 allocated 256/0/0/256 257700 256
cn0003 allocated 256/0/0/256 257700 256
cn0004 drained* 0/0/256/256 257700 256
cn0005 drained* 0/0/256/256 257700 256
cn0006 allocated 256/0/0/256 257700 256
cn0007 drained 0/0/256/256 257700 256
cn0008 allocated 256/0/0/256 257700 256
cn0064 mixed 128/128/0/256 257700 256
cn0065 mixed 128/128/0/256 257700 256
CPU allocation format A/I/O/T:
A- Allocated CPUs (in use by jobs)I- Idle CPUs (available for new jobs)O- Other CPUs (reserved, offline, etc.)T- Total CPUs on the node
Note
The number 256 here refers to the total number of CPU theads availale per node. The Linux OS running on the nodes discovers the CPU threads as logical CPUs and reports them as processors to the Linux kernel. So as the kernel reports them to Slurm and that is the reason we see 256 logical CPUs per node. In fact, on each node we have 128 CPU cores per node, but due to the hyperthreading technology, we see 256 logical CPUs per node (see above).
Examples:
256/0/0/256- All 256 CPUs allocated (fully busy)128/128/0/256- 128 CPUs allocated, 128 idle (mixed state)0/0/256/256- All 256 CPUs offline (drained nodes)
Show information for each node in the cluster¶
Most of the output produced by the command below is in terms of Slurm configuration file format.
scontrol show nodes
Example output:
NodeName=cn1108 Arch=x86_64 CoresPerSocket=16
CPUAlloc=12 CPUTot=256 CPULoad=6.02
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=cn1108-interconnect-1 NodeHostName=cn1108 Version=20.02.6-Bull.1.1
OS=Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021
RealMemory=257700 AllocMem=12288 FreeMem=242512 Sockets=8 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cn,rack12,pm12-isw3,ALL
BootTime=2025-09-02T13:49:53 SlurmdStartTime=2025-09-02T13:51:12
CfgTRES=cpu=256,mem=257700M,billing=256
AllocTRES=cpu=12,mem=12G
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=fn18 Arch=x86_64 CoresPerSocket=16
CPUAlloc=256 CPUTot=256 CPULoad=48.06
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=fn18-interconnect-1 NodeHostName=fn18 Version=20.02.6-Bull.1.1
OS=Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021
RealMemory=1031800 AllocMem=262144 FreeMem=729398 Sockets=8 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=fn,rack12,pm12-isw2,ALL
BootTime=2025-09-02T14:43:50 SlurmdStartTime=2025-09-02T14:45:13
CfgTRES=cpu=256,mem=1031800M,billing=256
AllocTRES=cpu=256,mem=256G
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Important fields in the output are explained below:
NodeName- Unique identifier for the compute nodeState- Current node state (ALLOCATED, MIXED, IDLE, DRAINED, etc.)CPUAlloc/CPUTot- Allocated CPUs by jobs vs Total CPUs (e.g., 12/256 means 12 CPUs (12 CPU threads) in use out of 256 total)CPULoad- Current CPU load percentageRealMemory- Total physical memory in MBAllocMem/FreeMem- Allocated vs Free memory in MBSockets- Number of NUMA domains/NUMA nodes on the node (not physical CPU sockets)ThreadsPerCore- Hyperthreading enabled (2 = 2 threads per physical core)Partitions- Which Slurm partitions this node belongs toBootTime/SlurmdStartTime- When the node and Slurm daemon startedCfgTRES- Configured trackable resources (total available)AllocTRES- Currently allocated trackable resources
The other parameters can be found in the Slurm documentation: https://slurm.schedmd.com/scontrol.html