Resource Overview¶
Aggregated computational resources¶
Discoverer (CPU cluster)¶
144384 (1128 × 128) hardware CPU cores
288768 (1128 × 256) logical CPUs
295.5 (0.25 × 1110 + 1 × 18 ) TiB RAM
Discoverer+ (GPU+CPU cluster)¶
32 (4 × 8) NVIDIA H200 GPU accelerators
448 (112 × 4) hardware CPU cores
896 (224 × 4) logical CPUs
7.84 (1.96 × 4) TiB RAM
Compute nodes¶
Discoverer (CPU cluster)¶
1128 compute nodes
1110 “regular” nodes18 “fat” nodes
Each compute node is equipped with:
- CPU: 2 × AMD EPYC 7H12 64-Core Processor (Sockets 8, CoresPerSocket 16, ThreadsPerCore 2)
- RAM:
- per “regular” node : 256 GB (Micron Technology Inc. DDR4, 3200 MT/s, 16 × 16 GB)
- per “fat” node: 1 TB (Micron Technology Inc. DDR4, 3200 MT/s, 16 × 64 GB)
- Ethernet: 2 × 1000BASE-T, 100BASE-TX, 10BASE-T Intel® Ethernet Controller I350
- Infiniband: Mellanox Technologies MT28908 Family [ConnectX-6], 1 × 200Gb/s
Detailed hardware information:
See also Partitions (of nodes)
Discoverer+ (GPU+CPU cluster)¶
4 compute nodes (4 × DGX H200)
For detailed hardware configuration:
https://www.nvidia.com/en-us/data-center/dgx-h200/
Detailed hardware information:
Connectivity¶
Data center (per node)¶
- Discoverer (CPU cluster):
- 1 × InfiniBand controller: 200 Gbps [Mellanox Technologies MT28908 Family [ConnectX-6]]
- 1 × Ethernet controller: 10 Gbps [Intel® Ethernet Controller I350]
- Discoverer+ (CPU+GPU cluster):
- 10 × Infiniband controller: 400 Gbps [Mellanox Technologies MT2910 Family [ConnectX-7]]
- 2 × Ethernet controller: 400 Gbps [Mellanox Technologies MT2910 Family [ConnectX-7]]
- 2 × Ethernet Controller: 100 Gbps [Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)]
- 1 × Ethernet controller: 10 Gbps [Intel Corporation Ethernet Controller X550 (rev 01)]
Storage (per node)¶
Locally installed¶
No local storage devices are installed upon the compute nodes.
Partitions (of nodes)¶
With respect to the control and aggregation of compute resources (managed by Slurm), and their hosting location, the compute nodes are organized into partitions:
Summary¶
Slurm cluster name Partition name Number of nodes Participating nodes (list of host names) discoverer ALL 1128 cn[0001-1110], fn[01-18] discoverer cn 1110 cn[0001-1110] discoverer fn 18 fn[01-18] disco-plus common 2 dgx[1-2]
Based on rack location (Discoverer CPU cluster)¶
Name Num of nodes Participating nodes (list of host names)* rack1 96 cn[0001-0096] rack2 96 cn[0097-0192] rack3 96 cn[0193-0288] rack4 96 cn[0289-0384] rack5 96 cn[0385-0480] rack6 96 cn[0481-0576] rack7 96 cn[0577-0672] rack8 96 cn[0673-0768] rack9 96 cn[0769-0864] rack10 96 cn[0865-0960] rack11 96 cn[0961-1056] rack12 72 cn[1057-1110], fn[01-18]
Based on IB connectivity (per switch, Discoverer CPU cluster)¶
Name Num of nodes Participating nodes (list of host names)* pm1-isw0 24 cn[0001-0012,0025-0030,0039-0040,0043-0044,0047-0048] pm1-isw1 24 cn[0013-0024,0031-0038,0041-0042,0045-0046] pm1-isw2 24 cn[0061-0073,0076,0079,0082,0085-0088,0091-0094] pm1-isw3 24 cn[0049-0060,0074-0075,0077-0078,0080-0081,0083-0084,0089-0090,0095-0096] pm2-isw0 24 cn[0097-0108,0121-0126,0135-0136,0139-0140,0143-0144] pm2-isw1 24 cn[0109-0120,0127-0134,0137-0138,0141-0142] pm2-isw2 24 cn[0157-0169,0172,0175,0178,0181-0184,0187-0190] pm2-isw3 24 cn[0145-0156,0170-0171,0173-0174,0176-0177,0179-0180,0185-0186,0191-0192] pm3-isw0 24 cn[0193-0204,0217-0222,0231-0232,0235-0236,0239-0240] pm3-isw1 24 cn[0205-0216,0223-0230,0233-0234,0237-0238] pm3-isw2 24 cn[0253-0265,0268,0271,0274,0277-0280,0283-0286] pm3-isw3 24 cn[0241-0252,0266-0267,0269-0270,0272-0273,0275-0276,0281-0282,0287-0288] pm4-isw0 24 cn[0289-0300,0313-0318,0327-0328,0331-0332,0335-0336] pm4-isw1 24 cn[0301-0312,0319-0326,0329-0330,0333-0334] pm4-isw2 24 cn[0349-0361,0364,0367,0370,0373-0376,0379-0382] pm4-isw3 24 cn[0337-0348,0362-0363,0365-0366,0368-0369,0371-0372,0377-0378,0383-0384] pm5-isw0 24 cn[0385-0396,0409-0414,0423-0424,0427-0428,0431-0432] pm5-isw1 24 cn[0397-0408,0415-0422,0425-0426,0429-0430] pm5-isw2 24 cn[0445-0457,0460,0463,0466,0469-0472,0475-0478] pm5-isw3 24 cn[0433-0444,0458-0459,0461-0462,0464-0465,0467-0468,0473-0474,0479-0480] pm6-isw0 24 cn[0481-0492,0505-0510,0519-0520,0523-0524,0527-0528] pm6-isw1 24 cn[0493-0504,0511-0518,0521-0522,0525-0526] pm6-isw2 24 cn[0541-0553,0556,0559,0562,0565-0568,0571-0574] pm6-isw3 24 cn[0529-0540,0554-0555,0557-0558,0560-0561,0563-0564,0569-0570,0575-0576] pm7-isw0 24 cn[0577-0588,0601-0606,0615-0616,0619-0620,0623-0624] pm7-isw1 24 cn[0589-0600,0607-0614,0617-0618,0621-0622] pm7-isw2 24 cn[0637-0649,0652,0655,0658,0661-0664,0667-0670] pm7-isw3 24 cn[0625-0636,0650-0651,0653-0654,0656-0657,0659-0660,0665-0666,0671-0672] pm8-isw0 24 cn[0673-0684,0697-0702,0711-0712,0715-0716,0719-0720] pm8-isw1 24 cn[0685-0696,0703-0710,0713-0714,0717-0718] pm8-isw2 24 cn[0733-0745,0748,0751,0754,0757-0760,0763-0766] pm8-isw3 24 cn[0721-0732,0746-0747,0749-0750,0752-0753,0755-0756,0761-0762,0767-0768] pm9-isw0 24 cn[0769-0780,0793-0798,0807-0808,0811-0812,0815-0816] pm9-isw1 24 cn[0781-0792,0799-0806,0809-0810,0813-0814] pm9-isw2 24 cn[0829-0841,0844,0847,0850,0853-0856,0859-0862] pm9-isw3 24 cn[0817-0828,0842-0843,0845-0846,0848-0849,0851-0852,0857-0858,0863-0864] pm10-isw0 24 cn[0865-0876,0889-0894,0903-0904,0907-0908,0911-0912] pm10-isw1 24 cn[0877-0888,0895-0902,0905-0906,0909-0910] pm10-isw2 24 cn[0925-0937,0940,0943,0946,0949-0952,0955-0958] pm10-isw3 24 cn[0913-0924,0938-0939,0941-0942,0944-0945,0947-0948,0953-0954,0959-0960] pm11-isw0 24 cn[0961-0972,0985-0990,0999-1000,1003-1004,1007-1008] pm11-isw1 24 cn[0973-0984,0991-0998,1001-1002,1005-1006] pm11-isw2 24 cn[1021-1033,1036,1039,1042,1045-1048,1051-1054] pm11-isw3 24 cn[1009-1020,1034-1035,1037-1038,1040-1041,1043-1044,1049-1050,1055-1056] pm12-isw0 24 cn[1057-1068,1081-1086,1095-1096,1099-1100,1103-1104] pm12-isw1 24 cn[1069-1080,1087-1094,1097-1098,1101-1102] pm12-isw2 12 fn[07-18] pm12-isw3 12 cn[1105-1110],fn[01-06]
- “cn” stands for “regular” node, “fn” for “fat” node, and “dgx” points to “DGX H200” node