Resource Overview

Aggregated computational resources

Discoverer (CPU cluster)

144384 (1128 × 128) hardware CPU cores

288768 (1128 × 256) logical CPUs

295.5 (0.25 × 1110 + 1 × 18 ) TiB RAM

Discoverer+ (GPU+CPU cluster)

32 (4 × 8) NVIDIA H200 GPU accelerators

448 (112 × 4) hardware CPU cores

896 (224 × 4) logical CPUs

7.84 (1.96 × 4) TiB RAM

Compute nodes

Discoverer (CPU cluster)

1128 compute nodes

1110 “regular” nodes18 “fat” nodes

Each compute node is equipped with:

Detailed hardware information:

See also Partitions (of nodes)

Discoverer+ (GPU+CPU cluster)

4 compute nodes (4 × DGX H200)

For detailed hardware configuration:

https://www.nvidia.com/en-us/data-center/dgx-h200/

Detailed hardware information:

Connectivity

Data center (per node)

Internet (overall)

10 Gbps aggregared maximum Internet connection speed (via BREN, GÉANT, Evolink)

Storage (per node)

Locally installed

No local storage devices are installed upon the compute nodes.

Network-attached (shared)

  • Available on Discoverer (CPU cluster):

    • /home is NFS (over Ethernet) storage for home folders (size: 4.4 TB) [by NetApp]
    • /discofs is Lustre (over InfiniBand) parallel scratch bulk storage (size: 2.1 PB on HDD) [by DDN on DDN ES7990X EXAScaler]
    • /disco2fs is Lustre (over InfiniBand) parallel scratch bulk storage (size: 27 TB on NVMe) [by DDN on DDN ES200NVX EXAScaler]
    • /valhalla is Lustre (over InfiniBand) parallel scratch bulk storage (size: 5.1 PB on NVMe) [by HPE on Cray ClusterStor E1000]
  • Available on Discoverer+ (CPU+GPU cluster):

    • /valhalla is Lustre (over InfiniBand) parallel scratch bulk storage (size: 5.1 PB on NVMe) [by HPE on Cray ClusterStor E1000]
    • /weka is WEKA (over InfiniBand) very fast parallel scratch bulk storage (size: 273 TB on NVMe) [by WEKA on WEKA cluster 4.4]

Partitions (of nodes)

With respect to the control and aggregation of compute resources (managed by Slurm), and their hosting location, the compute nodes are organized into partitions:

Summary

Slurm cluster name Partition name Number of nodes Participating nodes (list of host names)
discoverer ALL 1128 cn[0001-1110], fn[01-18]
discoverer cn 1110 cn[0001-1110]
discoverer fn 18 fn[01-18]
disco-plus common 2 dgx[1-2]

Based on rack location (Discoverer CPU cluster)

Name Num of nodes Participating nodes (list of host names)*
rack1 96 cn[0001-0096]
rack2 96 cn[0097-0192]
rack3 96 cn[0193-0288]
rack4 96 cn[0289-0384]
rack5 96 cn[0385-0480]
rack6 96 cn[0481-0576]
rack7 96 cn[0577-0672]
rack8 96 cn[0673-0768]
rack9 96 cn[0769-0864]
rack10 96 cn[0865-0960]
rack11 96 cn[0961-1056]
rack12 72 cn[1057-1110], fn[01-18]

Based on IB connectivity (per switch, Discoverer CPU cluster)

Name Num of nodes Participating nodes (list of host names)*
pm1-isw0 24 cn[0001-0012,0025-0030,0039-0040,0043-0044,0047-0048]
pm1-isw1 24 cn[0013-0024,0031-0038,0041-0042,0045-0046]
pm1-isw2 24 cn[0061-0073,0076,0079,0082,0085-0088,0091-0094]
pm1-isw3 24 cn[0049-0060,0074-0075,0077-0078,0080-0081,0083-0084,0089-0090,0095-0096]
pm2-isw0 24 cn[0097-0108,0121-0126,0135-0136,0139-0140,0143-0144]
pm2-isw1 24 cn[0109-0120,0127-0134,0137-0138,0141-0142]
pm2-isw2 24 cn[0157-0169,0172,0175,0178,0181-0184,0187-0190]
pm2-isw3 24 cn[0145-0156,0170-0171,0173-0174,0176-0177,0179-0180,0185-0186,0191-0192]
pm3-isw0 24 cn[0193-0204,0217-0222,0231-0232,0235-0236,0239-0240]
pm3-isw1 24 cn[0205-0216,0223-0230,0233-0234,0237-0238]
pm3-isw2 24 cn[0253-0265,0268,0271,0274,0277-0280,0283-0286]
pm3-isw3 24 cn[0241-0252,0266-0267,0269-0270,0272-0273,0275-0276,0281-0282,0287-0288]
pm4-isw0 24 cn[0289-0300,0313-0318,0327-0328,0331-0332,0335-0336]
pm4-isw1 24 cn[0301-0312,0319-0326,0329-0330,0333-0334]
pm4-isw2 24 cn[0349-0361,0364,0367,0370,0373-0376,0379-0382]
pm4-isw3 24 cn[0337-0348,0362-0363,0365-0366,0368-0369,0371-0372,0377-0378,0383-0384]
pm5-isw0 24 cn[0385-0396,0409-0414,0423-0424,0427-0428,0431-0432]
pm5-isw1 24 cn[0397-0408,0415-0422,0425-0426,0429-0430]
pm5-isw2 24 cn[0445-0457,0460,0463,0466,0469-0472,0475-0478]
pm5-isw3 24 cn[0433-0444,0458-0459,0461-0462,0464-0465,0467-0468,0473-0474,0479-0480]
pm6-isw0 24 cn[0481-0492,0505-0510,0519-0520,0523-0524,0527-0528]
pm6-isw1 24 cn[0493-0504,0511-0518,0521-0522,0525-0526]
pm6-isw2 24 cn[0541-0553,0556,0559,0562,0565-0568,0571-0574]
pm6-isw3 24 cn[0529-0540,0554-0555,0557-0558,0560-0561,0563-0564,0569-0570,0575-0576]
pm7-isw0 24 cn[0577-0588,0601-0606,0615-0616,0619-0620,0623-0624]
pm7-isw1 24 cn[0589-0600,0607-0614,0617-0618,0621-0622]
pm7-isw2 24 cn[0637-0649,0652,0655,0658,0661-0664,0667-0670]
pm7-isw3 24 cn[0625-0636,0650-0651,0653-0654,0656-0657,0659-0660,0665-0666,0671-0672]
pm8-isw0 24 cn[0673-0684,0697-0702,0711-0712,0715-0716,0719-0720]
pm8-isw1 24 cn[0685-0696,0703-0710,0713-0714,0717-0718]
pm8-isw2 24 cn[0733-0745,0748,0751,0754,0757-0760,0763-0766]
pm8-isw3 24 cn[0721-0732,0746-0747,0749-0750,0752-0753,0755-0756,0761-0762,0767-0768]
pm9-isw0 24 cn[0769-0780,0793-0798,0807-0808,0811-0812,0815-0816]
pm9-isw1 24 cn[0781-0792,0799-0806,0809-0810,0813-0814]
pm9-isw2 24 cn[0829-0841,0844,0847,0850,0853-0856,0859-0862]
pm9-isw3 24 cn[0817-0828,0842-0843,0845-0846,0848-0849,0851-0852,0857-0858,0863-0864]
pm10-isw0 24 cn[0865-0876,0889-0894,0903-0904,0907-0908,0911-0912]
pm10-isw1 24 cn[0877-0888,0895-0902,0905-0906,0909-0910]
pm10-isw2 24 cn[0925-0937,0940,0943,0946,0949-0952,0955-0958]
pm10-isw3 24 cn[0913-0924,0938-0939,0941-0942,0944-0945,0947-0948,0953-0954,0959-0960]
pm11-isw0 24 cn[0961-0972,0985-0990,0999-1000,1003-1004,1007-1008]
pm11-isw1 24 cn[0973-0984,0991-0998,1001-1002,1005-1006]
pm11-isw2 24 cn[1021-1033,1036,1039,1042,1045-1048,1051-1054]
pm11-isw3 24 cn[1009-1020,1034-1035,1037-1038,1040-1041,1043-1044,1049-1050,1055-1056]
pm12-isw0 24 cn[1057-1068,1081-1086,1095-1096,1099-1100,1103-1104]
pm12-isw1 24 cn[1069-1080,1087-1094,1097-1098,1101-1102]
pm12-isw2 12 fn[07-18]
pm12-isw3 12 cn[1105-1110],fn[01-06]
  • “cn” stands for “regular” node, “fn” for “fat” node, and “dgx” points to “DGX H200” node