Resource limits on login.discoverer.bg
Overview
login.discoverer.bg is the login node of the Discoverer CPU cluster. It is a shared
gateway used by all users of the cluster. It is not a compute node.
Its purpose is strictly limited to the following tasks:
connecting to the cluster via SSH;
submitting, monitoring, and cancelling SLURM jobs;
checking project resource allocation and quota;
managing files and directories under the project’s storage space.
Any workload beyond these lightweight operations — including IDEs, AI coding agents, interactive notebooks, long-running scripts, or any process that sustains significant CPU or memory consumption — does not belong on the login node. Such workloads degrade the experience for every other user sharing the node and, as described in the sections below, are subject to automatic throttling.
Per-user resource limits
To protect the shared environment, the login node enforces per-user resource limits via
systemd cgroup v2 user slices. Every user session is placed inside a slice named
user-<UID>.slice, and the following limits apply to the aggregate of all processes in that
slice:
Parameter |
Default limit |
Description |
|---|---|---|
|
200% |
Maximum CPU allocation (2 logical threads equivalent) |
|
4.0 GB |
Soft memory ceiling; throttling begins above this threshold |
|
5000 |
Maximum number of concurrent processes and threads |
|
10 MB/s per direction |
Local block device I/O throttle
(applied to |
These limits apply to all users within the designated UID range and are enforced continuously. They are not negotiable and will not be raised to accommodate unsupported workloads.
Note
The MemoryHigh parameter is a soft ceiling, not a hard kill threshold. When a user
slice exceeds it, the kernel begins throttling memory allocation and aggressively reclaiming
pages. Processes are not immediately terminated; instead they stall in uninterruptible sleep
(kernel state D), which increments the system load average regardless of CPU utilisation.
This behaviour is intentional: it preserves enough system headroom for the affected user to
log in via a new SSH session and terminate the offending process themselves.
Note
The I/O throttle applies only to local block devices. The NFS home filesystem and Lustre
project storage (/valhalla) are accessed over the network and are outside the scope of
the cgroup I/O controller. Only I/O directed to the local NVMe device (/dev/nvme0n1p3)
is subject to the bandwidth limit.
Checking your own resource consumption
Users are encouraged to monitor their own slice before attributing login node slowness to other accounts. The following command shows the current state of your cgroup slice:
systemctl status user-$(id -u).slice
The relevant line in the output is:
Memory: X.XG (high: 4.0G available: YB)
As long as the available figure is non-zero, your session is within the limit and is not
contributing to system load. If available reads 0B, your slice is at or above the
MemoryHigh threshold and your processes are being throttled.
The raw byte values can be read directly from the cgroup filesystem:
cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/memory.current
cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/memory.high
Important
When the login node feels slow or unresponsive, the first step is always to check your own account using the commands above. Do not assume that another user’s processes are the cause before verifying that your own slice is within its limits. Each user’s slice is accounted and throttled independently; a throttled slice belonging to another user cannot directly cause throttling in yours.
Effect of throttling on system load average
The system load average reported by uptime counts all processes in runnable or uninterruptible
sleep state. A process stalled in state D — waiting on memory reclaim, I/O, or a kernel lock —
contributes 1.0 to the load average for every second it remains in that state, irrespective of
how much CPU it is consuming.
This means a user slice sitting at MemoryHigh with available: 0B can produce a load
average contribution of 10–20 or more from a handful of processes, even though those processes
show only modest CPU percentages in top. The discrepancy between the CPU figures visible in
top and the load average reported by uptime is the diagnostic signature of memory-pressure
throttling.
On the use of VSCode, AI coding agents, and similar tools
Some users connect VSCode, Claude Code, OpenHands, or similar development environments to the login node via Remote-SSH. This is not explicitly blocked, but the following conditions apply without exception:
Use is at the user’s own risk. Such tools are not a supported use case on the login node.
All processes are subject to the cgroup limits described above. VSCode and its background processes — language servers, file indexers, extension workers — will be throttled as soon as the user’s slice approaches the
MemoryHighorCPUQuotaceiling. This is by design and will not be changed.No obligation to provide additional resources arises. The login node is not a compute resource. No EuroHPC resource allocation policy, national allocation agreement, or any other instrument governing access to this system creates an entitlement to additional login node capacity for the purpose of running development tools.
Performance degradation is not a support issue. If a VSCode or agent session becomes slow or unresponsive, the cause is cgroup throttling as described in this document. Users experiencing this should terminate the offending processes and migrate their workflow to a local workstation (see Recommended workflow).
The same conditions apply equally to all users. A user whose VSCode or node processes are
predominantly in state D is already being throttled by their own cgroup limits. Those
processes are not running freely and are not the cause of performance problems in other users’
sessions. Each user slice is independent.
On submitting development tools as SLURM jobs
Unlike the Discoverer+ GPU cluster, the Discoverer CPU cluster does not operate a GPU-based
billing fairness model. There is no billing counter that penalises CPU or memory consumption
relative to GPU allocation, because the compute nodes carry no GPUs.
This does not make SLURM interactive jobs a viable alternative for running development tools, for two reasons.
First, Discoverer does not permit unlimited wall time. Every job must declare a wall time and no job may exceed the maximum wall time enforced by the Slurm account’s QoS — which does not exceed 2 days. A development tool such as VSCode, Claude Code, or OpenHands running as an interactive job will therefore be unconditionally terminated by SLURM when the wall time expires, making it unsuitable for any persistent development workflow.
Second, on Discoverer the only trackable resource subject to allocation limits is CPU time. Memory usage is recorded by SLURM but carries no allocation limit and is not enforced. A development tool held in an interactive job for an extended period consumes project CPU-minutes for the full duration of the job regardless of whether any useful computational work is being performed. Users are strongly encouraged to run such tools on a local workstation instead, as described in the next section.
Recommended workflow
The following workflow is correct, supported, and consistent with how resource allocations on the Discoverer CPU cluster are intended to be consumed:
Run VSCode, Claude Code, OpenHands, Jupyter, or any other development tool on a local workstation or laptop. These tools have no business running on shared HPC infrastructure.
Connect from those local tools to the login node via SSH only for the following purposes:
submitting and monitoring SLURM jobs;
checking project resource allocation and storage quota;
managing files and directories under the project’s storage space.
For interactive workloads that genuinely require cluster compute resources, request an interactive SLURM job allocation on a compute node. Local tools may then connect to the allocated node directly via SSH for the duration of the job. The login node is never the target for such connections.
This model ensures that the login node remains responsive for all users, that project CPU-hour budgets are consumed by productive computational workloads rather than development tooling, and that users retain full access to the resources their project has been allocated.
Getting help
See Getting help