Linux Cgroups Beginner Guide: Resource Management for HPC
Overview
Control Groups (cgroups) are a Linux kernel feature that allows you to allocate, limit, and monitor system resources — CPU, memory, I/O, and more — for groups of processes. If you work on an HPC cluster, cgroups are almost certainly running behind the scenes every time you submit a job through [[hyperqueue-basics|HyperQueue]] or Slurm.
Why should you care? On a shared HPC cluster, one runaway job can consume all memory on a node and crash every other job running there. Cgroups prevent this by enforcing hard resource boundaries. When your Slurm job requests 4 CPUs and 16GB of RAM, it is cgroups that ensure your job cannot exceed those limits — and that other jobs on the same node cannot steal your allocation.
This guide focuses on cgroups v2 (the modern unified hierarchy), with references to v1 where it helps you understand older systems. By the end, you will be able to inspect cgroups on a live system, understand how Slurm uses them, and diagnose common resource-related job failures.
What You Will Learn
- What cgroups are and why HPC clusters depend on them
- The cgroup hierarchy and how controllers work
- How Slurm creates cgroups for every job and step
- How to inspect cgroup settings and diagnose OOM kills
- The difference between cgroups v1 and v2
Prerequisites
Before starting this tutorial, you should have:
- Basic Linux command-line skills — navigating directories, reading files, running commands (see [[linux-permissions-beginner-guide|Linux Permissions Beginner Guide]] for fundamentals)
- SSH access to an HPC cluster running a modern Linux distribution (RHEL/Rocky 8+, Ubuntu 22.04+)
- A Slurm account with permission to submit jobs
- Understanding of basic HPC concepts — nodes, jobs, partitions (see [[hyperqueue-basics|HyperQueue Basics]] for job scheduling concepts)
No root access is needed for most inspection tasks. Some exercises that create cgroups require root or delegation permissions.
Key Concepts
What Is a Cgroup?
A cgroup (control group) is a kernel mechanism that organizes processes into hierarchical groups and applies resource constraints to those groups. Think of it as a folder that contains processes, where the folder itself has rules about how much CPU, memory, or I/O those processes can use.
The Cgroup Hierarchy
Cgroups are organized in a tree structure, similar to a filesystem. Every process on the system belongs to exactly one cgroup. Here is what the hierarchy looks like on a typical HPC compute node running Slurm:
root (/)
|
+------------+------------+
| | |
system.slice user.slice system.slice
| |
slurmstepd other services
|
+--------+--------+
| |
job_123/ job_456/
| |
+--+--+ +--+--+
| | | |
step0 step1 step0 step1
| |
tasks tasks
(your (other
processes) user's
processes)
Each level in the tree can have resource limits. A child cgroup cannot exceed the limits of its parent.
Controllers
Controllers are the kernel subsystems that actually enforce resource limits. Each controller manages one type of resource:
| Controller | What It Controls | HPC Use Case |
|---|---|---|
cpu | CPU time allocation | Preventing jobs from stealing CPU cycles |
cpuset | CPU core and NUMA node pinning | Binding MPI ranks to specific cores |
memory | RAM and swap limits | Enforcing --mem requests in Slurm |
io | Disk I/O bandwidth | Preventing I/O-heavy jobs from starving others |
pids | Number of processes/threads | Preventing fork bombs |
hugetlb | Huge page allocation | Managing large-page memory for scientific apps |
rdma | RDMA/InfiniBand resources | HPC network resource isolation |
Cgroups v1 vs v2: A Brief Comparison
Most modern HPC clusters are migrating to cgroups v2. Here is why:
| Feature | v1 | v2 |
|---|---|---|
| Hierarchy | Multiple separate trees (one per controller) | Single unified tree |
| Controller mounting | Each controller mounted separately | All controllers in one hierarchy |
| Process placement | Processes anywhere in the tree | Processes only at leaf nodes |
| Resource distribution | Inconsistent across controllers | Uniform pressure-based model |
| Slurm support | cgroup/v1 plugin (deprecated) | cgroup/v2 plugin (recommended) |
Key takeaway: cgroups v2 simplifies everything by putting all controllers in a single hierarchy. If you see paths like /sys/fs/cgroup/memory/ and /sys/fs/cgroup/cpu/ as separate mount points, that is v1. If you see a single /sys/fs/cgroup/ with everything combined, that is v2.
Step-by-Step Instructions
Step 1: Check Which Cgroup Version Your Cluster Uses
# Check the cgroup version
stat -fc %T /sys/fs/cgroup/
Expected output for cgroups v2:
cgroup2fs
Expected output for cgroups v1:
tmpfs
You can also check the mount points:
mount | grep cgroup
Cgroups v2 output:
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
Cgroups v1 output (multiple lines):
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Step 2: Explore the Cgroup Filesystem
On a cgroups v2 system, browse the hierarchy:
# List the top-level cgroup directory
ls /sys/fs/cgroup/
Expected output:
cgroup.controllers cgroup.stat io.stat
cgroup.max.depth cgroup.subtree_control memory.current
cgroup.max.descendants cgroup.threads memory.stat
cgroup.procs cpu.stat system.slice/
cgroup.pressure init.scope/ user.slice/
Step 3: Find Your Process in the Cgroup Tree
Every process belongs to a cgroup. Find yours:
# See which cgroup your current shell belongs to
cat /proc/self/cgroup
Expected output (v2):
0::/user.slice/user-1000.slice/session-42.scope
Expected output (v1 — multiple lines, one per controller):
12:memory:/user.slice/user-1000.slice
11:cpuset:/
10:cpu,cpuacct:/user.slice/user-1000.slice
...
Step 4: Inspect a Running Slurm Job's Cgroups
When Slurm runs your job, it creates a cgroup for it. Submit a test job and inspect its cgroup:
# Submit a simple job that sleeps so you can inspect it
sbatch --job-name=cgroup-test --ntasks=1 --cpus-per-task=2 --mem=4G --time=00:10:00 --wrap="sleep 600"
Expected output:
Submitted batch job 123456
Now SSH to the compute node where the job is running and inspect:
# Find which node the job is on
squeue -u $USER -o "%.8i %.8j %.4T %.10M %.6D %R"
Expected output:
JOBID NAME ST TIME NODES NODELIST
123456 cgroup-t R 0:05 1 node001
# SSH to the node and find the job's cgroup (requires node access)
ssh node001
find /sys/fs/cgroup -name "job_123456" -type d 2>/dev/null
Expected output (v2, systemd-based):
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_123456
Step 5: Read Cgroup Resource Limits
Once you find the job's cgroup directory, inspect its limits:
JOB_CGROUP="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_123456"
# Memory limit (in bytes)
cat $JOB_CGROUP/memory.max
Expected output:
4294967296
That is 4GB (4 * 1024^3), matching the --mem=4G we requested.
# Current memory usage
cat $JOB_CGROUP/memory.current
Expected output:
8192000
# CPU set — which cores the job can use
cat $JOB_CGROUP/cpuset.cpus
Expected output:
0-1
This shows the job is pinned to cores 0 and 1, matching --cpus-per-task=2.
# Number of processes in this cgroup
cat $JOB_CGROUP/pids.current
Expected output:
3
Step 6: Check Available Controllers
See which controllers are available and active:
# Available controllers at the root
cat /sys/fs/cgroup/cgroup.controllers
Expected output:
cpuset cpu io memory hugetlb pids rdma misc
# Controllers enabled for child cgroups
cat /sys/fs/cgroup/cgroup.subtree_control
Expected output:
cpuset cpu io memory pids
Practical Examples
Example 1: Understanding an OOM-Killed Job
One of the most common HPC issues is a job being killed for exceeding its memory limit. Here is what happens behind the scenes:
# Submit a job that requests 2GB but tries to use more
sbatch --mem=2G --wrap="python3 -c \"
import numpy as np
# Allocate a 3GB array — more than our 2GB limit
arr = np.zeros((3 * 1024**3 // 8,), dtype=np.float64)
print('This will never print')
\""
Check the job's exit status:
sacct -j <JOBID> --format=JobID,State,ExitCode,MaxRSS
Expected output:
JobID State ExitCode MaxRSS
------------ ---------- -------- ----------
123457 OUT_OF_ME+ 0:137
123457.batch OUT_OF_ME+ 0:137 2048576K
The exit code 137 means the process was killed by signal 9 (SIGKILL) — the kernel's OOM killer, enforced by the memory cgroup.
To confirm, check the kernel log on the compute node:
dmesg | grep -i "oom\|killed process" | tail -5
Expected output:
[12345.678] memory: usage 2097152kB, limit 2097152kB, failcnt 42
[12345.679] oom-kill: constraint=CONSTRAINT_MEMCG ...
[12345.680] Killed process 54321 (python3) total-vm:3145728kB ...
Example 2: Checking CPU Pinning for MPI Jobs
When running parallel jobs, CPU pinning via the cpuset controller prevents processes from migrating between cores, which is critical for performance:
# Submit a 4-task MPI job
sbatch --ntasks=4 --cpus-per-task=1 --wrap="srun hostname"
On the compute node, inspect each step's cpuset:
JOB_DIR="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_123458"
for step in $JOB_DIR/step_*/; do
echo "$(basename $step): cpus=$(cat $step/cpuset.cpus)"
done
Expected output:
step_0: cpus=0
step_1: cpus=1
step_2: cpus=2
step_3: cpus=3
Each MPI rank is pinned to a separate core, preventing contention.
Example 3: Monitoring Real-Time Resource Usage
You can monitor a job's resource consumption in real time through cgroups:
# Watch memory usage of a running job (run on the compute node)
JOB_CGROUP="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_123456"
watch -n 1 "echo 'Memory: '$(cat $JOB_CGROUP/memory.current) '/'\
$(cat $JOB_CGROUP/memory.max) 'bytes'; \
echo 'CPU usage:'; cat $JOB_CGROUP/cpu.stat | head -3"
Expected output (updates every second):
Memory: 1073741824 / 4294967296 bytes
CPU usage:
usage_usec 5432100
user_usec 5000000
system_usec 432100
Hands-On Exercises
Exercise 1: Cgroup Version Discovery
Goal: Determine the cgroup version on your cluster's compute nodes.
- Submit a job:
srun --pty bash - Run
stat -fc %T /sys/fs/cgroup/ - Run
cat /proc/self/cgroup - Note whether you see a single line (v2) or multiple lines (v1)
- Check the Slurm configuration:
scontrol show config | grep -i cgroup
Expected findings: You should see CgroupPlugin = cgroup/v2 (or autodetect) and a unified hierarchy.
Exercise 2: Memory Limit Exploration
Goal: Understand how Slurm translates --mem to cgroup limits.
- Submit a job with a specific memory request:
srun --mem=8G --pty bash
- Find your job's cgroup:
cat /proc/self/cgroup
- Navigate to that cgroup directory and check:
CGROUP_PATH=$(cat /proc/self/cgroup | cut -d: -f3)cat /sys/fs/cgroup${CGROUP_PATH}/memory.max
- Convert the value from bytes to GB. Does it match your request?
- Monitor your memory usage:
cat /sys/fs/cgroup${CGROUP_PATH}/memory.current
Exercise 3: Trigger and Diagnose an OOM Kill
Goal: Intentionally exceed a memory limit and observe the cgroup enforcement.
- Submit a job with a small memory limit:
sbatch --mem=512M --wrap="python3 -c \"data = []for i in range(1000):data.append(bytearray(1024*1024)) # 1MB per iterationif i % 100 == 0:print(f'Allocated {i} MB')\""
- Wait for the job to finish, then check:
sacct -j <JOBID> --format=JobID,State,ExitCode,MaxRSS,ReqMem
- What exit code do you see? What state?
Exercise 4: Compare Requested vs. Used Resources
Goal: Learn to use sacct and cgroup data to identify inefficient jobs.
# Check your recent jobs' efficiency
sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d) \
--format=JobID,JobName,ReqMem,MaxRSS,ReqCPUS,CPUTime,State \
-u $USER
Look for jobs where MaxRSS is much less than ReqMem — these are wasting allocated resources that could be used by other users.
Troubleshooting
Job Killed with Exit Code 137 (OOM Kill)
Symptom: Job state shows OUT_OF_MEMORY, exit code is 0:137.
Cause: Your job exceeded its memory cgroup limit.
Solution:
# Check how much memory the job actually needed
sacct -j <JOBID> --format=MaxRSS
# Resubmit with more memory (add 20% buffer)
sbatch --mem=12G your_script.sh
Job Cannot See All GPUs
Symptom: Your job only sees 1 GPU even though the node has 4.
Cause: The devices controller in cgroups restricts GPU visibility to match your --gres=gpu:N request.
Solution: This is expected behavior. Request the number of GPUs you need:
sbatch --gres=gpu:2 your_gpu_script.sh
"Cannot write to cgroup" Errors
Symptom: Slurm logs show errors about writing to cgroup files.
Cause: The cgroup filesystem is not properly configured or systemd is not delegating correctly.
Solution (for admins):
# Check if the cgroup filesystem is mounted
mount | grep cgroup2
# Verify Slurm's cgroup configuration
scontrol show config | grep -i cgroup
Job Uses More CPUs Than Requested
Symptom: Your job's processes spread across more cores than you requested.
Cause: The cpuset controller may not be enabled, or ConstrainCores is not set in cgroup.conf.
Solution (for admins): Ensure cgroup.conf contains:
ConstrainCores=yes
Checking Slurm's Cgroup Configuration
To see how your cluster's Slurm is configured for cgroups:
# View the cgroup-related configuration
scontrol show config | grep -i cgroup
Expected output:
CgroupPlugin = autodetect
# View the cgroup.conf file (if you have access)
cat /etc/slurm/cgroup.conf
Expected output:
CgroupPlugin=autodetect
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRAMSpace=100
AllowedSwapSpace=0
References
- Linux Kernel cgroups v2 Documentation — the definitive kernel reference
- cgroups(7) man page — Linux manual page covering both v1 and v2
- Slurm Control Group Documentation — how Slurm integrates with cgroups
- Slurm cgroup/v2 Plugin — v2-specific configuration
- Slurm cgroup.conf Reference — configuration file documentation
- Rocky Linux: Migrating cgroups v1 to v2 — practical migration guide
- Red Hat: cgroups v2 in RHEL 8 — enterprise perspective
- HPC Sysadmin Basics: cgroups — HPC-focused cgroup usage
Related Tutorials
- [[cgroups-deep-dive|Cgroups Deep Dive]] — advanced cgroup internals, Slurm plugin configuration, and container integration
- [[docker-test-container-beginner-guide|Docker Test Container Beginner Guide]] — containers use cgroups for resource isolation
- [[docker-test-container-deep-dive|Docker Test Container Deep Dive]] — deeper look at container resource management
- [[linux-permissions-beginner-guide|Linux Permissions Beginner Guide]] — foundational Linux concepts
- [[linux-permissions-deep-dive|Linux Permissions Deep Dive]] — advanced Linux security mechanisms
- [[kubernetes-beginner-guide|Kubernetes Beginner Guide]] — Kubernetes uses cgroups for pod resource limits
- [[kubernetes-deep-dive|Kubernetes Deep Dive]] — Kubernetes resource management internals
- [[isaaclab-metagrasp-apptainer-hpc-beginner-guide|IsaacLab Apptainer HPC Guide]] — running containers on HPC with cgroup-managed resources
- [[hyperqueue-basics|HyperQueue Basics]] — alternative HPC job scheduling
- [[hyperqueue-deep-dive|HyperQueue Deep Dive]] — advanced HPC scheduling concepts
- [[parsl-beginner-guide|Parsl Beginner Guide]] — parallel computing on HPC clusters
- [[parsl-deep-dive|Parsl Deep Dive]] — advanced parallel execution patterns
Summary
Cgroups are the invisible backbone of resource management on HPC clusters. Every time Slurm runs your job, it creates a cgroup that enforces the CPU, memory, and device limits you requested. Understanding cgroups helps you:
- Diagnose job failures — OOM kills, CPU throttling, and device access issues all trace back to cgroup limits
- Right-size your resource requests — inspect actual usage through cgroup accounting to avoid wasting cluster resources
- Understand isolation — know why your job cannot see all GPUs or all memory on a node
Key points to remember:
- Cgroups v2 uses a single unified hierarchy (check with
stat -fc %T /sys/fs/cgroup/) - Slurm creates cgroups automatically for every job, step, and task
- Memory limits are hard — exceeding
memory.maxtriggers an OOM kill (exit code 137) - CPU pinning via
cpusetkeeps your processes on their assigned cores - Inspect cgroups through
/sys/fs/cgroup/and/proc/<PID>/cgroup
For a deeper understanding of cgroup internals, Slurm plugin configuration, and container integration, continue to the [[cgroups-deep-dive|Cgroups Deep Dive]].