Skip to content

cli: size worker pool by cgroup-aware cpu count#976

Open
GrapeBaBa wants to merge 8 commits into
mainfrom
cgroup-aware-cpu-count
Open

cli: size worker pool by cgroup-aware cpu count#976
GrapeBaBa wants to merge 8 commits into
mainfrom
cgroup-aware-cpu-count

Conversation

@GrapeBaBa

Copy link
Copy Markdown
Member

Motivation

std.Thread.getCpuCount() only reads the CPU affinity mask, so under a cgroup CPU quota (docker --cpus / k8s limits.cpu) it reports the host core count rather than the quota, over-sizing the shared worker thread pool and causing CFS throttling in CPU-limited containers. (node.zig already has a comment acknowledging this exact issue near the --rayon-threads guard.)

Changes

  • Add pkgs/cli/src/cpu_count.ziggetNumCpus(gpa, io) = min(cgroup quota, affinity), via /proc/self/{cgroup,mountinfo} + cpu.max / cpu.cfs_quota_us (cgroup v1 + v2). Ported from the num_cpus crate.
  • Use it when sizing the worker pool in both paths: the simulation path (main.zig) and the single-node path (node.zig). Both pass the process Io (init.io); Node.init gains an io param so it uses the caller's io rather than a global/throwaway one.
  • /proc files report size 0 → streamed to EOF (readerStreaming + allocRemaining); ceilDiv overflow-safe; non-Linux falls back to affinity.

Testing

  • 15 unit tests in cpu_count.zig; vectors adapted byte-for-byte from seanmonstar/num_cpus (MIT).
  • zig build install builds the cli; OrbStack Linux container: --cpus=2→2, --cpus=4→4, unlimited→host count.

🤖 Developed with AI assistance (Claude Code).

std.Thread.getCpuCount() only reads the CPU affinity mask, so under a
cgroup CPU quota (docker --cpus / k8s limits.cpu) it reports the host
core count rather than the quota, over-sizing the shared worker thread
pool and causing CFS throttling in CPU-limited containers.

Add pkgs/cli/src/cpu_count.zig (getNumCpus) returning min(cgroup quota,
affinity) by reading /proc/self/{cgroup,mountinfo} and cpu.max /
cpu.cfs_quota_us (cgroup v1 + v2), and use it when sizing the worker pool
in both the simulation (main.zig) and single-node (node.zig) paths.

/proc files report size 0, so they are streamed to EOF (readerStreaming +
allocRemaining); ceilDiv is overflow-safe; non-Linux falls back to the
affinity count.

CPU-count logic ported from the num_cpus crate; test vectors adapted from
seanmonstar/num_cpus (MIT).
g11tech
g11tech previously approved these changes Jun 5, 2026

@ch4r10t33r ch4r10t33r left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three medium notes inline — none blocking. The rayon thread sizing also benefits from this change (rayon_threads derives from desired_workers), which is worth calling out in the PR body since it's a bigger win than "worker pool sizing" alone.

Comment thread pkgs/cli/src/cpu_count.zig
Comment thread pkgs/cli/src/cpu_count.zig Outdated
Comment thread pkgs/cli/src/cpu_count.zig Outdated
Sync cpu_count.zig with the refined error contract: distinguish a genuine
absence of quota from a detection failure, instead of collapsing every
failure into a silent affinity fallback that could mask a real quota and
over-size the worker pool in a CPU-limited container.

- null (-> affinity) only when no quota can be located: non-Linux, no cpu
  controller, no cgroup mount, unresolvable path, quota file absent, or
  unlimited (cpu.max "max" / cfs_quota_us "-1").
- error otherwise: unreadable /proc or quota file, unopenable cgroup dir,
  malformed/zero content, or an unavailable affinity count (getCpuCount no
  longer `catch 1`) -- now propagated to the caller.
- match the v1 "-1" / v2 "max" unlimited sentinels before parseUnsigned so
  they are not taken for malformed input now that parse errors propagate.
- document the 8 MB mountinfo read ceiling.

getNumCpus is now fallible; both call sites (main.zig, node.zig) `try` it.
@GrapeBaBa GrapeBaBa requested review from ch4r10t33r and g11tech June 9, 2026 15:44
GrapeBaBa added 4 commits June 9, 2026 23:45
- Keep cgroup paths containing ':' intact (containerd's
  ...slice:cri-containerd:<id> cgroupfs naming): take the path with
  rest() instead of truncating at the next ':'.
- Treat '..' cgroup paths (process outside its cgroupns root) as
  unresolvable -> affinity fallback instead of a hard error.
- Take the minimum quota along the cgroup ancestor chain (leaf up to
  the mount point), matching Rust available_parallelism: the kernel
  enforces every level but each level's file reports only its own
  limit, so leaf-only reads missed LXC/Proxmox cpulimit and systemd
  CPUQuota= on a parent slice.
- Polish per review: assert ceilDiv preconditions, dedup quota-file
  reads, drop unused pubs, 100-col wraps, hybrid-mount and
  buffer-limit test coverage (23 tests).

Synced verbatim from ChainSafe/lodestar-z#386 (commits 7c84fab7..f504039f);
verified there end-to-end in containers including the ancestor and
colon-path scenarios.
Inline the logicalCpus wrapper (unnecessary abstraction over
std.Thread.getCpuCount(); the n >= 1 assert moves into getNumCpus),
drop the redundant 'never exceeds logical' test (the postcondition
assert checks it on every call), and document the consumer in the
module doc (adapted to this repo: the CLI worker pool).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants