Skip to content

Commit af8af73

Browse files
committed
docs(adr): record NCCL purego/dlopen migration (ADR-002)
Documents the rationale for replacing the CGo nccl binding with a runtime dlopen of libnccl.so.2, the AArch64 hidden-pointer ABI trick that lets us pass ncclUniqueId by value through the shared cuda.Ccall trampoline, and the consequences for build/CI/test posture.
1 parent d68cb06 commit af8af73

1 file changed

Lines changed: 81 additions & 0 deletions

File tree

docs/adr/002-nccl-purego.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# ADR-002: Migrate internal/nccl from CGo to purego/dlopen
2+
3+
**Status:** Accepted
4+
**Date:** 2026-04-09
5+
**Authors:** David Ndungu
6+
7+
## Context
8+
9+
The original `internal/nccl` package was a CGo binding (`#include <nccl.h>`,
10+
`-lnccl`) gated behind `//go:build cuda`. This forced any build that wanted
11+
NCCL to link against the system NCCL headers/library at compile time and
12+
exposed the package only when the `cuda` build tag was set. It also broke the
13+
project-wide rule that `go build ./...` must succeed on every supported
14+
platform without CGo.
15+
16+
Every other GPU runtime binding in ztensor (cuBLAS, cuDNN, cuRAND, TensorRT,
17+
HIP/ROCm, OpenCL) is already loaded at runtime via `internal/cuda.DlopenPath`
18+
and `internal/cuda.Ccall`. NCCL was the lone holdout.
19+
20+
## Decision
21+
22+
`internal/nccl` is implemented in Go-only via runtime dlopen of
23+
`libnccl.so.2`, mirroring the pattern in `internal/cublas/cublas_purego.go`.
24+
25+
Key points:
26+
27+
- **No build tag** on `nccl_purego.go` — it compiles on every platform.
28+
- On non-linux GOOS, `loadNccl` returns a `nccl: not supported on $GOOS`
29+
error without attempting `dlopen`. Every exported entry point surfaces this
30+
as a clean error rather than a panic.
31+
- ABI constants (`ncclSuccess`, the data-type and reduction-op enums, and
32+
`NCCL_UNIQUE_ID_BYTES = 128`) are hardcoded against the stable NCCL 2.x ABI.
33+
- `UniqueID` is marshaled as a fixed-size `[128]byte` array. `Bytes()` and
34+
`UniqueIDFromBytes` provide the serialization round-trip used to ferry the
35+
bootstrap blob between ranks.
36+
- The legacy CGo implementation is retained as `nccl_cgo.go` behind
37+
`//go:build cuda && cgo && nccl_cgo`. It is OFF by default and exists only
38+
as a debugging fallback if a future NCCL release introduces an ABI quirk
39+
that the dlopen path cannot handle.
40+
41+
### AArch64 hidden-pointer ABI for ncclCommInitRank
42+
43+
`ncclCommInitRank` takes the 128-byte `ncclUniqueId` **by value**:
44+
45+
```c
46+
ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks,
47+
ncclUniqueId commId, int rank);
48+
```
49+
50+
The shared `cuda.Ccall` trampoline only marshals `uintptr`-sized arguments,
51+
so passing 128 bytes by value is not directly possible. Fortunately the only
52+
supported NCCL platform for ztensor is **linux/arm64** (the DGX Spark host).
53+
Per AAPCS64 §6.8.2 rule B.4, *"if the argument type is a Composite Type that
54+
is larger than 16 bytes, then the argument is copied to memory allocated by
55+
the caller and the argument is replaced by a pointer to the copy."* That
56+
means at the ABI level the third argument is already a pointer; passing
57+
`uintptr(unsafe.Pointer(&uid.id[0]))` is the correct calling convention.
58+
59+
If we ever need to support NCCL on linux/amd64 (System V ABI), this trick
60+
will not work — SysV passes large aggregates on the stack by value — and we
61+
will need a small assembly trampoline, or we can fall back to the CGo path
62+
via the `nccl_cgo` build tag.
63+
64+
## Consequences
65+
66+
- `go build ./...` works everywhere with no `-tags cuda` and no system NCCL
67+
installed.
68+
- Tests in `internal/nccl/nccl_test.go` no longer require a build tag; they
69+
call `requireNccl(t)` to skip when `libnccl.so.2` is not dlopen-able. Pure
70+
constant and marshaling tests run on every platform.
71+
- The duplicate `internal/nccl` copy inside the `zerfoo` repository is **not**
72+
touched by this change and should be migrated in a follow-up.
73+
- CI's `go vet` exclude list adds `/internal/nccl$` since the dlopen
74+
trampoline relies on the same `unsafe.Pointer(uintptr(...))` pattern as
75+
every other GPU runtime binding (already documented in `docs/QUALITY.md`).
76+
77+
## References
78+
79+
- Issue: zerfoo/ztensor#78
80+
- Reference pattern: `internal/cublas/cublas_purego.go`
81+
- AAPCS64: <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst>

0 commit comments

Comments
 (0)