|
| 1 | +# ADR-002: Migrate internal/nccl from CGo to purego/dlopen |
| 2 | + |
| 3 | +**Status:** Accepted |
| 4 | +**Date:** 2026-04-09 |
| 5 | +**Authors:** David Ndungu |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +The original `internal/nccl` package was a CGo binding (`#include <nccl.h>`, |
| 10 | +`-lnccl`) gated behind `//go:build cuda`. This forced any build that wanted |
| 11 | +NCCL to link against the system NCCL headers/library at compile time and |
| 12 | +exposed the package only when the `cuda` build tag was set. It also broke the |
| 13 | +project-wide rule that `go build ./...` must succeed on every supported |
| 14 | +platform without CGo. |
| 15 | + |
| 16 | +Every other GPU runtime binding in ztensor (cuBLAS, cuDNN, cuRAND, TensorRT, |
| 17 | +HIP/ROCm, OpenCL) is already loaded at runtime via `internal/cuda.DlopenPath` |
| 18 | +and `internal/cuda.Ccall`. NCCL was the lone holdout. |
| 19 | + |
| 20 | +## Decision |
| 21 | + |
| 22 | +`internal/nccl` is implemented in Go-only via runtime dlopen of |
| 23 | +`libnccl.so.2`, mirroring the pattern in `internal/cublas/cublas_purego.go`. |
| 24 | + |
| 25 | +Key points: |
| 26 | + |
| 27 | +- **No build tag** on `nccl_purego.go` — it compiles on every platform. |
| 28 | +- On non-linux GOOS, `loadNccl` returns a `nccl: not supported on $GOOS` |
| 29 | + error without attempting `dlopen`. Every exported entry point surfaces this |
| 30 | + as a clean error rather than a panic. |
| 31 | +- ABI constants (`ncclSuccess`, the data-type and reduction-op enums, and |
| 32 | + `NCCL_UNIQUE_ID_BYTES = 128`) are hardcoded against the stable NCCL 2.x ABI. |
| 33 | +- `UniqueID` is marshaled as a fixed-size `[128]byte` array. `Bytes()` and |
| 34 | + `UniqueIDFromBytes` provide the serialization round-trip used to ferry the |
| 35 | + bootstrap blob between ranks. |
| 36 | +- The legacy CGo implementation is retained as `nccl_cgo.go` behind |
| 37 | + `//go:build cuda && cgo && nccl_cgo`. It is OFF by default and exists only |
| 38 | + as a debugging fallback if a future NCCL release introduces an ABI quirk |
| 39 | + that the dlopen path cannot handle. |
| 40 | + |
| 41 | +### AArch64 hidden-pointer ABI for ncclCommInitRank |
| 42 | + |
| 43 | +`ncclCommInitRank` takes the 128-byte `ncclUniqueId` **by value**: |
| 44 | + |
| 45 | +```c |
| 46 | +ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, |
| 47 | + ncclUniqueId commId, int rank); |
| 48 | +``` |
| 49 | +
|
| 50 | +The shared `cuda.Ccall` trampoline only marshals `uintptr`-sized arguments, |
| 51 | +so passing 128 bytes by value is not directly possible. Fortunately the only |
| 52 | +supported NCCL platform for ztensor is **linux/arm64** (the DGX Spark host). |
| 53 | +Per AAPCS64 §6.8.2 rule B.4, *"if the argument type is a Composite Type that |
| 54 | +is larger than 16 bytes, then the argument is copied to memory allocated by |
| 55 | +the caller and the argument is replaced by a pointer to the copy."* That |
| 56 | +means at the ABI level the third argument is already a pointer; passing |
| 57 | +`uintptr(unsafe.Pointer(&uid.id[0]))` is the correct calling convention. |
| 58 | +
|
| 59 | +If we ever need to support NCCL on linux/amd64 (System V ABI), this trick |
| 60 | +will not work — SysV passes large aggregates on the stack by value — and we |
| 61 | +will need a small assembly trampoline, or we can fall back to the CGo path |
| 62 | +via the `nccl_cgo` build tag. |
| 63 | +
|
| 64 | +## Consequences |
| 65 | +
|
| 66 | +- `go build ./...` works everywhere with no `-tags cuda` and no system NCCL |
| 67 | + installed. |
| 68 | +- Tests in `internal/nccl/nccl_test.go` no longer require a build tag; they |
| 69 | + call `requireNccl(t)` to skip when `libnccl.so.2` is not dlopen-able. Pure |
| 70 | + constant and marshaling tests run on every platform. |
| 71 | +- The duplicate `internal/nccl` copy inside the `zerfoo` repository is **not** |
| 72 | + touched by this change and should be migrated in a follow-up. |
| 73 | +- CI's `go vet` exclude list adds `/internal/nccl$` since the dlopen |
| 74 | + trampoline relies on the same `unsafe.Pointer(uintptr(...))` pattern as |
| 75 | + every other GPU runtime binding (already documented in `docs/QUALITY.md`). |
| 76 | +
|
| 77 | +## References |
| 78 | +
|
| 79 | +- Issue: zerfoo/ztensor#78 |
| 80 | +- Reference pattern: `internal/cublas/cublas_purego.go` |
| 81 | +- AAPCS64: <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst> |
0 commit comments