Skip to content

Commit 9b8e5aa

Browse files
committed
docs(plan): add open-issues resolution plan for #78 and #79
1 parent 06f6b1b commit 9b8e5aa

1 file changed

Lines changed: 188 additions & 0 deletions

File tree

docs/plan.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# ztensor Open GitHub Issues Resolution
2+
3+
## Context
4+
5+
Resolve all open GitHub issues in `github.com/zerfoo/ztensor`:
6+
7+
- **#78** Migrate `internal/nccl` from CGo to purego/dlopen. PR #80 is open on branch `chore/nccl-purego`, CI green (per memory 2412).
8+
- **#79** GPU engine `dst`-output routing bug: tensors written by GPU kernels read back as all-zero via `.Data()` on DGX GB10, freezing PatchTST training in zerfoo at a deterministic loss. CPU engine converges on the same test. Minimal ztensor-level diagnostic tests on branch `debug/gpu-dst-routing` did NOT reproduce the bug on DGX (memory 2407), so the fault is narrower than "any engine op" — likely specific to the op mix, in-place aliasing, or storage-flip path exercised by `trainWindowedGPU`.
9+
10+
### Objectives
11+
1. Merge PR #80, close #78.
12+
2. Reproduce #79 with a ztensor-only test, land a fix, close #79.
13+
14+
### Non-goals
15+
- Refactoring unrelated GPU engine code.
16+
- Touching zerfoo's PatchTST training loop beyond what is needed to bisect.
17+
18+
### Success metrics
19+
- `gh issue list --repo zerfoo/ztensor --state open` returns zero issues.
20+
- `go build ./...` in ztensor compiles with no build tags.
21+
- A ztensor unit test reproduces the original #79 failure on a pre-fix commit and passes post-fix on DGX Spark.
22+
23+
## Discovery Summary
24+
25+
Work type: **Engineering**.
26+
27+
Existing artifacts:
28+
- Branch `chore/nccl-purego` → PR #80 (CI green, verified on DGX GB10, ADR drafted).
29+
- Branch `debug/gpu-dst-routing` with basic round-trip tests for `Add`/`MatMul` — did not reproduce #79.
30+
- Issue #79 body contains seven probe logs and four hypotheses (α buffer mismatch, β SetStorage-after-kernel clobber, γ D2H source divergence, δ in-place aliasing).
31+
- zerfoo `devlog.md` entry dated 2026-04-08 "FINAL" has full investigation history.
32+
33+
Key code paths implicated by #79:
34+
- `compute/gpu_kernels.go:121-132``makeGPUResult` allocates a fresh device buffer and calls `dst[0].SetStorage(gs)`.
35+
- `tensor/gpu_storage.go:215-250``GPUStorage.Slice()` does `make + D2H memcpy` on every `.Data()`.
36+
- `compute/gpu_engine*.go` — Add/Sub/Mul/MatMul dispatch.
37+
38+
## Scope and Deliverables
39+
40+
| ID | Deliverable | Acceptance |
41+
|----|-------------|------------|
42+
| D1 | PR #80 merged via rebase-and-merge | #78 closed, `main` builds without `-tags cuda` |
43+
| D2 | ztensor regression test reproducing #79 on DGX | Test fails on current `main`, passes after fix |
44+
| D3 | Root-cause fix in ztensor GPU engine | PatchTST GPU training in zerfoo converges on DGX |
45+
| D4 | ADR documenting the routing fix | File in `docs/adr/` |
46+
| D5 | devlog entry with RCA | Appended to `docs/devlog.md` |
47+
48+
In scope: NCCL migration finalization, #79 reproduction, bisect, fix, tests, docs.
49+
Out of scope: New GPU features, perf work, unrelated kernel changes.
50+
51+
## Checkable Work Breakdown
52+
53+
### E1 — Close #78 (NCCL purego)
54+
55+
- [ ] T1.1 Re-check PR #80 CI on latest push. Owner: TBD. Est: 10m. verifies: [infrastructure]
56+
- [ ] T1.2 Rebase PR #80 onto `main` if needed. Owner: TBD. Est: 15m. verifies: [infrastructure]
57+
- [ ] T1.3 Rebase-and-merge PR #80, confirm #78 auto-closes. Owner: TBD. Est: 10m. verifies: [infrastructure]
58+
- [ ] T1.4 Verify `go build ./...` (no tags) on `main` post-merge. Owner: TBD. Est: 5m. verifies: [infrastructure]
59+
- [ ] T1.5 Open tracking issue in zerfoo to drop its duplicate `internal/nccl` and import from ztensor. Owner: TBD. Est: 15m. verifies: [infrastructure]
60+
61+
### E2 — Reproduce #79 at ztensor level
62+
63+
- [ ] T2.1 Expand `gpu_engine_dst_routing_test.go` to mirror `trainWindowedGPU`'s exact op sequence: MatMul → Add (in-place dst==src0) → Sub → elementwise scale, with storage flips between CPU-allocated wrappers and GPU kernel outputs. Owner: TBD. Est: 90m. verifies: [UC-79]
64+
- [ ] T2.2 Add a test exercising `engine.Add(ctx, a, b, a)` (in-place aliasing, hypothesis δ). Owner: TBD. Est: 30m. verifies: [UC-79]
65+
- [ ] T2.3 Add a test where `dst` is a freshly-allocated CPUStorage wrapper that gets flipped to GPUStorage by `makeGPUResult`, then read via `.Data()` immediately and after `engine.Sync()`. Owner: TBD. Est: 45m. verifies: [UC-79]
66+
- [ ] T2.4 Submit reproducer as a Spark Job manifest to `192.168.86.250:8080`; capture logs. Owner: TBD. Est: 30m. verifies: [UC-79]
67+
- [ ] T2.5 If still not reproduced, port the failing `trainWindowedGPU` first-batch path to a standalone `ztensor/compute` integration test (vendored minimal graph). Owner: TBD. Est: 2h. verifies: [UC-79]
68+
69+
### E3 — Diagnose #79
70+
71+
- [ ] T3.1 Instrument `makeGPUResult` to log: caller `dst` pointer, existing storage dptr, kernel write-target dptr, post-`SetStorage` dptr. Run reproducer on DGX via Spark. Owner: TBD. Est: 60m. verifies: [UC-79]
72+
- [ ] T3.2 Instrument `GPUStorage.Slice()` D2H to log source dptr and first 4 bytes. Owner: TBD. Est: 30m. verifies: [UC-79]
73+
- [ ] T3.3 Classify root cause against hypotheses α/β/γ/δ from issue #79. Owner: TBD. Est: 30m. verifies: [UC-79]
74+
- [ ] T3.4 Write ADR `docs/adr/002-gpu-dst-routing-fix.md` describing the chosen fix. Owner: TBD. Est: 30m. verifies: [infrastructure]
75+
76+
### E4 — Fix #79
77+
78+
- [ ] T4.1 Implement the routing fix in `compute/gpu_kernels.go` (and engine ops as needed). Owner: TBD. Est: 2h. verifies: [UC-79]
79+
- [ ] T4.2 Make `.Data()` on GPU tensor implicitly `Sync()` the stream before D2H memcpy (safety net). Owner: TBD. Est: 45m. verifies: [UC-79]
80+
- [ ] T4.3 Remove diagnostic logging added in E3. Owner: TBD. Est: 15m. verifies: [infrastructure]
81+
- [ ] T4.4 Run `gofmt`, `go vet`, `golangci-lint run` across touched packages. Owner: TBD. Est: 15m. verifies: [infrastructure]
82+
- [ ] T4.5 Unit tests: ensure E2 tests now pass on DGX via Spark submission. Owner: TBD. Est: 30m. verifies: [UC-79]
83+
- [ ] T4.6 End-to-end: run `zerfoo/scripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3` against a zerfoo branch pinned to the fixed ztensor; confirm loss decreases across epochs. Owner: TBD. Est: 45m. verifies: [UC-79]
84+
85+
### E5 — Release and close
86+
87+
- [ ] T5.1 Commit fix in small logical commits (test, instrumentation-removal, fix, ADR, devlog). Owner: TBD. Est: 20m. verifies: [infrastructure]
88+
- [ ] T5.2 Open PR for #79 fix, link issue. Owner: TBD. Est: 10m. verifies: [infrastructure]
89+
- [ ] T5.3 Rebase-and-merge after CI green. Owner: TBD. Est: 10m. verifies: [infrastructure]
90+
- [ ] T5.4 Confirm release-please PR bumps ztensor version; merge when ready. Owner: TBD. Est: 10m. verifies: [infrastructure]
91+
- [ ] T5.5 Append `docs/devlog.md` entry with RCA, probe outputs, fix summary. Owner: TBD. Est: 20m. verifies: [infrastructure]
92+
- [ ] T5.6 Confirm both #78 and #79 are closed. Owner: TBD. Est: 5m. verifies: [infrastructure]
93+
94+
## Parallel Work
95+
96+
| Track | Tasks | Notes |
97+
|-------|-------|-------|
98+
| A — NCCL merge | T1.1–T1.5 | Independent of #79 work |
99+
| B — #79 repro | T2.1–T2.5 | Independent of A |
100+
| C — #79 diagnose/fix | T3.*, T4.* | Depends on B |
101+
| D — Release | T5.* | Depends on A and C |
102+
103+
### Waves
104+
105+
### Wave 1: Parallel kickoff (6 agents)
106+
- [ ] T1.1 CI recheck PR #80
107+
- [ ] T1.2 Rebase PR #80
108+
- [ ] T2.1 Expand routing test
109+
- [ ] T2.2 In-place aliasing test
110+
- [ ] T2.3 Storage-flip test
111+
- [ ] T1.5 zerfoo duplicate-nccl tracker issue
112+
113+
### Wave 2: Merge + Spark repro (3 agents)
114+
- [ ] T1.3 Merge PR #80
115+
- [ ] T1.4 Post-merge build verify
116+
- [ ] T2.4 Submit reproducer to Spark
117+
118+
### Wave 3: Diagnose (fallback + instrument) (3 agents)
119+
- [ ] T2.5 Fallback trainWindowedGPU port (only if Wave 2 didn't repro)
120+
- [ ] T3.1 Instrument makeGPUResult
121+
- [ ] T3.2 Instrument GPUStorage.Slice
122+
123+
### Wave 4: Classify + fix (3 agents)
124+
- [ ] T3.3 Classify root cause
125+
- [ ] T3.4 ADR
126+
- [ ] T4.1 Implement fix
127+
128+
### Wave 5: Harden + verify (4 agents)
129+
- [ ] T4.2 Implicit Sync on .Data()
130+
- [ ] T4.3 Remove diagnostics
131+
- [ ] T4.4 Lint/format
132+
- [ ] T4.5 Spark test rerun
133+
134+
### Wave 6: Release (5 agents, sequential within track)
135+
- [ ] T4.6 zerfoo bench verification
136+
- [ ] T5.1 Small commits
137+
- [ ] T5.2 Open PR
138+
- [ ] T5.3 Merge
139+
- [ ] T5.4 release-please
140+
- [ ] T5.5 devlog
141+
- [ ] T5.6 Confirm closed
142+
143+
## Timeline and Milestones
144+
145+
| ID | Milestone | Dependencies | Exit criteria |
146+
|----|-----------|--------------|---------------|
147+
| M1 | #78 closed | Wave 2 | PR #80 merged, main builds no-tags |
148+
| M2 | #79 reproduced in ztensor test | Wave 2 or Wave 3 | Test fails deterministically on DGX |
149+
| M3 | #79 root cause identified | Wave 4 | Classified α/β/γ/δ, ADR drafted |
150+
| M4 | #79 fix merged | Wave 6 | PR merged, CI green |
151+
| M5 | zerfoo PatchTST GPU training converges | T4.6 | Loss decreases over 3 epochs via bench-spark |
152+
153+
## Risk Register
154+
155+
| ID | Risk | Impact | Likelihood | Mitigation |
156+
|----|------|--------|------------|------------|
157+
| R1 | #79 cannot be reproduced at ztensor level | High | Medium | Fallback T2.5: port minimal zerfoo graph verbatim |
158+
| R2 | Fix masks symptom, zerfoo still frozen | High | Low | Gate M4 on T4.6 bench-spark convergence |
159+
| R3 | Spark host contention delays DGX runs | Med | Low | Batch probes into single manifest submissions |
160+
| R4 | NCCL purego regressions in zerfoo's distributed path | Med | Low | T1.5 follow-up issue; keep `nccl_cgo` legacy path behind `//go:build nccl_cgo` for rollback |
161+
162+
## Operating Procedure
163+
164+
- Definition of done: PR merged, CI green, DGX Spark verification captured in PR description, release-please PR merged.
165+
- Always add tests alongside fixes.
166+
- Always run `gofmt`, `go vet`, `golangci-lint` after code changes.
167+
- No `ssh` benchmarks — use `bench-spark.sh` / Spark manifests only.
168+
- Small logical commits; never mix files across subdirectories in a single commit.
169+
- Rebase-and-merge on GitHub (never squash, never merge commits).
170+
171+
## Progress Log
172+
173+
### 2026-04-09 — Plan created
174+
Plan authored to resolve #78 and #79. #78 has PR #80 already open (CI green on DGX per prior session). #79 diagnostic tests on `debug/gpu-dst-routing` did not reproduce; plan adds a fallback path to port zerfoo's failing op sequence directly into a ztensor test. No ADRs created yet — ADR-002 will be written when root cause is classified.
175+
176+
## Hand-off Notes
177+
178+
- PR #80 branch: `chore/nccl-purego`. Reference pattern: `internal/cublas/cublas_purego.go`.
179+
- #79 instrumentation targets: `compute/gpu_kernels.go:121-132`, `tensor/gpu_storage.go:215-250`.
180+
- DGX Spark: `SPARK=http://192.168.86.250:8080`. Submit manifests, never ssh.
181+
- zerfoo reproducer: `scripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3 -cleanup`. Frozen loss signature: `0.268357`.
182+
- Full investigation history: zerfoo `docs/devlog.md` 2026-04-08 "FINAL" entry.
183+
184+
## Appendix
185+
186+
Use cases:
187+
- **UC-79** PatchTST training on GPU engine converges on DGX GB10 (currently BROKEN — loss frozen at 0.268357).
188+
- **UC-78** `go build ./...` in ztensor compiles the NCCL path without `-tags cuda` (IN PROGRESS via PR #80).

0 commit comments

Comments
 (0)