Skip to content

Commit 957e74c

Browse files
committed
docs(plan): trim to investigation-closed state for #78/#79
1 parent e04540e commit 957e74c

1 file changed

Lines changed: 80 additions & 181 deletions

File tree

docs/plan.md

Lines changed: 80 additions & 181 deletions
Original file line numberDiff line numberDiff line change
@@ -2,187 +2,86 @@
22

33
## Context
44

5-
Resolve all open GitHub issues in `github.com/zerfoo/ztensor`:
6-
7-
- **#78** Migrate `internal/nccl` from CGo to purego/dlopen. PR #80 is open on branch `chore/nccl-purego`, CI green (per memory 2412).
8-
- **#79** GPU engine `dst`-output routing bug: tensors written by GPU kernels read back as all-zero via `.Data()` on DGX GB10, freezing PatchTST training in zerfoo at a deterministic loss. CPU engine converges on the same test. Minimal ztensor-level diagnostic tests on branch `debug/gpu-dst-routing` did NOT reproduce the bug on DGX (memory 2407), so the fault is narrower than "any engine op" — likely specific to the op mix, in-place aliasing, or storage-flip path exercised by `trainWindowedGPU`.
9-
10-
### Objectives
11-
1. Merge PR #80, close #78.
12-
2. Reproduce #79 with a ztensor-only test, land a fix, close #79.
13-
14-
### Non-goals
15-
- Refactoring unrelated GPU engine code.
16-
- Touching zerfoo's PatchTST training loop beyond what is needed to bisect.
17-
18-
### Success metrics
19-
- `gh issue list --repo zerfoo/ztensor --state open` returns zero issues.
20-
- `go build ./...` in ztensor compiles with no build tags.
21-
- A ztensor unit test reproduces the original #79 failure on a pre-fix commit and passes post-fix on DGX Spark.
22-
23-
## Discovery Summary
24-
25-
Work type: **Engineering**.
26-
27-
Existing artifacts:
28-
- Branch `chore/nccl-purego` → PR #80 (CI green, verified on DGX GB10, ADR drafted).
29-
- Branch `debug/gpu-dst-routing` with basic round-trip tests for `Add`/`MatMul` — did not reproduce #79.
30-
- Issue #79 body contains seven probe logs and four hypotheses (α buffer mismatch, β SetStorage-after-kernel clobber, γ D2H source divergence, δ in-place aliasing).
31-
- zerfoo `devlog.md` entry dated 2026-04-08 "FINAL" has full investigation history.
32-
33-
Key code paths implicated by #79:
34-
- `compute/gpu_kernels.go:121-132``makeGPUResult` allocates a fresh device buffer and calls `dst[0].SetStorage(gs)`.
35-
- `tensor/gpu_storage.go:215-250``GPUStorage.Slice()` does `make + D2H memcpy` on every `.Data()`.
36-
- `compute/gpu_engine*.go` — Add/Sub/Mul/MatMul dispatch.
37-
38-
## Scope and Deliverables
39-
40-
| ID | Deliverable | Acceptance |
41-
|----|-------------|------------|
42-
| D1 | PR #80 merged via rebase-and-merge | #78 closed, `main` builds without `-tags cuda` |
43-
| D2 | ztensor regression test reproducing #79 on DGX | Test fails on current `main`, passes after fix |
44-
| D3 | Root-cause fix in ztensor GPU engine | PatchTST GPU training in zerfoo converges on DGX |
45-
| D4 | ADR documenting the routing fix | File in `docs/adr/` |
46-
| D5 | devlog entry with RCA | Appended to `docs/devlog.md` |
47-
48-
In scope: NCCL migration finalization, #79 reproduction, bisect, fix, tests, docs.
49-
Out of scope: New GPU features, perf work, unrelated kernel changes.
50-
51-
## Checkable Work Breakdown
52-
53-
### E1 — Close #78 (NCCL purego)
54-
55-
- [x] T1.1 Re-check PR #80 CI on latest push. 2026 04 09. CI green (test/COMPLETED/SUCCESS).
56-
- [x] T1.2 Rebase PR #80 onto `main` if needed. 2026 04 09. mergeStateStatus=CLEAN, no rebase needed.
57-
- [x] T1.3 Rebase-and-merge PR #80, confirm #78 auto-closes. 2026 04 09. Merged as af8af73; #78 CLOSED.
58-
- [x] T1.4 Verify `go build ./...` (no tags) on `main` post-merge. 2026 04 09. Build OK.
59-
- [ ] T1.5 Open tracking issue in zerfoo to drop its duplicate `internal/nccl` and import from ztensor. Owner: TBD. Est: 15m. verifies: [infrastructure]
60-
61-
### E2 — Reproduce #79 at ztensor level
62-
63-
- [ ] T2.1 Expand `gpu_engine_dst_routing_test.go` to mirror `trainWindowedGPU`'s exact op sequence: MatMul → Add (in-place dst==src0) → Sub → elementwise scale, with storage flips between CPU-allocated wrappers and GPU kernel outputs. Owner: TBD. Est: 90m. verifies: [UC-79]
64-
- [ ] T2.2 Add a test exercising `engine.Add(ctx, a, b, a)` (in-place aliasing, hypothesis δ). Owner: TBD. Est: 30m. verifies: [UC-79]
65-
- [ ] T2.3 Add a test where `dst` is a freshly-allocated CPUStorage wrapper that gets flipped to GPUStorage by `makeGPUResult`, then read via `.Data()` immediately and after `engine.Sync()`. Owner: TBD. Est: 45m. verifies: [UC-79]
66-
- [ ] T2.4 Submit reproducer as a Spark Job manifest to `192.168.86.250:8080`; capture logs. Owner: TBD. Est: 30m. verifies: [UC-79]
67-
- [x] T2.5 Port `trainWindowedGPU` patch-embedding backward op sequence to a standalone compute test. 2026 04 09. Added TestGPUEngine_PatchTSTBackward_DstRoundTrip on branch fix/issue-79-matmul-accumulate-repro. All 5 dst-routing tests PASS on DGX GB10 via Spark pod ztensor-issue79-repro-1775759440. Bug not reproducible at ztensor primitive level. See docs/devlog.md 2026-04-09 entry.
68-
69-
### E3 — Diagnose #79
70-
71-
- [ ] T3.1 Instrument `makeGPUResult` to log: caller `dst` pointer, existing storage dptr, kernel write-target dptr, post-`SetStorage` dptr. Run reproducer on DGX via Spark. Owner: TBD. Est: 60m. verifies: [UC-79]
72-
- [ ] T3.2 Instrument `GPUStorage.Slice()` D2H to log source dptr and first 4 bytes. Owner: TBD. Est: 30m. verifies: [UC-79]
73-
- [ ] T3.3 Classify root cause against hypotheses α/β/γ/δ from issue #79. Owner: TBD. Est: 30m. verifies: [UC-79]
74-
- [ ] T3.4 Write ADR `docs/adr/002-gpu-dst-routing-fix.md` describing the chosen fix. Owner: TBD. Est: 30m. verifies: [infrastructure]
75-
76-
### E4 — Fix #79
77-
78-
- [ ] T4.1 Implement the routing fix in `compute/gpu_kernels.go` (and engine ops as needed). Owner: TBD. Est: 2h. verifies: [UC-79]
79-
- [ ] T4.2 Make `.Data()` on GPU tensor implicitly `Sync()` the stream before D2H memcpy (safety net). Owner: TBD. Est: 45m. verifies: [UC-79]
80-
- [ ] T4.3 Remove diagnostic logging added in E3. Owner: TBD. Est: 15m. verifies: [infrastructure]
81-
- [ ] T4.4 Run `gofmt`, `go vet`, `golangci-lint run` across touched packages. Owner: TBD. Est: 15m. verifies: [infrastructure]
82-
- [ ] T4.5 Unit tests: ensure E2 tests now pass on DGX via Spark submission. Owner: TBD. Est: 30m. verifies: [UC-79]
83-
- [ ] T4.6 End-to-end: run `zerfoo/scripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3` against a zerfoo branch pinned to the fixed ztensor; confirm loss decreases across epochs. Owner: TBD. Est: 45m. verifies: [UC-79]
84-
85-
### E5 — Release and close
86-
87-
- [ ] T5.1 Commit fix in small logical commits (test, instrumentation-removal, fix, ADR, devlog). Owner: TBD. Est: 20m. verifies: [infrastructure]
88-
- [ ] T5.2 Open PR for #79 fix, link issue. Owner: TBD. Est: 10m. verifies: [infrastructure]
89-
- [ ] T5.3 Rebase-and-merge after CI green. Owner: TBD. Est: 10m. verifies: [infrastructure]
90-
- [ ] T5.4 Confirm release-please PR bumps ztensor version; merge when ready. Owner: TBD. Est: 10m. verifies: [infrastructure]
91-
- [ ] T5.5 Append `docs/devlog.md` entry with RCA, probe outputs, fix summary. Owner: TBD. Est: 20m. verifies: [infrastructure]
92-
- [ ] T5.6 Confirm both #78 and #79 are closed. Owner: TBD. Est: 5m. verifies: [infrastructure]
93-
94-
## Parallel Work
95-
96-
| Track | Tasks | Notes |
97-
|-------|-------|-------|
98-
| A — NCCL merge | T1.1–T1.5 | Independent of #79 work |
99-
| B — #79 repro | T2.1–T2.5 | Independent of A |
100-
| C — #79 diagnose/fix | T3.*, T4.* | Depends on B |
101-
| D — Release | T5.* | Depends on A and C |
102-
103-
### Waves
104-
105-
### Wave 1: Parallel kickoff (6 agents)
106-
- [ ] T1.1 CI recheck PR #80
107-
- [ ] T1.2 Rebase PR #80
108-
- [ ] T2.1 Expand routing test
109-
- [ ] T2.2 In-place aliasing test
110-
- [ ] T2.3 Storage-flip test
111-
- [ ] T1.5 zerfoo duplicate-nccl tracker issue
112-
113-
### Wave 2: Merge + Spark repro (3 agents)
114-
- [ ] T1.3 Merge PR #80
115-
- [ ] T1.4 Post-merge build verify
116-
- [ ] T2.4 Submit reproducer to Spark
117-
118-
### Wave 3: Diagnose (fallback + instrument) (3 agents)
119-
- [ ] T2.5 Fallback trainWindowedGPU port (only if Wave 2 didn't repro)
120-
- [ ] T3.1 Instrument makeGPUResult
121-
- [ ] T3.2 Instrument GPUStorage.Slice
122-
123-
### Wave 4: Classify + fix (3 agents)
124-
- [ ] T3.3 Classify root cause
125-
- [ ] T3.4 ADR
126-
- [ ] T4.1 Implement fix
127-
128-
### Wave 5: Harden + verify (4 agents)
129-
- [ ] T4.2 Implicit Sync on .Data()
130-
- [ ] T4.3 Remove diagnostics
131-
- [ ] T4.4 Lint/format
132-
- [ ] T4.5 Spark test rerun
133-
134-
### Wave 6: Release (5 agents, sequential within track)
135-
- [ ] T4.6 zerfoo bench verification
136-
- [ ] T5.1 Small commits
137-
- [ ] T5.2 Open PR
138-
- [ ] T5.3 Merge
139-
- [ ] T5.4 release-please
140-
- [ ] T5.5 devlog
141-
- [ ] T5.6 Confirm closed
142-
143-
## Timeline and Milestones
144-
145-
| ID | Milestone | Dependencies | Exit criteria |
146-
|----|-----------|--------------|---------------|
147-
| M1 | #78 closed | Wave 2 | PR #80 merged, main builds no-tags |
148-
| M2 | #79 reproduced in ztensor test | Wave 2 or Wave 3 | Test fails deterministically on DGX |
149-
| M3 | #79 root cause identified | Wave 4 | Classified α/β/γ/δ, ADR drafted |
150-
| M4 | #79 fix merged | Wave 6 | PR merged, CI green |
151-
| M5 | zerfoo PatchTST GPU training converges | T4.6 | Loss decreases over 3 epochs via bench-spark |
152-
153-
## Risk Register
154-
155-
| ID | Risk | Impact | Likelihood | Mitigation |
156-
|----|------|--------|------------|------------|
157-
| R1 | #79 cannot be reproduced at ztensor level | High | Medium | Fallback T2.5: port minimal zerfoo graph verbatim |
158-
| R2 | Fix masks symptom, zerfoo still frozen | High | Low | Gate M4 on T4.6 bench-spark convergence |
159-
| R3 | Spark host contention delays DGX runs | Med | Low | Batch probes into single manifest submissions |
160-
| R4 | NCCL purego regressions in zerfoo's distributed path | Med | Low | T1.5 follow-up issue; keep `nccl_cgo` legacy path behind `//go:build nccl_cgo` for rollback |
161-
162-
## Operating Procedure
163-
164-
- Definition of done: PR merged, CI green, DGX Spark verification captured in PR description, release-please PR merged.
165-
- Always add tests alongside fixes.
166-
- Always run `gofmt`, `go vet`, `golangci-lint` after code changes.
167-
- No `ssh` benchmarks — use `bench-spark.sh` / Spark manifests only.
168-
- Small logical commits; never mix files across subdirectories in a single commit.
169-
- Rebase-and-merge on GitHub (never squash, never merge commits).
5+
Resolve all open GitHub issues in `github.com/zerfoo/ztensor`.
6+
7+
Status as of 2026-04-09:
8+
- **#78** NCCL purego migration -- CLOSED via PR #80 (merged `af8af73`).
9+
- **#79** GPU engine dst-output routing -- INVESTIGATION CLOSED ztensor-side.
10+
ztensor primitives cleared; follow-up must happen in zerfoo. Branch
11+
`fix/issue-79-matmul-accumulate-repro` retained as evidence.
12+
13+
No open ztensor issues remain assigned to this plan. If new issues are
14+
filed, re-open this plan or start a fresh one.
15+
16+
## Evidence for #79 closure
17+
18+
Test file `compute/gpu_dst_roundtrip_test.go` on branch
19+
`fix/issue-79-matmul-accumulate-repro` ports the exact backward-pass op
20+
sequence from `zerfoo/timeseries/patchtst_gpu_train.go:1022-1031`:
21+
22+
```
23+
Transpose(patches -> patchesT)
24+
Zero(dPEW)
25+
MatMul(patchesT, dX, dPEW)
26+
Add(gradW, dPEW, gradW) # in-place accumulate
27+
gradW.Data()
28+
```
29+
30+
Ran 7 variants on DGX GB10 via Spark pod `ztensor-issue79-repro-1775761950`:
31+
32+
```
33+
TestGPUEngine_Add_DstRoundTrip_OutOfPlace PASS
34+
TestGPUEngine_Add_DstRoundTrip_InPlace PASS
35+
TestGPUEngine_Add_DstRoundTrip_RepeatedInPlace PASS
36+
TestGPUEngine_Add_DstRoundTrip_NoExplicitSync PASS
37+
TestGPUEngine_PatchTSTBackward_DstRoundTrip PASS (4x3 / 4x2 tiny)
38+
TestGPUEngine_PatchTSTBackward_RealisticShapes PASS (1600x8 / 1600x64, 20 iters)
39+
TestGPUEngine_PatchTSTBackward_LargerBatch PASS (3200x8 / 3200x64, 20 iters)
40+
```
41+
42+
None of the four hypotheses (alpha/beta/gamma/delta) from issue #79 is
43+
triggered by the patch-embedding backward sequence at production shapes
44+
and over many accumulation iterations. The `makeGPUResult` /
45+
`SetStorage` / `GPUStorage.Slice()` path correctly routes dst tensors.
46+
47+
Findings posted to issue #79 as two comments; devlog entry dated
48+
2026-04-09.
49+
50+
## Remaining suspects (zerfoo-side)
51+
52+
Future work, not tracked in this plan:
53+
1. `encoderBackward` full op chain (dozens of ops per batch) -- not covered here.
54+
2. CPU-loop `dPosData += dXData` interleave at `patchtst_gpu_train.go:1012-1019`
55+
forcing mid-backward D2H on the same stream.
56+
3. zerfoo-side `gradTs` wrapper/arena state diverging from fresh `tensor.New` wrappers.
57+
58+
Recommended next action: instrument `trainWindowedGPU` directly
59+
(log device pointers before/after each op on the first batch).
60+
61+
## Infra notes captured during investigation
62+
63+
- Spark silently drops `pod.spec.containers[0].command` when multi-element.
64+
Use `args: ["bash", "-c", ...]` with no `command` field.
65+
- Spark silently truncates long `args[i]` strings. Put scripts on host at
66+
`/var/lib/zerfoo/bench-out/*.sh` and mount.
67+
- Spark drops container stdout/stderr. Redirect to host file with
68+
`exec >...log 2>&1` inside the script.
69+
- ztensor's `-tags cuda` build tag is unmaintained. The kernels package
70+
has only `//go:build !cuda` purego files. Default build is the GPU
71+
path. Do not pass `-tags cuda`.
72+
- A prebuilt `/opt/zerfoo/lib/libkernels.so` exists on the DGX host and
73+
must be mounted into any pod running ztensor GPU tests.
74+
75+
Reference manifest: `docs/bench/manifests/issue-79-repro.yaml`.
76+
Reference script: `/var/lib/zerfoo/bench-out/issue79-run.sh` on DGX host.
17077

17178
## Progress Log
17279

173-
### 2026-04-09 — Plan created
174-
Plan authored to resolve #78 and #79. #78 has PR #80 already open (CI green on DGX per prior session). #79 diagnostic tests on `debug/gpu-dst-routing` did not reproduce; plan adds a fallback path to port zerfoo's failing op sequence directly into a ztensor test. No ADRs created yet — ADR-002 will be written when root cause is classified.
175-
176-
## Hand-off Notes
177-
178-
- PR #80 branch: `chore/nccl-purego`. Reference pattern: `internal/cublas/cublas_purego.go`.
179-
- #79 instrumentation targets: `compute/gpu_kernels.go:121-132`, `tensor/gpu_storage.go:215-250`.
180-
- DGX Spark: `SPARK=http://192.168.86.250:8080`. Submit manifests, never ssh.
181-
- zerfoo reproducer: `scripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3 -cleanup`. Frozen loss signature: `0.268357`.
182-
- Full investigation history: zerfoo `docs/devlog.md` 2026-04-08 "FINAL" entry.
183-
184-
## Appendix
185-
186-
Use cases:
187-
- **UC-79** PatchTST training on GPU engine converges on DGX GB10 (currently BROKEN — loss frozen at 0.268357).
188-
- **UC-78** `go build ./...` in ztensor compiles the NCCL path without `-tags cuda` (IN PROGRESS via PR #80).
80+
### 2026-04-09 -- Investigation closed
81+
- Merged PR #80, closed #78.
82+
- Wrote, expanded, and ran 7-variant dst-routing test suite on DGX GB10.
83+
- All variants PASS at production shapes with 20-iteration accumulation.
84+
- Posted two update comments to issue #79 with findings.
85+
- Trimmed plan: removed E3/E4/E5/E6 (diagnose/fix/release tracks) since
86+
the underlying premise -- reproducing #79 at ztensor level -- is
87+
disproven. Investigation continues in zerfoo if at all.

0 commit comments

Comments
 (0)