Skip to content

spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32

Open
nycdubliner wants to merge 4 commits into
Anbeeld:mainfrom
nycdubliner:rocm-multi-gpu-tape
Open

spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32
nycdubliner wants to merge 4 commits into
Anbeeld:mainfrom
nycdubliner:rocm-multi-gpu-tape

Conversation

@nycdubliner
Copy link
Copy Markdown

@nycdubliner nycdubliner commented May 23, 2026

Goal / Overview

This PR implements Phase 2 of ROCm Multi-GPU DFlash support: enabling GPU conv state replay (tape_replay_conv_gpu), GPU hidden-state capture, and direct GPU-to-GPU peer copying into the speculative verification ring buffer. It also addresses all reviewer feedback from the Phase 2 evaluation.


Key Implementations

  1. GPU Conv State Replay & Hidden-State Capture:

    • Replaced multi-device checks to enable tape_replay_conv_gpu on multi-GPU setups.
    • Allocated hidden state contexts natively on their assigned physical layer GPU.
    • Implemented direct device-to-device memory copies (cudaMemcpyPeerAsync) to write hidden states directly into the speculative verification ring buffer.
    • Achieved cpu_copy=0.000 ms overhead in GPU ring writes.
  2. PR Review Feedback Addressal:

    • Dynamic P2P Gating: Added lazy dynamic peer access auto-configuration (via cudaDeviceCanAccessPeer and cudaDeviceEnablePeerAccess) inside dflash_cross_ring_gpu_write_d2d on the first cross-device copy. This guarantees out-of-the-box D2D transfer support without requiring users to manually configure environment variables (like GGML_CUDA_P2P=1).
    • Dead Code Cleanup: Deleted the obsolete struct dflash_hidden_gpu_layer definition and removed the legacy single buf/ctx fields from dflash_hidden_gpu.
    • Public Gating API: Consolidated multi-GPU tape gating checks by introducing the public library function llama_dflash_allow_multi_gpu_tape() in llama.h and resolving duplications in llama-context.cpp and common/speculative.cpp.
    • Max Devices Bounds: Substituted magic array size boundaries of 32 with GGML_CUDA_MAX_DEVICES (incorporating safe fallback macros).
    • Documentation: Added documentation validating that the GPU replay path is attempted unconditionally because capability gating checks are handled internally in tape_replay_conv_gpu.

Verification & Performance Matrix (Dual AMD Radeon RX 7800 XT)

All benchmarks were executed at temperature=0.0, ctx=8192, ubatch=128, batch=512, using cache types q5_0 (K) and q4_1 (V) under a layer split strategy.

  • Vanilla Baseline: Speculative decoding disabled.
    • task_store_module: 18.23 tok/s
    • kv_report_module: 18.35 tok/s
    • doubly_linked_list: 19.06 tok/s
  • Priority 1 DFlash Patched (GPU Tape & Ring Enabled, -fit off):
    • task_store_module: 20.75 tok/s (1.14x speedup, 25.45% acceptance)
    • kv_report_module: 31.27 tok/s (1.70x speedup, 40.73% acceptance)
    • doubly_linked_list: 23.16 tok/s (1.22x speedup, 24.61% acceptance)
  • Priority 2 (Best Parameter Sweep):
    • task_store_module: 22.61 tok/s (1.24x speedup, 29.77% acceptance with cross_ctx=1024)
    • kv_report_module: 31.80 tok/s (1.73x speedup, 34.11% acceptance with n_max=24)
    • doubly_linked_list: 24.21 tok/s (1.27x speedup, 20.90% acceptance with n_max=32)
  • Priority 3 (Flash Attention -fit on Stability Test):
    • Previously crashed on multi-GPU targets. With our layer-aware allocator patches, -fit on is 100% stable and yields:
    • kv_report_module: 31.86 tok/s (1.74x speedup, 40.73% acceptance)
  • Pinned-Drafter Architecture PoC:
    • Pinned the 1GB drafter to GPU0 via -otd '.*=ROCm0' while leaving the verifier split -ts 1,1.
    • kv_report_module: 31.66 tok/s (1.73x speedup). This demonstrates that split-drafter placement peer copy overhead is negligible, but pinning still saves minor latency.

How to Reproduce / Benchmark Commands

Run the commands below from the parent directory:

# Run server with dynamic P2P and -fit on enabled:
GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=1 GGML_DFLASH_GPU_RING=1 ./build/bin/llama-server \
  -hf unsloth/Qwen3.6-27B-GGUF:Q5_K_S --no-mmproj \
  --spec-draft-hf Anbeeld/Qwen3.6-27B-DFlash-GGUF:Q4_K_M \
  --spec-type dflash --spec-branch-budget 0 --spec-dflash-cross-ctx 512 \
  --spec-draft-ctx-size 2048 --spec-draft-n-max 12 --no-spec-dm-adaptive \
  -ngl all --spec-draft-ngl all -sm layer -ts 1,1 \
  -c 8192 -np 1 --kv-unified -b 512 -ub 128 \
  --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on -fit on \
  --cache-ram 0 -rea off --host 0.0.0.0 --port 8082

Disclosure: This Pull Request was developed with the assistance of AI coding assistants (Antigravity).

…ulti-GPU

For multi-GPU ROCm/HIP setups, allocating all spec-dec tape buffers on a single
device (e.g. ROCm0) causes execution failures or severe bottlenecks when recurrent
layers are split across devices (e.g. ROCm1 cannot access ROCm0 source views).

This commit enables device-aware recurrent GPU tape placement:
- Tape buffers are allocated per recurrent layer on the exact physical GPU
  device assigned to compile/run that layer.
- During DFlash tape capture and direct GPU replay, state memory is verified
  to reside on the local physical GPU before executing.
- Non-local device memory accesses and associated ROCm IPC/peer faults are avoided.
- A fallback path to CPU tape capture and replay is preserved when
  `GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0` is set or on single-GPU setups.
- Retained optimized scheduler callback-mode synchronization checks to minimize
  overhead when Hidden/GDN callback evaluation is enabled.
@github-actions github-actions Bot added the ggml label May 23, 2026
Copy link
Copy Markdown
Author

@nycdubliner nycdubliner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expert PR Review — Claude Opus 4.6 (Thinking)

Model under test: Qwen 3.6 27B (Q5_K_S target / Q4_K_M DFlash drafter)
Hardware: 2× RX 7800 XT, ROCm/HIP gfx1101


Overall Assessment

This is a well-structured PR that addresses a real performance bottleneck. The core idea — per-layer device-aware tape buffer allocation using model.dev_layer(il) — is the correct architectural approach. The diff is 222+/126- across 5 files, touching the right subsystems. The ROCm backend-name awareness ("CUDA" || "ROCm") is applied consistently. Good first contribution.

Verdict: Approve with requested changes (mostly documentation and one correctness concern).


What works well

  1. allocate_tape_gpu refactor — Per-layer ggml context + buffer allocation keyed to model.dev_layer(il) via backend_for_dev(). This is the right fix. Each tape layer's buf, ctx, and dev are now owned at layer granularity with correct cleanup in the destructor.

  2. dflash_gpu_backend_reg() helper — Centralizes the CUDA-then-ROCm registry lookup, eliminating 5 separate ggml_backend_reg_by_name("CUDA") callsites. Clean dedup.

  3. dflash_is_cuda_compatible_tensor() helper — Replaces 3 local is_cuda_tensor lambdas with a shared function that checks both "CUDA" and "ROCm" buffer names. Good.

  4. build_recurrent_copy_plan / copy_cell in llama-memory-recurrent.cpp — Removing the single copy_plan_device and tracking per-entry device with per-device set_device/sync_device calls. This is the correct multi-GPU D2D pattern.

  5. tape_replay_gdn_direct_gpu device validation — Now checks that ALL inputs (state, k, v, gate, beta) are on the same device before launching, and correctly sets replay_device = -2 for heterogeneous launches with per-ptr sync. Solid.

  6. dflash_memory_seq_cp_recurrent_ordered — Removed the model.n_devices() > 1 early-return gate and replaced it with per-backend sync across all GPU devices. Correct.

  7. ggml scheduler callback gate — Moving ggml_backend_synchronize(split_backend) inside the if (need) block is a targeted optimization that avoids unnecessary CPU-GPU sync when no callback is registered for a tensor. This is safe because the need flag already tracks callback presence.


Requested Changes

1. allocate_hidden_gpu still has an ungated n_devices() > 1 early return

// llama-context.cpp L1969-1973 (unmodified in this diff)
if (model.n_devices() > 1) {
    dflash_capture->hidden_gpu.clear();
    ...
    return;
}

This means multi-GPU tape is enabled but hidden GPU capture always falls back to eval callbacks. The PR body says "hidden capture stays on eval callback" — so this is intentional. But it's not documented in the code. Please add a comment at L1969 explaining why hidden GPU capture is not yet enabled for multi-GPU (e.g., "// Hidden GPU capture requires same-device graph output tensors; not yet supported for multi-GPU layer splits").

2. allocate_prefill_gpu still has an ungated n_devices() > 1 early return

// llama-context.cpp L2061-2063 (unmodified in this diff)
if (model.n_devices() > 1) {
    return false;
}

Same situation — this is probably intentional but should have a comment noting it's not yet multi-GPU aware.

3. Env var GGML_DFLASH_ALLOW_MULTI_GPU_TAPE behavior

The implementation at L1138:

return env && env[0] != '\0' && std::strcmp(env, "0") != 0;

This means =1, =yes, =banana all enable it. This matches the pattern used by GGML_DFLASH_GPU_RING, so it's consistent — good. But the env var is undocumented. Please add at minimum:

  • A one-line comment at the dflash_allow_multi_gpu_tape() function explaining when/why to use it
  • A note in the PR body about the interaction: GPU Ring (GGML_DFLASH_GPU_RING) controls the cross-attention ring path; this env var controls the recurrent tape path. They are independent.

4. set_dflash_gpu_capture forced-on behavior

dflash_capture->gpu_capture_enabled =
    enabled || (model.n_devices() > 1 && dflash_allow_multi_gpu_tape());

This means if the server calls set_dflash_gpu_capture(false) but the env var is set, GPU capture stays enabled on multi-GPU. This is probably correct for the tape path, but it means an explicit disable request is silently overridden. Consider at least a LLAMA_LOG_INFO when this override fires, so users don't get confused if they try to force CPU-only.

5. std::map include

Adding #include <map> for the layers_by_dev diagnostic log is fine, but the project style prefers avoiding "fancy-looking modern STL constructs" (CONTRIBUTING.md). A simple loop counting devices would avoid the include. Not blocking — just flagging for awareness.

6. touched_devices vector in copy_cell hot path

std::vector<int> touched_devices;  // allocated every copy_cell call

copy_cell is called per-cell during recurrent state management. Allocating a std::vector on every call adds heap pressure. Consider a small fixed-size array (max 8 devices) or moving touched_devices into the class as a reusable scratch buffer, similar to how copy_plan_entries is already a member.

7. AI disclosure

CONTRIBUTING.md requires: "Disclose that AI was used in your PR description." The PR body doesn't currently mention AI assistance. If AI tools were used during development, please add a disclosure line.


Minor / Nit

  • L1408 indentation change: The dflash_eval_callback block has a pure re-indent (4 spaces less). This makes the diff noisy but doesn't change logic. Consider splitting whitespace-only changes into a separate commit for cleaner review history.
  • Unused variable removal: n_embd_r was removed from tape_replay — correct, it was unused after the refactor. Good cleanup.
  • Double fn_prepare call in tape_replay_gdn_direct_from_cpu_tape at L2930-2931 still exists (called once in the validation loop, then again in the launch loop). This was pre-existing, not introduced by this PR, but worth noting.

Questions for the author

  1. Have you tested with GGML_DFLASH_ALLOW_MULTI_GPU_TAPE unset (not =0, literally absent from environment) to confirm the existing behavior is byte-identical?
  2. Is there a reason tape_replay_conv_gpu at L2959 still has the hard model.n_devices() > 1 → return false gate? This means conv state replay always falls back to CPU on multi-GPU even with the env var set. Is conv replay not needed for the measured speedup, or is this a follow-up?
  3. The validation matrix in the PR body only shows kv_report_module. The continuation-state doc shows 3 prompts with varying acceptance (27%–41%). Consider including all 3 in the PR body for reviewer confidence.

Summary

Area Status
Core tape placement correctness ✅ Sound — per-layer device-aware allocation
ROCm compatibility ✅ Consistent CUDA/ROCm buffer name + registry handling
Multi-GPU D2D copy plan ✅ Per-device set/sync instead of single-device assumption
Replay device validation ✅ All inputs verified same-device before launch
Fallback preservation ✅ CPU path intact when env var disabled/unset
Documentation ⚠️ Env var undocumented; ungated paths need comments
Performance hot path ⚠️ touched_devices vector allocation in copy_cell
AI disclosure ⚠️ Missing per CONTRIBUTING.md
Scheduler callback gate ✅ Safe optimization

- Documented n_devices() > 1 multi-GPU limitations for hidden and prefill GPU allocations.
- Documented role of GGML_DFLASH_ALLOW_MULTI_GPU_TAPE environment variable.
- Added warning logging when explicitly disabling GPU capture is overridden by the env var.
- Avoid std::map dependency in allocate_tape_gpu by using a vector of structures to count device occurrences.
- Move touched_devices vector out of copy_cell hot-path stack to class member to avoid heap pressure.
@nycdubliner
Copy link
Copy Markdown
Author

✅ Review Complete — Ready for Maintainer Review

All review items from the initial review have been addressed in commit 4b208f7:

  • allocate_hidden_gpu / allocate_prefill_gpu multi-GPU gates documented
  • ✅ Env var documented in PR body with interaction table
  • ✅ Override warning log added to set_dflash_gpu_capture
  • std::map replaced with simple vector struct
  • touched_devices moved to class member (copy_plan_touched_devices)
  • ✅ AI disclosure added
  • tape_replay_conv_gpu multi-GPU gate clarified (follow-up item, not a blocker — CPU path handles it)

Additional validation noted: Determinism test, VRAM leak profiling (15 requests), multi-turn session stability.

Recommendation: This PR is ready to come out of draft and be sent to the maintainer for final review.

Review model: Claude Opus 4.6 (Thinking)

@nycdubliner nycdubliner marked this pull request as ready for review May 23, 2026 14:19
@Anbeeld
Copy link
Copy Markdown
Owner

Anbeeld commented May 23, 2026

Thank you. I'm currently investigating rebasing to latest llama.cpp, so it might take some time before I review it.

- Auto-enable peer access dynamically in dflash_cross_ring_gpu_write_d2d to avoid manual setup.
- Clean up unused struct dflash_hidden_gpu_layer and dead buf/ctx fields in dflash_hidden_gpu.
- Resolve speculative.cpp duplication by introducing a public llama_dflash_allow_multi_gpu_tape() API.
- Replace magic array sizes of 32 with GGML_CUDA_MAX_DEVICES for touched devices tracking.
- Document unconditional call to tape_replay_conv_gpu as internally gated.
- Update test-dflash-plumbing.cpp regex assertions for renamed gating function.
@nycdubliner
Copy link
Copy Markdown
Author

Phase 2 Follow-up Review — 7fd860667 (branch rocm-multi-gpu-conv-replay)

This is a pre-review of the Phase 2 work (GPU conv replay + hidden-state capture + GPU ring on multi-GPU) ahead of a formal PR once #32 merges.

All five blocking/important issues from the previous review have been addressed. Full item-by-item:

✅ Dynamic P2P gatingcudaDeviceCanAccessPeer + cudaDeviceEnablePeerAccess is now called lazily inside dflash_cross_ring_gpu_write_d2d on first cross-device write, gated by a static bool peer_enabled[GGML_CUDA_MAX_DEVICES][GGML_CUDA_MAX_DEVICES]. Both directions are enabled; cudaGetLastError() correctly absorbs cudaErrorPeerAccessAlreadyEnabled if called twice. GGML_CUDA_P2P is no longer a prerequisite. ✓

✅ Dead struct fieldsdflash_hidden_gpu_layer removed. buf/ctx fields removed from dflash_hidden_gpu. Destructor cleaned up. ✓

✅ API deduplicationllama_dflash_allow_multi_gpu_tape() is now a single public function declared in llama.h, implemented in llama-context.cpp. Duplicate in common/speculative.cpp removed. ✓

GGML_CUDA_MAX_DEVICES — all three [32] magic values replaced. #ifndef guard in cross-ring-interleave.cu is appropriate for standalone compilation. ✓

✅ Self-gating comment — comment added at tape_replay_conv call site explaining unconditional GPU attempt. ✓


One nit (not blocking): peer_enabled is a static local with a non-atomic read-modify-write. Concurrent first calls for the same device pair could both invoke cudaDeviceEnablePeerAccess — this is safe because the duplicate call returns cudaErrorPeerAccessAlreadyEnabled which is already cleared by cudaGetLastError(). Worth a one-line comment for future readers.


Benchmark validation noted: 31.32 tok/s (1.71×) with cpu_copy=0.000 ms — meaningful improvement over PR #32's 28.53 tok/s (1.55×), consistent with removing hidden capture callback sync overhead.

Verdict: Ready to open as a PR stacked on #32. The nit can go in the PR description rather than a code change.

Review model: Claude Sonnet 4.6 (Thinking)

@nycdubliner
Copy link
Copy Markdown
Author

@Anbeeld Thanks for your work. :)

Just in case it's useful, I've a follow up PR clearing out the rest of the Multi-GPU ROCm stuff I found here:
nycdubliner#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants