Skip to content

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371

Open
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:cont-depth1
Open

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:cont-depth1

Conversation

@pandysp

@pandysp pandysp commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Continuous depth-1 MTP speculation, discussed in #369.

The shipped --mtp-draft 2 decodes the base token on its own and then batch-verifies the MTP draft; that standalone decode is a full shared-weight pass the verify could have carried. This folds them together: draft from the trunk hidden state the previous verify left in batch_cur_hc, and verify [first_token, draft] in one batched pass, removing the base decode.

Branch-measured, paired and interleaved on an M4 Max (q2-q4-imatrix base, Q4K/Q8 MTP head): continuous beats --mtp-draft 2 by +7% to +12% across copy / technical-prose / free-prose, deterministic 0.50 versus 0.70 shared reads per token on copy with the base decode gone. Against plain autoregressive decode the gain is content-dependent, +20% on copy down to roughly flat on free prose, since the speculation benefit itself tracks draft acceptance. Full table and method in #369.

Same near-greedy class as the batched draft-2 verifier, not bit-exact to a strict decode: it only ever commits the verifier's argmax, byte-identical to plain decode on copy-heavy output and diverging only at genuine logit ties on prose. It defers to --quality and DS4_MTP_STRICT, which select the exact verifier. Depth-1 only; it does not revive deeper drafting (the head's step-2 acceptance drops off a cliff), it makes the depth-1 cycle cheaper.

Env-gated by DS4_MTP_CONTINUOUS, about 120 lines, nearly all one block in ds4_session_eval_speculative_argmax reusing the existing verify, draft, and prefix-1 helpers. It rides the existing speculative path, so it takes effect under greedy decode with --mtp-draft 2 or higher; the env var alone does nothing. The anchor is invalidated on sync, rewind, invalidate, payload restore, and plain eval, and draft misses log under DS4_MTP_SPEC_LOG. The test reuses the #358 verify-depth oracle (replay the committed stream, require every token within tie tolerance of the argmax):

make ds4_test
DS4_TEST_MODEL=<base.gguf> DS4_TEST_MTP=<mtp.gguf> ./ds4_test --cont-argmax-gap

Verified on Metal, single-session. The call sites are backend-shared, but I have not tested CUDA or the server.

@rinaldofesta

Copy link
Copy Markdown

Third-party verification of #371 (+#381) on Apple M5 Max 128GB, macOS Darwin 25.5, Metal backend, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf + DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, base 91bafb5.

Three build states: A = main 91bafb5, B = A + #381 (clamp), C = B + #371 (continuous). Each built clean (make clean && make, 0 warnings).

Correctness

check A B C
./ds4_test --logprob-vectors OK OK OK
DS4_TEST_MTP=... ./ds4_test --mtp-verify-depth (incl. #371's new continuous tests on C) OK OK OK

Committed-token identity (greedy, -n 256 --temp 0 --nothink, fixed prompt, sha256 of output)

config hash class note
A no-MTP (pure greedy) G reference
A --mtp-draft 2 M main's margin-gated verifier is near-greedy by design (deviates from G at token ~20)
A --mtp-draft 2 + DS4_MTP_STRICT=1 G strict = bit-identical to greedy
B --mtp-draft 2 M identical to A — #381 changes nothing here, as claimed
C --mtp-draft 2 (env unset) M identical to A — #371 inert without DS4_MTP_CONTINUOUS, as claimed
C continuous --mtp-draft 1 G lossless on this prompt
C continuous --mtp-draft 2 M′ deterministic (re-run hash-identical); a different near-greedy sequence than M, same class
C continuous --mtp-draft 2 + DS4_MTP_STRICT=1 G strict correctly defers to the exact verifier

Determinism: repeated runs of A-draft2 and C-continuous-draft2 produced byte-identical outputs.

Speed (median of 3 interleaved runs, min–max in parentheses, default --power 100, idle machine, 60s cooldowns)

-n 256 --temp 0 --nothink; short = one-line prompt, long = 20kB (~5k tokens) of promessi_sposi.txt.

config short gen t/s long gen t/s long prefill t/s
A no-MTP 39.13 (39.08–39.18) 31.73 (31.71–31.76) 413.2
A --mtp-draft 1 38.98 (38.97–38.99) 31.70 (31.68–31.75) 413.7
A --mtp-draft 2 37.63 (37.63–37.64) 32.94 (32.91–32.96) 413.6
B --mtp-draft 2 37.62 (37.59–37.64) 32.94 (32.93–32.97) 413.7
C --mtp-draft 2 (env unset) 37.60 (37.55–37.61) 32.96 (32.93–32.98) 413.8
C continuous --mtp-draft 1 39.01 (38.90–39.05) 31.72 (31.71–31.74) 413.7
C continuous --mtp-draft 2 39.41 (39.39–39.42) 37.32 (37.31–37.32) 413.8

On this hardware the continuous path is a clear win: long-prompt generation +13.3% over main's --mtp-draft 2 (37.32 vs 32.94) and +17.6% over no-MTP (vs 31.73). It also removes draft-2's short-prompt penalty: main's draft-2 is slower than no-MTP on short prompts (37.63 vs 39.13, −3.8%) while continuous draft-2 edges it out (+0.7%). Draft-1 in both flavors is within noise of no-MTP here — the acceptance gains roughly cancel the draft cost.

Long-prompt prefill is flat (~413.7 t/s) across all configs.

Non-MTP regression check: canonical ds4-bench sweep (2048→65536 step 2048), each state run twice in opposite order with cooldowns — no measurable delta (best-of-2 per frontier: gen mean −0.3%, prefill mean +1.6%, both inside same-state run-to-run variance). Worth noting for anyone comparing sweeps back-to-back: run order is a real confound — the same C binary measured 30.4 t/s gen at ctx 2048 when started immediately after a full A sweep, and 36.8 t/s when run first after a cooldown.

Notes

  • Timing-wise, B ≡ A and C-with-env-unset ≡ A within the min–max bands, consistent with Clamp MTP draft depth to the prefill capacity #381 being a correctness guard and Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371 being fully inert when disabled.
  • Since DS4_MTP_CONTINUOUS=1 stayed in the same output class (near-greedy, deterministic; strict mode bit-exact to greedy) and never regressed in our runs, the data may support making continuous the single fixed behavior rather than an env knob (with --quality/DS4_MTP_STRICT keeping the exact path) — leaving that call to the maintainer.

Commands used:

# correctness
./ds4_test --logprob-vectors
DS4_TEST_MTP=DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf ./ds4_test --mtp-verify-depth
# identity + speed (per config; env/flags as in the tables)
[DS4_MTP_CONTINUOUS=1] ./ds4 -p "<prompt>" -n 256 --temp 0 --nothink [--mtp MTP.gguf --mtp-draft N]
# bench
./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --csv out.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants