Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371
Conversation
|
Third-party verification of #371 (+#381) on Apple M5 Max 128GB, macOS Darwin 25.5, Metal backend, Three build states: A = main Correctness
Committed-token identity (greedy,
|
| config | hash class | note |
|---|---|---|
| A no-MTP (pure greedy) | G | reference |
A --mtp-draft 2 |
M | main's margin-gated verifier is near-greedy by design (deviates from G at token ~20) |
A --mtp-draft 2 + DS4_MTP_STRICT=1 |
G | strict = bit-identical to greedy |
B --mtp-draft 2 |
M | identical to A — #381 changes nothing here, as claimed |
C --mtp-draft 2 (env unset) |
M | identical to A — #371 inert without DS4_MTP_CONTINUOUS, as claimed |
C continuous --mtp-draft 1 |
G | lossless on this prompt |
C continuous --mtp-draft 2 |
M′ | deterministic (re-run hash-identical); a different near-greedy sequence than M, same class |
C continuous --mtp-draft 2 + DS4_MTP_STRICT=1 |
G | strict correctly defers to the exact verifier |
Determinism: repeated runs of A-draft2 and C-continuous-draft2 produced byte-identical outputs.
Speed (median of 3 interleaved runs, min–max in parentheses, default --power 100, idle machine, 60s cooldowns)
-n 256 --temp 0 --nothink; short = one-line prompt, long = 20kB (~5k tokens) of promessi_sposi.txt.
| config | short gen t/s | long gen t/s | long prefill t/s |
|---|---|---|---|
| A no-MTP | 39.13 (39.08–39.18) | 31.73 (31.71–31.76) | 413.2 |
A --mtp-draft 1 |
38.98 (38.97–38.99) | 31.70 (31.68–31.75) | 413.7 |
A --mtp-draft 2 |
37.63 (37.63–37.64) | 32.94 (32.91–32.96) | 413.6 |
B --mtp-draft 2 |
37.62 (37.59–37.64) | 32.94 (32.93–32.97) | 413.7 |
C --mtp-draft 2 (env unset) |
37.60 (37.55–37.61) | 32.96 (32.93–32.98) | 413.8 |
C continuous --mtp-draft 1 |
39.01 (38.90–39.05) | 31.72 (31.71–31.74) | 413.7 |
C continuous --mtp-draft 2 |
39.41 (39.39–39.42) | 37.32 (37.31–37.32) | 413.8 |
On this hardware the continuous path is a clear win: long-prompt generation +13.3% over main's --mtp-draft 2 (37.32 vs 32.94) and +17.6% over no-MTP (vs 31.73). It also removes draft-2's short-prompt penalty: main's draft-2 is slower than no-MTP on short prompts (37.63 vs 39.13, −3.8%) while continuous draft-2 edges it out (+0.7%). Draft-1 in both flavors is within noise of no-MTP here — the acceptance gains roughly cancel the draft cost.
Long-prompt prefill is flat (~413.7 t/s) across all configs.
Non-MTP regression check: canonical ds4-bench sweep (2048→65536 step 2048), each state run twice in opposite order with cooldowns — no measurable delta (best-of-2 per frontier: gen mean −0.3%, prefill mean +1.6%, both inside same-state run-to-run variance). Worth noting for anyone comparing sweeps back-to-back: run order is a real confound — the same C binary measured 30.4 t/s gen at ctx 2048 when started immediately after a full A sweep, and 36.8 t/s when run first after a cooldown.
Notes
- Timing-wise, B ≡ A and C-with-env-unset ≡ A within the min–max bands, consistent with Clamp MTP draft depth to the prefill capacity #381 being a correctness guard and Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371 being fully inert when disabled.
- Since
DS4_MTP_CONTINUOUS=1stayed in the same output class (near-greedy, deterministic; strict mode bit-exact to greedy) and never regressed in our runs, the data may support making continuous the single fixed behavior rather than an env knob (with--quality/DS4_MTP_STRICTkeeping the exact path) — leaving that call to the maintainer.
Commands used:
# correctness
./ds4_test --logprob-vectors
DS4_TEST_MTP=DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf ./ds4_test --mtp-verify-depth
# identity + speed (per config; env/flags as in the tables)
[DS4_MTP_CONTINUOUS=1] ./ds4 -p "<prompt>" -n 256 --temp 0 --nothink [--mtp MTP.gguf --mtp-draft N]
# bench
./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --csv out.csv
Continuous depth-1 MTP speculation, discussed in #369.
The shipped
--mtp-draft 2decodes the base token on its own and then batch-verifies the MTP draft; that standalone decode is a full shared-weight pass the verify could have carried. This folds them together: draft from the trunk hidden state the previous verify left inbatch_cur_hc, and verify[first_token, draft]in one batched pass, removing the base decode.Branch-measured, paired and interleaved on an M4 Max (q2-q4-imatrix base, Q4K/Q8 MTP head): continuous beats
--mtp-draft 2by +7% to +12% across copy / technical-prose / free-prose, deterministic 0.50 versus 0.70 shared reads per token on copy with the base decode gone. Against plain autoregressive decode the gain is content-dependent, +20% on copy down to roughly flat on free prose, since the speculation benefit itself tracks draft acceptance. Full table and method in #369.Same near-greedy class as the batched draft-2 verifier, not bit-exact to a strict decode: it only ever commits the verifier's argmax, byte-identical to plain decode on copy-heavy output and diverging only at genuine logit ties on prose. It defers to
--qualityandDS4_MTP_STRICT, which select the exact verifier. Depth-1 only; it does not revive deeper drafting (the head's step-2 acceptance drops off a cliff), it makes the depth-1 cycle cheaper.Env-gated by
DS4_MTP_CONTINUOUS, about 120 lines, nearly all one block inds4_session_eval_speculative_argmaxreusing the existing verify, draft, and prefix-1 helpers. It rides the existing speculative path, so it takes effect under greedy decode with--mtp-draft 2or higher; the env var alone does nothing. The anchor is invalidated on sync, rewind, invalidate, payload restore, and plain eval, and draft misses log underDS4_MTP_SPEC_LOG. The test reuses the #358 verify-depth oracle (replay the committed stream, require every token within tie tolerance of the argmax):Verified on Metal, single-session. The call sites are backend-shared, but I have not tested CUDA or the server.