Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) by pandysp · Pull Request #371 · antirez/ds4

pandysp · 2026-06-09T17:36:15Z

Continuous depth-1 MTP speculation, discussed in #369.

The shipped --mtp-draft 2 decodes the base token on its own and then batch-verifies the MTP draft; that standalone decode is a full shared-weight pass the verify could have carried. This folds them together: draft from the trunk hidden state the previous verify left in batch_cur_hc, and verify [first_token, draft] in one batched pass, removing the base decode.

Branch-measured, paired and interleaved on an M4 Max (q2-q4-imatrix base, Q4K/Q8 MTP head): continuous beats --mtp-draft 2 by +7% to +12% across copy / technical-prose / free-prose, deterministic 0.50 versus 0.70 shared reads per token on copy with the base decode gone. Against plain autoregressive decode the gain is content-dependent, +20% on copy down to roughly flat on free prose, since the speculation benefit itself tracks draft acceptance. Full table and method in #369.

Same near-greedy class as the batched draft-2 verifier, not bit-exact to a strict decode: it only ever commits the verifier's argmax, byte-identical to plain decode on copy-heavy output and diverging only at genuine logit ties on prose. It defers to --quality and DS4_MTP_STRICT, which select the exact verifier. Depth-1 only; it does not revive deeper drafting (the head's step-2 acceptance drops off a cliff), it makes the depth-1 cycle cheaper.

Env-gated by DS4_MTP_CONTINUOUS, about 120 lines, nearly all one block in ds4_session_eval_speculative_argmax reusing the existing verify, draft, and prefix-1 helpers. It rides the existing speculative path, so it takes effect under greedy decode with --mtp-draft 2 or higher; the env var alone does nothing. The anchor is invalidated on sync, rewind, invalidate, payload restore, and plain eval, and draft misses log under DS4_MTP_SPEC_LOG. The test reuses the #358 verify-depth oracle (replay the committed stream, require every token within tie tolerance of the argmax):

make ds4_test
DS4_TEST_MODEL=<base.gguf> DS4_TEST_MTP=<mtp.gguf> ./ds4_test --cont-argmax-gap

Verified on Metal, single-session. The call sites are backend-shared, but I have not tested CUDA or the server.

rinaldofesta · 2026-06-10T20:39:03Z

Third-party verification of #371 (+#381) on Apple M5 Max 128GB, macOS Darwin 25.5, Metal backend, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf + DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, base 91bafb5.

Three build states: A = main 91bafb5, B = A + #381 (clamp), C = B + #371 (continuous). Each built clean (make clean && make, 0 warnings).

Correctness

check	A	B	C
`./ds4_test --logprob-vectors`	OK	OK	OK
`DS4_TEST_MTP=... ./ds4_test --mtp-verify-depth` (incl. #371's new continuous tests on C)	OK	OK	OK

Committed-token identity (greedy, `-n 256 --temp 0 --nothink`, fixed prompt, sha256 of output)

config	hash class	note
A no-MTP (pure greedy)	G	reference
A `--mtp-draft 2`	M	main's margin-gated verifier is near-greedy by design (deviates from G at token ~20)
A `--mtp-draft 2` + `DS4_MTP_STRICT=1`	G	strict = bit-identical to greedy
B `--mtp-draft 2`	M	identical to A — #381 changes nothing here, as claimed
C `--mtp-draft 2` (env unset)	M	identical to A — #371 inert without `DS4_MTP_CONTINUOUS`, as claimed
C continuous `--mtp-draft 1`	G	lossless on this prompt
C continuous `--mtp-draft 2`	M′	deterministic (re-run hash-identical); a different near-greedy sequence than M, same class
C continuous `--mtp-draft 2` + `DS4_MTP_STRICT=1`	G	strict correctly defers to the exact verifier

Determinism: repeated runs of A-draft2 and C-continuous-draft2 produced byte-identical outputs.

Speed (median of 3 interleaved runs, min–max in parentheses, default `--power 100`, idle machine, 60s cooldowns)

-n 256 --temp 0 --nothink; short = one-line prompt, long = 20kB (~5k tokens) of promessi_sposi.txt.

config	short gen t/s	long gen t/s	long prefill t/s
A no-MTP	39.13 (39.08–39.18)	31.73 (31.71–31.76)	413.2
A `--mtp-draft 1`	38.98 (38.97–38.99)	31.70 (31.68–31.75)	413.7
A `--mtp-draft 2`	37.63 (37.63–37.64)	32.94 (32.91–32.96)	413.6
B `--mtp-draft 2`	37.62 (37.59–37.64)	32.94 (32.93–32.97)	413.7
C `--mtp-draft 2` (env unset)	37.60 (37.55–37.61)	32.96 (32.93–32.98)	413.8
C continuous `--mtp-draft 1`	39.01 (38.90–39.05)	31.72 (31.71–31.74)	413.7
C continuous `--mtp-draft 2`	39.41 (39.39–39.42)	37.32 (37.31–37.32)	413.8

On this hardware the continuous path is a clear win: long-prompt generation +13.3% over main's --mtp-draft 2 (37.32 vs 32.94) and +17.6% over no-MTP (vs 31.73). It also removes draft-2's short-prompt penalty: main's draft-2 is slower than no-MTP on short prompts (37.63 vs 39.13, −3.8%) while continuous draft-2 edges it out (+0.7%). Draft-1 in both flavors is within noise of no-MTP here — the acceptance gains roughly cancel the draft cost.

Long-prompt prefill is flat (~413.7 t/s) across all configs.

Non-MTP regression check: canonical ds4-bench sweep (2048→65536 step 2048), each state run twice in opposite order with cooldowns — no measurable delta (best-of-2 per frontier: gen mean −0.3%, prefill mean +1.6%, both inside same-state run-to-run variance). Worth noting for anyone comparing sweeps back-to-back: run order is a real confound — the same C binary measured 30.4 t/s gen at ctx 2048 when started immediately after a full A sweep, and 36.8 t/s when run first after a cooldown.

Notes

Timing-wise, B ≡ A and C-with-env-unset ≡ A within the min–max bands, consistent with Clamp MTP draft depth to the prefill capacity #381 being a correctness guard and Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371 being fully inert when disabled.
Since DS4_MTP_CONTINUOUS=1 stayed in the same output class (near-greedy, deterministic; strict mode bit-exact to greedy) and never regressed in our runs, the data may support making continuous the single fixed behavior rather than an env knob (with --quality/DS4_MTP_STRICT keeping the exact path) — leaving that call to the maintainer.

Commands used:

# correctness
./ds4_test --logprob-vectors
DS4_TEST_MTP=DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf ./ds4_test --mtp-verify-depth
# identity + speed (per config; env/flags as in the tables)
[DS4_MTP_CONTINUOUS=1] ./ds4 -p "<prompt>" -n 256 --temp 0 --nothink [--mtp MTP.gguf --mtp-draft N]
# bench
./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --csv out.csv

Add continuous depth-1 speculation (DS4_MTP_CONTINUOUS)

205f146

pandysp force-pushed the cont-depth1 branch from 49c7f64 to 205f146 Compare June 10, 2026 11:46

This was referenced Jun 10, 2026

Continuous depth-1 MTP speculation: +7-12% over --mtp-draft 2 #369

Open

Clamp MTP draft depth to the prefill capacity #381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:cont-depth1

pandysp commented Jun 9, 2026 •

edited

Loading

Uh oh!

rinaldofesta commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pandysp commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rinaldofesta commented Jun 10, 2026

Correctness

Committed-token identity (greedy, -n 256 --temp 0 --nothink, fixed prompt, sha256 of output)

Speed (median of 3 interleaved runs, min–max in parentheses, default --power 100, idle machine, 60s cooldowns)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pandysp commented Jun 9, 2026 •

edited

Loading

Committed-token identity (greedy, `-n 256 --temp 0 --nothink`, fixed prompt, sha256 of output)

Speed (median of 3 interleaved runs, min–max in parentheses, default `--power 100`, idle machine, 60s cooldowns)