Commit 0f90427
feat(phi3): fused QKV → Q4 split path (opt-in via TQ_PHI3_SPLIT=1)
Adds support for splitting Phi-3's fused gguf_w_qkv (and fused
gguf_w_up_gate) into separate wq/wk/wv / w_gate/w_up FP32 tensors
during load-time Q4 conversion. With this path active, Phi-3.5 Q4_K_M
can use the batched prefill fast path — measured 4.6× end-to-end
speedup (149s → 32s) on a ~250-token prompt.
However: the conversion introduces a measurable quality regression
on arithmetic/exact tasks. Phi-3.5 Q4_K_M's "2+2=" went from "4" to
"3" after conversion — the internal Q4 format has per-32 scales
where Q4_K has 6-bit sub-block scales (strictly more precision).
Gated behind TQ_PHI3_SPLIT=1 until a higher-precision split path
is available. Default behavior unchanged:
- Phi-3.5 Q4_K_M stays on raw-GGUF int8 path (int8 dot kernel)
- No batched prefill for Phi-3 (too precise to risk regression)
The 11/11 STRICT tests now pass with default settings. Users who
prioritize prefill speed on long Phi-3 prompts can opt in via:
TQ_PHI3_SPLIT=1 ./build/quant phi3.gguf -p "long prompt..." -n N
Future work: investigate Q4+Q2 progressive conversion for Phi-3 to
preserve Q4_K-level precision while gaining batched.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent baabe82 commit 0f90427
2 files changed
Lines changed: 105 additions & 22 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
57 | | - | |
| 57 | + | |
58 | 58 | | |
59 | | - | |
60 | | - | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
61 | 63 | | |
62 | | - | |
| 64 | + | |
| 65 | + | |
63 | 66 | | |
64 | | - | |
| 67 | + | |
65 | 68 | | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
80 | 88 | | |
81 | 89 | | |
82 | 90 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3906 | 3906 | | |
3907 | 3907 | | |
3908 | 3908 | | |
| 3909 | + | |
3909 | 3910 | | |
3910 | | - | |
3911 | | - | |
3912 | | - | |
| 3911 | + | |
| 3912 | + | |
| 3913 | + | |
| 3914 | + | |
| 3915 | + | |
| 3916 | + | |
| 3917 | + | |
| 3918 | + | |
| 3919 | + | |
| 3920 | + | |
| 3921 | + | |
| 3922 | + | |
| 3923 | + | |
3913 | 3924 | | |
3914 | 3925 | | |
3915 | 3926 | | |
| |||
4007 | 4018 | | |
4008 | 4019 | | |
4009 | 4020 | | |
| 4021 | + | |
| 4022 | + | |
| 4023 | + | |
| 4024 | + | |
| 4025 | + | |
| 4026 | + | |
| 4027 | + | |
| 4028 | + | |
| 4029 | + | |
| 4030 | + | |
| 4031 | + | |
| 4032 | + | |
| 4033 | + | |
| 4034 | + | |
| 4035 | + | |
| 4036 | + | |
| 4037 | + | |
| 4038 | + | |
| 4039 | + | |
| 4040 | + | |
| 4041 | + | |
| 4042 | + | |
| 4043 | + | |
| 4044 | + | |
| 4045 | + | |
| 4046 | + | |
| 4047 | + | |
| 4048 | + | |
| 4049 | + | |
| 4050 | + | |
| 4051 | + | |
| 4052 | + | |
| 4053 | + | |
| 4054 | + | |
| 4055 | + | |
| 4056 | + | |
| 4057 | + | |
| 4058 | + | |
| 4059 | + | |
| 4060 | + | |
| 4061 | + | |
| 4062 | + | |
| 4063 | + | |
| 4064 | + | |
| 4065 | + | |
| 4066 | + | |
| 4067 | + | |
| 4068 | + | |
| 4069 | + | |
| 4070 | + | |
| 4071 | + | |
| 4072 | + | |
| 4073 | + | |
| 4074 | + | |
| 4075 | + | |
| 4076 | + | |
| 4077 | + | |
| 4078 | + | |
| 4079 | + | |
| 4080 | + | |
| 4081 | + | |
| 4082 | + | |
| 4083 | + | |
| 4084 | + | |
4010 | 4085 | | |
4011 | 4086 | | |
4012 | 4087 | | |
| |||
0 commit comments