|
46 | 46 | └──────────────────┴──────────────────────────────────────────────────┘ |
47 | 47 | ``` |
48 | 48 |
|
49 | | -### Perplexity — PPL +0.03% (Almost Zero Degradation) |
| 49 | +### Perplexity — Zero Degradation Across Architectures |
50 | 50 |
|
51 | 51 | ``` |
52 | | -Gemma 3 4B, 101 tokens, teacher-forced: |
| 52 | +SmolLM2 1.7B (Llama arch), 105 tokens: Gemma 3 4B, 101 tokens: |
53 | 53 |
|
54 | | - FP16 KV ████████████████████████████████████ 35.99 PPL (baseline) |
55 | | - 1-bit K + FP16 V ████████████████████████████████████ 35.99 PPL (+0.00%) |
56 | | - 1-bit K + Q4 V ████████████████████████████████████ 36.00 PPL (+0.03%) ← almost no loss |
57 | | - 1-bit K + Q2 V █████████████████████████████████████████ 42.23 PPL (+17.3%) |
| 54 | + baseline ██████ 5.84 PPL baseline ████████████████████ 35.99 PPL |
| 55 | + 1-bit K ██████ 5.84 PPL (+0.00%) 1-bit K ████████████████████ 35.99 PPL (+0.00%) |
| 56 | + 1-bit K+Q4V ██████ 5.82 PPL (-0.04%) 1-bit K+Q4V ████████████████████ 36.00 PPL (+0.03%) |
58 | 57 |
|
59 | 58 | K-only quantization (V as FP16) is perplexity-identical. |
60 | 59 | K + Q4 V adds just +0.03% PPL — statistically negligible. |
@@ -114,15 +113,16 @@ ctest --test-dir build # 32/32 should pass |
114 | 113 |
|
115 | 114 | ## Supported Models |
116 | 115 |
|
117 | | -| Model | Params | Format | Speed (6T, M3) | KV 1-bit Verified | |
118 | | -|-------|--------|--------|----------------|-------------------| |
119 | | -| **Qwen3.5-35B-A3B** | 35B (3B active) | GGUF IQ2_XXS | ~1-4 tok/s | byte-identical ✓ | |
120 | | -| **Qwen3.5-4B** | 4B | GGUF Q8_0 | ~15 tok/s | byte-identical ✓ | |
121 | | -| **Qwen3.5-0.8B** | 752M | TQM / GGUF | 35 tok/s | byte-identical ✓ | |
122 | | -| **Gemma 3 4B** | 4B | TQM | 20 tok/s | PPL +0.03% ✓ | |
123 | | -| **Gemma 3 270M** | 270M | TQM | 176 tok/s | byte-identical ✓ | |
| 116 | +| Model | Arch | Params | Format | Speed (6T, M3) | KV 1-bit Verified | |
| 117 | +|-------|------|--------|--------|----------------|-------------------| |
| 118 | +| **Qwen3.5-35B-A3B** | Qwen2-MoE | 35B (3B active) | GGUF IQ2_XXS | ~1-4 tok/s | byte-identical ✓ | |
| 119 | +| **Qwen3.5-4B** | Qwen3.5 | 4B | GGUF Q8_0 | 5.4 tok/s | byte-identical ✓ | |
| 120 | +| **SmolLM2-1.7B** | **Llama** | 1.7B | GGUF Q8_0 | 24 tok/s | **PPL +0.00%** ✓ | |
| 121 | +| **Qwen3.5-0.8B** | Qwen3.5 | 752M | TQM / GGUF | 35 tok/s | byte-identical ✓ | |
| 122 | +| **Gemma 3 4B** | Gemma 3 | 4B | TQM | 20 tok/s | PPL +0.03% ✓ | |
| 123 | +| **Gemma 3 270M** | Gemma 3 | 270M | TQM | 176 tok/s | byte-identical ✓ | |
124 | 124 |
|
125 | | -Architectures: Gemma 3 (sliding window, GeGLU), Qwen3.5 (DeltaNet hybrid), Qwen2-MoE (256 experts, top-8, shared expert). |
| 125 | +**4 architectures verified:** Llama (SmolLM2), Gemma 3 (sliding window, GeGLU), Qwen3.5 (DeltaNet hybrid), Qwen2-MoE (256 experts, top-8, shared expert). |
126 | 126 |
|
127 | 127 | --- |
128 | 128 |
|
|
0 commit comments