Commit 1c656b5
tokenizer: document O(n^2) BPE merge cap + Llama 3 detection guard
The max_tok cap at max_seq_len protects against O(n^2) BPE merge on
long texts. GPT2 BPE produces one initial token per byte; a 17KB text
causes 17K^2 merge operations = impractical. Added sentencepiece
detection guard (vocab<100K) and documented the BPE complexity issue.
All PPL measurements at 957 tokens remain the correct maximum for
the current tokenizer. S1 correction #9 validated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 1a7fe20 commit 1c656b5
2 files changed
Lines changed: 20 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1152 | 1152 | | |
1153 | 1153 | | |
1154 | 1154 | | |
1155 | | - | |
1156 | | - | |
1157 | | - | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
| 1160 | + | |
| 1161 | + | |
| 1162 | + | |
| 1163 | + | |
| 1164 | + | |
| 1165 | + | |
| 1166 | + | |
| 1167 | + | |
1158 | 1168 | | |
1159 | 1169 | | |
1160 | 1170 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
417 | 417 | | |
418 | 418 | | |
419 | 419 | | |
420 | | - | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
421 | 427 | | |
422 | 428 | | |
423 | 429 | | |
| |||
0 commit comments