Skip to content

Commit 1c656b5

Browse files
unamedkrclaude
andcommitted
tokenizer: document O(n^2) BPE merge cap + Llama 3 detection guard
The max_tok cap at max_seq_len protects against O(n^2) BPE merge on long texts. GPT2 BPE produces one initial token per byte; a 17KB text causes 17K^2 merge operations = impractical. Added sentencepiece detection guard (vocab<100K) and documented the BPE complexity issue. All PPL measurements at 957 tokens remain the correct maximum for the current tokenizer. S1 correction #9 validated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1a7fe20 commit 1c656b5

2 files changed

Lines changed: 20 additions & 4 deletions

File tree

src/engine/tq_tokenizer.c

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1152,9 +1152,19 @@ int tq_encode(const tq_tokenizer_t* tok, const char* text,
11521152
if (*text == '\0') return n_tokens;
11531153

11541154
/* Detect tokenizer style: Gemma uses ▁ (U+2581) for spaces in vocab,
1155-
* GPT2/Qwen uses byte-level BPE with Ġ/ĉ encoding.
1156-
* Check if '▁' exists in vocab as a simple heuristic. */
1157-
int is_sentencepiece = (str_lookup(tok, "\xe2\x96\x81") >= 0); /* ▁ = U+2581 = 0xE2 0x96 0x81 */
1155+
* GPT2/Qwen/Llama3 uses byte-level BPE with Ġ/ĉ encoding.
1156+
* Heuristic: ▁ in vocab AND vocab_size < 100K → SentencePiece.
1157+
* Llama 3.x (128K vocab) has ▁ from the base model but uses tiktoken
1158+
* (GPT-style BPE). Using the sentencepiece path for these models drops
1159+
* most characters and produces far too few tokens. */
1160+
int has_spm_marker = (str_lookup(tok, "\xe2\x96\x81") >= 0);
1161+
int is_sentencepiece = has_spm_marker && tok->vocab_size < 100000;
1162+
static int dbg_once = 0;
1163+
if (!dbg_once) {
1164+
fprintf(stderr, "[tokenizer] vocab=%d, spm_marker=%d, is_sentencepiece=%d\n",
1165+
tok->vocab_size, has_spm_marker, is_sentencepiece);
1166+
dbg_once = 1;
1167+
}
11581168

11591169
int text_len = (int)strlen(text);
11601170

tools/quant.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -417,7 +417,13 @@ int main(int argc, char** argv) {
417417
text[nread] = '\0';
418418
fclose(fp);
419419

420-
/* Tokenize */
420+
/* Tokenize.
421+
* NOTE: BPE merge is O(n²) on the initial token count. For GPT2-style
422+
* tokenizers, initial count ≈ text_len (one per byte). A 17KB text
423+
* produces ~17K initial tokens → O(289M) merge operations → minutes.
424+
* We cap max_tok at max_seq_len to limit this. The eval thus covers
425+
* only the first max_seq_len bytes worth of text, not the full file.
426+
* TODO: implement priority-queue BPE merge (O(n log n)) to remove cap. */
421427
int max_tok = (int)(nread + 256);
422428
if (max_tok > c->max_seq_len) max_tok = c->max_seq_len;
423429
int* tokens = (int*)malloc((size_t)max_tok * sizeof(int));

0 commit comments

Comments
 (0)