Commit 2396027
Fix GGUF BPE merge parsing — fixes Qwen3/Llama3 garbage output (#21)
tq_load_tokenizer_from_gguf() allocated the merge_pairs buffer and
set n_merges, but never actually parsed the GGUF merge strings into
(id_a, id_b, id_merged) triples. The buffer was zeroed and left
unpopulated.
BPE tokenizers (Qwen3 248K vocab, Llama 3, GPT-2 style) depend on
merge pairs to combine byte tokens into word tokens. Without parsed
merges, every byte was emitted as a separate token, producing
garbage Unicode output.
SentencePiece tokenizers (SmolLM2, Gemma) worked because they use
character-level encoding and don't need BPE merges.
The fix iterates over the GGUF string array, splits each "tok_a tok_b"
merge rule, looks up token IDs, and stores the triple — identical to
the existing JSON tokenizer path (tq_tokenizer.c:596-672).
Applied to both tq_tokenizer.c (library) and quant.h (single-header /
WASM).
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 10c49ff commit 2396027
2 files changed
Lines changed: 98 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8033 | 8033 | | |
8034 | 8034 | | |
8035 | 8035 | | |
8036 | | - | |
| 8036 | + | |
| 8037 | + | |
| 8038 | + | |
| 8039 | + | |
8037 | 8040 | | |
8038 | 8041 | | |
8039 | 8042 | | |
8040 | 8043 | | |
8041 | 8044 | | |
8042 | | - | |
8043 | | - | |
8044 | | - | |
8045 | | - | |
| 8045 | + | |
| 8046 | + | |
| 8047 | + | |
8046 | 8048 | | |
8047 | | - | |
| 8049 | + | |
| 8050 | + | |
| 8051 | + | |
| 8052 | + | |
| 8053 | + | |
| 8054 | + | |
| 8055 | + | |
| 8056 | + | |
| 8057 | + | |
| 8058 | + | |
| 8059 | + | |
| 8060 | + | |
| 8061 | + | |
| 8062 | + | |
| 8063 | + | |
| 8064 | + | |
| 8065 | + | |
| 8066 | + | |
| 8067 | + | |
| 8068 | + | |
| 8069 | + | |
| 8070 | + | |
| 8071 | + | |
| 8072 | + | |
| 8073 | + | |
| 8074 | + | |
| 8075 | + | |
| 8076 | + | |
| 8077 | + | |
| 8078 | + | |
| 8079 | + | |
| 8080 | + | |
| 8081 | + | |
| 8082 | + | |
| 8083 | + | |
| 8084 | + | |
| 8085 | + | |
| 8086 | + | |
| 8087 | + | |
| 8088 | + | |
| 8089 | + | |
| 8090 | + | |
8048 | 8091 | | |
8049 | 8092 | | |
8050 | 8093 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
881 | 881 | | |
882 | 882 | | |
883 | 883 | | |
884 | | - | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
885 | 888 | | |
886 | 889 | | |
887 | 890 | | |
888 | 891 | | |
889 | 892 | | |
890 | | - | |
891 | | - | |
892 | | - | |
893 | | - | |
| 893 | + | |
| 894 | + | |
| 895 | + | |
894 | 896 | | |
895 | | - | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
896 | 939 | | |
897 | 940 | | |
898 | 941 | | |
| |||
0 commit comments