This guide covers everything needed to build THAIME from source, including the dictionary pipeline, WASM target, and web demo.
| Tool | Version | Purpose |
|---|---|---|
| Rust | stable | Core toolchain |
| wasm-pack | latest | WASM build (optional) |
| Node.js | 22+ | Web demo (optional) |
| cbindgen | 0.29+ | C header generation (auto via build dep) |
# Build everything (debug)
cargo build --workspace
# Run tests
cargo test --workspace
# Run the CLI
cargo run -p thaime_cli
# Run the TUI
cargo run -p thaime_tuiThis works out of the box if the data/dict/ directory already contains compiled dictionary binaries (they are committed to the repository).
The build.sh script runs the complete pipeline: dictionary generation, workspace build, WASM compilation, and web demo setup.
./build.sh [path/to/trie_dataset.json]Default input: data/input/trie_dataset.json
Step 1/6: Build thaime_dictgen (release)
└─ cargo build -r -p thaime_dictgen
Step 2/6: Generate dictionary
└─ target/release/thaime_dictgen <input.json> data/dict/
Step 3/6: Version dictionary files
└─ Renames trie.bin, metadata.bin, thaime.dict
to trie-v0_5_0.bin, metadata-v0_5_0.bin, thaime-v0_5_0.dict
(version tag derived from workspace Cargo.toml)
Step 4/6: Build workspace (release)
└─ cargo build -r --workspace
(embeds fresh dictionary via include_bytes!())
Step 5/6: Build WASM package
└─ wasm-pack build crates/thaime_wasm --target web
Step 6/6: Copy dict + ngram to web demo
└─ Copies thaime-v0_5_0.dict to web/public/dict/
└─ Copies highest min_count ngram binary to web/public/dict/
└─ Writes web/.env with VITE_DICT_FILE and VITE_NGRAM_FILE
The dictionary source data (trie_dataset.json) is generated from the companion thaime-nlp repository:
# In the thaime-nlp repo:
python -m pipelines trie run
# Copy output to thaime:
cp pipelines/outputs/trie/trie_dataset.json /path/to/thaime/data/input/The thaime_dictgen crate compiles this JSON into binary files:
cargo run -p thaime_dictgen -- data/input/trie_dataset.json data/dict/This produces three files in data/dict/:
| File | Contents |
|---|---|
trie.bin |
yada double-array trie (romanization keys → group IDs) |
metadata.bin |
bincode-serialized word metadata + CSR posting lists |
thaime.dict |
Combined blob for WASM ([4B trie_len][trie][metadata]) |
The build script (crates/thaime_engine/build.rs) discovers these files and sets THAIME_TRIE_PATH and THAIME_METADATA_PATH environment variables so they are embedded at compile time via include_bytes!(). It prefers versioned filenames (e.g., trie-v0_5_0.bin) and falls back to unversioned names.
N-gram language model binaries are generated from the thaime-nlp repository:
# In the thaime-nlp repo:
python -m pipelines ngram run
# Copy output to thaime:
cp pipelines/outputs/ngram/thaime_ngram_v1_mc*.bin /path/to/thaime/data/input/Multiple variants with different min_count thresholds are available (e.g., mc10, mc15, mc20). Higher min_count means fewer but more reliable n-gram entries. The build script and build.sh both prefer the highest available min_count variant.
File naming convention: thaime_ngram_v{format_version}_mc{min_count}.bin
N-gram data can be loaded in several ways:
| Mode | How | When |
|---|---|---|
| Embedded | embed-ngram feature flag |
Compile-time embedding via include_bytes!() |
| Runtime (binary) | --ngram-bin <path> (CLI) |
Load pre-built binary at startup |
| Runtime (TSV) | --ngram-dir <path> (CLI) |
Load raw TSV count files (dev/debug) |
| Auto-discover | Default CLI behavior | Scans data/input/ for binary, falls back to TSV |
| Hot-load (WASM) | WasmEngine::load_ngram() |
Fire-and-forget after page load |
| Feature | Default | Description |
|---|---|---|
embed-dict |
yes | Embed dictionary binaries at compile time |
embed-ngram |
no | Embed n-gram binary at compile time |
# Build with both features
cargo build --features embed-ngram
# Build without embedded dictionary (for runtime loading)
cargo build --no-default-features| Artifact | Path | Description |
|---|---|---|
| Shared library | target/{debug,release}/libthaime_engine.so |
C ABI library for frontends |
| Static library | target/{debug,release}/libthaime_engine.a |
Static linking alternative |
| C header | target/thaime.h |
Auto-generated by cbindgen |
| CLI binary | target/{debug,release}/thaime_cli |
Interactive REPL |
| TUI binary | target/{debug,release}/thaime_tui |
Visual debugger |
| WASM package | crates/thaime_wasm/pkg/ |
wasm-bindgen JS+WASM bundle |
Requires wasm-pack and the wasm32-unknown-unknown target:
rustup target add wasm32-unknown-unknown
cargo install wasm-pack
wasm-pack build crates/thaime_wasm --target webOutput goes to crates/thaime_wasm/pkg/, which the web demo's vite.config.ts references.
The web demo is a React + TypeScript application in the web/ directory:
cd web
npm install
npm run dev # Development server
npm run build # Production build → web/dist/The demo needs dictionary and n-gram files in web/public/dict/. The build.sh pipeline (step 6) handles this, or you can copy them manually:
mkdir -p web/public/dict
cp data/dict/thaime-v0_5_0.dict web/public/dict/
cp data/input/thaime_ngram_v1_mc20.bin web/public/dict/
echo 'VITE_DICT_FILE=thaime-v0_5_0.dict' > web/.env
echo 'VITE_NGRAM_FILE=thaime_ngram_v1_mc20.bin' >> web/.envThe C header is auto-generated by cbindgen during cargo build via the build script at crates/thaime_engine/build.rs. Output: target/thaime.h.
To verify exported symbols:
nm -D target/debug/libthaime_engine.so | grep thaime_Two GitHub Actions workflows are configured:
Runs on push to main and all PRs:
- License header check (hawkeye)
- Format check (
cargo fmt --all -- --check) - Clippy lint (
cargo clippy --all-targets --all-features -- -D warnings) - Build (
cargo build --workspace) - Test (
cargo test --workspace)
Runs on push to main (and manual dispatch):
- Build WASM package
- Copy versioned dictionary + n-gram to
web/public/dict/ - Build Vite production bundle
- Deploy to GitHub Pages
# Format all code
cargo fmt --all
# Lint with all features enabled
cargo clippy --all-targets --all-features
# Run all tests
cargo test --workspace- All source files must include
// SPDX-License-Identifier: MPL-2.0 - Each module should have
#[cfg(test)] mod tests { ... } Cargo.lockis committed for reproducible builds