You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf(wasm): Web Worker + no ASYNCIFY — maximum inference speed (#28)
* fix: quantcpp CLI command + default to Llama-3.2-1B (user feedback)
User feedback: "quantcpp command not found" + "garbage text from 135M"
1. Added `quantcpp` CLI entry point (pyproject.toml [project.scripts])
- `quantcpp "question"` — one-shot
- `quantcpp` — interactive chat
- `quantcpp --model path.gguf` — custom model
2. Default model changed from SmolLM2-135M to Llama-3.2-1B
- 135M produces garbage text — terrible first impression
- 1B is 750MB (bigger download) but actually useful output
- SmolLM2-135M still available for bandwidth-constrained users
3. README Quick Start now shows `quantcpp` CLI first, Python second
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf(wasm): Web Worker architecture — eliminate ASYNCIFY for max speed
Replace ASYNCIFY-based streaming with a dedicated Web Worker.
Inference runs entirely in the worker thread; tokens stream to
the main thread via postMessage(). The main thread never blocks.
Changes:
- inference-worker.js: new Web Worker that loads WASM + runs
quant_generate() in a blocking loop, posting each token
- quant_wasm.c: simplified — removed ASYNCIFY, sleep, async
variants. Single sync callback posts tokens via EM_JS
- build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added
-mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty
with pthreads). ALLOW_MEMORY_GROWTH=0
- index.html: generate() sends to worker, receives tokens via
onmessage handler. Model loading via transferable ArrayBuffer
Performance impact:
- ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind)
- Fixed memory: eliminates pthreads+growth penalty
- Relaxed SIMD: FMA instructions where available
- Binary: 384K → 256K (-33%)
Combined with pthreads (PR #27) and SIMD128 (PR #25):
expected total speedup 8-15x vs original single-thread build.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments