Add @quantcpp/wasm npm package + ROADMAP realism update

unamedkr · claude · unamedkr · commit 25207584e831 · 2026-04-08T01:32:39.000+09:00
ROADMAP: - Split Direction 2 into "production-ready" (uniform_4b) and "building blocks (research, not yet production-ready)" sections, reflecting the honest TurboQuant reproduction status from issue #14. wasm/ — npm package layout for @quantcpp/wasm: - package.json with proper metadata (Apache 2.0, keywords, repo) - index.mjs: ESM Quant class with create/loadModel/generate/free - index.js: CJS shim that lazy-imports the ESM module - index.d.ts: full TypeScript surface (KVType, GenerateOptions, etc.) - README.md: install + quick start + API reference + KV type table - .npmignore: keep build artifacts out of the published tarball The package wraps the existing 192KB quant.wasm so any web project can do `npm install @quantcpp/wasm` and run GGUF inference client-side. This is part of the embedded/edge moat — none of the other TurboQuant impls have a browser story. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -46,25 +46,36 @@ The world's simplest way to add LLM to a C/C++ project.
 
 ## Direction 2: KV Compression Research Platform
 
-The reference implementation for KV cache quantization research.
+A C reference engine for KV cache quantization research.
 
-### Done
-- [x] 7 quantization types (Polar, QJL, Turbo, Uniform, TurboKV)
+### Production-ready
+- [x] `uniform_4b` KV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B)
+- [x] `uniform_4b` + Q4 V combo (6.9x KV memory reduction)
 - [x] Delta compression (P-frame encoding)
-- [x] QK-norm aware compression
+- [x] QK-norm aware compression (Gemma 4 / hybrid attention models)
 - [x] Plugin architecture (3 functions to add new type)
-- [x] 34 unit tests
-
-### In Progress
-- [ ] "Add Your Own Type" tutorial (docs/custom-quantization.md)
+- [x] 35 unit tests
+
+### Building blocks (research, not yet production-ready)
+- [x] Random Hadamard Transform (`tq_rht.c`)
+- [x] Lloyd-Max-Gaussian codebook quantizer (`tq_codebook.c`)
+- [x] 1-bit QJL sign hash (`tq_qjl.c`)
+- [x] PolarQuant (polar coordinate) compression (`tq_polar.c`)
+- [x] `turbo_kv_*` types composing the building blocks (paper structure, gap in quality)
+
+### Open: TurboQuant paper reproduction
+- [ ] Close the gap on `turbo_kv_*` quality vs Google paper — see issue #14
+- [ ] Per-channel outlier handling (paper's 32-channel split)
+- [ ] QJL constant verification for Rademacher rows
+- [ ] Per-head rotation seeds
+- [ ] Regression test pinning `turbo_kv_4b` PPL on Llama 3.2 3B ≤ 14.5
+
+### Planned (after Direction 2 reproduction)
+- [ ] "Add Your Own Type" tutorial polish (docs/custom-quantization.md)
 - [ ] Arxiv tech report
-
-### Planned
-- [ ] llama.cpp KV type PR (ggml type registration)
+- [ ] llama.cpp KV type PR (ggml type registration) — only after paper reproduction works
 - [ ] vLLM KV compression plugin
 - [ ] Benchmarking suite (PPL across models × KV types)
-- [ ] Learned codebook quantization
-- [ ] Per-head adaptive bit allocation
 
 ## Non-Goals
 
diff --git a/wasm/.npmignore b/wasm/.npmignore
@@ -0,0 +1,5 @@
+_headers
+build.sh
+quant_wasm.c
+test/
+.npmignore
diff --git a/wasm/README.md b/wasm/README.md
@@ -0,0 +1,101 @@
+# @quantcpp/wasm
+
+> Single-header C LLM inference engine compiled to WebAssembly. **192 KB** binary. Runs GGUF models in your browser with KV cache compression.
+
+[![npm version](https://img.shields.io/npm/v/@quantcpp/wasm.svg)](https://www.npmjs.com/package/@quantcpp/wasm)
+[![Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
+
+## Install
+
+```bash
+npm install @quantcpp/wasm
+```
+
+## Quick start
+
+```html
+<script type="module">
+  import { Quant } from '@quantcpp/wasm';
+
+  const q = await Quant.create({
+    scriptUrl: 'node_modules/@quantcpp/wasm/quant.js',
+    modelUrl: 'https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q8_0.gguf',
+    onStatus: (msg) => console.log('[quant]', msg),
+  });
+
+  await q.generate('The capital of France is', {
+    maxTokens: 32,
+    temperature: 0.0,
+    onToken: (text) => document.body.append(text),
+    onDone: ({ nTokens, elapsedMs }) => {
+      console.log(`Generated ${nTokens} tokens in ${elapsedMs.toFixed(0)} ms`);
+    },
+  });
+
+  q.free();
+</script>
+```
+
+## Why?
+
+- **192 KB binary.** The entire inference engine — tokenizer, transformer forward pass, KV cache compression — fits in less than most JPEGs.
+- **Zero server.** Models load and run entirely client-side. Nothing is uploaded.
+- **Real models.** Llama 3, Qwen 3.5, Gemma 3, SmolLM2, and any other GGUF model under your memory budget.
+- **KV compression built in.** Run 4–7× longer context than FP16 KV cache.
+- **One file at the source.** Powered by [`quant.h`](https://github.com/quantumaikr/quant.cpp), a 628 KB single-header C library you can drop into any project.
+
+## API
+
+See [`index.d.ts`](./index.d.ts) for the full TypeScript surface.
+
+```ts
+import { Quant } from '@quantcpp/wasm';
+
+const q = await Quant.create({
+  scriptUrl: './quant.js',           // path to the loaded WASM glue
+  modelUrl: '/models/llama.gguf',    // optional eager model load
+  kvType: 'uniform_4b',              // KV cache quantization
+  vQuant: 'q4',                      // value cache quantization
+});
+
+await q.generate('Hello', {
+  maxTokens: 64,
+  temperature: 0.7,
+  onToken: (text) => process.stdout.write(text),
+});
+
+q.free();
+```
+
+### Supported KV quantization types
+
+| Type | Bits/elem | Notes |
+|---|---|---|
+| `fp32` | 32 | baseline |
+| `uniform_4b` ⭐ | 4 | recommended; +6.3% PPL on Llama 3.2 3B |
+| `uniform_2b` | 2 | maximum compression, lower quality |
+| `polar_3b` / `polar_4b` | 3 / 4 | PolarQuant-style |
+| `qjl_1b` | 1 | sign-hash baseline |
+| `turbo_kv_3b` / `turbo_kv_4b` | 3 / 4 | TurboQuant-structure (research; see [issue #14](https://github.com/quantumaikr/quant.cpp/issues/14)) |
+
+## Build from source
+
+```bash
+git clone https://github.com/quantumaikr/quant.cpp
+cd quant.cpp/wasm
+bash build.sh   # requires emscripten (brew install emscripten)
+```
+
+Output: `quant.wasm` (192 KB) and `quant.js` (~30 KB glue).
+
+## License
+
+Apache 2.0. See [LICENSE](../LICENSE).
+
+## Citation
+
+If you use quant.cpp's KV compression building blocks in research, please cite the underlying papers:
+
+- [TurboQuant — Zandieh et al., ICLR 2026](https://arxiv.org/abs/2504.19874)
+- [PolarQuant — AISTATS 2026](https://arxiv.org/abs/2502.02617)
+- [QJL — AAAI 2025](https://arxiv.org/abs/2406.03482)
diff --git a/wasm/index.d.ts b/wasm/index.d.ts
@@ -0,0 +1,72 @@
+/**
+ * Type definitions for @quantcpp/wasm
+ */
+
+export type KVType =
+  | 'fp32'
+  | 'uniform_4b'
+  | 'uniform_2b'
+  | 'uniform_3b'
+  | 'polar_3b'
+  | 'polar_4b'
+  | 'qjl_1b'
+  | 'turbo_3b'
+  | 'turbo_4b'
+  | 'turbo_kv_1b'
+  | 'turbo_kv_3b'
+  | 'turbo_kv_4b';
+
+export type VQuant = 'fp16' | 'q4' | 'q2';
+
+export interface QuantCreateOptions {
+  /** URL to load quant.js from. Default: './quant.js' */
+  scriptUrl?: string;
+  /** Optional URL of a .gguf model to load on creation. */
+  modelUrl?: string;
+  /** KV cache quantization type. Default: 'uniform_4b'. */
+  kvType?: KVType;
+  /** Value cache quantization. Default: 'fp16'. */
+  vQuant?: VQuant;
+  /** Status callback for engine messages. */
+  onStatus?: (message: string) => void;
+}
+
+export interface GenerateOptions {
+  /** Maximum tokens to generate. Default: 128. */
+  maxTokens?: number;
+  /** Sampling temperature. Default: 0.7. */
+  temperature?: number;
+  /** Per-token callback for streaming. */
+  onToken?: (text: string) => void;
+  /** Called when generation completes. */
+  onDone?: (nTokens: number, elapsedMs: number) => void;
+}
+
+export interface GenerateResult {
+  nTokens: number;
+  elapsedMs: number;
+}
+
+export class Quant {
+  private constructor(module: unknown);
+
+  /**
+   * Create a Quant instance. If `modelUrl` is provided, the model is
+   * loaded before the promise resolves.
+   */
+  static create(options?: QuantCreateOptions): Promise<Quant>;
+
+  /** Load a GGUF model from a URL. */
+  loadModel(url: string): Promise<void>;
+
+  /** Generate text from a prompt. Returns when generation completes. */
+  generate(prompt: string, options?: GenerateOptions): Promise<GenerateResult>;
+
+  /** Free model resources. */
+  free(): void;
+
+  /** Whether a model is currently loaded. */
+  isReady(): boolean;
+}
+
+export default Quant;
diff --git a/wasm/index.js b/wasm/index.js
@@ -0,0 +1,26 @@
+/**
+ * @quantcpp/wasm — CommonJS entry point
+ *
+ * Re-exports the ESM module for CommonJS consumers.
+ */
+'use strict';
+
+let _esmPromise = null;
+
+function loadEsm() {
+  if (!_esmPromise) {
+    _esmPromise = import('./index.mjs');
+  }
+  return _esmPromise;
+}
+
+module.exports = {
+  /**
+   * Async-load the ESM Quant class.
+   * Usage:
+   *   const { Quant } = await require('@quantcpp/wasm').load();
+   */
+  load: async function () {
+    return loadEsm();
+  },
+};
diff --git a/wasm/index.mjs b/wasm/index.mjs
@@ -0,0 +1,134 @@
+/**
+ * @quantcpp/wasm — ESM entry point
+ *
+ * Single-header C LLM inference engine in your browser.
+ *
+ * Usage:
+ *
+ *   import { Quant } from '@quantcpp/wasm';
+ *
+ *   const q = await Quant.create({
+ *     modelUrl: '/models/SmolLM2-135M-Instruct-Q8_0.gguf',
+ *     kvType: 'uniform_4b',
+ *     vQuant: 'q4',
+ *     onStatus: (msg) => console.log('[status]', msg),
+ *   });
+ *
+ *   await q.generate('Hello, my name is', {
+ *     maxTokens: 64,
+ *     temperature: 0.7,
+ *     onToken: (text) => process.stdout.write(text),
+ *   });
+ *
+ *   q.free();
+ */
+
+let _modulePromise = null;
+
+function loadEmscriptenModule(scriptUrl) {
+  if (_modulePromise) return _modulePromise;
+  _modulePromise = new Promise((resolve, reject) => {
+    if (typeof window === 'undefined') {
+      reject(new Error('Node.js loader not implemented yet — use the browser build for now'));
+      return;
+    }
+    const script = document.createElement('script');
+    script.src = scriptUrl;
+    script.onload = () => {
+      // Emscripten modularize=0 attaches Module to globalThis
+      if (typeof Module === 'undefined') {
+        reject(new Error('quant.js loaded but Module is undefined'));
+        return;
+      }
+      Module.onRuntimeInitialized = () => resolve(Module);
+    };
+    script.onerror = () => reject(new Error(`Failed to load ${scriptUrl}`));
+    document.head.appendChild(script);
+  });
+  return _modulePromise;
+}
+
+export class Quant {
+  constructor(module) {
+    this._m = module;
+    this._loaded = false;
+  }
+
+  /**
+   * Create a Quant instance, optionally loading a model.
+   * @param {object} opts
+   * @param {string} [opts.scriptUrl='./quant.js']
+   * @param {string} [opts.modelUrl] - URL to a .gguf model file
+   * @param {string} [opts.kvType='uniform_4b'] - one of fp32, uniform_4b, turbo_kv_3b, ...
+   * @param {string} [opts.vQuant='fp16'] - one of fp16, q4, q2
+   * @param {function} [opts.onStatus] - status callback
+   */
+  static async create(opts = {}) {
+    const scriptUrl = opts.scriptUrl || './quant.js';
+    const module = await loadEmscriptenModule(scriptUrl);
+
+    if (opts.onStatus) module.onStatus = opts.onStatus;
+
+    const q = new Quant(module);
+
+    if (opts.modelUrl) {
+      await q.loadModel(opts.modelUrl);
+    }
+
+    return q;
+  }
+
+  /**
+   * Load a GGUF model from a URL into the WASM filesystem.
+   */
+  async loadModel(url) {
+    const resp = await fetch(url);
+    if (!resp.ok) throw new Error(`Failed to fetch model: ${resp.status} ${resp.statusText}`);
+    const buf = new Uint8Array(await resp.arrayBuffer());
+    const path = '/model.gguf';
+    this._m.FS.writeFile(path, buf);
+    const ret = this._m.ccall('wasm_load_model', 'number', ['string'], [path]);
+    if (ret !== 0) throw new Error(`wasm_load_model failed (rc=${ret})`);
+    this._loaded = true;
+  }
+
+  /**
+   * Generate text from a prompt.
+   * @param {string} prompt
+   * @param {object} opts
+   * @param {number} [opts.maxTokens=128]
+   * @param {number} [opts.temperature=0.7]
+   * @param {function} [opts.onToken] - called per token with (text) string
+   * @param {function} [opts.onDone] - called with (nTokens, elapsedMs)
+   */
+  generate(prompt, opts = {}) {
+    if (!this._loaded) throw new Error('No model loaded — call loadModel() first or pass modelUrl to create()');
+    if (opts.onToken) this._m.onToken = opts.onToken;
+    if (opts.onDone) this._m.onDone = opts.onDone;
+    const maxTokens = opts.maxTokens ?? 128;
+    const temperature = opts.temperature ?? 0.7;
+    return new Promise((resolve) => {
+      this._m.onDone = (nTokens, elapsedMs) => {
+        if (opts.onDone) opts.onDone(nTokens, elapsedMs);
+        resolve({ nTokens, elapsedMs });
+      };
+      this._m.ccall('wasm_generate', 'number', ['string', 'number', 'number'], [prompt, maxTokens, temperature]);
+    });
+  }
+
+  /**
+   * Free model resources. Call when done.
+   */
+  free() {
+    if (this._loaded) {
+      this._m.ccall('wasm_free_model', null, [], []);
+      this._loaded = false;
+    }
+  }
+
+  isReady() {
+    return this._loaded;
+  }
+}
+
+export default Quant;
diff --git a/wasm/package.json b/wasm/package.json