An another Node binding of llama.cpp to make same API with llama.rn as much as possible.
- macOS
- arm64: CPU and Metal GPU acceleration
- x86_64: CPU only
- Windows (x86_64 and arm64)
- CPU
- GPU acceleration via Vulkan
- GPU acceleration via CUDA (x86_64)
- Linux (x86_64 and arm64)
- CPU
- GPU acceleration via Vulkan
- GPU acceleration via CUDA
- Web
- WebAssembly CPU
- GPU acceleration via WebGPU
npm install @fugood/llama.nodeimport { loadModel } from '@fugood/llama.node'
// Initial a Llama context with the model (may take a while)
const context = await loadModel({
model: 'path/to/gguf/model',
n_ctx: 2048,
n_gpu_layers: 99, // > 0: enable GPU
// lib_variant: 'vulkan', // Change backend
})
// Do completion
const { text } = await context.completion(
{
prompt: 'This is a conversation between user and llama, a friendly chatbot. respond in simple markdown.\n\nUser: Hello!\nLlama:',
n_predict: 100,
stop: ['</s>', 'Llama:', 'User:'],
// n_threads: 4,
},
(data) => {
// This is a partial completion callback
const { token } = data
},
)
console.log('Result:', text)Browser builds are published as the optional @fugood/node-llama-wasm package.
Import @fugood/llama.node in browser bundles; the package browser entry
re-exports the WASM runtime while preserving the same high-level loadModel() /
context wrapper shape where browser constraints allow:
import { loadModel } from '@fugood/llama.node'
const context = await loadModel({
model: 'https://example.com/model.gguf',
n_ctx: 2048,
})
const tokens = await context.tokenize('Hello from the browser')
const state = await context.saveSession()
await context.loadSession(new Blob([state]))Build the package with Emscripten:
npm run build-wasm
npm run build-wasm-docker
npm run build-wasm -- --webgpu
npm run serve-wasm-testCPU and WebGPU builds use separate directories under build-wasm/, so switching
variants does not invalidate the other build. Fresh builds use Ninja when it is
available, JOBS=8 can cap parallelism, and installing ccache enables compiler
launcher caching automatically. Emscripten's system-library cache is kept in
build-wasm/emcache unless EM_CACHE is already set. The Docker helper selects
emscripten/emsdk:4.0.14-arm64 on arm64 hosts such as Apple Silicon Macs, and
uses emscripten/emsdk:4.0.13 on amd64 hosts. Override with EMSCRIPTEN_IMAGE
or EMSCRIPTEN_PLATFORM if you need a specific container image.
loadModel() runs the WASM runtime in a dedicated Web Worker by default so
model loading, tokenization, completion, state, embeddings, rerank, multimodal
staging, and benchmarks do not block the browser UI thread. On isolated pages
with SharedArrayBuffer, the CPU path selects the pthread build and defaults
n_threads to min(4, navigator.hardwareConcurrency). Pass
wasm: { threads: false } to force the single-thread artifact, or
wasm: { maxThreads: 8 } / n_threads to tune CPU thread count. Pass
wasm: { worker: false } only when you intentionally need direct access to the
Emscripten module on the current thread.
Model strings are fetched as URLs by default. saveSession() returns an
ArrayBuffer, and loadSession() accepts a URL, Blob, ArrayBuffer, or typed
array. URL downloads are saved in browser Cache Storage by default so repeated
model/session/media URL loads can reuse the previous bytes; pass
wasm: { cacheDownloads: false } to force a fresh fetch, wasm.cacheName to use
a separate cache, or call clearWasmDownloadCache() to clear the default cache.
The loadModel() progress callback still receives the numeric percentage first,
and also receives an optional detail object with source: 'network' | 'cache' | 'memory' | 'buffer' so UI code can distinguish real downloads from cache hits.
Browser ArrayBuffer limits still apply, so split GGUF files at or above the 2
GB limit into smaller shards before loading. WebGPU is opt-in via n_gpu_layers
and requires both navigator.gpu and WebAssembly JSPI.
The browser package includes a serial context.parallel queue for API
compatibility. It preserves the parallel API shape for web callers, but it runs
one request at a time until the native slot-manager path is ported to WASM.
-
default: General usage, not support GPU except macOS (Metal) -
vulkan: Support GPU Vulkan (Windows/Linux), but some scenario might unstable -
cuda: Support GPU CUDA (Windows/Linux), but only for limited capabilityLinux: (x86_64: 8.9, arm64: 8.7) Windows: x86_64 - 12.0
-
wasm: Browser package with WebAssembly CPU and optional WebGPU
MIT
Built and maintained by BRICKS.