llama.node

An another Node binding of llama.cpp to make same API with llama.rn as much as possible.

llama.cpp: Inference of LLaMA model in pure C/C++
llama.rn: React Native binding of llama.cpp

Platform Support

macOS
- arm64: CPU and Metal GPU acceleration
- x86_64: CPU only
Windows (x86_64 and arm64)
- CPU
- GPU acceleration via Vulkan
- GPU acceleration via CUDA (x86_64)
Linux (x86_64 and arm64)
- CPU
- GPU acceleration via Vulkan
- GPU acceleration via CUDA
Web
- WebAssembly CPU
- GPU acceleration via WebGPU

Installation

npm install @fugood/llama.node

Usage

import { loadModel } from '@fugood/llama.node'

// Initial a Llama context with the model (may take a while)
const context = await loadModel({
  model: 'path/to/gguf/model',
  n_ctx: 2048,
  n_gpu_layers: 99, // > 0: enable GPU
  // lib_variant: 'vulkan', // Change backend
})

// Do completion
const { text } = await context.completion(
  {
    prompt: 'This is a conversation between user and llama, a friendly chatbot. respond in simple markdown.\n\nUser: Hello!\nLlama:',
    n_predict: 100,
    stop: ['</s>', 'Llama:', 'User:'],
    // n_threads: 4,
  },
  (data) => {
    // This is a partial completion callback
    const { token } = data
  },
)
console.log('Result:', text)

WebAssembly

Browser builds are published as the optional @fugood/node-llama-wasm package. Import @fugood/llama.node in browser bundles; the package browser entry re-exports the WASM runtime while preserving the same high-level loadModel() / context wrapper shape where browser constraints allow:

import { loadModel } from '@fugood/llama.node'

const context = await loadModel({
  model: 'https://example.com/model.gguf',
  n_ctx: 2048,
})

const tokens = await context.tokenize('Hello from the browser')
const state = await context.saveSession()
await context.loadSession(new Blob([state]))

Build the package with Emscripten:

npm run build-wasm
npm run build-wasm-docker
npm run build-wasm -- --webgpu
npm run serve-wasm-test

CPU and WebGPU builds use separate directories under build-wasm/, so switching variants does not invalidate the other build. Fresh builds use Ninja when it is available, JOBS=8 can cap parallelism, and installing ccache enables compiler launcher caching automatically. Emscripten's system-library cache is kept in build-wasm/emcache unless EM_CACHE is already set. The Docker helper selects emscripten/emsdk:4.0.14-arm64 on arm64 hosts such as Apple Silicon Macs, and uses emscripten/emsdk:4.0.13 on amd64 hosts. Override with EMSCRIPTEN_IMAGE or EMSCRIPTEN_PLATFORM if you need a specific container image.

loadModel() runs the WASM runtime in a dedicated Web Worker by default so model loading, tokenization, completion, state, embeddings, rerank, multimodal staging, and benchmarks do not block the browser UI thread. On isolated pages with SharedArrayBuffer, the CPU path selects the pthread build and defaults n_threads to min(4, navigator.hardwareConcurrency). Pass wasm: { threads: false } to force the single-thread artifact, or wasm: { maxThreads: 8 } / n_threads to tune CPU thread count. Pass wasm: { worker: false } only when you intentionally need direct access to the Emscripten module on the current thread.

Model strings are fetched as URLs by default. saveSession() returns an ArrayBuffer, and loadSession() accepts a URL, Blob, ArrayBuffer, or typed array. URL downloads are saved in browser Cache Storage by default so repeated model/session/media URL loads can reuse the previous bytes; pass wasm: { cacheDownloads: false } to force a fresh fetch, wasm.cacheName to use a separate cache, or call clearWasmDownloadCache() to clear the default cache. The loadModel() progress callback still receives the numeric percentage first, and also receives an optional detail object with source: 'network' | 'cache' | 'memory' | 'buffer' so UI code can distinguish real downloads from cache hits. Browser ArrayBuffer limits still apply, so split GGUF files at or above the 2 GB limit into smaller shards before loading. WebGPU is opt-in via n_gpu_layers and requires both navigator.gpu and WebAssembly JSPI.

The browser package includes a serial context.parallel queue for API compatibility. It preserves the parallel API shape for web callers, but it runs one request at a time until the native slot-manager path is ported to WASM.

Lib Variants

default: General usage, not support GPU except macOS (Metal)
vulkan: Support GPU Vulkan (Windows/Linux), but some scenario might unstable
cuda: Support GPU CUDA (Windows/Linux), but only for limited capability

Linux: (x86_64: 8.9, arm64: 8.7) Windows: x86_64 - 12.0
wasm: Browser package with WebAssembly CPU and optional WebGPU

License

MIT

Built and maintained by BRICKS.

Name		Name	Last commit message	Last commit date
Latest commit History 625 Commits
.github		.github
.husky		.husky
.vscode		.vscode
cmake		cmake
examples		examples
lib		lib
packages		packages
scripts		scripts
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.release-it.json		.release-it.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
README.md		README.md
babel.config.js		babel.config.js
commitlint.config.js		commitlint.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.node

Platform Support

Installation

Usage

WebAssembly

Lib Variants

License

About

Uh oh!

Releases 122

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama.node

Platform Support

Installation

Usage

WebAssembly

Lib Variants

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 122

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages