|
| 1 | +# Docker Usage Guide |
| 2 | + |
| 3 | +quant.cpp ships as a minimal Docker image (~10MB) built on Alpine Linux. |
| 4 | +The binary is statically linked with zero runtime dependencies. |
| 5 | + |
| 6 | +## Quick Start |
| 7 | + |
| 8 | +### Build the image |
| 9 | + |
| 10 | +```bash |
| 11 | +docker build -t quant.cpp . |
| 12 | +``` |
| 13 | + |
| 14 | +### Run inference |
| 15 | + |
| 16 | +Mount a directory containing your GGUF model file and pass CLI arguments: |
| 17 | + |
| 18 | +```bash |
| 19 | +docker run -v ./models:/models quant.cpp /models/model.gguf -p "hello" -k uniform_4b -v q4 |
| 20 | +``` |
| 21 | + |
| 22 | +### Full example with all options |
| 23 | + |
| 24 | +```bash |
| 25 | +docker run -v ./models:/models quant.cpp \ |
| 26 | + /models/model.gguf \ |
| 27 | + -p "Once upon a time" \ |
| 28 | + -n 512 \ |
| 29 | + -k turbo_3b \ |
| 30 | + -v q4 \ |
| 31 | + -j 4 \ |
| 32 | + -T 0.8 |
| 33 | +``` |
| 34 | + |
| 35 | +### Print model info |
| 36 | + |
| 37 | +```bash |
| 38 | +docker run -v ./models:/models quant.cpp /models/model.gguf --info |
| 39 | +``` |
| 40 | + |
| 41 | +### Compute perplexity |
| 42 | + |
| 43 | +```bash |
| 44 | +docker run -v ./models:/models -v ./data:/data quant.cpp \ |
| 45 | + /models/model.gguf --ppl /data/wikitext.txt -k polar_3b -v q4 |
| 46 | +``` |
| 47 | + |
| 48 | +## Docker Compose |
| 49 | + |
| 50 | +The included `docker-compose.yml` provides a preconfigured inference service: |
| 51 | + |
| 52 | +```bash |
| 53 | +# Place your model at ./models/model.gguf, then: |
| 54 | +docker compose up |
| 55 | + |
| 56 | +# Override the prompt: |
| 57 | +docker compose run inference /models/model.gguf -p "Your prompt here" -k turbo_3b -v q4 |
| 58 | +``` |
| 59 | + |
| 60 | +Edit `docker-compose.yml` to change the default model path, KV compression type, |
| 61 | +or thread count. |
| 62 | + |
| 63 | +## KV Compression Options |
| 64 | + |
| 65 | +| Flag | Values | Description | |
| 66 | +|------|--------|-------------| |
| 67 | +| `-k` | `fp32`, `uniform_4b`, `uniform_2b`, `polar_3b`, `polar_4b`, `turbo_3b`, `turbo_4b` | Key cache quantization | |
| 68 | +| `-v` | `fp16`, `q4`, `q2` | Value cache quantization | |
| 69 | +| `-j` | integer | Thread count for matmul | |
| 70 | + |
| 71 | +## Volume Mounts |
| 72 | + |
| 73 | +Models are not baked into the image. Mount them at runtime: |
| 74 | + |
| 75 | +- `/models` -- default mount point for GGUF model files |
| 76 | +- Mount additional directories as needed (e.g., `/data` for perplexity evaluation) |
| 77 | + |
| 78 | +## Image Size |
| 79 | + |
| 80 | +The final image is approximately 10MB: |
| 81 | +- Alpine base: ~7MB |
| 82 | +- quant binary: ~500KB (statically linked, zero dependencies) |
0 commit comments