This project uses zappa/ollama to deploy ollama compatable models to AWS lambda.
The CLI provides commands for preparing, deploying, and managing Ollama model deployments:
# Prepare deployment files for an Ollama model
python -m merle.cli prepare --model {OLLAMA_MODEL} [--auth-token TOKEN] [--tags KEY=VALUE,...]
# Deploy a prepared model to AWS Lambda
python -m merle.cli deploy --model {MODEL_NAME} --auth-token {AUTH_TOKEN}
# List all configured models
python -m merle.cli list
# Start an interactive chat session with a deployed model
python -m merle.cli chat --model {MODEL_NAME}
# Tear down a deployed Lambda function
python -m merle.cli destroy --model {MODEL_NAME}Note: You can find a list of available Ollama models at https://ollama.com/library
Before deploying, ensure your AWS credentials are configured. Merle uses the standard AWS credential chain:
# Option 1: Set AWS profile (recommended for multiple accounts)
export AWS_PROFILE=your-profile-name
# Option 2: Set credentials directly
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
# Optional: Set default region (overrides the CLI default)
export AWS_DEFAULT_REGION=us-east-1Region Configuration:
- Default region:
ap-northeast-1 - Override with
--regionoption:merle prepare --model llama2 --region us-west-2 - Or set via environment:
export AWS_DEFAULT_REGION=us-west-2
Note: The region must be specified during prepare step as it's embedded in the deployment configuration.
merle supports two deployment topologies. Pick one with --topology at prepare / deploy time.
| Topology | Max request duration | Auth layer | When to pick it |
|---|---|---|---|
apigw (default) |
29 seconds (API Gateway REST integration cap; cannot be raised) | API Gateway custom authorizer Lambda validates X-API-Key before the request reaches Lambda |
Small models whose warm end-to-end latency is well under 29s (e.g. tinyllama, tiny quantised 1B models). |
function-url |
Up to Lambda's configured timeout_seconds (15 min max) |
Lambda Function URL with AuthType=NONE; the Flask app validates X-API-Key via a before_request hook |
Anything that can't finish in 29s on CPU — basically every real-world model, including schroneko/gemma-2-2b-jpn-it, llama3.2, mistral, and larger. |
WARNING — 29s ceiling on
apigw: API Gateway REST has a hard 29-second integration timeout that AWS does not let you raise. If cold-start + model-load + first-token exceeds 29s the client getsHTTP 504 "Endpoint request timed out"even though the Lambda itself finishes the request (visible in CloudWatch). Streaming does not help — API Gateway buffers before flushing. For CPU inference on anything larger than toy models, choose--topology function-url.
# Prepare + deploy with a Function URL (no API Gateway)
uvx merle prepare --model schroneko/gemma-2-2b-jpn-it --topology function-url
uvx merle deploy --model schroneko/gemma-2-2b-jpn-it
# Subsequent chat uses the Function URL automatically
uvx merle chat --model schroneko/gemma-2-2b-jpn-itUnder function-url, merle sets MERLE_REQUIRE_API_KEY=true on the Lambda. The Flask app in the container enforces X-API-Key on every route (including /health and /), matching the behaviour of the API Gateway authorizer in apigw mode. The authorizer Lambda and its IAM role are not provisioned in function-url mode.
To switch an existing deployment between topologies, destroy it first:
uvx merle destroy --model {MODEL}
uvx merle prepare --model {MODEL} --topology function-url
uvx merle deploy --model {MODEL}merle proxies both Ollama's native API (/api/*) and Ollama's OpenAI-compatible surface (/v1/*). OpenAI SDK users can point at the merle deployment URL directly:
from openai import OpenAI
client = OpenAI(
base_url="https://<function-url-or-apigw-url>/v1",
api_key="<the X-API-Key you set at prepare time>",
default_headers={"X-API-Key": "<same token>"},
)
reply = client.chat.completions.create(
model="schroneko/gemma-2-2b-jpn-it",
messages=[{"role": "user", "content": "こんにちは"}],
)Note: the OpenAI SDK sends the token as Authorization: Bearer ..., but merle's authorizer and in-app gate read X-API-Key. Pass the token in default_headers as shown, or set X-API-Key on each request.
You can run merle without installing it using uvx, which executes the CLI in an isolated environment:
# Prepare deployment files (with optional region)
uvx merle prepare --model llama2 --auth-token YOUR_TOKEN --region us-east-1
# Deploy to AWS Lambda
uvx merle deploy --model llama2 --auth-token YOUR_TOKEN
# List configured models
uvx merle list
# Start interactive chat
uvx merle chat --model llama2
# Destroy deployment
uvx merle destroy --model llama2
# Check version
uvx merle --versionBenefits of using uvx:
- No installation required
- Always uses an isolated environment
- Fast subsequent runs due to caching
- Perfect for CI/CD pipelines and one-off commands
Note: First run may take a moment to set up the environment, but subsequent runs are nearly instant due to uv's caching.
zappa-merle/
├── .github/
│ └── workflows/
│ ├── register-circleci-project.yml
│ └── test.yml
├── merle/
│ ├── __init__.py
│ ├── app.py
│ ├── chat.py
│ ├── cli.py
│ ├── functions.py
│ ├── settings.py
│ └── templates/
│ ├── Dockerfile.template
│ ├── authorizer.py
│ └── zappa_settings.json.template
├── tests/
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_chat.py
│ ├── test_cli.py
│ ├── test_deployment_completeness.py
│ ├── test_docker.py
│ └── test_functions.py
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── pyproject.toml
└── uv.lock
Python: 3.13
Requires uv for dependency management
-
Install
pre-commithooks (ruff):Assumes pre-commit is already installed.
pre-commit install
-
The following command installs project and development dependencies:
uv sync
uv run poe check
Run type checking:
uv run poe typecheck
This project uses pytest for running testcases.
Test cases should be added in the tests directory.
To run test cases, execute the following command:
pytest -v
# Or, from the parent directory
uv run poe test