LLM Inference Gateway

A production-ready, stateless, and horizontally scalable LLM inference gateway built with FastAPI.

Features

SSE Streaming: Proxies server-sent events with zero buffering.
Latency-Aware Load Balancing: Routes requests using Envoy's Peak EWMA algorithm.
Circuit Breaker: Deprioritizes degraded backends instantly, protecting the system from hard failures.
Semantic Caching: FAISS-backed (vector similarity) + exact match cache reduces latency for repeated or similar queries.
Observability: Exposes rich Prometheus metrics (latency, error rates, EWMA scores).

Architecture

Client
  ↓
FastAPI Gateway  ←→  Semantic Cache (FAISS + Redis-like LRU)
  ↓
Router (Peak EWMA + Circuit Breaker)
  ↓
Model Backends (vLLM / Mock Servers)

Prometheus → Grafana (Metrics Pipeline)

Running the Project

docker-compose up --build

This starts:

The gateway on port 8000.
Two mock streaming backends on ports 8001 and 8002.
Prometheus on port 9090.
Grafana on port 3000.

Testing the Gateway

curl -N -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello world", "stream": true}'

Watch the response stream token by token. Send the same request again, and it will return instantly via the semantic cache.

Known Limitations and Tradeoffs

State is held in-process per gateway replica. Circuit breaker states will differ between replicas based on the network paths they experience.
Cache Invalidation: FAISS doesn't support vector deletion well. We use an ID to metadata mapping, dropping old entries logically, and rebuilding the index in the background.
CPU Bottleneck: The embedding model runs inside the gateway process. At high concurrency, this could block the async event loop if not isolated properly via run_in_executor. For high load, externalize embeddings to a sidecar.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
gateway		gateway
mock_backend		mock_backend
monitoring		monitoring
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Gateway

Features

Architecture

Running the Project

Testing the Gateway

Known Limitations and Tradeoffs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Gateway

Features

Architecture

Running the Project

Testing the Gateway

Known Limitations and Tradeoffs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages