A production-ready, stateless, and horizontally scalable LLM inference gateway built with FastAPI.
- SSE Streaming: Proxies server-sent events with zero buffering.
- Latency-Aware Load Balancing: Routes requests using Envoy's Peak EWMA algorithm.
- Circuit Breaker: Deprioritizes degraded backends instantly, protecting the system from hard failures.
- Semantic Caching: FAISS-backed (vector similarity) + exact match cache reduces latency for repeated or similar queries.
- Observability: Exposes rich Prometheus metrics (latency, error rates, EWMA scores).
Client
↓
FastAPI Gateway ←→ Semantic Cache (FAISS + Redis-like LRU)
↓
Router (Peak EWMA + Circuit Breaker)
↓
Model Backends (vLLM / Mock Servers)
Prometheus → Grafana (Metrics Pipeline)
docker-compose up --buildThis starts:
- The gateway on port
8000. - Two mock streaming backends on ports
8001and8002. - Prometheus on port
9090. - Grafana on port
3000.
curl -N -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello world", "stream": true}'Watch the response stream token by token. Send the same request again, and it will return instantly via the semantic cache.
- State is held in-process per gateway replica. Circuit breaker states will differ between replicas based on the network paths they experience.
- Cache Invalidation: FAISS doesn't support vector deletion well. We use an ID to metadata mapping, dropping old entries logically, and rebuilding the index in the background.
- CPU Bottleneck: The embedding model runs inside the gateway process. At high concurrency, this could block the async event loop if not isolated properly via
run_in_executor. For high load, externalize embeddings to a sidecar.