Skip to content

Nikhil172913832/LLMInferenceGateway

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference Gateway

A production-ready, stateless, and horizontally scalable LLM inference gateway built with FastAPI.

Features

  • SSE Streaming: Proxies server-sent events with zero buffering.
  • Latency-Aware Load Balancing: Routes requests using Envoy's Peak EWMA algorithm.
  • Circuit Breaker: Deprioritizes degraded backends instantly, protecting the system from hard failures.
  • Semantic Caching: FAISS-backed (vector similarity) + exact match cache reduces latency for repeated or similar queries.
  • Observability: Exposes rich Prometheus metrics (latency, error rates, EWMA scores).

Architecture

Client
  ↓
FastAPI Gateway  ←→  Semantic Cache (FAISS + Redis-like LRU)
  ↓
Router (Peak EWMA + Circuit Breaker)
  ↓
Model Backends (vLLM / Mock Servers)

Prometheus → Grafana (Metrics Pipeline)

Running the Project

docker-compose up --build

This starts:

  1. The gateway on port 8000.
  2. Two mock streaming backends on ports 8001 and 8002.
  3. Prometheus on port 9090.
  4. Grafana on port 3000.

Testing the Gateway

curl -N -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello world", "stream": true}'

Watch the response stream token by token. Send the same request again, and it will return instantly via the semantic cache.

Known Limitations and Tradeoffs

  • State is held in-process per gateway replica. Circuit breaker states will differ between replicas based on the network paths they experience.
  • Cache Invalidation: FAISS doesn't support vector deletion well. We use an ID to metadata mapping, dropping old entries logically, and rebuilding the index in the background.
  • CPU Bottleneck: The embedding model runs inside the gateway process. At high concurrency, this could block the async event loop if not isolated properly via run_in_executor. For high load, externalize embeddings to a sidecar.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors