An agent + web interface to benchmark LLMs (Claude / GPT / Gemini) at writing GSQL from natural language. Pick a Workspace → Graph, select one/many/all benchmark questions and the models, run the batch, and read a detailed report — each query is scored through a three-layer gate: does it run → is the result correct → how well-written is it.
Status: Phase A + Iteration A.1 — mocks end-to-end. The full generate → execute → retry×3 → score loop, the product web UI (Workspace→Graph→batch→report), and the scoring gates (Layers 1–2 + tries-decay) run entirely on mocks — no API keys, no TigerGraph. Real providers (Phase B) and the full Layer-3 quality scoring (Phase C) are stubbed behind clean interfaces and raise
NotImplementedErrorrather than silently faking. Plan:~/.claude/plans/do-one-thing-do-nifty-anchor.md.
- Workspace → Graph model: a mock "Demo Workspace" containing an AML graph and a Transactions graph; questions filter to the selected graph.
- 8 seed questions (one per tier × graph) and 4 mock "models" (first-try ace, retrier, struggler, total fail).
- The agent loop: 3 attempts, locked feedback (raw TG error on a failed run; one frozen string on a wrong result — never the answer), tries-decay 20 / 12 / 6 / 0.
- Layer 2 Result Comparator: float tolerance, unordered set/multiset, top-N tie groups, spurious-column projection, null normalization.
- Batch benchmark over (selected questions × selected models) with live SSE progress, then a detailed report: per-tier first-attempt accuracy and score/100 matrices, recorded-not-scored telemetry (avg iterations, tokens, latency, cost/correct), per-question drill-down into each model's attempts, and CSV/JSON export.
- Per-attempt telemetry and a run manifest on every run.
- 25 backend tests; quality columns (Layer 3, 80 pts) show as pending until real models.
- Python 3.12, Node 18+, Docker (for Postgres).
# 1. Postgres
docker compose up -d db
# 2. Backend
cd backend
python -m venv .venv
# Windows: ./.venv/Scripts/python -m pip install -e ".[dev]"
# macOS/Linux: ./.venv/bin/python -m pip install -e ".[dev]"
./.venv/Scripts/python -m uvicorn app.main:app --reload # http://127.0.0.1:8000
# 3. Frontend (separate terminal)
cd frontend
npm install
npm run dev # http://localhost:5173Open http://localhost:5173, pick a question, hit ▶ Run & watch on a model, or check two+ models and Compare.
cd backend && ./.venv/Scripts/python -m app.democd backend
./.venv/Scripts/python -m pytest -q # 20 tests
./.venv/Scripts/python -m ruff check app testsReact/Vite UI ──HTTP/SSE──► FastAPI ──► Run Orchestrator (state machine)
│ codes against interfaces only
┌─────────────────────────────────┼──────────────────────────────┐
LLMProvider (Mock│LiteLLM*) TGAdapter (Mock│Savanna*) Scoring Engine
+ WorkspaceManager L1 → L2 Comparator → L3* + tries
└──────────────── Telemetry ───────┴────────► Postgres (SQLAlchemy) ──► Aggregator
(* = Phase B/C)
Two hard seams — TGAdapter and LLMProvider — each with a Mock. The orchestrator and
scoring engine are database-free pure logic, so they're unit-tested without Postgres.
backend/app/
orchestrator/ state_machine (runner.py), locked feedback, prompt
adapters/tg/ TGAdapter + MockTGAdapter + WorkspaceManager
adapters/llm/ LLMProvider + MockLLMProvider
scoring/ comparator (L2), engine (gates + tries), static_analyzer + judge (Phase C stubs)
catalog/ YAML question-bank loader
models/ SQLAlchemy entities
api/ routes, persistence, seed
questions/ aml/*.yaml, txn/*.yaml (ground-truth specs — server-side only)
frontend/src/ QuestionBrowser, RunPanel, ResultView, CompareGrid
Question YAML holds the ground-truth result and comparator/allow-set tags. The question API serves a public projection that strips all of it; the wrong-result retry feedback is a single frozen constant with zero ground-truth bytes (enforced by a test). Contamination of the benchmark this way would invalidate every number.