Skip to content

devanshu-tg/benchmark_agent

Repository files navigation

NL→GSQL Benchmark Agent

An agent + web interface to benchmark LLMs (Claude / GPT / Gemini) at writing GSQL from natural language. Pick a Workspace → Graph, select one/many/all benchmark questions and the models, run the batch, and read a detailed report — each query is scored through a three-layer gate: does it run → is the result correct → how well-written is it.

Status: Phase A + Iteration A.1 — mocks end-to-end. The full generate → execute → retry×3 → score loop, the product web UI (Workspace→Graph→batch→report), and the scoring gates (Layers 1–2 + tries-decay) run entirely on mocks — no API keys, no TigerGraph. Real providers (Phase B) and the full Layer-3 quality scoring (Phase C) are stubbed behind clean interfaces and raise NotImplementedError rather than silently faking. Plan: ~/.claude/plans/do-one-thing-do-nifty-anchor.md.

What works today

  • Workspace → Graph model: a mock "Demo Workspace" containing an AML graph and a Transactions graph; questions filter to the selected graph.
  • 8 seed questions (one per tier × graph) and 4 mock "models" (first-try ace, retrier, struggler, total fail).
  • The agent loop: 3 attempts, locked feedback (raw TG error on a failed run; one frozen string on a wrong result — never the answer), tries-decay 20 / 12 / 6 / 0.
  • Layer 2 Result Comparator: float tolerance, unordered set/multiset, top-N tie groups, spurious-column projection, null normalization.
  • Batch benchmark over (selected questions × selected models) with live SSE progress, then a detailed report: per-tier first-attempt accuracy and score/100 matrices, recorded-not-scored telemetry (avg iterations, tokens, latency, cost/correct), per-question drill-down into each model's attempts, and CSV/JSON export.
  • Per-attempt telemetry and a run manifest on every run.
  • 25 backend tests; quality columns (Layer 3, 80 pts) show as pending until real models.

Prerequisites

  • Python 3.12, Node 18+, Docker (for Postgres).

Quickstart

# 1. Postgres
docker compose up -d db

# 2. Backend
cd backend
python -m venv .venv
# Windows:  ./.venv/Scripts/python -m pip install -e ".[dev]"
# macOS/Linux:  ./.venv/bin/python -m pip install -e ".[dev]"
./.venv/Scripts/python -m uvicorn app.main:app --reload   # http://127.0.0.1:8000

# 3. Frontend (separate terminal)
cd frontend
npm install
npm run dev                                               # http://localhost:5173

Open http://localhost:5173, pick a question, hit ▶ Run & watch on a model, or check two+ models and Compare.

Zero-setup CLI demo (no DB, no server, no creds)

cd backend && ./.venv/Scripts/python -m app.demo

Tests & lint

cd backend
./.venv/Scripts/python -m pytest -q      # 20 tests
./.venv/Scripts/python -m ruff check app tests

Architecture

React/Vite UI ──HTTP/SSE──► FastAPI ──► Run Orchestrator (state machine)
                                          │  codes against interfaces only
        ┌─────────────────────────────────┼──────────────────────────────┐
   LLMProvider (Mock│LiteLLM*)      TGAdapter (Mock│Savanna*)        Scoring Engine
                                    + WorkspaceManager          L1 → L2 Comparator → L3* + tries
        └──────────────── Telemetry ───────┴────────► Postgres (SQLAlchemy) ──► Aggregator
                                                              (* = Phase B/C)

Two hard seams — TGAdapter and LLMProvider — each with a Mock. The orchestrator and scoring engine are database-free pure logic, so they're unit-tested without Postgres.

backend/app/
  orchestrator/   state_machine (runner.py), locked feedback, prompt
  adapters/tg/    TGAdapter + MockTGAdapter + WorkspaceManager
  adapters/llm/   LLMProvider + MockLLMProvider
  scoring/        comparator (L2), engine (gates + tries), static_analyzer + judge (Phase C stubs)
  catalog/        YAML question-bank loader
  models/         SQLAlchemy entities
  api/            routes, persistence, seed
questions/        aml/*.yaml, txn/*.yaml   (ground-truth specs — server-side only)
frontend/src/     QuestionBrowser, RunPanel, ResultView, CompareGrid

Ground truth is never leaked

Question YAML holds the ground-truth result and comparator/allow-set tags. The question API serves a public projection that strips all of it; the wrong-result retry feedback is a single frozen constant with zero ground-truth bytes (enforced by a test). Contamination of the benchmark this way would invalidate every number.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors