Gold-label evaluation framework for LLM agents. Measure what your model actually does, track it over time, and catch regressions before they ship.
pip install evalspec
# Create a dataset
cat > datasets/analytics.yaml <<EOF
questions:
- id: Q-001
question: "How many swaps happened last quarter?"
gold_answer: "tool_call"
expected_tool: "get_swap_counts"
expected_behaviour: "answer_with_citation"
language: EN
EOF
# Run against OpenAI
export OPENAI_API_KEY="sk-..."
evalspec-run --all --provider openai --model gpt-4o --tag baseline-v1
# Record baseline and check for regressions
evalspec-regression --record baselines/gpt4o.json --provider openai --model gpt-4o
evalspec-regression --check baselines/gpt4o.json --provider openai --model gpt-4oEvery question gets a gold_answer that defines correct behavior:
| Label | Meaning | Measured by |
|---|---|---|
ABSTAIN |
Model must refuse | was_refused=True |
CLARIFY |
Model must ask for clarification | asked_clarification=True |
tool_call |
Model must call the expected tool | expected_tool in called tools |
| Provider | Flag | Environment |
|---|---|---|
| OpenAI | --provider openai --model gpt-4o |
OPENAI_API_KEY |
| opencode | --provider opencode --model deepseek |
opencode CLI |
| Mock | --mock --mock-mode perfect |
None |
| HTTP | --agent-url http://localhost:8080 |
None |
| Command | Purpose |
|---|---|
evalspec-run |
Run evaluation harness |
evalspec-split |
80/20 stratified holdout split |
evalspec-compare |
Side-by-side model comparison |
evalspec-regression |
CI regression gate |
evalspec-leakage |
Parametric leakage filter |
evalspec-run --all --provider openai --model gpt-4o --tag gpt4o --report runs/gpt4o.json
evalspec-run --all --provider openai --model gpt-4o-mini --tag gpt4o-mini --report runs/gpt4o-mini.json
evalspec-compare runs/gpt4o.json runs/gpt4o-mini.json --html compare.html# .github/workflows/eval.yml
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install evalspec openai pyyaml
- run: evalspec-regression --check baselines/gpt4o.json -p openai --model gpt-4o
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}- Gold labels, not heuristics: Every question defines what "correct" means (refuse, clarify, or call a specific tool).
- Version-frozen: Reports include dataset hashes so you know exactly which corpus a score refers to.
- Held-out split: 80/20 stratified split prevents overfitting to the eval set.
- CI-gated: 3% regression tolerance means prompt or model changes don't silently degrade quality.
MIT