Skip to content

Ref34t/evalspec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

evalspec

Gold-label evaluation framework for LLM agents. Measure what your model actually does, track it over time, and catch regressions before they ship.

Quick start

pip install evalspec

# Create a dataset
cat > datasets/analytics.yaml <<EOF
questions:
  - id: Q-001
    question: "How many swaps happened last quarter?"
    gold_answer: "tool_call"
    expected_tool: "get_swap_counts"
    expected_behaviour: "answer_with_citation"
    language: EN
EOF

# Run against OpenAI
export OPENAI_API_KEY="sk-..."
evalspec-run --all --provider openai --model gpt-4o --tag baseline-v1

# Record baseline and check for regressions
evalspec-regression --record baselines/gpt4o.json --provider openai --model gpt-4o
evalspec-regression --check baselines/gpt4o.json --provider openai --model gpt-4o

Gold labels

Every question gets a gold_answer that defines correct behavior:

Label Meaning Measured by
ABSTAIN Model must refuse was_refused=True
CLARIFY Model must ask for clarification asked_clarification=True
tool_call Model must call the expected tool expected_tool in called tools

Agents

Provider Flag Environment
OpenAI --provider openai --model gpt-4o OPENAI_API_KEY
opencode --provider opencode --model deepseek opencode CLI
Mock --mock --mock-mode perfect None
HTTP --agent-url http://localhost:8080 None

CLI tools

Command Purpose
evalspec-run Run evaluation harness
evalspec-split 80/20 stratified holdout split
evalspec-compare Side-by-side model comparison
evalspec-regression CI regression gate
evalspec-leakage Parametric leakage filter

Model comparison

evalspec-run --all --provider openai --model gpt-4o --tag gpt4o --report runs/gpt4o.json
evalspec-run --all --provider openai --model gpt-4o-mini --tag gpt4o-mini --report runs/gpt4o-mini.json
evalspec-compare runs/gpt4o.json runs/gpt4o-mini.json --html compare.html

CI gate

# .github/workflows/eval.yml
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install evalspec openai pyyaml
      - run: evalspec-regression --check baselines/gpt4o.json -p openai --model gpt-4o
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Philosophy

  • Gold labels, not heuristics: Every question defines what "correct" means (refuse, clarify, or call a specific tool).
  • Version-frozen: Reports include dataset hashes so you know exactly which corpus a score refers to.
  • Held-out split: 80/20 stratified split prevents overfitting to the eval set.
  • CI-gated: 3% regression tolerance means prompt or model changes don't silently degrade quality.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages