This repository contains the evaluation code for the BEAVER text-to-SQL benchmark. It includes (i) four text-to-SQL methods: ReFoRCE, DAIL-SQL, DIN-SQL, and Few-shot (ii) two evaluation modes: coarse-grained evaluation (execution accuracy) and fine-grained evaluation across five subtasks critical to text-to-SQL.
If you find our data, code, or the paper helpful, please cite the paper:
@article{chen2024beaver,
title={BEAVER: an enterprise benchmark for text-to-sql},
author={Chen, Peter Baile and Yang, Devin and Li, Weiyue and Wenz, Fabian and Zhang, Yi and Tatbul, Nesime and Cafarella, Michael and Demiralp, {\c{C}}a{\u{g}}atay and Stonebraker, Michael},
journal={arXiv preprint arXiv:2409.02038},
year={2024}
}
βββ .env # Credential file (API keys + MySQL database password)
βββ data/ # Dataset files (e.g. metadata, questions, tables)
β βββ dw/ # `dw` dataset
β β βββ example.json # Few-shot examples
β β βββ ...
β βββ ... # Other datasets
βββ eval/ # Evaluation methods and scripts
β βββ reforce/ # ReFoRCE evaluation pipeline
β βββ fewshot/ # Few-shot evaluation pipeline
β βββ dailsql/ # DAIL-SQL evaluation pipeline
β βββ dinsql/ # DIN-SQL evaluation pipeline
β βββ evaluate_ex_acc.py # Script for computing execution accuracy
β βββ evaluate_subtasks.py # Script for subtask evaluation
βββ retrieve/ # Table retrieval
All API keys and MySQL credentials are managed through a single .env file at the root directory.
The .env file should contain:
# LLM API Keys
OPENAI_API_KEY=xxx
OPENROUTER_API_KEY=xxx
# MySQL Credentials (shared across all databases)
MYSQL_HOST=localhost
MYSQL_USER=root
MYSQL_PASSWORD=xxx
Our gated dataset is hoted on Hugging Face. You must be authenticated to access it, which means logging in through your CLI.
python data/download_hf.py --sample [sample_size]For each of the four datasets (dw, nova, neutron, dw_real), this script automatically downloads the datasets from Hugging Face and generates:
dev.json: The full set of questions.dev_tables.json: The full set of tables.dev_sampled.json: A sampled subset of questions of sizesample_size(default100) since running on the full dataset can be computationally expensive
You should also follow the instruction here to setup the MySQL databases.
The following command applies a retrieve-then-rerank pipeline to retrieve tables in dev_tables.json that are semantically relevant to questions in dev_sampled.json, guiding downstream text-to-SQL generation. The pipeline uses a dense embedding model embed_model for retrieval and an optional reranker model rerank_model for improved ordering. If rerank_model is omitted, the pipeline performs retrieval only, without the reranking step.
python retrieve/retrieve.py --dataset [dataset] --embed_provider local --rerank_model Qwen/Qwen3-Reranker-8B
The command takes the following arguments:
dataset: one ofdw,dw_real,neutron,novaembed_model(defaultQwen/Qwen3-Embedding-8B): the dense embedding model for the retrieval stepembed_k(default50): the number of tables returned by the retrieval stepembed_provider:local(GPU required) oropenrouter(requires a validOPENROUTER_API_KEY)rerank_model(defaultNone): the optional reranker model for the reranking step; the paper usedQwen/Qwen3-Reranker-8Brerank_k(default15): the number of tables returned by the reranking step
Note: the retrieval step outputs retrieved_tables.json, and the reranking step (if enabled) outputs reranked_tables.json. While generating SQL, the text-to-SQL method uses reranked_tables.json if it exists and otherwise retrieved_tables.json.
We consider four text-to-SQL methods:
- ReFoRCE (adapted from this official ReFoRCE implementation): an agentic method with candidate generation, majority voting, and column exploration.
- DIN-SQL (adapted from this Spider2 implementation): a method with query decomposition and self-correction
- DAIL-SQL (adapted from this Spider2 implementation): a method with example selection
- Few-shot: a method with static in-context examples
You can setup the package environment using either conda or venv.
cd eval
# Using `conda`
conda create -n beaver-eval python=3.10 -y
conda activate beaver-eval
# Using `venv`
python3 -m venv beaver-eval
source beaver-eval/bin/activate
pip install -r requirements.txt
# To run the DAILβSQL method, execute the following additional commands:
python dailsql/nltk_downloader.py
python -m spacy download en_core_web_smAll methods can be executed using the run.sh script in their respective folders.
cd eval/[method]
./run.sh --model [model] --dataset [dataset] --setting {0,1,2}model: the LLM for SQL generation (e.g.,gpt-5-mini)dataset: one ofdw,dw_real,neutron,novasetting=0(default setting): Standard end-to-end setting with no hints. Base information with only the top-k tables retrieved from table retrieval.setting=1: With hints for three schema-linking subtasks. Includes gold tables, column mapping, and join keys.setting=2: With hints for all five subtasks. Includes three subtasks insetting=1, domain knowledge, and subquery decomposition.
We evaluate text-to-SQL methods in two complementary modes:
- Coarse-grained evaluation using execution accuracy. This is the standard metric, but an all-or-nothing signal makes it hard to diagnose where a method failed, especially when producing a correct SQL involves solving multiple compounded challenges, such as domain knowledge and complex query construction.
- Fine-grained evaluation across five subtasks critical to text-to-SQL performance, enabling more targeted error analysis.
Execution accuracy (ex_acc) is 1 when the generated SQL returns the same execution result as the gold SQL, and 0 otherwise.
cd eval
# Using `conda`
conda activate beaver-eval
# Using `venv`
source beaver-eval/bin/activate
python evaluate_ex_acc.py --dataset [dataset] --input_dir unified-output/[method]/[run_name]We assess five subtasks: multi-table retrieval, join key detection, column mapping, domain knowledge extraction, and query decomposition.
For each subtask (except query decomposition), we compare information extracted from the generated SQL (using a specified LLM model) with annotated information from the gold SQL. Query decomposition is evaluated via LLM-as-a-judge: the model scores how closely the structure of the generated SQL matches the gold queryβs decomposition based on humanβdesigned rubrics.
python evaluate_subtasks.py --dataset [dataset] --model [model] --input_dir unified-output/[method]/[run_name]