Problem
The external-benchmark roadmap now selects BFCL as the cheapest public size-axis benchmark for the policy-vs-free-form contrast, but @tangle-network/agent-bench does not yet expose a BFCL adapter.
Desired adapter
Add a fail-loud bfcl adapter under bench/src/benchmarks/ that delegates to the official Berkeley Function Calling Leaderboard assets/evaluator where possible. The first useful scope is deterministic function-call categories suitable for weak-vs-gold calibration and small-vs-large model comparisons.
Constraints
- Do not fabricate BFCL scores.
- Fixture mode may test adapter plumbing, but official mode must require the real Gorilla/BFCL checkout or
bfcl-eval project root.
- Keep BFCL V4/V3 naming current: V4 is the latest official line, while V3 multi-turn/missing-function categories are the immediate research target if they remain the best fit.
- Expose through the existing
ADAPTERS map; no runtime-loop changes.
- Add preflight/load/judge tests that fail loud without official assets and pass on fixtures.
Research use
This enables the AppWorld + BFCL easy-subset-first tranche: AppWorld serves external validity through stateful code/API orchestration, BFCL serves the size axis for function-calling and missing-function/free-form contrast.
Problem
The external-benchmark roadmap now selects BFCL as the cheapest public size-axis benchmark for the policy-vs-free-form contrast, but
@tangle-network/agent-benchdoes not yet expose a BFCL adapter.Desired adapter
Add a fail-loud
bfcladapter underbench/src/benchmarks/that delegates to the official Berkeley Function Calling Leaderboard assets/evaluator where possible. The first useful scope is deterministic function-call categories suitable for weak-vs-gold calibration and small-vs-large model comparisons.Constraints
bfcl-evalproject root.ADAPTERSmap; no runtime-loop changes.Research use
This enables the AppWorld + BFCL easy-subset-first tranche: AppWorld serves external validity through stateful code/API orchestration, BFCL serves the size axis for function-calling and missing-function/free-form contrast.