Skip to content

feat(bench): add BFCL function-calling adapter #412

Description

@drewstone

Problem

The external-benchmark roadmap now selects BFCL as the cheapest public size-axis benchmark for the policy-vs-free-form contrast, but @tangle-network/agent-bench does not yet expose a BFCL adapter.

Desired adapter

Add a fail-loud bfcl adapter under bench/src/benchmarks/ that delegates to the official Berkeley Function Calling Leaderboard assets/evaluator where possible. The first useful scope is deterministic function-call categories suitable for weak-vs-gold calibration and small-vs-large model comparisons.

Constraints

  • Do not fabricate BFCL scores.
  • Fixture mode may test adapter plumbing, but official mode must require the real Gorilla/BFCL checkout or bfcl-eval project root.
  • Keep BFCL V4/V3 naming current: V4 is the latest official line, while V3 multi-turn/missing-function categories are the immediate research target if they remain the best fit.
  • Expose through the existing ADAPTERS map; no runtime-loop changes.
  • Add preflight/load/judge tests that fail loud without official assets and pass on fixtures.

Research use

This enables the AppWorld + BFCL easy-subset-first tranche: AppWorld serves external validity through stateful code/API orchestration, BFCL serves the size axis for function-calling and missing-function/free-form contrast.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions