Skip to content

sony/MusTBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MusTBench: Benchmarking and Advancing Temporal Grounding in Music LLMs

arXiv Project website Hugging Face dataset

MusTBench is a benchmark for evaluating temporal grounding in Large Audio-Language Models (LALMs). It evaluates whether a model can identify, describe, order, and localize temporally situated musical events.

MusTBench contains 1,264 question-answer examples over 875 unique audio tracks and consists of five tasks:

Task Full name Output QA examples
TSG Temporal Source Grounding Timestamp (M:SS) 400
LTR Local Timestamp Reasoning Multiple-choice answer (A-D) 208
TAD Timestamp-Aware Description Free-form transition description 208
GTO Global Transition Ordering Multiple-choice answer (A-F) 198
MTR Music Temporal Region Grounding One or more intervals (M:SS-M:SS) 250

Installation

Clone the repository:

git clone https://github.com/wqysony/MusTBench.git
cd MusTBench

Create a Python environment:

conda create -n mustbench python=3.10 -y
conda activate mustbench

Install a PyTorch build compatible with the CUDA version on your system, then install the remaining dependencies:

pip install pandas pyarrow datasets accelerate transformers librosa soundfile

The inference code was tested with Python 3.10, PyTorch 2.9.0, Transformers 4.57.1, and Accelerate 1.10.1. ffmpeg is recommended for audio decoding.

Benchmark Data

The benchmark is provided as a unified Parquet file:

benchmarks/mustbench.parquet

Each row corresponds to one QA example.

Schema

Column Description
id Unique QA example ID
track_id Audio track identifier
task Task type: TSG, LTR, TAD, GTO, or MTR
category Category information for TSG; null for other tasks
qa_type QA subtype for TSG; null for other tasks
question Question text
answer Canonical answer for single-answer tasks
answers List of reference answers; for MTR, this can contain multiple reference intervals
split Dataset split

Load the Local Parquet File

import pandas as pd

benchmark = pd.read_parquet("benchmarks/mustbench.parquet")
print(benchmark)
print(benchmark.iloc[0])

tsg = benchmark[benchmark["task"] == "TSG"]

Load from Hugging Face

from datasets import load_dataset

dataset = load_dataset("wqysony/mustbench", split="test")
print(dataset)
print(dataset[0])

Source Audio

This repository distributes MusTBench annotations and QA data, but does not redistribute the associated audio files.

The benchmark uses tracks from:

Download both datasets from their original providers and comply with their respective licenses and access conditions.

The 875 required tracks consist of:

Source dataset Track ID format Tracks Collected format
MTG-Jamendo Numeric ID, e.g. 1001315 749 MP3
Slakh2100 Track followed by five digits, e.g. Track01876 126 FLAC

Preparing the Audio Files

code/collect_audios.py reads all unique track_id values from the benchmark, finds the corresponding files in downloaded MTG-Jamendo and Slakh2100 directories, and copies only the 875 required tracks into one flat directory.

Run the collector after downloading both source datasets:

python code/collect_audios.py \
  --benchmark-path benchmarks/mustbench.parquet \
  --mtg-root-path /absolute/path/to/mtg-jamendo \
  --slakh-root-path /absolute/path/to/slakh2100 \
  --output-dir audios

The resulting directory is flat:

audios/
├── 1001315.mp3
├── 1002010.mp3
├── ...
├── Track01876.flac
└── ...

Existing byte-identical destination files are skipped. A differing destination file causes an error unless --overwrite is supplied. Unrelated MP3 or FLAC files in the output directory cause an error unless --allow-extra-audio is supplied.

Inference

The provided inference pipeline runs Qwen/Qwen2.5-Omni-7B through Hugging Face Transformers.

Run all commands from the repository root.

python code/inference.py \
  --benchmark-path benchmarks/mustbench.parquet \
  --audio-root audios \
  --output-path results/mustbench_qwen25_omni_predictions.json \
  --model-id Qwen/Qwen2.5-Omni-7B \
  --device-map auto \
  --torch-dtype bfloat16 \
  --batch-size 1 \
  --save-every 20 \
  --continue-on-generation-error

To evaluate selected tasks, add an option such as --tasks TSG MTR.

Rerunning the same command with the same --output-path resumes an interrupted run and skips completed examples.

Predictions are saved as JSON with the original benchmark fields, model response, parsed prediction, references, and parse status.

Evaluation

After inference is complete, evaluate the prediction JSON with:

python code/evaluate.py \
  results/mustbench_qwen25_omni_predictions.json \
  --audio-root audios \
  --clap-device cuda

The evaluator reports the following metrics:

Type Task Metric Definition
Type1 TSG Hit@3s Fraction of predicted timestamps within 3 seconds of a reference timestamp
Type2 LTR Accuracy Exact accuracy of the predicted multiple-choice option
Type3 TAD METEOR Exact-token METEOR between the generated and reference transition descriptions
Type3 TAD CLAPScore Cosine similarity between the generated text and the CLAP audio embedding for the T-10s to T+10s clip around the queried timestamp
Type4 GTO Accuracy Exact accuracy of the predicted chronological-order option
Type5 MTR Macro Temporal IoU Mean per-example intersection-over-union of the predicted and reference interval sets
Type5 MTR Macro Temporal F1 Mean per-example harmonic mean of temporal precision and temporal recall

By default, the evaluator writes:

results/mustbench_qwen25_omni_evaluation.json
results/mustbench_qwen25_omni_evaluation_rows.json

The first file contains aggregate metrics, while the second contains per-example scores and parsed predictions. Use --skip-clap to evaluate all other metrics without loading the CLAP model.

Citation

If you use MusTBench in your research, please cite:

@article{kwon2026MusTBench,
  title={MusTBench: Benchmarking and Advancing Temporal Grounding in Music LLMs},
  author={Kwon, Daeyong and Wu, Qiyu and Kuriya, Shinobu and Koo, Junghyun and Cui, Shuyang and Zhong, Zhi and Liao, Wei-Hsiang and Wakaki, Hiromi and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2605.29300},
  year={2026}
}

Acknowledgements

MusTBench builds on audio from Slakh2100 and MTG-Jamendo. We thank the authors and maintainers of these datasets and the developers of the open-source models and libraries used by this project.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages