DualStreamBench: A Dual-System Benchmark for Streaming Video Understanding

DualStreamBench is a benchmark and evaluation framework for streaming video understanding under a dual-system setting. It evaluates how a fast S1 path, a slower S2 reasoning path, memory, streaming visual cache, and a route scheduler interact under real video-stream constraints.

The repository directory is still named SysStream for compatibility with existing scripts and result paths.

What This Repository Contains

A five-part streaming video benchmark: proactive, clue, realtime, reasoning, and sequential.
Dual-system execution: S1 answers from a short window, while S2 can use a longer context window and memory retrieval.
Route scheduling: rule, semantic, API/local-router, force-S2, and offline rerun/simulation paths.
Streaming execution: asynchronous video reading, frame-window construction, visual embedding cache, and optional KV reuse.
Local and API model backends: Qwen, LLaVA, LLaVA-NeXT-Video, MiniCPM, InternVL, OpenAI, Anthropic, and Google.
Analysis scripts for routing, result completeness, router comparison, memory ablation, and compact table generation.

Setup

Install the Python dependencies in the environment you use for evaluation:

pip install -r requirements.txt

For local model/router experiments, the project has usually been run in the SysStream conda environment:

conda run -n SysStream python <script.py>

Model weights and sentence-transformer caches are expected under local cache directories such as model_cache/ or the paths configured in src/config.yaml.

Dataset Layout

The merged benchmark files live under datasets/final/:

datasets/final/all.json
datasets/final/task_sets/*.json
datasets/final/shards/*.json
datasets/final/shards_mini/*.json
datasets/final/memory_ablation/*.json

The video root used by evaluation is configured through:

dataset:
  data_dir: ../datasets/data

Running Evaluation

Run a single config-driven evaluation from the repository root:

python src/eval_from_config.py --config src/config.yaml

Useful overrides:

python src/eval_from_config.py --config src/config.yaml --max_samples 5
python src/eval_from_config.py --config src/config.yaml --mode stream
python src/eval_from_config.py --config src/config.yaml --benchmark clue
python src/eval_from_config.py --config src/config.yaml --resume

Run sharded evaluation with generated configs and tmux sessions:

python run_all.py --dry-run
python run_all.py

run_all.py owns the generated shard configs under src/configs/ and launches the shard jobs. Do not edit generated shard configs as source-of-truth changes; edit src/config.yaml or the generator path instead.

API Batch Evaluation

The batch workflow is controlled by top-level variables in src/batch_eval.py, not CLI flags:

cd src
python batch_eval.py

Important variables include:

CONFIG_PATH
DATASET
DATASET_SHARD
EVAL_MODE
BATCH_DRY_RUN
POLL_INTERVAL
POLL_TIMEOUT
CLEAN_START

Batch outputs and state are written under batch_files/, batch_state/, batch_frames/, and configured result paths.

Reports And Analysis

The benchmark report entrypoint scans existing result files and writes reports:

python src/benchmark/run_eval.py
python src/benchmark/run_batch_report.py

Repository-local analysis scripts are grouped under scripts/:

scripts/
├── routing/        # scheduler simulation, misroute analysis, route reruns
├── result_tools/   # result completeness checks and S1/S2 score exports
├── reporting/      # router comparison, backend tables, memory ablation reports
├── experiments/    # small runtime experiments such as prefill_stride tests
└── tools/          # model/cache utilities

Common commands:

python scripts/routing/analyze_scheduler_misroutes.py
python scripts/routing/simulate_scheduler_policy.py
python scripts/routing/rerun_scheduler_routes.py

python scripts/result_tools/check_results_completeness.py
python scripts/result_tools/export_s1_s2_case_scores.py

python scripts/reporting/compare_router_results.py
python scripts/reporting/export_backend_router_table.py
python scripts/reporting/compare_memory_ablation_results.py

python scripts/experiments/run_stride_test.py
python scripts/tools/cache_retrieval_model.py
python scripts/tools/download_modelscope.py

The script directory names intentionally avoid reports and results, because those names are ignored as runtime-output directories by .gitignore.

Configuration Notes

Primary configuration is in src/config.yaml.

Key sections:

dataset: benchmark JSON and video data roots.
model: backend type, model path, dtype, quantization, GPU placement.
video: FPS, max frames, and resize policy.
system: S1/S2 windows, structured output, S2 memory behavior, and scheduler.
system.streaming_embed_cache: streaming visual cache, prefill stride, KV reuse, and context clipping.
memory: short-term/long-term memory and retrieval encoder settings.
stream: streaming runtime and reader settings.
output: result directory, file name, resume behavior, and compact saving.

Supported local backend names include:

qwen, llava, llava-onevision, llava_next_video, minicpm, internvl

API configs are provided separately, for example src/config_openai.yaml, src/config_anthropic.yaml, and src/config_google.yaml.

Output Directories

Runtime outputs are intentionally ignored by Git:

results/
reports/
memory_storage/
model_cache/
batch_files/
batch_frames/
batch_state/
src/configs/

Saved-result analysis commonly uses:

results_1/api/
results_1/internvl/
results_1/qwen/
results_1/minicpm/
results_1/llava_next_video/
results_1/memory/
results_1/router/

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
datasets		datasets
mlp_fusion		mlp_fusion
mlp_pipeline		mlp_pipeline
mlp_router		mlp_router
reports		reports
results_1		results_1
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
batch_test.log		batch_test.log
kill_all.py		kill_all.py
requirements.txt		requirements.txt
run_all.py		run_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DualStreamBench: A Dual-System Benchmark for Streaming Video Understanding

What This Repository Contains

Setup

Dataset Layout

Running Evaluation

API Batch Evaluation

Reports And Analysis

Configuration Notes

Output Directories

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DualStreamBench: A Dual-System Benchmark for Streaming Video Understanding

What This Repository Contains

Setup

Dataset Layout

Running Evaluation

API Batch Evaluation

Reports And Analysis

Configuration Notes

Output Directories

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages