DualStreamBench is a benchmark and evaluation framework for streaming video understanding under a dual-system setting. It evaluates how a fast S1 path, a slower S2 reasoning path, memory, streaming visual cache, and a route scheduler interact under real video-stream constraints.
The repository directory is still named SysStream for compatibility with
existing scripts and result paths.
- A five-part streaming video benchmark:
proactive,clue,realtime,reasoning, andsequential. - Dual-system execution: S1 answers from a short window, while S2 can use a longer context window and memory retrieval.
- Route scheduling: rule, semantic, API/local-router, force-S2, and offline rerun/simulation paths.
- Streaming execution: asynchronous video reading, frame-window construction, visual embedding cache, and optional KV reuse.
- Local and API model backends: Qwen, LLaVA, LLaVA-NeXT-Video, MiniCPM, InternVL, OpenAI, Anthropic, and Google.
- Analysis scripts for routing, result completeness, router comparison, memory ablation, and compact table generation.
Install the Python dependencies in the environment you use for evaluation:
pip install -r requirements.txtFor local model/router experiments, the project has usually been run in the
SysStream conda environment:
conda run -n SysStream python <script.py>Model weights and sentence-transformer caches are expected under local cache
directories such as model_cache/ or the paths configured in src/config.yaml.
The merged benchmark files live under datasets/final/:
datasets/final/all.jsondatasets/final/task_sets/*.jsondatasets/final/shards/*.jsondatasets/final/shards_mini/*.jsondatasets/final/memory_ablation/*.json
The video root used by evaluation is configured through:
dataset:
data_dir: ../datasets/dataRun a single config-driven evaluation from the repository root:
python src/eval_from_config.py --config src/config.yamlUseful overrides:
python src/eval_from_config.py --config src/config.yaml --max_samples 5
python src/eval_from_config.py --config src/config.yaml --mode stream
python src/eval_from_config.py --config src/config.yaml --benchmark clue
python src/eval_from_config.py --config src/config.yaml --resumeRun sharded evaluation with generated configs and tmux sessions:
python run_all.py --dry-run
python run_all.pyrun_all.py owns the generated shard configs under src/configs/ and launches
the shard jobs. Do not edit generated shard configs as source-of-truth changes;
edit src/config.yaml or the generator path instead.
The batch workflow is controlled by top-level variables in src/batch_eval.py,
not CLI flags:
cd src
python batch_eval.pyImportant variables include:
CONFIG_PATHDATASETDATASET_SHARDEVAL_MODEBATCH_DRY_RUNPOLL_INTERVALPOLL_TIMEOUTCLEAN_START
Batch outputs and state are written under batch_files/, batch_state/,
batch_frames/, and configured result paths.
The benchmark report entrypoint scans existing result files and writes reports:
python src/benchmark/run_eval.py
python src/benchmark/run_batch_report.pyRepository-local analysis scripts are grouped under scripts/:
scripts/
├── routing/ # scheduler simulation, misroute analysis, route reruns
├── result_tools/ # result completeness checks and S1/S2 score exports
├── reporting/ # router comparison, backend tables, memory ablation reports
├── experiments/ # small runtime experiments such as prefill_stride tests
└── tools/ # model/cache utilities
Common commands:
python scripts/routing/analyze_scheduler_misroutes.py
python scripts/routing/simulate_scheduler_policy.py
python scripts/routing/rerun_scheduler_routes.py
python scripts/result_tools/check_results_completeness.py
python scripts/result_tools/export_s1_s2_case_scores.py
python scripts/reporting/compare_router_results.py
python scripts/reporting/export_backend_router_table.py
python scripts/reporting/compare_memory_ablation_results.py
python scripts/experiments/run_stride_test.py
python scripts/tools/cache_retrieval_model.py
python scripts/tools/download_modelscope.pyThe script directory names intentionally avoid reports and results, because
those names are ignored as runtime-output directories by .gitignore.
Primary configuration is in src/config.yaml.
Key sections:
dataset: benchmark JSON and video data roots.model: backend type, model path, dtype, quantization, GPU placement.video: FPS, max frames, and resize policy.system: S1/S2 windows, structured output, S2 memory behavior, and scheduler.system.streaming_embed_cache: streaming visual cache, prefill stride, KV reuse, and context clipping.memory: short-term/long-term memory and retrieval encoder settings.stream: streaming runtime and reader settings.output: result directory, file name, resume behavior, and compact saving.
Supported local backend names include:
qwen, llava, llava-onevision, llava_next_video, minicpm, internvl
API configs are provided separately, for example src/config_openai.yaml,
src/config_anthropic.yaml, and src/config_google.yaml.
Runtime outputs are intentionally ignored by Git:
results/reports/memory_storage/model_cache/batch_files/batch_frames/batch_state/src/configs/
Saved-result analysis commonly uses:
results_1/api/results_1/internvl/results_1/qwen/results_1/minicpm/results_1/llava_next_video/results_1/memory/results_1/router/
MIT License