Skip to content

oopb/SysStream

Repository files navigation

DualStreamBench: A Dual-System Benchmark for Streaming Video Understanding

DualStreamBench is a benchmark and evaluation framework for streaming video understanding under a dual-system setting. It evaluates how a fast S1 path, a slower S2 reasoning path, memory, streaming visual cache, and a route scheduler interact under real video-stream constraints.

The repository directory is still named SysStream for compatibility with existing scripts and result paths.

What This Repository Contains

  • A five-part streaming video benchmark: proactive, clue, realtime, reasoning, and sequential.
  • Dual-system execution: S1 answers from a short window, while S2 can use a longer context window and memory retrieval.
  • Route scheduling: rule, semantic, API/local-router, force-S2, and offline rerun/simulation paths.
  • Streaming execution: asynchronous video reading, frame-window construction, visual embedding cache, and optional KV reuse.
  • Local and API model backends: Qwen, LLaVA, LLaVA-NeXT-Video, MiniCPM, InternVL, OpenAI, Anthropic, and Google.
  • Analysis scripts for routing, result completeness, router comparison, memory ablation, and compact table generation.

Setup

Install the Python dependencies in the environment you use for evaluation:

pip install -r requirements.txt

For local model/router experiments, the project has usually been run in the SysStream conda environment:

conda run -n SysStream python <script.py>

Model weights and sentence-transformer caches are expected under local cache directories such as model_cache/ or the paths configured in src/config.yaml.

Dataset Layout

The merged benchmark files live under datasets/final/:

  • datasets/final/all.json
  • datasets/final/task_sets/*.json
  • datasets/final/shards/*.json
  • datasets/final/shards_mini/*.json
  • datasets/final/memory_ablation/*.json

The video root used by evaluation is configured through:

dataset:
  data_dir: ../datasets/data

Running Evaluation

Run a single config-driven evaluation from the repository root:

python src/eval_from_config.py --config src/config.yaml

Useful overrides:

python src/eval_from_config.py --config src/config.yaml --max_samples 5
python src/eval_from_config.py --config src/config.yaml --mode stream
python src/eval_from_config.py --config src/config.yaml --benchmark clue
python src/eval_from_config.py --config src/config.yaml --resume

Run sharded evaluation with generated configs and tmux sessions:

python run_all.py --dry-run
python run_all.py

run_all.py owns the generated shard configs under src/configs/ and launches the shard jobs. Do not edit generated shard configs as source-of-truth changes; edit src/config.yaml or the generator path instead.

API Batch Evaluation

The batch workflow is controlled by top-level variables in src/batch_eval.py, not CLI flags:

cd src
python batch_eval.py

Important variables include:

  • CONFIG_PATH
  • DATASET
  • DATASET_SHARD
  • EVAL_MODE
  • BATCH_DRY_RUN
  • POLL_INTERVAL
  • POLL_TIMEOUT
  • CLEAN_START

Batch outputs and state are written under batch_files/, batch_state/, batch_frames/, and configured result paths.

Reports And Analysis

The benchmark report entrypoint scans existing result files and writes reports:

python src/benchmark/run_eval.py
python src/benchmark/run_batch_report.py

Repository-local analysis scripts are grouped under scripts/:

scripts/
├── routing/        # scheduler simulation, misroute analysis, route reruns
├── result_tools/   # result completeness checks and S1/S2 score exports
├── reporting/      # router comparison, backend tables, memory ablation reports
├── experiments/    # small runtime experiments such as prefill_stride tests
└── tools/          # model/cache utilities

Common commands:

python scripts/routing/analyze_scheduler_misroutes.py
python scripts/routing/simulate_scheduler_policy.py
python scripts/routing/rerun_scheduler_routes.py

python scripts/result_tools/check_results_completeness.py
python scripts/result_tools/export_s1_s2_case_scores.py

python scripts/reporting/compare_router_results.py
python scripts/reporting/export_backend_router_table.py
python scripts/reporting/compare_memory_ablation_results.py

python scripts/experiments/run_stride_test.py
python scripts/tools/cache_retrieval_model.py
python scripts/tools/download_modelscope.py

The script directory names intentionally avoid reports and results, because those names are ignored as runtime-output directories by .gitignore.

Configuration Notes

Primary configuration is in src/config.yaml.

Key sections:

  • dataset: benchmark JSON and video data roots.
  • model: backend type, model path, dtype, quantization, GPU placement.
  • video: FPS, max frames, and resize policy.
  • system: S1/S2 windows, structured output, S2 memory behavior, and scheduler.
  • system.streaming_embed_cache: streaming visual cache, prefill stride, KV reuse, and context clipping.
  • memory: short-term/long-term memory and retrieval encoder settings.
  • stream: streaming runtime and reader settings.
  • output: result directory, file name, resume behavior, and compact saving.

Supported local backend names include:

qwen, llava, llava-onevision, llava_next_video, minicpm, internvl

API configs are provided separately, for example src/config_openai.yaml, src/config_anthropic.yaml, and src/config_google.yaml.

Output Directories

Runtime outputs are intentionally ignored by Git:

  • results/
  • reports/
  • memory_storage/
  • model_cache/
  • batch_files/
  • batch_frames/
  • batch_state/
  • src/configs/

Saved-result analysis commonly uses:

  • results_1/api/
  • results_1/internvl/
  • results_1/qwen/
  • results_1/minicpm/
  • results_1/llava_next_video/
  • results_1/memory/
  • results_1/router/

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages