Project Page | Paper | Checkpoints
Yajie Li*, Bozhou Zhang*, Chun Gu, Zipei Ma, Jiahui Zhang, Jiankang Deng, Xiatian Zhu, Li Zhang
Main MoLA environment:
conda create -n mola python==3.10
conda activate mola
pip install setuptools==57.5.0
git clone --recurse-submodules https://github.com/mees/calvin.git
cd calvin
sh install.sh
cd MoLA_PATH
pip install -r requirements.txt
pip install "numpy<2" --force-reinstall
pip uninstall -y torch torchvision torchaudio
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121IDM training environment:
cd idms
conda create -n mola-idm python=3.8
conda activate mola-idm
pip install -r requirements.txt
cd ..Benchmark environments:
git clone --recurse-submodules https://github.com/mees/calvin.git
cd calvin
sh install.sh
cd ..
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
cd ..MoLA first pre-encodes robot videos into SVD VAE latents, then fine-tunes the video imagination model on the cached latent videos.
Download pre-extracted video-latent data. Pre-extracted video-latent data can be downloaded from Hugging Face: yjguo/vpp_svd_latent.
Prepare LIBERO video-latent data. Download the official LIBERO hdf5 datasets from Hugging Face: yifengzhu-hf/LIBERO-datasets.
export LIBERO_SOURCE_DIR=/path/to/LIBERO-datasets
export LATENT_OUTPUT_DIR="$VIDEO_DATASET_DIR/libero"
export SVD_MODEL_PATH=stabilityai/stable-video-diffusion-img2vid
python step1_prepare_latent_libero.pySet train_args.dataset_dir in video_conf/*.yaml, or use:
export VIDEO_DATASET_DIR=/path/to/opensource_robotdataTrain the CALVIN video model:
accelerate launch --main_process_port 29506 step1_train_svd.py \
--config video_conf/train_calvin_svd.yaml \
train_args.clip_model_path=/path/to/clip-vit-base-patch32Train the mixed CALVIN + LIBERO video model:
accelerate launch --main_process_port 29506 step1_train_svd.py \
--config video_conf/train_calvin_libero_svd.yaml \
train_args.clip_model_path=/path/to/clip-vit-base-patch32Pretrained video model: the model finetuned on STH-v2, Open X-Embodiment, and CALVIN ABC videos is available at Hugging Face: yjguo/svd-robot-calvin-ft.
See idms/README.md for details.
CALVIN:
bash scripts/train_calvin_stage3.sh \
/path/to/calvin/task_ABC_D \
/path/to/video_model \
openai/clip-vit-base-patch32 \
8 \
/path/to/idm_flow.pt \
/path/to/idm_depth.pt \
/path/to/idm_semantic.pt \
false \
32LIBERO:
First train the action model on libero_90:
bash scripts/train_libero_stage3.sh \
/path/to/libero_calvin_style/libero_90 \
/path/to/video_model \
openai/clip-vit-base-patch32 \
8 \
/path/to/idm_flow.pt \
/path/to/idm_depth.pt \
/path/to/idm_semantic.pt \
false \
32Then train one separate model for each LIBERO suite:
for SUITE in libero_spatial libero_object libero_goal libero_10; do
bash scripts/train_libero_stage3.sh \
"/path/to/libero_calvin_style/${SUITE}" \
/path/to/video_model \
openai/clip-vit-base-patch32 \
8 \
/path/to/idm_flow.pt \
/path/to/idm_depth.pt \
/path/to/idm_semantic.pt \
false \
32
doneCALVIN:
bash scripts/rollout_calvin.sh \
/path/to/video_model \
/path/to/action_model \
/path/to/clip-vit-base-patch32 \
"$CALVIN_DATASET_DIR" \
8LIBERO:
bash scripts/rollout_libero.sh \
/path/to/action_model_dir \
/path/to/video_model \
/path/to/clip-vit-base-patch32 \
/path/to/LIBERO \
8 \
[SUITE=libero_goal]@article{li2026imagined,
title={From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation},
author={Li, Yajie and Zhang, Bozhou and Gu, Chun and Ma, Zipei and Zhang, Jiahui and Deng, Jiankang and Zhu, Xiatian and Zhang, Li},
journal={arXiv preprint arXiv:2605.12167},
year={2026}
}