From Imagined Futures to Executable Actions:

Mixture of Latent Actions for Robot Manipulation

ICML 2026

Yajie Li*, Bozhou Zhang*, Chun Gu, Zipei Ma, Jiahui Zhang, Jiankang Deng, Xiatian Zhu, Li Zhang

Model Overview

Environment

Main MoLA environment:

conda create -n mola python==3.10
conda activate mola

pip install setuptools==57.5.0
git clone --recurse-submodules https://github.com/mees/calvin.git
cd calvin
sh install.sh

cd MoLA_PATH
pip install -r requirements.txt
pip install "numpy<2" --force-reinstall

pip uninstall -y torch torchvision torchaudio
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

IDM training environment:

cd idms
conda create -n mola-idm python=3.8
conda activate mola-idm
pip install -r requirements.txt
cd ..

Benchmark environments:

git clone --recurse-submodules https://github.com/mees/calvin.git
cd calvin
sh install.sh
cd ..

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
cd ..

Training

Step 1: Video Imagination Model

MoLA first pre-encodes robot videos into SVD VAE latents, then fine-tunes the video imagination model on the cached latent videos.

Step 1.1: Prepare Latent Videos

Download pre-extracted video-latent data. Pre-extracted video-latent data can be downloaded from Hugging Face: yjguo/vpp_svd_latent.

Prepare LIBERO video-latent data. Download the official LIBERO hdf5 datasets from Hugging Face: yifengzhu-hf/LIBERO-datasets.

export LIBERO_SOURCE_DIR=/path/to/LIBERO-datasets
export LATENT_OUTPUT_DIR="$VIDEO_DATASET_DIR/libero"
export SVD_MODEL_PATH=stabilityai/stable-video-diffusion-img2vid

python step1_prepare_latent_libero.py

Set train_args.dataset_dir in video_conf/*.yaml, or use:

export VIDEO_DATASET_DIR=/path/to/opensource_robotdata

Step 1.2: Train the Video Model

Train the CALVIN video model:

accelerate launch --main_process_port 29506 step1_train_svd.py \
  --config video_conf/train_calvin_svd.yaml \
  train_args.clip_model_path=/path/to/clip-vit-base-patch32

Train the mixed CALVIN + LIBERO video model:

accelerate launch --main_process_port 29506 step1_train_svd.py \
  --config video_conf/train_calvin_libero_svd.yaml \
  train_args.clip_model_path=/path/to/clip-vit-base-patch32

Pretrained video model: the model finetuned on STH-v2, Open X-Embodiment, and CALVIN ABC videos is available at Hugging Face: yjguo/svd-robot-calvin-ft.

Step 2: Mixture of Inverse Dynamics Models

See idms/README.md for details.

Step 3: Action Model

CALVIN:

bash scripts/train_calvin_stage3.sh \
  /path/to/calvin/task_ABC_D \
  /path/to/video_model \
  openai/clip-vit-base-patch32 \
  8 \
  /path/to/idm_flow.pt \
  /path/to/idm_depth.pt \
  /path/to/idm_semantic.pt \
  false \
  32

LIBERO:

First train the action model on libero_90:

bash scripts/train_libero_stage3.sh \
  /path/to/libero_calvin_style/libero_90 \
  /path/to/video_model \
  openai/clip-vit-base-patch32 \
  8 \
  /path/to/idm_flow.pt \
  /path/to/idm_depth.pt \
  /path/to/idm_semantic.pt \
  false \
  32

Then train one separate model for each LIBERO suite:

for SUITE in libero_spatial libero_object libero_goal libero_10; do
  bash scripts/train_libero_stage3.sh \
    "/path/to/libero_calvin_style/${SUITE}" \
    /path/to/video_model \
    openai/clip-vit-base-patch32 \
    8 \
    /path/to/idm_flow.pt \
    /path/to/idm_depth.pt \
    /path/to/idm_semantic.pt \
    false \
    32
done

Evaluation

CALVIN:

bash scripts/rollout_calvin.sh \
  /path/to/video_model \
  /path/to/action_model \
  /path/to/clip-vit-base-patch32 \
  "$CALVIN_DATASET_DIR" \
  8

LIBERO:

bash scripts/rollout_libero.sh \
  /path/to/action_model_dir \
  /path/to/video_model \
  /path/to/clip-vit-base-patch32 \
  /path/to/LIBERO \
  8 \
  [SUITE=libero_goal]

Acknowledgements

Citation

@article{li2026imagined,
  title={From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation},
  author={Li, Yajie and Zhang, Bozhou and Gu, Chun and Ma, Zipei and Zhang, Jiahui and Deng, Jiankang and Zhu, Xiatian and Zhang, Li},
  journal={arXiv preprint arXiv:2605.12167},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
clip		clip
idms		idms
policy_conf		policy_conf
policy_evaluation		policy_evaluation
policy_models		policy_models
policy_training		policy_training
scripts		scripts
utils		utils
video_conf		video_conf
video_dataset		video_dataset
video_models		video_models
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
eval_libero.py		eval_libero.py
requirements.txt		requirements.txt
step1_prepare_latent.py		step1_prepare_latent.py
step1_prepare_latent_libero.py		step1_prepare_latent_libero.py
step1_train_svd.py		step1_train_svd.py
step2_train_idms.sh		step2_train_idms.sh
step3_train_action.py		step3_train_action.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

From Imagined Futures to Executable Actions:

Mixture of Latent Actions for Robot Manipulation

ICML 2026

Model Overview

Environment

Training

Step 1: Video Imagination Model

Step 1.1: Prepare Latent Videos

Step 1.2: Train the Video Model

Step 2: Mixture of Inverse Dynamics Models

Step 3: Action Model

Evaluation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

From Imagined Futures to Executable Actions:

Mixture of Latent Actions for Robot Manipulation

ICML 2026

Model Overview

Environment

Training

Step 1: Video Imagination Model

Step 1.1: Prepare Latent Videos

Step 1.2: Train the Video Model

Step 2: Mixture of Inverse Dynamics Models

Step 3: Action Model

Evaluation

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages