🤖 Awesome Embodied Robotics and Agent

This is a curated list of "Embodied robotics or agent with Vision-Language Models (VLMs) and Large Language Models (LLMs)" research which is maintained by haonan.

Watch this repository for the latest updates and feel free to raise pull requests if you find some interesting papers!

News🔥

[2025/10/30] 🎉 Our survey paper "A Survey on Efficient Vision-Language-Action Models" [arXiv] has been released!
[2025/04/23] Add π-0.5, a lightweight and modular framework designed to integrate perception, control, and learning directly within physical systems.
[2025/03/18] Add some popular vision-language action (VLA) models. 🦾
[2024/06/28] Created a new board about agent self-evolutionary research. 🤖
[2024/06/07] Add Mobile-Agent-v2, a mobile device operation assistant with effective navigation via multi-agent collaboration. 🚀
[2024/05/13] Add "Learning Interactive Real-World Simulators"——outstanding paper award in ICLR 2024 🥇.
[2024/04/24] Add "A Survey on Self-Evolution of Large Language Models", a systematic survey on self-evolution in LLMs! 💥
[2024/04/16] Add some CVPR 2024 papers.
[2024/04/15] Add MetaGPT, accepted for oral presentation (top 1.2%) at ICLR 2024, ranking #1 in the LLM-based Agent category. 🚀
[2024/03/13] Add CRADLE, an interesting paper exploring LLM-based agent in Red Dead Redemption II！🎮

Development of Embodied Robotics and Benchmarks

Figure 1. The Organization of Our Survey. We systematically categorize efficient VLAs into three core pillars: (1) Efficient Model Design, encompassing efficient architectures and model compression techniques, (2) Efficient Training, covering efficient pre-training and post-training strategies, and (3) Efficient Data Collection, including efficient data collection and augmentation methods. The framework also reviews VLA foundations, key applications, challenges, and future directions, establishing the groundwork for advancing scalable embodied intelligence.

Methods

Survey

A Survey on Efficient Vision-Language-Action Models [arXiv 2025.10] [Github] [Project Page]
Zhaoshu Yu¹, Bo Wang¹, Pengpeng Zeng¹, Haonan Zhang¹, Ji Zhang¹, Lianli Gao³, Jingkuan Song¹, Nicu Sebe⁴, Heng Tao Shen¹
¹ Tongji University, ² Southwest Jiaotong University, ³ University of Electronic Science and Technology of China, ⁴ University of Trento
A Survey on Vision-Language-Action Models for Embodied AI [arXiv 2024.03]
The Chinese University of Hong Kong, Huawei Noah’s Ark Lab
Large Multimodal Agents: A Survey [arXiv 2024.02] [Github]
Junlin Xie^♣♡ Zhihong Chen^♣♡ Ruifei Zhang^♣♡ Xiang Wan^♣ Guanbin Li^♠
^♡The Chinese University of Hong Kong, Shenzhen ^♣Shenzhen Research Institute of Big Data, ^♠Sun Yat-sen University
A Survey on Self-Evolution of Large Language Models [arXiv 2024.01]
Key Lab of HCST (PKU), MOE; School of Computer Science, Peking University, Alibaba Group, Nanyang Technological University
Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2024.01]
Stanford University, Microsoft Research, Redmond, University of California, Los Angeles, University of Washington, Microsoft Gaming
Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents [arXiv 2023.11]
Shanghai Jiao Tong University, Amazon Web Services, Yale University
The Rise and Potential of Large Language Model Based Agents: A Survey [arXiv 2023.09]
Fudan NLP Group, miHoYo Inc
A Survey on LLM-based Autonomous Agents [arXiv 2023,08]
Gaoling School of Artificial Intelligence, Renmin University of China

Vision-Language-Action Model

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning [arXiv 2026] [Github][Project page]
AMAP CV Lab, Alibaba Group
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI [ICLR 2026] [Github][Project page][HuggingFace🤗]
Stanford University, Seoul National University, MAUM.AI
Robotic Control via Embodied Chain-of-Thought Reasoning [CoRL 2024] [Github][Project page][HuggingFace🤗]
Michał Zawalski^∗1,2, William Chen^∗1, Karl Pertsch^1,3 Oier Mees¹, Chelsea Finn³, Sergey Levine¹
¹UC Berkeley, ²University of Warsaw, ³Stanford University
π0.5: a VLA with Open-World Generalization [arXiv 2025.04] [Project page]
Physical Intelligence
π0: A Vision-Language-Action Flow Model for General Robot Control [arXiv 2024.10] [Project page]
Physical Intelligence
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [arXiv 2025.02] [Project page]
Physical Intelligence
OpenVLA: An Open-Source Vision-Language-Action Model [arXiv 2024.01] [Github][Project page][HuggingFace🤗]
Stanford University, UC Berkeley, Toyota Research Institute, Google Deepmind, Physical Intelligence, MIT
FAST: Efficient Action Tokenization for Vision-Language-Action Models [arXiv 2025.01] [Project page][HuggingFace🤗]
Physical Intelligence
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2024.07] [Project Page]
Google Deepmind
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [arXiv 2025.03] [Github] [Project Page] [HuggingFace🤗]
Zhejiang University; Institute of Software, Chinese Academy of Sciences; Alibaba Group

Self-Evolving Agents

Meta-Control: Automatic Model-based Control System Synthesis for Heterogeneous Robot Skills [CoRL 2024] [Project page]
Tianhao Wei^1*, Liqian Ma^12*, Rui Chen¹, Weiye Zhao¹, Changliu Liu¹
¹Carnegie Mellon University ²Tsinghua University
AGENTGYM: Evolving Large Language Model-based Agents across Diverse Environments [arXiv 2024.06] [Github] [Project page]
Fudan NLP Lab & Fudan Vision and Learning Lab
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models [arXiv 2024.06] [Github]
Fangzhi Xu^♢♡, Qiushi Sun^{2, ♡}, Kanzhi Cheng¹, Jun Liu^♢, Yu Qiao♡, Zhiyong Wu^♡
^♢Xi’an Jiaotong University, ^♡Shanghai Artificial Intelligence Laboratory, ¹The University of Hong Kong, ²Nanjing Univerisity
Symbolic Learning Enables Self-Evolving Agents [arXiv 2024.06] [Github]
Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang
AIWaves Inc. https://meta-control-paper.github.io/

Advanced Agent Applications

Meta-Control: Automatic Model-based Control System Synthesis for Heterogeneous Robot Skills [CoRL 2024] [Project page]
Tianhao Wei^1*, Liqian Ma^12*, Rui Chen¹, Weiye Zhao¹, Changliu Liu¹
^*Equal Contribution ¹Carnegie Mellon University ²Tsinghua University
[Embodied-agents] [Github]
Seamlessly integrate state-of-the-art transformer models into robotics stacks.
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [arXiv 2024] [Github]
Junyang Wang¹, Haiyang Xu², Haitao Jia¹, Xi Zhang², Ming Yan², Weizhou Shen², Ji Zhang², Fei Huang², Jitao Sang¹
¹Beijing Jiaotong University ²Alibaba Group
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family [ICLR 2024 Workshop LLM Agents] [Github]
Junyang Wang¹, Haiyang Xu², Jiabo Ye², Ming Yan², Weizhou Shen², Ji Zhang², Fei Huang², Jitao Sang¹
¹Beijing Jiaotong University ²Alibaba Group
[Machinascript-for-robots] [Github]
Build LLM-powered robots in your garage with MachinaScript For Robots!
[ros2_medkit] [Github] [MCP Server]
LLM-powered robot diagnostics for ROS 2 — fault detection, root cause analysis, and self-healing via MCP.
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model [CVPR 2024] [Github]
Lirui Zhao^1,2 Yue Yang^2,4 Kaipeng Zhang² Wenqi Shao², Yuxin Zhang¹, Yu Qiao², Ping Luo^2,3 Rongrong Ji¹
¹Xiamen University, ²OpenGVLab, Shanghai AI Laboratory ³The University of Hong Kong, ⁴Shanghai Jiao Tong University
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework [ICLR 2024 (oral)]
DeepWisdom, AI Initiative, King Abdullah University of Science and Technology, Xiamen University, The Chinese University of Hong Kong, Shenzhen, Nanjing University, University of Pennsylvania, University of California, Berkeley, The Swiss AI Lab IDSIA/USI/SUPSI
AppAgent: Multimodal Agents as Smartphone Users [Project page] [Github]
Chi Zhang∗ ZhaoYang∗ JiaxuanLiu∗ YuchengHan XinChen Zebiao Huang BinFu GangYu†
Tencent

LLMs with RL or World Model

KALM: Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts [NeurIPS 2024] [Project Page]
Jing-Cheng Pang, Si-Hang Yang, Kaiyuan Li, Jiaji Zhang, Xiong-Hui Chen, Nan Tang, Yang Yu
¹Nanjing University, ²Polixir.ai
Learning Interactive Real-World Simulators [ICLR 2024 (Outstanding Papers)] [Project Page]
Sherry Yang^1,2, Yilun Du³, Kamyar Ghasemipour², Jonathan Tompson², Leslie Kaelbling³, Dale Schuurmans², Pieter Abbeel¹
¹UC Berkeley, ²Google DeepMind, ³MIT
Robust agents learn causal world models [ICLR 2024]
Jonathan Richens*, TomEveritt
Google DeepMind
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [CVPR 2024] [Github]
Yijun Yang¹⁵⁴, Tianyi Zhou², Kanxue Li³, Dapeng Tao³, Lvsong Li⁴, Li Shen⁴, Xiaodong He⁴, Jing Jiang⁵, Yuhui Shi¹
¹Southern University of Science and Technology, ²University of Maryland, College Park, ³Yunnan University, ⁴JD Explore Academy, ⁵University of Technology Sydney
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning [NeurIPS 2023] [Project Page][Github]
Lin_Guan¹, Karthik Valmeekam¹, Sarath Sreedharan², Subbarao Kambhampati¹
¹School of Computing & AI Arizona State University Tempe, AZ, ²Department of Computer Science Colorado State University Fort Collins, CO
Eureka: Human-Level Reward Design via Coding Large Language Models [NeurIPS 2023 Workshop ALOE Spotlight] [Project page] [Github]
Jason Ma^1,2, William Liang², Guanzhi Wang^1,3, De-An Huang¹, Osbert Bastani², Dinesh Jayaraman², Yuke Zhu^1,4, Linxi "Jim" Fan¹, Anima Anandkumar^1,3
¹NVIDIA; ²UPenn; ³Caltech; ⁴UT Austin
RLAdapter: Bridging Large Language Models to Reinforcement Learning in Open Worlds [arXiv 2023]
Can Language Agents Be Alternatives to PPO? A Preliminary Empirical Study on OpenAI Gym [arXiv 2023]
RoboGPT: An intelligent agent of making embodied long-term decisions for daily instruction tasks [arXiv 2023]
Aligning Agents like Large Language Models [arXiv 2023]
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents [ICLR 2024 spotlight]
STARLING: Self-supervised Training of Text-based Reinforcement Learning Agent with Large Language Models [arXiv 2023]
Text2Reward: Dense Reward Generation with Language Models for Reinforcement Learning [ICLR 2024 spotlight]
Leveraging Large Language Models for Optimised Coordination in Textual Multi-Agent Reinforcement Learning [arXiv 2023]
Online Continual Learning for Interactive Instruction Following Agents [ICLR 2024]
ADAPTER-RL: Adaptation of Any Agent using Reinforcement Learning [arXiv 2023]
Language Reward Modulation for Pretraining Reinforcement Learning [arXiv 2023]
Informing Reinforcement Learning Agents by Grounding Natural Language to Markov Decision Processes [arXiv 2023]
Learning to Model the World with Language [arXiv 2023]
MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning [ICLR 2024]
Language Reward Modulation for Pretraining Reinforcement Learning [arXiv 2023] [Github]
Ademi Adeniji, Amber Xie, Carmelo Sferrazza, Younggyo Seo, Stephen James, Pieter Abbeel
¹UC Berkeley
Guiding Pretraining in Reinforcement Learning with Large Language Models [ICML 2023]
Yuqing Du^1*, Olivia Watkins^1*, Zihan Wang², Cedric Colas ´^3,4, Trevor Darrell¹, Pieter Abbeel¹, Abhishek Gupta², Jacob Andreas³
¹Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA ²University of Washington, Seattle ³Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory ⁴ Inria, Flowers Laboratory.

Planning and Manipulation or Pretraining

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [Arxiv 2025] [Project Page] [Code]
Enshen Zhou^1,2,, Jingkun An^1,, Cheng Chi^2,*
¹Beihang University, ²Beijing Academy of Artificial Intelligence
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics [CVPR 2025 (Oral)] [Project Page] [Code]
Chan Hee Song¹, Valts Blukis², Jonathan Tremblay², Stephen Tyree², Yu Su¹, Stan Birchfield²
¹The Ohio State University, ²NVIDIA
Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples [AAAI 2025] [Project page]
Taewoong Kim, Byeonghwi Kim, Jonghyun Choi^†
Seoul National University
Pre-emptive Action Revision by Environmental Feedback for Embodied Instruction Following Agents [CoRL 2024] [Project page]
Jinyeon Kim^1,2,, Cheolhong Min^1,, Byeonghwi Kim¹, Jonghyun Choi¹
¹Seoul National University ²Yonsei University
Meta-Control: Automatic Model-based Control System Synthesis for Heterogeneous Robot Skills [CoRL 2024] [Project page]
Tianhao Wei^1*, Liqian Ma^12*, Rui Chen¹, Weiye Zhao¹, Changliu Liu¹ ^*Equal Contribution ¹Carnegie Mellon University ²Tsinghua University
Voyager: An Open-Ended Embodied Agent with Large Language Models [NeurIPS 2023 Workshop ALOE Spotlight] [Project page] [Github]
Guanzhi Wang^1,2, Yuqi Xie³, Yunfan Jiang⁴, Ajay Mandlekar¹, Chaowei Xiao^1,5, Yuke Zhu^1,3, Linxi Fan¹, Anima Anandkumar^1,2 ¹NVIDIA, ²Caltech, ³UT Austin, ⁴Stanford, ⁵UW Madison
Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [ACL 2024][Github]
Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, Weiming Lu
Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives [ACL 2024]
Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [arXiv 2024] [Project Page]
Enshen Zhou^1,2 Yiran Qin^1,3 Zhenfei Yin^1,4 Yuzhou Huang³ Ruimao Zhang³ Lu Sheng² Yu Qiao¹ Jing Shao¹
¹Shanghai Artificial Intelligence Laboratory, ²The Chinese University of Hong Kong, Shenzhen, ³Beihang University, ⁴The University of Sydney
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [CVPR 2024] [Project Page]
Yiran Qin^1,2 Enshen Zhou^1,3 Qichang Liu^1,4 Zhenfei Yin^1,5 Lu Sheng³ Ruimao Zhang² Yu Qiao¹ Jing Shao¹
¹Shanghai Artificial Intelligence Laboratory, ²The Chinese University of Hong Kong, Shenzhen, ³Beihang University, ⁴Tsinghua University, ⁵The University of Sydney
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [CVPR 2025] [Project Page]
Enshen Zhou^1* Qi Su^2* Cheng Chi^3*; Zhizheng Zhang⁴ Zhongyuan Wang³ Tiejun Huang^2,3 Lu Sheng^1; He Wang^2,3,4;
¹Beihang University, ²Peking University, ³Beijing Academy of Artificial Intelligence, ⁴GalBot
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation [CVPR 2024]
Zeyuan Yang¹, LIU JIAGENG, Peihao Chen², Anoop Cherian³, Tim Marks, Jonathan Le Roux⁴, Chuang Gan⁵ ¹Tsinghua University, ²South China University of Technology, ³Mitsubishi Electric Research Labs (MERL), ⁴Mitsubishi Electric Research Labs, ⁵MIT-IBM Watson AI Lab
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study [arXiv 2024] [Project Page] [Code]
Weihao Tan², Ziluo Ding¹, Wentao Zhang², Boyu Li¹, Bohan Zhou³, Junpeng Yue³, Haochong Xia², Jiechuan Jiang³, Longtao Zheng², Xinrun Xu1, Yifei Bi¹, Pengjie Gu²,
¹Beijing Academy of Artificial Intelligence (BAAI), China; ²Nanyang Technological University, Singapore; ³School of Computer Science, Peking University, China
See and Think: Embodied Agent in Virtual Environment [arXiv 2023]
Zhonghan Zhao^1*, Wenhao Chai^2*, Xuan Wang^1*, Li Boyi¹, Shengyu Hao¹, Shidong Cao¹, Tian Ye³, Jenq-Neng Hwang², Gaoang Wang¹
¹Zhejiang University ¹University of Washington ¹Hong Kong University of Science and Technology (GZ)
Agent Instructs Large Language Models to be General Zero-Shot Reasoners [arXiv 2023]
Nicholas Crispino¹, Kyle Montgomery¹, Fankun Zeng¹, Dawn Song², Chenguang Wang¹
¹Washington University in St. Louis, ²UC Berkeley
JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models [NeurIPS 2023] [Project Page]
Zihao Wang^1,2 Shaofei Cai^1,2 Anji Liu³ Yonggang Jin⁴ Jinbing Hou⁴ Bowei Zhang⁵ Haowei Lin^1,2 Zhaofeng He⁴ Zilong Zheng⁶ Yaodong Yang¹ Xiaojian Ma^6† Yitao Liang^1†
¹Institute for Artificial Intelligence, Peking University, ²School of Intelligence Science and Technology, Peking University, ³Computer Science Department, University of California, Los Angeles, ⁴Beijing University of Posts and Telecommunications, ⁵School of Electronics Engineering and Computer Science, Peking University, ⁶Beijing Institute for General Artificial Intelligence (BIGAI)
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents [NeurIPS 2023]
Zihao Wang^1,2 Shaofei Cai^1,2 Guanzhou Chen³ Anji Liu⁴ Xiaojian Ma⁴ Yitao Liang^1,5†
¹Institute for Artificial Intelligence, Peking University, ²School of Intelligence Science and Technology, Peking University, ³School of Computer Science, Beijing University of Posts and Telecommunications, ⁴Computer Science Department, University of California, Los Angeles, ⁵Beijing Institute for General Artificial Intelligence (BIGAI)
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society [NeurIPS 2023] [Github] [Project page]
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem
¹King Abdullah University of Science and Technology (KAUST)
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [arXiv 2022] [Github] [Project page]
Wenlong Huang¹, Pieter Abbeel¹, Deepak Pathak², Igor Mordatch³
¹UC Berkeley, ²Carnegie Mellon University, ³Google
FILM: Following Instructions in Language with Modular Methods [ICLR 2022] [Github] [Project page]
So Yeon Min¹, Devendra Singh Chaplot², Pradeep Ravikumar¹, Yonatan Bisk¹, Ruslan Salakhutdinov¹
¹Carnegie Mellon University ²Facebook AI Research
Embodied Task Planning with Large Language Models [arXiv 2023] [Github] [Project page] [Demo] [Huggingface Model]
Zhenyu Wu¹, Ziwei Wang^2,3, Xiuwei Xu^2,3, Jiwen Lu^2,3, Haibin Yan^1*
¹School of Automation, Beijing University of Posts and Telecommunications, ²Department of Automation, Tsinghua University, ³Beijing National Research Center for Information Science and Technology
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning [arXiv 2023]
Yue Wu^1,4* , Shrimai Prabhumoye² , So Yeon Min¹ , Yonatan Bisk¹ , Ruslan Salakhutdinov¹ ,Amos Azaria³ , Tom Mitchell¹ , Yuanzhi Li^1,4
¹Carnegie Mellon University, ²NVIDIA, ³Ariel University, ⁴Microsoft Research
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning [CVPR 2022 (Oral)] [Project page] [Github]
Santhosh Kumar Ramakrishnan^1,2, Devendra Singh Chaplot¹, Ziad Al-Halah² Jitendra Malik^1,3, Kristen Grauman^1,2
¹Facebook AI Research, ²UT Austin, ³UC Berkeley
Moving Forward by Moving Backward: Embedding Action Impact over Action Semantics [ICLR 2023] [Project page] [Github]
Kuo-Hao Zeng¹, Luca Weihs², Roozbeh Mottaghi¹, Ali Farhadi¹
¹Paul G. Allen School of Computer Science & Engineering, University of Washington, ²PRIOR @ Allen Institute for AI
Modeling Dynamic Environments with Scene Graph Memory [ICML 2023]
Andrey Kurenkov¹, Michael Lingelbach¹, Tanmay Agarwal¹, Emily Jin¹, Chengshu Li¹, Ruohan Zhang¹, Li Fei-Fei¹, Jiajun Wu¹, Silvio Savarese², Roberto Mart´ın-Mart´ın³
¹Department of Computer Science, Stanford University ²Salesforce AI Research ³Department of Computer Science, University of Texas at Austin.
Reasoning with Language Model is Planning with World Model [arXiv 2023]
Shibo Hao^∗♣, Yi Gu^∗♣, Haodi Ma^♢, Joshua Jiahua Hong^♣, Zhen Wang^{♣ ♠}, Daisy Zhe Wang^♢, Zhiting Hu^♣
^♣UC San Diego, ^♢University of Florida, ^♠Mohamed bin Zayed University of Artificial Intelligence
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [arXiv 2022]
Robotics at Google, Everyday Robots
Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling [ICML 2023]
Kolby Nottingham¹ Prithviraj Ammanabrolu² Alane Suhr² Yejin Choi^3,2 Hannaneh Hajishirzi^3,2 Sameer Singh^1,2 Roy Fox¹
¹Department of Computer Science, University of California Irvine ²Allen Institute for Artificial Intelligence ³Paul G. Allen School of Computer Science
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents [ICCV 2023] [Project page]
Byeonghwi Kim Jinyeon Kim Yuyeong Kim^1,* Cheolhong Min Jonghyun Choi^†
Yonsei University ¹Gwangju Institute of Science and Technology
Inner Monologue: Embodied Reasoning through Planning with Language Models [CoRL 2022] [Project page]
Robotics at Google
Language Models Meet World Models: Embodied Experiences Enhance Language Models [arXiv 2023] [Twitter]
Jiannan Xiang^∗♠, Tianhua Tao^∗♠, Yi Gu^♠, Tianmin Shu^♢, Zirui Wang^♠, Zichao Yang^♡, Zhiting Hu^♠
^♠UC San Diego, ^♣UIUC, ^♢MIT, ^♡Carnegie Mellon University
AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation [arXiv 2023] [Video]
Chuhao Jin^1* , Wenhui Tan^1* , Jiange Yang^2* , Bei Liu3^† , Ruihua Song¹ , Limin Wang² , Jianlong Fu^3†
¹Renmin University of China, ²Nanjing University, ³Microsoft Research
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution [CoRL 2021] [Project page] [Poster]
Valts Blukis^1,2, Chris Paxton¹, Dieter Fox^1,3, Animesh Garg^1,4, Yoav Artzi²
¹NVIDIA ²Cornell University ³University of Washington ⁴University of Toronto, Vector Institute
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models [ICCV 2023] [Project page] [Github]
Chan Hee Song¹, Jiaman Wu¹, Clayton Washington¹, Brian M. Sadler², Wei-Lun Chao¹, Yu Su¹
¹The Ohio State University, ²DEVCOM ARL
Code as Policies: Language Model Programs for Embodied Control [arXiv 2023] [Project page] [Github] [Blog] [Colab]
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
Robotics at Google
3D-LLM: Injecting the 3D World into Large Language Models [arXiv 2023]
¹Yining Hong, ²Haoyu Zhen, ³Peihao Chen, ⁴Shuhong Zheng, ⁵Yilun Du, ⁶Zhenfang Chen, ^6,7Chuang Gan
¹UCLA ² SJTU ³ SCUT ⁴ UIUC ⁵ MIT ⁶MIT-IBM Watson AI Lab ⁷ Umass Amherst
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2023] [Project page] [Online Demo]
Wenlong Huang¹, Chen Wang¹, Ruohan Zhang¹, Yunzhu Li^1,2, Jiajun Wu¹, Li Fei-Fei¹
¹Stanford University ²University of Illinois Urbana-Champaign
Palm-e: An embodied multimodal language mode [ICML 2023] [Project page]
¹Robotics at Google ²TU Berlin 3Google Research
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [arXiv 2023]
Zirui Zhao Wee Sun Lee David Hsu
School of Computing National University of Singapore
An Embodied Generalist Agent in 3D World [ICML 2024]
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang Beijing Institute for General Artificial Intelligence (BIGAI)

Multi-Agent Learning and Coordination

Building Cooperative Embodied Agents Modularly with Large Language Models [ICLR 2024] [Project page] [Github]
Hongxin Zhang^1*, Weihua Du^2*, Jiaming Shan³, Qinhong Zhou¹, Yilun Du⁴, Joshua B. Tenenbaum⁴, Tianmin Shu⁴, Chuang Gan^1,5
¹University of Massachusetts Amherst, ²Tsinghua University, ³Shanghai Jiao Tong University, ⁴MIT, ⁵MIT-IBM Watson AI Lab
War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars [arXiv 2023]
Wenyue Hua^1*, Lizhou Fan^2*, Lingyao Li², Kai Mei¹, Jianchao Ji¹, Yingqiang Ge¹, Libby Hemphill², Yongfeng Zhang¹
¹Rutgers University, ²University of Michigan
MindAgent: Emergent Gaming Interaction [arXiv 2023]
Ran Gong^*1† Qiuyuan Huang^*2‡ Xiaojian Ma^*1 Hoi Vo³ Zane Durante^†4 Yusuke Noda³ Zilong Zheng⁵ Song-Chun Zhu¹⁵⁶⁷⁸ Demetri Terzopoulos¹ Li Fei-Fei⁴ Jianfeng Gao²
¹UCLA; ²Microsoft Research, Redmond; ³Xbox Team, Microsoft; ⁴Stanford; ⁵BIGAI; ⁶PKU; ⁷THU; ⁸UCLA
Demonstration-free Autonomous Reinforcement Learning via Implicit and Bidirectional Curriculum [ICML 2023]
Jigang Kim^*1,2 Daesol Cho^*1,2 H. Jin Kim^1,3
¹Seoul National University, ²Artificial Intelligence Institute of Seoul National University (AIIS), ³Automation and Systems Research Institute (ASRI).
Note: This paper mainly focuses on reinforcement learning for Embodied AI.
Adaptive Coordination in Social Embodied Rearrangement [ICML 2023]
Andrew Szot^1,2 Unnat Jain¹ Dhruv Batra^1,2 Zsolt Kira² Ruta Desai¹ Akshara Rai¹
¹Meta AI ²Georgia Institute of Technology.

Vision and Language Navigation

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction [ICRA 2025]
Suhwan Choi¹ Yongjun Cho¹ Minchan Kim¹ Jaeyoon Jung¹ Myunchul Joe¹ Yubeen Park¹ Minseo Kim² Sungwoong Kim² Sungjae Lee² Hwiseong Park¹ Jiwan Chung² Youngjae Yu²
¹MAUM.AI ²Yonsei University
IndoorSim-to-OutdoorReal: Learning to Navigate Outdoors without any Outdoor Experience [arXiv 2023]
Joanne Truong^1,2, April Zitkovich¹, Sonia Chernova², Dhruv Batra^2,3, Tingnan Zhang¹, Jie Tan¹, Wenhao Yu¹
¹Robotics at Google ²Georgia Institute of Technology ³Meta AI
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [arXiv 2024] [Project page]
Zhaowei Wang¹, Hongming Zhang², Tianqing Fang^1,2, Ye Tian³, Yue Yang⁴, Kaixin Ma², Xiaoman Pan², Yangqiu Song¹, Dong Yu²
¹CSE Department, HKUST ²Tencent AI Lab, Bellevue, USA ³Robotics X, Tencent ⁴University of Pennsylvania
ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation [ICML 2023]
Kaiwen Zhou¹, Kaizhi Zheng¹, Connor Pryor¹, Yilin Shen², Hongxia Jin², Lise Getoor¹, Xin Eric Wang¹
¹University of California, Santa Cruz ²Samsung Research America.
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models [arXiv 2023]
Gengze Zhou¹ Yicong Hong² Qi Wu¹
¹The University of Adelaide ²The Australian National University
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model [arXiv 2023] [Github]
Siyuan Huang^1,2 Zhengkai Jiang⁴ Hao Dong³ Yu Qiao² Peng Gao² Hongsheng Li⁵
¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory, ³CFCS, School of CS, PKU, ⁴University of Chinese Academy of Sciences, ⁵The Chinese University of Hong Kong

Detection

DetGPT: Detect What You Need via Reasoning [arXiv 2023]
Renjie Pi^1∗ Jiahui Gao^2* Shizhe Diao^1∗ Rui Pan¹ Hanze Dong¹ Jipeng Zhang¹ Lewei Yao¹ Jianhua Han³ Hang Xu² Lingpeng Kong² Tong Zhang¹
¹The Hong Kong University of Science and Technology ²The University of Hong Kong 3Shanghai Jiao Tong University

3D Grounding

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [arXiv 2023]
Jianing Yang^1,, Xuweiyi Chen^1,, Shengyi Qian¹, Nikhil Madaan, Madhavan Iyengar¹, David F. Fouhey^1,2, Joyce Chai¹
¹University of Michigan, ²New York University
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [ICCV 2023]
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li Beijing Institute for General Artificial Intelligence (BIGAI)

Interactive Embodied Learning

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games [ICCV 2025] [Project page]
Peng Chen*, Pi Bu*, Yingyao Wang, Xinyi Wang, Ziming Wang, Jie Guo, Yingxiu Zhao, Qi Zhu, Jun Song†, Siran Wang, Jiamang Wang, Bo Zheng
Alibaba Group
Meta-Control: Automatic Model-based Control System Synthesis for Heterogeneous Robot Skills [CoRL 2024] [Project page]
Tianhao Wei^1*, Liqian Ma^12*, Rui Chen¹, Weiye Zhao¹, Changliu Liu¹
^*Equal Contribution ¹Carnegie Mellon University ²Tsinghua University
Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning [ICML 2023]
Thomas Carta^1*, Clement Romac ´^1,2, Thomas Wolf², Sylvain Lamprier³, Olivier Sigaud⁴, Pierre-Yves Oudeyer¹
¹Inria (Flowers), University of Bordeaux, ²Hugging Face, ³Univ Angers, LERIA, SFR MATHSTIC, F-49000, ⁴Sorbonne University, ISIR
Learning Affordance Landscapes for Interaction Exploration in 3D Environments [NeurIPS 2020] [Project page]
Tushar Nagarajan, Kristen Grauman
UT Austin and Facebook AI Research, UT Austin and Facebook AI Research
Embodied Question Answering in Photorealistic Environments with Point Cloud Perception [CVPR 2019 (oral)] [Slides]
Erik Wijmans^1†, Samyak Datta¹, Oleksandr Maksymets^2†, Abhishek Das¹, Georgia Gkioxari², Stefan Lee¹, Irfan Essa¹, Devi Parikh^1,2, Dhruv Batra^1,2
¹Georgia Institute of Technology, ²Facebook AI Research
Multi-Target Embodied Question Answering [CVPR 2019]
Licheng Yu¹, Xinlei Chen³, Georgia Gkioxari³, Mohit Bansal¹, Tamara L. Berg^1,3, Dhruv Batra^2,3
¹University of North Carolina at Chapel Hill ²Georgia Tech 3Facebook AI
Neural Modular Control for Embodied Question Answering [CoRL 2018 (Spotlight)] [Project page] [Github]
Abhishek Das¹,Georgia Gkioxari², Stefan Lee¹, Devi Parikh^1,2, Dhruv Batra^1,2
¹Georgia Institute of Technology ²Facebook AI Research
Embodied Question Answering [CVPR 2018 (oral)] [Project page] [Github]
Abhishek Das¹, Samyak Datta¹, Georgia Gkioxari2², Stefan Lee¹, Devi Parikh^2,1, Dhruv Batra²
¹Georgia Institute of Technology, ²Facebook AI Research

Rearrangement

A Simple Approach for Visual Room Rearrangement: 3D Mapping and Semantic Search [ICLR 2023]
¹Brandon Trabucco, ²Gunnar A Sigurdsson, ²Robinson Piramuthu, ^2,3Gaurav S. Sukhatme, ¹Ruslan Salakhutdinov
¹CMU, ²Amazon Alexa AI, ³University of Southern California

Benchmark

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [Arxiv 2025] [Project Page]
Enshen Zhou^1,2,, Jingkun An^1,, Cheng Chi^2,*
¹Beihang University, ²Beijing Academy of Artificial Intelligence
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [arXiv 2024] [Project page]
Zhaowei Wang¹, Hongming Zhang², Tianqing Fang^1,2, Ye Tian³, Yue Yang⁴, Kaixin Ma², Xiaoman Pan², Yangqiu Song¹, Dong Yu²
¹CSE Department, HKUST ²Tencent AI Lab, Bellevue, USA ³Robotics X, Tencent ⁴University of Pennsylvania
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments [ECCV 2024] [Project page]
Taewoong Kim^1*, Cheolhong Min^1*, Byeonghwi Kim¹, Jinyeon Kim¹², Wonje Jeung¹, Jonghyun Choi¹
^*Equal Contribution ¹Seoul National University ²Yonsei University
Online Continual Learning for Interactive Instruction Following Agents [ICLR 2024] [Project page]
Byeonghwi Kim^1*, Minhyuk Seo^1*, Jonghyun Choi²
^*Equal Contribution ¹Yonsei University ²Seoul National University
SmartPlay: A Benchmark for LLMs as Intelligent Agents [ICLR 2024] [Github]
Yue Wu^1,2, Xuan Tang¹, Tom Mitchell¹, Yuanzhi Li^1,2 ¹Carnegie Mellon University, ²Microsoft Research
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [arXiv 2023] [Project page] [Github]
Yufei Wang¹, Zhou Xian¹, Feng Chen², Tsun-Hsuan Wang³, Yian Wang⁴, Katerina Fragkiadaki¹, Zackory Erickson¹, David Held¹, Chuang Gan^4,5
¹CMU, ²Tsinghua IIIS, ³MIT CSAIL, ⁴UMass Amherst, ⁵MIT-IBM AI Lab
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [ICLR 2021] [Project page] [Github]
Mohit Shridhar^† Xingdi Yuan^♡ Marc-Alexandre Côté^♡ Yonatan Bisk^‡ Adam Trischler^♡ Matthew Hausknecht^♣
^‡University of Washington ^♡Microsoft Research, Montréal ^‡Carnegie Mellon University ^♣Microsoft Research
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks [CVPR 2020] [Project page] [Github]
Mohit Shridhar¹ Jesse Thomason¹ Daniel Gordon¹ Yonatan Bisk^1,2,3 Winson Han³ Roozbeh Mottaghi^1,3 Luke Zettlemoyer¹ Dieter Fox^1,4
¹Paul G. Allen School of Computer Sci. & Eng., Univ. of Washington, ²Language Technologies Institute @ Carnegie Mellon University, ³Allen Institute for AI, ⁴NVIDIA
VIMA: Robot Manipulation with Multimodal Prompts [ICML 2023] [Project page] [Github] [VIMA-Bench]
Yunfan Jiang¹ Agrim Gupta^1† Zichen Zhang^2† Guanzhi Wang^3,4† Yongqiang Dou⁵ Yanjun Chen¹ Li Fei-Fei¹ Anima Anandkumar^3,4 Yuke Zhu^3,6‡ Linxi Fan^3‡
SQA3D: Situated Question Answering in 3D Scenes [ICLR 2023] [Project page] [Slides] [Github]
Xiaojian Ma² Silong Yong^1,3* Zilong Zheng¹ Qing Li¹ Yitao Liang^1,4 Song-Chun Zhu^1,2,3,4 Siyuan Huang¹
¹Beijing Institute for General Artificial Intelligence (BIGAI) ²UCLA ³Tsinghua University ⁴Peking University
IQA: Visual Question Answering in Interactive Environments [CVPR 2018] [Github] [Demo video (YouTube)]
Danie¹ Gordon1 Aniruddha Kembhavi² Mohammad Rastegari^2,4 Joseph Redmon¹ Dieter Fox^1,3 Ali Farhadi^1,2
¹Paul G. Allen School of Computer Science, University of Washington ²Allen Institute for Artificial Intelligence ³Nvidia ⁴Xnor.ai
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments [ICCV 2021] [Project page] [Github]
Difei Gao^1,2, Ruiping Wang^1,2,3, Ziyi Bai^1,2, Xilin Chen¹,
¹Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, ²University of Chinese Academy of Sciences, ³Beijing Academy of Artificial Intelligence

Simulator

LEGENT: Open Platform for Embodied Agents [ACL 2024] [Project page] [Github]
Tsinghua University
AI2-THOR: An Interactive 3D Environment for Visual AI [arXiv 2022] [Project page] [Github]
Allen Institute for AI, University of Washington, Stanford University, Carnegie Mellon University
iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes [IROS 2021] [Project page] [Github]
Bokui Shen*, Fei Xia* et al.
Habitat: A Platform for Embodied AI Research [ICCV 2019] [Project page] [Habitat-Sim] [Habitat-Lab] [Habitat Challenge]
Facebook AI Research, Facebook Reality Labs, Georgia Institute of Technology, Simon Fraser University, Intel Labs, UC Berkeley
Habitat 2.0: Training Home Assistants to Rearrange their Habitat [NeurIPS 2021] [Project page]
Facebook AI Research, Georgia Tech, Intel Research, Simon Fraser University, UC Berkeley

Others

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models [ICLR 2023]
Google Research, Brain Team
React: Synergizing reasoning and acting in language models [ICLR 2023]
Shunyu Yao^1∗, Jeffrey Zhao², Dian Yu², Nan Du², Izhak Shafran², Karthik Narasimhan¹, Yuan Cao²
¹Department of Computer Science, Princeton University ², Google Research, Brain team
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models [arXiv 2023]
Virginia Tech, Microsoft
Graph of Thoughts: Solving Elaborate Problems with Large Language Models [arXiv 2023]
ETH Zurich, Cledar, Warsaw University of Technology
Tree of Thoughts: Deliberate Problem Solving with Large Language Models [arXiv 2023]
Shunyu Yao¹, Dian Yu², Jeffrey Zhao², Izhak Shafran², Thomas L. Griffiths¹, Yuan Cao², Karthik Narasimhan¹
¹Princeton University, ²Google DeepMind
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [NeurIPS 2022]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou
Google Research, Brain Team
MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [NeurIPS 2022] [Github] [Project page] [Knowledge Base]
Linxi Fan¹ , Guanzhi Wang^2∗ , Yunfan Jiang^3* , Ajay Mandlekar¹ , Yuncong Yang⁴ , Haoyi Zhu⁵ , Andrew Tang⁴ , De-An Huang¹ , Yuke Zhu^1,6† , Anima Anandkumar^1,2†
¹NVIDIA, ²Caltech, ³Stanford, ⁴Columbia, ⁵SJTU, ⁶UT Austin
Distilling Internet-Scale Vision-Language Models into Embodied Agents [ICML 2023]
Theodore Sumers^1∗ Kenneth Marino² Arun Ahuja² Rob Fergus² Ishita Dasgupta²
LISA: Reasoning Segmentation via Large Language Model [arXiv 2023] [Github] [Huggingface Models] [Dataset] [Online Demo]
TXin Lai¹ Zhuotao Tian² Yukang Chen¹ Yanwei Li¹ Yuhui Yuan³ Shu Liu² Jiaya Jia^1,2
¹The Chinese University of Hong Kong ²SmartMore ³MSRA
Meta-Control: Automatic Model-based Control System Synthesis for Heterogeneous Robot Skills [CoRL 2024] [Project page]
Tianhao Wei^1*, Liqian Ma^12*, Rui Chen¹, Weiye Zhao¹, Changliu Liu¹ ^*Equal Contribution ¹Carnegie Mellon University ²Tsinghua University

Acknowledge

Thanks to everyone who has contributed to this repository! Special thanks to those who submitted PRs with solid work—your efforts make this project better and stronger. 🚀✨

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
assets		assets
Genshin.jpg		Genshin.jpg
LICENSE		LICENSE
README.md		README.md
trend.png		trend.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Awesome Embodied Robotics and Agent

News🔥

Development of Embodied Robotics and Benchmarks

Table of Contents 🍃

Methods

Survey

Vision-Language-Action Model

Self-Evolving Agents

Advanced Agent Applications

LLMs with RL or World Model

Planning and Manipulation or Pretraining

Multi-Agent Learning and Coordination

Vision and Language Navigation

Detection

3D Grounding

Interactive Embodied Learning

Rearrangement

Benchmark

Simulator

Others

Acknowledge

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🤖 Awesome Embodied Robotics and Agent

News🔥

Development of Embodied Robotics and Benchmarks

Table of Contents 🍃

Methods

Survey

Vision-Language-Action Model

Self-Evolving Agents

Advanced Agent Applications

LLMs with RL or World Model

Planning and Manipulation or Pretraining

Multi-Agent Learning and Coordination

Vision and Language Navigation

Detection

3D Grounding

Interactive Embodied Learning

Rearrangement

Benchmark

Simulator

Others

Acknowledge

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages