RoboMIND2.0

A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence

Chengkai Hou^1,2,*,‡, Kun Wu^1,*,‡, Jiaming Liu^2,*,‡, Zhengping Che^1,*,†, Di Wu^1,2,*,
Fei Liao^1,2,*, Guangrun Li^1,2,*, Jingyang He^1,2,*, Qiuxuan Feng^1,2,*, Zhao Jin^1,*,
Chenyang Gu², Zhuoyang Liu², Nuowei Han², Xiangju Mi², Yaoxu Lv²,
Yankai Fu², Gaole Dai², Langzhe Gu², Tao Li¹, Yuheng Zhang¹, Xinhua Wang¹,
Shichao Fan¹, Yixue Zhang¹, Meng Li¹, Zhen Zhao¹, Ning Liu¹,
Zhiyuan Xu¹, Pei Ren¹, Junjie Ji¹, Haonan Liu¹,
Kuan Cheng², Shanghang Zhang^2,✉, Jian Tang^1,✉

¹Beijing Innovation Center of Humanoid Robotics ²School of Computer Science, Peking University

^*Equal contribution ^†Project lead ^‡Co-first Core authors ^✉Corresponding authors

[pdf] [arxiv] [data]

Dataset Overview

Abstract.

While data-driven imitation learning has revolutionized robotic manipulation, current approaches remain constrained by the scarcity of large-scale, diverse real-world demonstrations. Consequently, the ability of existing models to generalize across long-horizon bimanual tasks and mobile manipulation in unstructured environments remains limited.

To bridge this gap, we present RoboMIND 2.0, a comprehensive real-world dataset comprising over 310k dual-arm manipulation trajectories collected across six distinct robot embodiments and 739 complex tasks. Crucially, to support research in contact-rich and spatially extended tasks, the dataset incorporates 12k tactile-enhanced episodes and 20k mobile manipulation trajectories. Complementing this physical data, we construct high-fidelity digital twins of our real-world environments, releasing an additional 20k-trajectory simulated dataset to facilitate robust sim-to-real transfer.

To fully exploit the potential of RoboMIND 2.0, we propose MIND-2 system, a hierarchical dual-system framework optimized via offline reinforcement learning. CoVLA integrates a high-level semantic planner (MIND-2-VLM) to decompose abstract natural language instructions into grounded subgoals, coupled with a low-level Vision-Language-Action executor (MIND-2-VLA), which generates precise, proprioception-aware motor actions. Extensive evaluations across six distinct robotic embodiments validate the effectiveness of our dataset and demonstrate that MIND-2 system significantly outperforms four single-task baselines (covering both 2D image and 3D point cloud modalities as well as four state-of-the-art VLA models). Furthermore, we observe that integrating tactile modalities yields measurable gains in fine-grained manipulation tasks.

Finally, experimental results show that mixing real and simulated data during training consistently enhances physical execution performance, validating both the fidelity of our simulation benchmarks and the cost-efficiency of synthetic data augmentation. Our full dataset, simulation assets, and training code are publicly released to advance research in general-purpose robotic manipulation.

We introduce RoboMIND 2.0, a large-scale dataset comprising 310K dual-arm trajectories collected from six heterogeneous robot embodiments, totaling over 1,000 hours. The dataset features rich modalities, including 12K tactile-enriched sequences and 20K mobile manipulation trajectories. Collected through a unified teleoperation and quality assurance pipeline, RoboMIND 2.0 ensures consistent proprioception and provides fine-grained natural language annotations. To support scalable training and evaluation, we release digital-twin USD assets and 20K simulation trajectories aligned with real-world tasks. Building on this foundation, we propose MIND-2, a dual-system controller that integrates a slow high-level planner MIND-2-VLM with a fast low-level policy MIND-2-VLA, enabling robust long-horizon mobile manipulation across diverse scenarios.

Collection Platform.

Collection platform of Franka and UR5e. Collect a robotic manipulation dataset by controlling the dual-arm system (Franka and UR5e) via HACTS.

Collection platform of AgileX and ARX. We use a VR headset to control the ARK robot for data collection, and employ a slave arm to teleoperate the master arm for gathering the Agilex manipulation dataset.

Collection platform of Tienkung and Tianyi. For the humanoid robot TienKung, data collectors will wear motion capture suits to record joint movements, which are then mapped to the robot to enable robotic manipulation. For the dual-arm mobile robot with a wheeled base Tian Yi, we collected datasets using two human operators.

Real-world Setup.

Robotic real-world setup. For the Franka and UR5e robots, we use cameras positioned at the top, left, and right viewpoints to record the visual information of the task trajectories. For the humanoid (Tien Kung and Tian Yi) robots, we use their built-in RGB-D cameras to capture visual observations. For the AgileX and ARX robots, we use dual wrist-mounted cameras (one on each arm) as well as a head-mounted camera to capture visual information.

Real-World Experiments.

Franka Tasks

UR Tasks

Agilex Mobile Manipulation Tasks

Agilex Manipulation Tasks

Ark Tasks

Tianyi Tasks

Tienkung Tasks

Performance comparison of single task imitation learning methods and VLA models across different task categories.

Dual System Experiments.

Dual-robot Collaborative Tasks

Performance comparison across AgileX mobile manipulation tasks. MIND-2 fast-slow system achieves significantly better performance across various tasks compared to both VLA models and single-task imitation learning methods

Success rates across three collaborative tasks. MIND-2 (Post Training) is instantiated by fine-tuning InternVL3 and pi0.5 directly on data from the three multi-robot collaboration tasks. MIND-2 (Full-scale Training) is first pretrained on the full-scale mobile manipulation dataset using a fast-slow system architecture, and then further fine-tuned via post-training on data from three multi-robot collaboration tasks. MIND-2 (Offline RL), after full-scale training, we apply Implicit Q-Learning (IQL) to conduct offline reinforcement learning on the MIND-2-VLA.

BibTeX

@misc{hou2025robomind20multimodalbimanual,
      title={RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence}, 
      author={Chengkai Hou and Kun Wu and Jiaming Liu and Zhengping Che and Di Wu and Fei Liao and Guangrun Li and Jingyang He and Qiuxuan Feng and Zhao Jin and Chenyang Gu and Zhuoyang Liu and Nuowei Han and Xiangju Mi and Yaoxu Lv and Yankai Fu and Gaole Dai and Langzhe Gu and Tao Li and Yuheng Zhang and Yixue Zhang and Xinhua Wang and Shichao Fan and Meng Li and Zhen Zhao and Ning Liu and Zhiyuan Xu and Pei Ren and Junjie Ji and Haonan Liu and Kuan Cheng and Shanghang Zhang and Jian Tang},
      year={2025},
      eprint={2512.24653},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.24653}, 
}

Content

Dataset Overview