Robowheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning

1 Tsinghua University 2 Synapath 3 The Chinese University of Hong Kong 4 The University of Hong Kong 5 The Hong Kong Polytechnic University
* Equal contribution   |   Corresponding authors: dsyuhong@gmail.com, gaozihanthu@gmail.com, adreamob@gmail.com, thu.lhchen@gmail.com
Teaser image of real-world robot manipulation

We present a data engine and training pipeline that directly leverage real-world manipulation behaviors, transfer effectively across diverse robot embodiments, and continually improve by iteratively absorbing rapidly accumulating physical interaction experience.

Abstract

We introduce Robowheel, a data engine that converts human hand-object interaction (HOI) videos into training-ready supervision for cross-morphology robotic learning. From monocular RGB/RGB-D inputs, we perform high-precision HOI reconstruction and enforce physical plausibility via a reinforcement learning (RL) optimizer that refines hand–object relative poses under contact and penetration constraints. The reconstructed, contact-rich trajectories are then retargeted to cross-embodiments, robot arms with simple end-effectors, dexterous hands, and humanoids, yielding executable actions and rollouts. To scale coverage, we build a simulation-augmented framework on Isaac Sim with diverse domain randomization (embodiments, trajectories, object retrieval, background textures, hand motion mirroring) , which enriches the distributions of trajectories and observations while preserving spatial relationships and physical plausibility. The entire data pipeline forms an end-to-end pipeline from video → reconstruction → retargeting → augmentation → data acquisition. We validate the data on mainstream vision-language-action (VLA) and imitation learning architectures, demonstrating that trajectories produced by our pipeline are as stable as those from teleoperation and yield comparable continual performance gains. To our knowledge, this provides the first quantitative evidence that HOI modalities can serve as effective supervision for robotic learning. Compared with teleoperation, Robowheel is lightweight: a single monocular RGB(D) camera is sufficient to extract a universal, embodiment-agnostic motion representation that could be flexibly retargeted across embodiments. We further assemble a large-scale multimodal dataset combining multi-camera captures, monocular videos, and public HOI corpora for training and evaluating embodied models.

Pipeline Overview

We construct a reconstruction-retargeting-augmentation-training loop in Robowheel, the data engine directly leverage real-world manipulation behaviors, transfer effectively across diverse robot embodiments, and continually improve by iteratively absorbing rapidly accumulating physical interaction experience.

Teaser image of real-world robot manipulation

RoboWheel first watches humans and objects move, then rebuilds that interaction in a clean physical world. It estimates whole-body or hand-only motion, reconstructs object geometry and 6-DoF pose, and sends everything into a canonical action space. A TSDF stage then optimizes the hand pose to remove interpenetration with the object, for example by solving \[ \min_{\Theta_{\text{hand}}}\; \sum_{t} \sum_{v \in \mathcal{V}_{\text{hand}}} \phi_o\!\big(T_{\text{hand}}(t)\,v\big)^2 + \lambda \big\| \Delta T_{\text{hand}}(t) \big\|^2 \] where 𝜙𝑜 is the object TSDF and \(𝑇_{hand}(𝑡)\) is the hand transform at timestamp \(t\), and \(\Delta T_{\text{hand}}(t)=\log\!\big(T_{\text{hand}}(t)^{-1}T_{\text{hand}}(t+1)\big)\) denotes the incremental transform between consecutive frames. After this TSDF optimization, an RL policy in simulation nudges the 3D hand poses so that the final HOI sequence is physically plausible and dynamically smooth.

Next, RoboWheel puts these human skills onto robots. It retargets the refined actions to different robot arms, replays them in sim or on hardware to gather rich visual data and augment trajectories, then feeds that into VLA/IL models like Pi0, RDT, ACT, and Diffusion Policy. The same canonical actions can also be mapped to dexterous hands and humanoids, turning one human demonstration into skills that travel across embodiments and domains.

HOI Reconstruction Results

Scalability

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}