The Art of Modalities in Human-Centric Data
Introducing ModalityNet
Compared to conventional AI systems, embodied intelligence necessitates a considerably richer and more diverse set of data modalities, encompassing vision, language, motion dynamics, tactile feedback, depth perception, mesh-level object tracking, scene reconstruction, audio signals, and beyond. Such multimodal embodied data are extremely scarce, and even systematic methods for acquisition and collection remain an open research problem.
We propose ModalityNet — https://modalitynet.com, in tribute to the seminal contribution of ImageNet to modern computer vision, aspiring to serve as an embodied intelligence counterpart. ModalityNet is a large-scale human-centric data benchmark that covers various data modalities mentioned above.
We first introduce a four-level data pyramid structure of embodied AI (see Figure 0), which serves as the foundational design principle of ModalityNet.

Figure 0
In the data pyramid, as we move from the first layer to the last, the data becomes progressively less embodiment-specific, while its diversity and generalization capability with respect to the real physical world continue scaling up. Teleoperation data offer embodiment-specific and in-domain supervision that policies can be trained via direct behavioral cloning without space gap. However, teleoperation data are expensive and inefficient to collect, and exhibit limited cross-embodiment generalization. Internet data and synthetic data are inherently multi-source and large-scale. They offer substantially greater diversity, but the discrenpancy between the data formats and the observation / action space of embodiments is significant, as well as largely unconstrained precision, making direct alignment with embodied control particularly challenging.
To bridge the substantial gap between the top and bottom layers, ModalityNet focuses on the acquisition of two novel categories of human-centric data as intermediate layers that are High-Precision Human-environment Interaction (HiPHI) data and human-centric In The Wild (ITW) data. HiPHI is collected in dedicated data factories, featuring ultra-high precision and multimodal synchronization, while ITW aims at natural diversity from daily-life collection. Specifically, three datasets are defined
• HiPHI – Motion with Object & Vision (HiPHI-MOV) targets short-horizon whole-body motion capturing, together with ego-vision and interacted object tracking. Its objective is to exhaustively span the feasible human motion manifold while preserving a structurally clean and physically consistent distribution. It is designed to systematically characterize the intrinsic dynamics of embodied movement, addressing the fundamental question of how the body moves.
• HiPHI – OmniModality (HiPHI-OM) is dedicated to full-modality embodied data acquisition of fine-grained dexterous manipulation, particularly under complex object interaction and high-precision operational settings. Its objective is to systematically collect and align modalities, encompassing vision, language, motion dynamics, tactile feedback, depth perception, object-level mesh tracking, scene reconstruction, and audio signals. It addresses the fundamental question of omni-modality acquisition for dexterous manipulation.
• In the Wild (ITW) aims at standardized daily-life data collection under unconstrained natural environments, featuring the most diverse scenarios, tasks, objects and long-tail behaviors. It addresses the problem of generalized physical commonsense under real-world data distributions.
The three datasets are strategically positioned along three orthogonal design dimensions of embodied data. MWV spans the motion space, HiPHI-OM captures full-spectrum multimodality, and ITW grounds the intrinsic complexity of real-world distribution. Below, we introduce each dataset in detail.
The HiPHI-MOV Data
Introduction. The HiPHI-MOV Dataset is a human-centric, high-fidelity multimodal corpus specifically engineered for the development of robust locomotion and whole-body loco-manipulation policies. It includes full-body motion capture, tracking of interacting objects (if present), egocentric RGB-D visual data, and third-person RGB-D visual data. Full-body motion is captured by 17 body trackers and 13 trackers for a single-hand, modeled and output as a body BVH file with 21 end-effector 6-DOF poses, and a hand BVH file with 15 skeletal joint 6-DOF poses for each hand. The entire acquisition environment is deployed within large indoor spaces with hybrid optical-inertial motion capture systems.

Figure 1

Figure 2
Quality Assessment. As described above, HiPHI-MOV aims to span whole-body motion space. To achieve this, we employ a set of embodiment-relevant verb primitives guided by the FrameNet[1] theory to cover the entire human whole-body motion space. For each primitive, we guide LLMs to generate performance scripts for data collection. Under this script-generation framework, the dataset structure is organized as illustrated in Figure 1.
A certain portion of motion primitives correspond to human self-motion that does not involve object interaction, and recent advances, such as NVIDIA’s SONIC [2], have been trained on hundreds of hours large-scale mocap data of this kind. While another substantial portion of motion primitives in our framework involve full-body interactions with objects, such as carry. For these primitives, we simultaneously capture mesh-level object trajectories. HiPHI-MOV is intentionally designed for whole-body behavior. Although finger motions are naturally present, the manipulated objects are primarily everyday macroscopic items (e.g., tables, chairs, boxes), where palm-level manipulation suffices. Fine-grained, dexterous finger-level manipulation is instead systematically studied in the HiPHI-OM dataset.
We evaluate multiple quality dimensions of HiPHI-MOV data, including mesh penetration, floating artifacts, skating, and other physical inconsistencies. The quality evaluation results are shown in Figure 2. Currently, our data platform operates at a production capacity of approximately xxx hours per week.
How to use HiPHI-MOV. HiPHI-MOV is well suited for motion tracking based locomotion or loco-manipulation learning. Such approaches leverage high-quality and diverse human motion trajectories as reference signals, construct imitation-based reward functions, and perform reinforcement learning (RL) in physics simulation. Scaling up this type of data enables generalization across the human motion space, with applications including whole-body teleoperation and terrain-adaptive loco-manipulation.
To validate the practical trainability of HiPHI-MOV, we established a closed-loop data training pipeline, including cross-embodiment retargeting, simulation-based RL training, sim-to-sim evaluation, and sim-to-real deployment. We further conducted preliminary scaling law validation to evaluate the effect of increasing dataset size on model performance. Video 1 summarizes the data utilization pipeline, and shows results of our data and model validation, ensuring the quality and practical trainability of the dataset.
Noitom Robotics
Noitom Robotics is a data infrastructure company for embodied AI. We build end-to-end pipelines that transform real-world human activity into synchronized, training-ready multimodal datasets—motion, vision, and interaction signals—at scale. By standardizing capture, labeling, quality control, and governance, NR helps teams overcome the data bottlenecks that limit robotic intelligence, including the cross-embodiment challenge of transferring learned skills across different robot forms. Our mission is to power faster, safer deployment of capable robots by making high-quality embodied data accessible, scalable, and production-grade. www.noitomrobotics.com1