Five steps. One pipeline. Every dataset delivered in LeRobot format, COCO 1.0, and GDPR-documented — ready to drop into your training run.
We record egocentric 4K video from the worker's perspective — the same viewpoint a humanoid robot has. Real tasks, real people, real environments across Germany and the EU.
For industrial datasets, we bring our equipment on-site. For everyday datasets, we work with our participant network across Germany. Every recording environment is documented with lighting conditions, camera position, and task description.
Raw 4K footage is transcoded to 1080p H.264 MP4 using FFmpeg. Frames are extracted at 2–5 fps for annotation. The 4K original is retained as a lossless master for future re-processing.
MediaPipe detects 21 hand keypoints per frame automatically — x, y, and z coordinates. Z-values are normalized per frame relative to hand size (z_norm) to ensure cross-frame consistency. Grounding DINO identifies and labels objects in the scene via text prompts.
This step covers 80–90% of frames without human intervention. Confidence scores are retained for every keypoint.
MediaPipe pre-annotations are imported into CVAT, our local annotation platform. Human annotators review every frame, correcting edge cases: occluded hands, fast movements, low-light conditions, unusual poses.
Action labels are added per sequence. Every frame receives a natural language description of hand position, finger state, and action — essential for VLA model training.
The final dataset is exported in all requested formats, packaged with a README, GDPR consent documentation, and data lineage records. Delivered as a ZIP or via secure transfer.