How we work

From human motion
to robot action.

Five steps. One pipeline. Every dataset delivered in LeRobot format, COCO 1.0, and GDPR-documented — ready to drop into your training run.

Record in real environments

We record egocentric 4K video from the worker's perspective — the same viewpoint a humanoid robot has. Real tasks, real people, real environments across Germany and the EU.

For industrial datasets, we bring our equipment on-site. For everyday datasets, we work with our participant network across Germany. Every recording environment is documented with lighting conditions, camera position, and task description.

iPhone 4K Chest/head mount Consent protocol GDPR compliant

Transcode and prepare

Raw 4K footage is transcoded to 1080p H.264 MP4 using FFmpeg. Frames are extracted at 2–5 fps for annotation. The 4K original is retained as a lossless master for future re-processing.

FFmpeg H.264 1080p Frame extraction

Automatic pre-annotation

MediaPipe detects 21 hand keypoints per frame automatically — x, y, and z coordinates. Z-values are normalized per frame relative to hand size (z_norm) to ensure cross-frame consistency. Grounding DINO identifies and labels objects in the scene via text prompts.

This step covers 80–90% of frames without human intervention. Confidence scores are retained for every keypoint.

MediaPipe 0.10.x Grounding DINO Segment Anything 21 keypoints · x/y/z

Human review and correction

MediaPipe pre-annotations are imported into CVAT, our local annotation platform. Human annotators review every frame, correcting edge cases: occluded hands, fast movements, low-light conditions, unusual poses.

Action labels are added per sequence. Every frame receives a natural language description of hand position, finger state, and action — essential for VLA model training.

CVAT 2.x (local) Bounding boxes Keypoint correction Natural language labels

Export and delivery

The final dataset is exported in all requested formats, packaged with a README, GDPR consent documentation, and data lineage records. Delivered as a ZIP or via secure transfer.

LeRobot format COCO 1.0 JSON YOLO TXT MOT CSV MP4 + JPG frames GDPR documentation

Delivery formats

LeRobot

episodes.jsonl + frames.jsonl + MP4. Drop directly into Hugging Face training pipelines.

COCO 1.0

Full bbox and keypoint annotations. Compatible with all major training frameworks.

YOLO TXT

Normalized bounding boxes per frame. Ready for YOLO-family model training.

MOT CSV

Multi-object tracking format. For temporal sequence modeling across frames.

Raw frames

JPG frames at 2–5 fps. PNG for segmentation masks and depth maps.

GDPR docs