NVIDIA Publishes Physical AI Data Factory Blueprint on GitHub With Major Robotics Partners

NVIDIA has released its Physical AI Data Factory Blueprint on GitHub, an open-source framework for massive-scale data processing designed to accelerate the training of robots, autonomous vehicles, and other embodied AI systems. The April release arrives with an expanding partner ecosystem and a new healthcare-focused dataset that signals physical AI's push into medical domains.

The Blueprint

The Data Factory Blueprint provides a modular pipeline for generating, processing, and validating the enormous datasets required to train physical AI systems. The framework supports synthetic data generation — creating photorealistic simulated environments where robots can train on millions of scenarios without real-world risk — as well as reinforcement learning workflows that allow systems to optimize their behavior through trial and error at scale.

The architecture is built on NVIDIA's Cosmos world foundation models and integrates with the company's Omniverse simulation platform. Engineers can specify environmental parameters, object types, physics properties, and edge cases, then generate vast datasets of training scenarios on demand.

Partner Ecosystem

Several major companies have adopted or co-developed portions of the blueprint:

FieldAI is using the framework for autonomous navigation in unstructured outdoor environments
Hexagon Robotics has integrated the pipeline into its industrial automation training workflows
Skild AI is leveraging synthetic data generation to train general-purpose robot foundation models
Uber continues to expand its use of the blueprint for autonomous vehicle perception
Teradyne Robotics, parent of Universal Robots and MiR, applies the framework to improve manipulation reliability

This breadth of adoption suggests the blueprint is achieving NVIDIA's goal of becoming the default infrastructure for physical AI development across industries.

Open-H Healthcare Robotics Dataset

Alongside the blueprint, NVIDIA launched Open-H, a healthcare robotics dataset containing more than 700 hours of surgical video and clinical procedure recordings. The dataset is designed to train AI systems that assist in surgical environments — from instrument tracking and tissue identification to autonomous suturing and procedural guidance.

Open-H represents one of the largest publicly available datasets for healthcare robotics, addressing a critical data scarcity problem in medical AI. Collecting surgical training data has historically been constrained by privacy regulations, institutional gatekeeping, and the sheer difficulty of instrumenting operating rooms. By releasing a curated, de-identified dataset at scale, NVIDIA is attempting to lower the barrier to entry for research teams working on surgical AI.

Strategic Significance

The open-source release follows NVIDIA's established playbook: give away the software and development tools to increase demand for the GPU hardware that runs them. Every lab and startup that adopts the Data Factory Blueprint becomes a potential customer for NVIDIA's training and inference hardware.

But the significance extends beyond commercial strategy. Physical AI has been constrained by data scarcity in a way that language AI has not — the real world does not generate neatly labeled training data the way the internet generates text. Synthetic data generation at scale may be the key to unlocking the same rapid capability gains in robotics that transformer models achieved in language.

NVIDIA Publishes Physical AI Data Factory Blueprint on GitHub With Major Robotics Partners

The Blueprint

Partner Ecosystem

Open-H Healthcare Robotics Dataset

Strategic Significance

Get Lanceum in your inbox

More in Research

Hybrid AI Approach Slashes Energy Use 100x While Boosting Accuracy

Carnegie Mellon Launches AI-Astronomy Fellowship to Accelerate Cosmic Discovery

Neuromorphic Computers Crack Physics Simulations Once Reserved for Supercomputers