🦾 Great milestone for open-source robotics: pi0 & pi0.5 by Physical Intelligence are now on Hugging Face, fully ported to PyTorch in LeRobot and validated side-by-side with OpenPI for everyone to experiment with, fine-tune & deploy in their robots! π₀.₅ is a Vision-Language-Action model which represents a significant evolution from π₀ to address a big challenge in robotics: open-world generalization. While robots can perform impressive tasks in controlled environments, π₀.₅ is designed to generalize to entirely new environments and situations that were never seen during training. Generalization must occur at multiple levels: - Physical Level: Understanding how to pick up a spoon (by the handle) or plate (by the edge), even with unseen objects in cluttered environments - Semantic Level: Understanding task semantics, where to put clothes and shoes (laundry hamper, not on the bed), and what tools are appropriate for cleaning spills - Environmental Level: Adapting to "messy" real-world environments like homes, grocery stores, offices, and hospitals The breakthrough innovation in π₀.₅ is co-training on heterogeneous data sources. The model learns from: - Multimodal Web Data: Image captioning, visual question answering, object detection - Verbal Instructions: Humans coaching robots through complex tasks step-by-step - Subtask Commands: High-level semantic behavior labels (e.g., "pick up the pillow" for an unmade bed) - Cross-Embodiment Robot Data: Data from various robot platforms with different capabilities - Multi-Environment Data: Static robots deployed across many different homes - Mobile Manipulation Data: ~400 hours of mobile robot demonstrations This diverse training mixture creates a "curriculum" that enables generalization across physical, visual, and semantic levels simultaneously. Huge thanks to the Physical Intelligence team & contributors Model: https://lnkd.in/eAEr7Yk6 LeRobot: https://lnkd.in/ehzQ3Mqy
Applications of Heterogeneous Pre-Training in Robotics
Explore top LinkedIn content from expert professionals.
Summary
Heterogeneous pre-training in robotics is a method where robots are trained on a wide variety of data types—like images, text, human demonstrations, and sensor readings—to help them learn and adapt to new environments and tasks without needing to be programmed for each specific scenario. This approach makes it possible for robots to handle complex, unpredictable situations using skills learned from diverse sources.
- Mix training sources: Incorporate images, videos, and instructions from different environments and robots so your system learns to handle new challenges independently.
- Use shared data models: Align different kinds of data into a common format, making it easier for robots to understand and switch between tasks in everyday situations.
- Open access collaboration: Take advantage of open-source models and shared datasets to improve robot capabilities across various industries, from manufacturing to healthcare.
-
-
🦾 𝗔 𝗳𝗮𝘀𝘁𝗲𝗿, 𝗯𝗲𝘁𝘁𝗲𝗿 𝘄𝗮𝘆 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹-𝗽𝘂𝗿𝗽𝗼𝘀𝗲 𝗿𝗼𝗯𝗼𝘁𝘀 - Inspired by LLMs, researchers develop a training technique, that enables robots to absorb diverse sensor data 𝗶𝗻𝘁𝗼 𝗮 𝘂𝗻𝗶𝗳𝗶𝗲𝗱 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲. Using principles from large language models (LLMs), MIT’s Heterogeneous Pretrained Transformers (HPT) model enables robots to absorb diverse sensor data—like vision and proprioception—into a unified training architecture, boosting performance over traditional methods by another 20% in real-world and simulated environments. This architecture allows robots to build a broad foundation of "understanding" tasks similar to how LLMs absorb linguistic data. AI’s language models, capable of generalizing across various inputs, have demonstrated that diverse data can create scalable, adaptable intelligence. Applying this concept in robotics, MIT’s HPT allows a robot to train faster by 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗶𝗻𝗴 from different data sources in a way that simulates human intuition. This cross-pollination of technologies vastly reduces the time and data needed for training, marking a pivotal moment in robotics where complex, real-world tasks become feasible 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝘁𝗵𝗲 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀 𝗼𝗳 𝘁𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝘁𝗮𝘀𝗸-𝘀𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗽𝗿𝗼𝗴𝗿𝗮𝗺𝗺𝗶𝗻𝗴. This cross-acceleration is more than theoretical - industries can now realistically deploy robots for a wide range of tasks without customized programming for each. Such adaptability could redefine roles in manufacturing, logistics, and healthcare, supporting leaner and more responsive operations. We’re now on the cusp of achieving general-purpose robotics—a vision where robots perform complex, multi-domain tasks. 👉 What’s extraordinary is how disciplines like robotics, AI, spatial computing, and cloud computing are profoundly accelerating each other - with remarkable achievements daily. 𝐹𝑜𝑙𝑙𝑜𝑤 𝑚𝑒 𝑓𝑜𝑟 𝑚𝑜𝑟𝑒 𝑖𝑛𝑠𝑖𝑔ℎ𝑡𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 #AI, #Robotics, 𝑎𝑛𝑑 𝑛𝑒𝑥𝑡-𝑔𝑒𝑛 #IndustrialAutomation. #IndustrialAI #MachineLearning #LLMs #IndustrialAutomation #Innovation
-
🚀The world’s first Open Foundation Model for generalist humanoid robots was just launched during NVIDIA’s GTC, and it’s nothing short of exciting! My take is, this new model, designed for diverse manipulation tasks, will be performing in open-ended environments, where “new, unseen data” will be coming in on the fly! I’m hoping we’re surmounting the hurdles seen with autonomous vehicles, as we fine tune this foundational model into many sub-versions. Making it open source is a major strength, in my opinion. Researchers around the world will be thinking about ways to fine tune using innovative reinforcement learning techniques, given that Omniverse and and Cosmos provides a space to explore synthetic data while removing the constraints of human-annotated data. Nonetheless, here are the quick facts about Groot N1: 🔹Vision-Language-Action (VLA) Architecture: Combines a vision-language model for reasoning (System 2) with a diffusion transformer for real-time motor actions (System 1). 🔹Trained on Heterogeneous Data: Uses a structured data pyramid like human videos, synthetic simulations, and real-robot demonstrations. 🔹Cross-Embodiment Generalization: Supports multiple robot types, from simple arms to full humanoid robots. 🔹High-Frequency Control: Processes perception at 10Hz and generates motor actions at 120Hz on an NVIDIA L40 GPU. 🔹State-of-the-Art Learning: Outperforms imitation learning baselines in both simulation and real-world humanoid benchmarks. 🔹Open-Source Availability: Model weights, datasets, and simulation environments are accessible on GitHub & Hugging Face. Hope you’re as excited as I am about this new frontier, and what’s coming next! #genai #technology #artificialintelligence
-
MIT researchers have developed an innovative technique to train general-purpose robots faster and better, inspired by large language models. 💡 By combining a vast amount of diverse data from various sources, this approach enables robots to learn a wide range of tasks without the need for extensive task-specific training. The method, called 𝐇𝐞𝐭𝐞𝐫𝐨𝐠𝐞𝐧𝐞𝐨𝐮𝐬 𝐏𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 (𝐇𝐏𝐓), aligns data from different domains and modalities into a shared "language" that a generative AI model can process. 🧠 HPT outperformed traditional training methods by more than 20% in both simulation and real-world experiments. This breakthrough could lead to more versatile and adaptable robots, capable of seamlessly switching between tasks like the iconic Rosie from "The Jetsons." 📺 🔗 https://lnkd.in/dMc_tECN #Robotics #ArtificialIntelligence #MachineLearning #InnovationInAction #MITResearch