As we've seen recently with the release of DeepSeek, there is substantial room for improvement in large scale foundation models, both in terms of architectural efficiency and unsupervised training techniques. While the discussion has been mostly about LLMs, there is also a strong need for improvement to the scalability of generative AI in other domains such as video and multi-sensor world models. In the last several months we have released multiple foundation models for video and multi-sensor generative simulation for the autonomous driving space: VidGen-1 and 2, WorldGen-1 and GenSim-2. These models were developed fully in-house (and not fine-tuned from any open-source models) using only ~100 H100 GPUs (inclusive of all the R&D and final training runs), which is a tiny percentage of the typical compute budgets associated with video foundation model development (thousands to tens of thousands of H100 GPUs). How did we achieve industry leading foundation models with much less compute? We combined DNN architecture innovation with advanced unsupervised learning techniques. By leveraging our Deep Teaching technology and improvements to generative AI DNN architectures, we were able to use smaller parameter/more efficient models and to simultaneously accelerate the unsupervised learning process, leading to superior scaling laws compared to industry-typical methods, which means higher accuracy per compute dollar spent, both during training and inference. We have verified that these scaling law advantages persist at larger scales of compute/data, and look forward to keep pushing the frontier of world models for autonomous driving and robotics by scaling up. In essence, combining Deep Teaching with generative AI architecture innovation, leads to a highly scalable form of generative AI for simulation.
Latest AIST Robot Foundation Model Developments
Explore top LinkedIn content from expert professionals.
Summary
Latest-aist-robot-foundation-model-developments refer to cutting-edge advancements in artificial intelligence models that serve as a base for teaching robots new skills, allowing them to learn complex tasks from various sources like videos and human demonstrations. These models make it possible for robots to understand instructions, adapt to new environments, and perform a wide range of activities without needing to be reprogrammed for every new task.
- Explore open-source tools: Take advantage of newly released foundation models and datasets that are available for public use, helping you experiment and innovate in robotics without high barriers to entry.
- Combine multiple data types: Use diverse sources like human demonstrations, synthetic simulations, and sensor data to train robots, making them better at handling new and unfamiliar situations.
- Try generalist models: Consider using foundation models designed to work across different robot types, enabling easy adaptation to new tasks by simply providing a video demonstration or language instruction.
-
-
🚀The world’s first Open Foundation Model for generalist humanoid robots was just launched during NVIDIA’s GTC, and it’s nothing short of exciting! My take is, this new model, designed for diverse manipulation tasks, will be performing in open-ended environments, where “new, unseen data” will be coming in on the fly! I’m hoping we’re surmounting the hurdles seen with autonomous vehicles, as we fine tune this foundational model into many sub-versions. Making it open source is a major strength, in my opinion. Researchers around the world will be thinking about ways to fine tune using innovative reinforcement learning techniques, given that Omniverse and and Cosmos provides a space to explore synthetic data while removing the constraints of human-annotated data. Nonetheless, here are the quick facts about Groot N1: 🔹Vision-Language-Action (VLA) Architecture: Combines a vision-language model for reasoning (System 2) with a diffusion transformer for real-time motor actions (System 1). 🔹Trained on Heterogeneous Data: Uses a structured data pyramid like human videos, synthetic simulations, and real-robot demonstrations. 🔹Cross-Embodiment Generalization: Supports multiple robot types, from simple arms to full humanoid robots. 🔹High-Frequency Control: Processes perception at 10Hz and generates motor actions at 120Hz on an NVIDIA L40 GPU. 🔹State-of-the-Art Learning: Outperforms imitation learning baselines in both simulation and real-world humanoid benchmarks. 🔹Open-Source Availability: Model weights, datasets, and simulation environments are accessible on GitHub & Hugging Face. Hope you’re as excited as I am about this new frontier, and what’s coming next! #genai #technology #artificialintelligence
-
TRI's "LBM 1.0" paper appeared on arxiv last night! Large Behavior Models (LBMs) are foundation models for robots that map robot sensors (notably camera inputs) and natural language commands into robot actions. The robots are programmed just through demonstration; we can develop incredible new skills like the video below without writing a single line of new code. There is a lot of excitement in the field right now because of the incredible potential for this type of technology. Inevitably, there is also a lot of hype. One of our main goals for this paper was to put out a very careful and thorough study on the topic to help people understand the state of the technology, and to share a lot of details for how we're achieving it. The short version is: LBMs work! We see consistent and statistically significant improvements as we increase the amount of pretraining data. But doing the science is still hard; as a field we have more work to do to improve the statistical power of our experiments. Please check out our project website for the paper and more details: https://lnkd.in/eDn_sqGh. https://lnkd.in/epSksw5E
CutAppleInSlices Task
https://www.youtube.com/
-
𝗔 𝗡𝗲𝘄 𝗘𝗿𝗮 𝗼𝗳 𝗣𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲? What if a single model could control multiple robots — and learn new tasks from just one video? The team at Physical Intelligence has developed a generalist robotics policy that demonstrates exactly that: 🧠 One model 🤖 Multiple embodiments 🛠️ Countless real-world tasks Trained across 𝗲𝗶𝗴𝗵𝘁 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗿𝗼𝗯𝗼𝘁 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺𝘀, it can perform everything from folding laundry to bussing tables — even tasks it has never seen before. All it takes? ✅ A single video demonstration ✅ Or a natural language instruction No fine-tuning. No task-specific code. Just action. This isn’t just robotics automation — it’s a bold step toward 𝗴𝗲𝗻𝗲𝗿𝗮𝗹-𝗽𝘂𝗿𝗽𝗼𝘀𝗲 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲. We’ve seen what foundation models can do in vision and language. This is what happens when that power moves into the physical world. #AI #Robotics #Automation #Innovation