Robotics data is expensive and slow to collect. A lot of videos are available online, but not readily usable by robotics because of lack of action labels. AMPLIFY solves this problem by learning Actionless Motion Priors that unlock better sample efficiency, generalization, and scaling for robot learning. Our key insight is to factor the problem into two stages: The "what": Predict the visual dynamics required to accomplish a task The "how": Map predicted motions to low-level actions This decoupling enables remarkable generalizability: our policy can perform tasks where we have NO action data, only videos. We outperform SOTA BC baselines on this by 27x 🤯 AMPLIFY is composed of three stages: 1. Motion Tokenization: We track dense keypoint grids through videos and compress their trajectories into discrete motion tokens. 2. Forward Dynamics: Given an image and task description (e.g., "open the box"), we autoregressively predict a sequence of motion tokens representing how keypoints should move over the next second or so. This model can train on ANY text-labeled video data - robot demonstrations, human videos, YouTube videos. 3. Inverse Dynamics: We decode predicted motion tokens into robot actions. This module learns the robot-specific mapping from desired motions to actions. This part can train on ANY robot interaction data - not just expert demonstrations (think off-task data, play data, or even random actions). So, does it actually work? Few-shot learning: Given just 2 action-annotated demos per task, AMPLIFY nearly doubles SOTA few-shot performance on LIBERO. This is possible because our Actionless Motion Priors provide a strong inductive bias that dramatically reduces the amount of robot data needed to train a policy. Cross-embodiment learning: We train the forward dynamics model on both human and robot videos, but the inverse model sees only robot actions. Result: 1.4× average improvement on real-world tasks. Our system successfully transfers motion information from human demonstrations to robot execution. And now my favorite result: AMPLIFY enables zero-shot task generalization. We train on LIBERO-90 tasks and evaluate on tasks where we’ve seen no actions, only pixels. While our best baseline achieves ~2% success, AMPLIFY reaches a 60% average success rate, outperforming SOTA behavior cloning baselines by 27x. This is a new way to train VLAs for robotics which dont always start with large scale teleoperation. Instead of collecting millions of robot demonstrations, we just need to teach robots how to read the language of motion. Then, every video becomes training data. led by Jeremy Collins & Loránd Cheng in collaboration with Kunal Aneja, Albert Wilcox, Benjamin Joffe at College of Computing at Georgia Tech Check out our paper and project page for more details: 📄 Paper: https://lnkd.in/eZif-mB7 🌐 Website: https://lnkd.in/ezXhzWGQ
How to Apply Zero-Shot Learning in Robotics
Explore top LinkedIn content from expert professionals.
Summary
Zero-shot learning in robotics allows robots to perform tasks they've never been explicitly trained for by leveraging patterns and knowledge from existing data, often using videos and other unlabeled inputs.
- Utilize diverse datasets: Train models on large-scale, multi-source videos or demonstrations, including non-robotic data like human actions, to build versatile and adaptive robot learning systems.
- Apply motion analysis: Focus on breaking tasks into smaller components, such as predicting movements and mapping them to actions, to enable robots to generalize across new tasks.
- Incorporate world modeling: Use advanced AI systems that understand and predict physical environments, enabling robots to anticipate outcomes and operate in dynamic, real-world scenarios.
-
-
🚀 𝗔𝗱𝘃𝗮𝗻𝗰𝗶𝗻𝗴 𝗪𝗼𝗿𝗹𝗱 𝗠𝗼𝗱𝗲𝗹𝘀 𝘄𝗶𝘁𝗵 𝗩‐𝗝𝗘𝗣𝗔 𝟮 - Excited to share the latest milestone from one of the research teams I work with at Meta FAIR Core Learning & Reasoning. V‑JEPA 2 is a world model to understand physical reality, anticipate outcomes, and plan efficient strategies. 𝗪𝗲, 𝗵𝘂𝗺𝗮𝗻𝘀 𝘂𝘀𝗲 𝗼𝘂𝗿 𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝘁𝗼 𝗽𝗿𝗲𝗱𝗶𝗰𝘁 𝗵𝗼𝘄 𝘁𝗵𝗲 𝘄𝗼𝗿𝗹𝗱 𝘄𝗶𝗹𝗹 𝗿𝗲𝘀𝗽𝗼𝗻𝗱 to our actions—or the actions of others— all the time, especially when planning what actions to take and how to best navigate new situations. 𝗩‐𝗝𝗘𝗣𝗔 𝟮 𝗶𝘀 𝗮 𝘀𝗲𝗹𝗳-𝘀𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝘄𝗼𝗿𝗹𝗱 𝗺𝗼𝗱𝗲𝗹 that reasons in learnt latent space - not in token space like LLMs. The encoder and predictor are pre-trained through self-supervised learning from visual data, leveraging abundant natural videos (more than 1 million hours) from diverse sources to bootstrap physical world understanding and prediction. 𝗔𝘀 Yann LeCun 𝗵𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝗲𝗱 𝘁𝗼𝗱𝗮𝘆 𝗮𝘁 𝗩𝗶𝘃𝗮𝗧𝗲𝗰𝗵, this represents a fundamental shift toward AI systems that can reason, plan, and act through physical world models. To reach advanced machine intelligence, AI must go beyond perception and understand how the physical world works—anticipating dynamics, causality, and consequences. V‑JEPA 2 does just that. 𝗥𝗼𝗯𝗼𝘁𝗶𝗰𝘀 a𝘀 𝗮 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹 𝗲𝘅𝗮𝗺𝗽𝗹𝗲: with just 62 hours of robot interaction and no task-specific training, V‑JEPA 2 enables zero-shot planning and achieves up to 80% task success, reducing latency from minutes to seconds. It’s a leap toward practical, adaptable robots that can operate in dynamic, real-world environments. 𝗧𝗵𝗲 𝘁𝗲𝗰𝗵 𝗰𝗼𝗺𝗺𝘂𝗻𝗶𝘁𝘆 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗻𝗼𝘁𝗲 and on Hacker News, one commenter summed it up well: “𝘐𝘵'𝘴 𝘪𝘮𝘱𝘰𝘳𝘵𝘢𝘯𝘵 𝘵𝘰 𝘬𝘦𝘦𝘱 𝘴𝘰𝘮𝘦 𝘱𝘦𝘳𝘴𝘱𝘦𝘤𝘵𝘪𝘷𝘦: 𝘵𝘩𝘦𝘳𝘦 𝘢𝘳𝘦 𝘻𝘦𝘳𝘰 𝘳𝘰𝘣𝘰𝘵𝘴 𝘪𝘯 𝘵𝘩𝘦 𝘸𝘪𝘭𝘥, 𝘢𝘵 𝘵𝘩𝘦 𝘮𝘰𝘮𝘦𝘯𝘵, 𝘵𝘩𝘢𝘵 𝘶𝘴𝘦 𝘢 𝘸𝘰𝘳𝘭𝘥 𝘮𝘰𝘥𝘦𝘭 𝘵𝘰 𝘸𝘰𝘳𝘬 𝘰𝘯 𝘵𝘢𝘴𝘬𝘴 𝘵𝘩𝘦𝘺 𝘸𝘦𝘳𝘦𝘯'𝘵 𝘴𝘱𝘦𝘤𝘪𝘧𝘪𝘤𝘢𝘭𝘭𝘺 𝘵𝘳𝘢𝘪𝘯𝘦𝘥 𝘰𝘯. 𝘛𝘩𝘪𝘴 𝘪𝘴 𝘤𝘶𝘵𝘵𝘪𝘯𝘨 𝘦𝘥𝘨𝘦 𝘳𝘦𝘴𝘦𝘢𝘳𝘤𝘩 ...”. - ricardobeat. 💡 If you’re passionate about physical world modeling, causal understanding, or building intelligent systems that interact with the real world—join us. More information, the paper, and a bunch of explanation videos to check here: 🔗 𝗦𝗶𝘁𝗲: https://ai.meta.com/vjepa 🔗 𝗕𝗹𝗼𝗴: https://lnkd.in/gwccegvT 🔗 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/gFGDi4xG 🔗 𝗠𝗲𝘁𝗮 𝗖𝗮𝗿𝗲𝗲𝗿𝘀: https://lnkd.in/g-MaBMc5 #AI #WorldModels #VJEPA #MetaAI #FAIR #EmbodiedAI #Robotics #CausalReasoning #LeCun #MachineLearning
-
Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning Meta AI has released V-JEPA 2, an open-source video world model designed to learn from large-scale unlabeled video data using a self-supervised joint-embedding predictive architecture. Trained on over 1 million hours of internet-scale video and 1 million images, V-JEPA 2 excels at motion understanding, action anticipation, and video question answering. It achieves state-of-the-art performance on benchmarks like Something-Something v2 and Epic-Kitchens-100, without requiring language supervision during pretraining. Its architecture scales to over 1B parameters, leveraging advanced pretraining strategies such as progressive resolution and temporal extension to enable robust video representation learning. In addition to perception tasks, Meta introduces V-JEPA 2-AC—an action-conditioned extension trained on just 62 hours of robot interaction data. This version enables zero-shot planning and manipulation on real-world robotic arms, performing tasks like grasping and pick-and-place using visual goals alone. Compared to other models like Octo and Cosmos, V-JEPA 2-AC offers faster inference and higher task success rates, without task-specific tuning or rewards. Together, V-JEPA 2 and its variants showcase a scalable and efficient path toward general-purpose embodied AI..... 🧲 Read full article: https://lnkd.in/gH2BTZa7 🎓 Paper: https://lnkd.in/gQEDYMMQ 🔥 Models on Hugging Face: https://lnkd.in/g_Gw9ZW9 💡 GitHub Page: https://lnkd.in/gwCQj8wc Meta AI at Meta #artificialintelligence #robotics #ai #opensource
-
📢 🧬 New paper drop: "Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies" by stellar PhD student Kevin Li Massachusetts Institute of Technology Harvard Medical School https://lnkd.in/eWbtEyVy Imagine an agent that can reach any goal while avoiding danger, without retraining, even when the hazards change. That's the reach-avoid challenge. Think self-driving cars dodging new construction or cell therapies steering clear of tumorigenic states. Most RL methods hardwire the danger zones during training. Want to avoid something new? Retrain. Want to scale to new configurations? Retrain. But what if you could just tell the model what to avoid, on the fly? Enter RADT: Reach-Avoid Decision Transformer. It learns from suboptimal data. It uses no rewards or costs. It encodes goals and avoid regions as prompt tokens. And it generalizes zero-shot to new goals and hazards. 🧵👇 What is different here? RADT does not see rewards. Instead, it learns from relabeled offline trajectories. Each trajectory is framed as either a "good" or "bad" demonstration of avoiding specified regions. The prompt looks like this: ✅ Goal token ❌ One or more avoid tokens (can be of any shape/size) 🟢 Success/failure indicators You can mix, match, or modify the prompt at inference time. RADT will adapt, zero-shot. Benchmarks: FetchReach and MazeObstacle 🏗️ RADT beats baselines (even retrained ones!) at avoiding hazards and hitting targets Handles more avoid regions and larger ones, without ever seeing them in training Zero-shot generalization actually works Real-world applications: Cell reprogramming 🧬 Start with a fibroblast, reach a cardiomyocyte, and avoid dangerous intermediate states (e.g., tumorigenic ones). RADT reduces time spent in harmful expression states, even when avoidance is impossible, it minimizes exposure. Why it matters: Flexible deployment: same model, new avoid regions Reward-free: no need for hand-designed cost functions Works in both robotics and biology Helps in safety-critical settings where retraining is infeasible Limitations: It can only handle "box-shaped" avoid regions for now But the core idea, prompt-driven, reward-free, zero-shot control, is powerful and widely applicable. RADT is part of a bigger vision: general-purpose agents that follow high-level instructions about where to go and what to avoid, safely and efficiently. Read the paper: https://lnkd.in/eWbtEyVy 👏 Big kudos to Kevin Li for pushing the frontier on safe, compositional policy learning! Massachusetts Institute of Technology Harvard Medical School Department of Biomedical Informatics Harvard Medical School Kempner Institute at Harvard University Harvard Data Science Initiative Broad Institute of MIT and Harvard Harvard University