Applying Motion Capture from Video to Robotics

Explore top LinkedIn content from expert professionals.

Summary

Applying motion capture from video to robotics refers to using ordinary video footage to teach robots how to mimic human or animal movements, making robot training more accessible and flexible. By converting visual demonstrations into robotic actions, this process allows robots to learn new skills and behaviors without needing expensive equipment or labor-intensive data collection.

  • Use everyday videos: Take advantage of widely available online videos to extract movement patterns for robot learning, cutting down on specialized recording needs.
  • Break tasks into steps: Separate visual analysis from action planning to help robots understand both what needs to be done and how to do it using their own abilities.
  • Adapt to real-world changes: Incorporate systems that let robots adjust their movements based on new environments or feedback, making them more reliable in varied situations.
Summarized by AI based on LinkedIn member posts
  • View profile for Supriya Rathi

    105k+ | India #1 Robotics Communicator. World #10 | Share your research, and find new ideas through my community | DM for global collabs

    108,576 followers

    SLoMo is a first-of-its-kind #framework for transferring skilled motions from casually captured in-the-wild video footage of #humans and #animals to legged #robots. Here, they demonstrated their approach by transferring the motions of cats, dogs, and humans to example robots including a #quadruped (on hardware) and a #humanoid (in simulation). Traditional motion imitation for legged motor skills often requires expert animators, collaborative demonstrations, and/or expensive motion capture equipment, all of which limits scalability. SLoMo only relies on easy-to-obtain videos, readily available in online repositories like YouTube. It converts videos into motion primitives that can be executed reliably by real-world robots. It works in three stages: 1. Synthesize a physically plausible reconstructed key-point trajectory from monocular videos 2. Optimize a dynamically feasible reference trajectory for the robot offline that includes body and foot motion, as well as a contact sequence that closely tracks the key points 3. Track the reference trajectory online using a general-purpose model predictive controller on robot hardware Paper: https://lnkd.in/dVBMHHWB #Github: https://lnkd.in/d8RfAuJu By: John Zhang, Shuo Yang, Gengshan Yang, Arun Bishop, Swaminathan Gurumurthy, Deva Ramanan, Zac Manchester #robotics #research #inspiration #bioinspiration #invention #tech

  • View profile for Animesh Garg

    RL + Foundation Models in Robotics. Faculty at Georgia Tech. Prev at Nvidia

    18,669 followers

    Robotics data is expensive and slow to collect. A lot of videos are available online, but not readily usable by robotics because of lack of action labels. AMPLIFY solves this problem by learning Actionless Motion Priors that unlock better sample efficiency, generalization, and scaling for robot learning. Our key insight is to factor the problem into two stages: The "what": Predict the visual dynamics required to accomplish a task The "how": Map predicted motions to low-level actions This decoupling enables remarkable generalizability: our policy can perform tasks where we have NO action data, only videos. We outperform SOTA BC baselines on this by 27x 🤯 AMPLIFY is composed of three stages: 1. Motion Tokenization: We track dense keypoint grids through videos and compress their trajectories into discrete motion tokens. 2. Forward Dynamics: Given an image and task description (e.g., "open the box"), we autoregressively predict a sequence of motion tokens representing how keypoints should move over the next second or so. This model can train on ANY text-labeled video data - robot demonstrations, human videos, YouTube videos. 3. Inverse Dynamics: We decode predicted motion tokens into robot actions. This module learns the robot-specific mapping from desired motions to actions. This part can train on ANY robot interaction data - not just expert demonstrations (think off-task data, play data, or even random actions). So, does it actually work? Few-shot learning: Given just 2 action-annotated demos per task, AMPLIFY nearly doubles SOTA few-shot performance on LIBERO. This is possible because our Actionless Motion Priors provide a strong inductive bias that dramatically reduces the amount of robot data needed to train a policy. Cross-embodiment learning: We train the forward dynamics model on both human and robot videos, but the inverse model sees only robot actions. Result: 1.4× average improvement on real-world tasks. Our system successfully transfers motion information from human demonstrations to robot execution. And now my favorite result: AMPLIFY enables zero-shot task generalization. We train on LIBERO-90 tasks and evaluate on tasks where we’ve seen no actions, only pixels. While our best baseline achieves ~2% success, AMPLIFY reaches a 60% average success rate, outperforming SOTA behavior cloning baselines by 27x. This is a new way to train VLAs for robotics which dont always start with large scale teleoperation. Instead of collecting millions of robot demonstrations, we just need to teach robots how to read the language of motion. Then, every video becomes training data. led by Jeremy Collins & Loránd Cheng in collaboration with Kunal Aneja, Albert Wilcox, Benjamin Joffe at College of Computing at Georgia Tech Check out our paper and project page for more details: 📄 Paper: https://lnkd.in/eZif-mB7 🌐 Website: https://lnkd.in/ezXhzWGQ

  • View profile for Jim Fan
    Jim Fan Jim Fan is an Influencer

    NVIDIA Director of AI & Distinguished Scientist. Co-Lead of Project GR00T (Humanoid Robotics) & GEAR Lab. Stanford Ph.D. OpenAI's first intern. Solving Physical AGI, one motor at a time.

    223,331 followers

    Let’s reverse engineer Tesla Optimus humanoid robot! Part 2: Neural Architecture. Optimus is trained end-to-end: videos in, actions out. I'm quite sure it's implemented by a multimodal Transformer with the following components: (1) Image: some variant of efficient ViT, or simply an old ResNet/EfficientNet backbone (https://lnkd.in/gwJceCJu). The block pick-and-place demo doesn't require sophisticated vision. The spatial feature map from the image backbone can be tokenized easily. (2) Video: two ways. Either flatten the video into a sequence of images and produce tokens independently, or have a video-level tokenizer. There're numerous ways to efficiently process video pixel volumes. You don't necessarily need Transformer backbones, e.g. SlowFast Network (https://lnkd.in/gxZdpeB9) and RubiksNet (https://lnkd.in/gigDhJeT, my paper at ECCV 2020, efficient CUDA shift primitives). (3) Language: it's not clear if Optimus is language prompted. If it is, there needs to be a way to "fuse" the language representations into perception. FiLM is a very lightweight neural network module that serves this purpose (https://lnkd.in/gRskFhwv). You can think of it intuitively as a "cross attention" of language embedding into the image-processing neural pathway. (4) Action tokenization: Optimus needs to convert the continuous motion signals into discrete tokens for the autoregressive Transformer to work. A few ways: - Directly bin the continuous values for each hand joint control. [0, 0.01) -> token #0, [0.01, 0.02) -> token #1, etc. This is straightforward but could be inefficient due to the long sequence length. - The joint movements are highly dependent on each other, which means they occupy a low-dimensional "state space". Apply VQVAE to the motion data to obtain a shorter-length, compressed token set. (5) Putting the above pieces together, we have a Transformer controller that consumes video tokens (optionally with language modulation), and outputs action tokens, one step at a time. The next frame from the table is fed back to the Transformer, so it knows the consequence of its action. That gives the *self-corrective ability* shown in the demo. I believe the architecture is most similar to: - NVIDIA VIMA (my team’s work): https://lnkd.in/gZEDB3fD - Google RT-1: https://lnkd.in/g7N45aCU Lastly, I'm genuinely impressed by the hardware quality. The motions are fluid, and the aesthetics is amazing as well. As I mentioned above, it's such a great decision to follow human morphology closely, so that there is no gap in imitating humans. Atlas from Boston Dynamics only has simple gripper-style hands. In the long run, Optimus' bi-dexterous, 5-finger hands will prove far superior in daily tasks.

  • View profile for Ahsen Khaliq

    ML @ Hugging Face

    35,815 followers

    Robot See Robot Do Imitating Articulated Object Manipulation with Monocular 4D Reconstruction Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation.

  • View profile for Chris Paxton

    AI + Robotics Research Scientist

    6,969 followers

    Just collecting manipulation data isn’t enough for robots - they need to be able to move around in the world, which has a whole different set of challenges from pure manipulation. And bringing navigation and manipulation together in a single framework is even more challenging. Enter HERMES, from Zhecheng Yuan and Tianming Wei. This is a four-stage process in which human videos are used to set up an RL sim-to-real training pipeline in order to overcome differences between robot and human kinematics, and used together with a navigation foundation model to move around in a variety of environments. To learn more, join us as Zhecheng Yuan and Tianming Wei tell us about how they built their system to perform mobile dexterous manipulation from human videos in a variety of environments. Watch Episode #45 of RoboPapers today, hosted by Michael Cho and Chris Paxton! Abstract: Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasks Project Page: https://lnkd.in/e-aEbQzn ArXiV: https://lnkd.in/eemU6Pwa Watch/listen: Youtube: https://lnkd.in/erzbkYjz Substack: https://lnkd.in/e3ea76Q8

    Ep#45: HERMES: Human-to-Robot Embodied Learning From Multi-Source Motion Data for Mobile Dexterous Manipulation

    Ep#45: HERMES: Human-to-Robot Embodied Learning From Multi-Source Motion Data for Mobile Dexterous Manipulation

    robopapers.substack.com

  • View profile for Aaron Prather

    Director, Robotics & Autonomous Systems Program at ASTM International

    81,367 followers

    Controlling humanoid robots remotely has always been tough, needing big improvements in both the hardware and software to make the robots move easily and naturally. This research, conducted by team members from the Florida Institute for Human and Machine Cognition, Boardwalk Robotics, and the University of West Florida, introduces a new way of controlling robots that combines several key elements: motion capture without calibration, fast whole-body movement streaming, and special high-speed cycloidal motors. The motion capture system is unique because it only needs 7 sensors to create full-body movements for the robot, making it simple to set up. The kinematics streaming tool helps control the robot’s movements in real-time, making the robot respond quickly with less delay. The cycloidal motors used can handle high speeds and impacts, which is important for tough environments. Together, these tools create a powerful system for controlling robots. Tests with the humanoid robot Nadia showed that this setup works really well, making robot control more efficient and effective than before. Read the research here: https://lnkd.in/e7Fd8wwE Watch the full video here: https://lnkd.in/ei6QDaxC

  • View profile for Akshet Patel 🤖

    Robotics Engineer | Creator

    45,430 followers

    HumanPlus brings robots one step closer to humans—not just in form, but in motion, response, and intent. [⚡Join 2500+ Robotics enthusiasts – https://lnkd.in/dYxB9iCh] A team from Stanford University - Zipeng Fu, Qingqing Zhao, Qi Wu*, Gordon Wetzstein, and Chelsea Finn** (*Equal contributions) 🏆 Best Paper Award Finalist (Top 6) at CoRL 2024 Introduce HumanPlus, a framework for humanoid imitation and real-time shadowing directly from human videos and demonstrations. The system uses vision-based motion capture and imitation learning to train policies that allow humanoid robots to mimic human movement across a wide range of behaviours. It enables one-shot shadowing, where the robot mirrors a person’s motion in real time, and long-horizon imitation of demonstrations. The trained policy runs directly on hardware, showing generalisation to unseen motions and tasks. This brings us closer to truly human-compatible robotics, machines that not only look like us but move with our fluency and intent. If humanoids can now shadow and imitate us this closely, how should we shape the next generation of collaboration? Paper: https://lnkd.in/etV46kMt Project Page: https://lnkd.in/eW9RE9qs #HumanoidRobotics #ImitationLearning #HumanRobotInteraction #EmbodiedAI #CoRL2024

  • View profile for Pranshu Kaushal

    MDE @KLA |Ex Research Officer @A*STAR |Ex- Embedded Intern @Cirus Defence | Ex -Robotics Intern @IITD |Ex-Robotics Research Intern @SUTD | Prototype - Funding Recipient at 19 | 4M+ Impressions

    10,618 followers

    We're stepping into a new era of 𝐫𝐨𝐛𝐨𝐭 𝐜𝐨𝐧𝐭𝐫𝐨𝐥 with a method that enables the control of a robotic character using 𝐟𝐮𝐥𝐥-𝐛𝐨𝐝𝐲 𝐤𝐢𝐧𝐞𝐦𝐚𝐭𝐢𝐜 𝐦𝐨𝐭𝐢𝐨𝐧 references. The technique involves a two-stage approach: 𝐋𝐚𝐭𝐞𝐧𝐭 𝐌𝐨𝐭𝐢𝐨𝐧 𝐄𝐧𝐜𝐨𝐝𝐢𝐧𝐠: A Variational Autoencoder (VAE) learns compact representations from unstructured human motion data. 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐂𝐨𝐧𝐭𝐫𝐨𝐥 𝐏𝐨𝐥𝐢𝐜𝐲: This embedding is used to train a control policy that accurately maps kinematic inputs to robot actuator outputs. The result? Robust tracking of complex, unseen motions both in simulation and real-world robots. Video: https://lnkd.in/dAWsBQQb Paper: https://lnkd.in/d-kzexdJ Authors: Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, Moritz Baecher #Robotics #AI #MotionControl #Kinematics #DeepLearning

  • View profile for Mike Kalil

    content pro | mikekalil.com | youtube: @mikekalil | digital marketer | interested in deep tech, industry 4.0, b2b saas, product development, ai in manufacturing, digital engineering, automation, iiot

    3,901 followers

    Researchers from Nvidia and Carnegie Mellon University are teaching humanoid robots to move like top athletes using artificial intelligence. The team just introduced a new humanoid robot training framework called ASAP, short for Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills. ASAP was tested in three environments: Nvidia's Isaac Gym, Isaac Sym, and a real Unitree G1 humanoid robot. In their paper, the researchers said the method made the robot's movements more natural and reduced errors by more than 50 percent. They said it's a promising way to make humanoid robots move like real people in the future. "ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics," the researchers wrote. The research team used videos of real people doing moves like Cristiano Ronaldo's celebration jump, Lebron James's "silencer" pose, and Kobe Bryan's fadeaway shot. A tool called TRAM converted the videos into 3D motion data to teach the robots. The robots trained in simulation first but struggled in the real world because of differences in physics. The motion data included body position, movement, and shape details. They finetuned the data before using it to train G1 AI robots to copy the actions. The researchers concluded that humanoid robots can get much better at moving like humans, but there are still big physical limitations. The researchers had to stop testing because of overheating and mechanical failures. Two robots were damaged during testing because the high-energy movements overheated their motors and stressed hardware. "Future directions could focus on developing damage-aware policy architectures to mitigate hardware risks," they wrote. #nvidia #robotics #humanoidrobots #ai #technology

Explore categories