From Rigid Paths to Adaptive Perception: How Robots Learn to See Not long ago, industrial robots relied on rigidity: fixed jigs, calibrated paths, and zero tolerance for variation. Efficient, yes, but brittle. Today, perception has changed that. Robots now see and adapt. Cameras, depth sensors, and AI models allow them to align, grasp, and inspect without human correction. At Black Coffee Robotics, we have worked across this evolution: • Using ICP for millimeter-level AMR docking and part alignment in welding systems • Applying RANSAC to identify waterlines and docks from noisy marine LiDAR data • Combining SIFT/ORB with geometric solvers for cluttered object localization • Deploying YOLO and ClipSeg for assistive and construction robots • Experimenting with Vision-Language-Action models for zero-shot task execution in simulation Our latest blog explains these techniques, how they are used in practice, and how teams can select the right perception approach for their own robotic applications. It concludes with a case study of a vision-guided welding system that adapts its weld path in real time using perception. Read the full post and watch the welding demo: https://lnkd.in/gp6jQS3r #robotics #AI #perception #ros2 #isaacsim
Overcoming Robot Perception Challenges in Automation
Explore top LinkedIn content from expert professionals.
Summary
Overcoming robot perception challenges in automation means improving how robots use sensors and AI to see, feel, and understand their environment so they can handle unpredictable tasks more reliably. This involves both visual and tactile technologies, allowing robots to adapt like humans when assembling parts, sorting objects, or working alongside people.
- Combine sensory inputs: Pairing vision systems with touch sensors helps robots adapt to complex tasks that rely on both seeing and feeling, such as gripping delicate objects or threading cables.
- Use smart algorithms: Training robots with new AI models and carefully selected data allows them to recognize items, avoid obstacles, and adjust their actions in real time.
- Design for transparency: Programming robots to explain their actions and alert users to issues builds trust and makes it easier to troubleshoot or collaborate in busy industrial settings.
-
-
I've started a series of short experiments using advanced Vision-Language Models (#VLM) to improve #robot #perception. In the first article, I showed how simple prompt engineering can steer Grounded SAM 2 to produce impressive detection and segmentation results. However, the major challenge remains: most #robotic systems, including mine, lack GPUs powerful enough to run these large models in real time. In my latest experiment, I tackled this issue by using Grounded SAM 2 to auto-label a dataset and then fine-tuning a compact #YOLO v8 model. The result? A small, efficient model that detects and segments my SHL-1 robot in real time on its onboard #NVIDIA #Jetson computer! If you're working in #robotics or #computervision and want to skip the tedious process of manually labeling datasets, check out my article (code included). I explain how I fine-tuned a YOLO model in just a couple of hours instead of days. Thanks to Roboflow and its amazing #opensource tools for making all of this more straightforward. #AI #MachineLearning #DeepLearning
-
Tactile sensing, or the sense of touch, is very important for humans and robots when they need to handle tasks involving a lot of contact. In robotic tactile sensing, there are three big challenges: 🤖 Understanding the signals from sensors 🤖 Predicting what the sensors will sense in new situations 🤖 Learning how to use the sensor data to make decisions. For visuotactile sensors, which combine vision and touch, interpreting the data is easier because they are similar to vision sensors like cameras. However, predicting sensor signals is still challenging because these sensors deal with contact, deformation, light, and images, which are all expensive and difficult to simulate. This also makes it hard to learn sensor-based policies since large-scale data collection through simulation is challenging. Researchers at NVIDIA have developed TacSL (pronounced "taxel"), a new library for simulating and learning with visuotactile sensors using GPUs. It allows users to simulate visuotactile images and get contact-force distributions over 200 times faster than previous methods, all within the popular Isaac Gym simulator. TacSL also offers a learning toolkit with various sensor models, training environments that involve a lot of contact, and online and offline algorithms to help with learning policies that can be applied from simulation to real-life scenarios. On the algorithm side, TacSL introduces a new online reinforcement-learning method called asymmetric actor-critic distillation (AACD). This method is designed to learn how to use tactile data effectively in simulations to transfer the knowledge to real-world applications. Finally, TacSL demonstrates its usefulness by showing how its library and algorithms improve contact-rich tasks and successfully transfer learning from simulations to the real world. 📝 Research Paper: https://lnkd.in/eRDWcSvF 📊 Project Page: https://lnkd.in/eDUFUjbp #robotics #research
-
Can AI and LLMs Get Robots to Cooperate Smarter? Imagine going into a busy factory floor where robots are performing complicated tasks but also describing in real time what they're doing. This was the sci-fi dream now so well within grasp. This paper gives great insight into how LLMs and VLMs will reshape human-robot collaboration-particularly in high-consequence industries. 🔹 Research Focus Ammar N. Abbas (TU Dublin Computer Science) and Csaba Beleznai (AIT Austrian Institute of Technology) discussed how the integration of LLMs and VLMs into robotics will interpret natural language commands, understand inputs in the form of images, and explain internal processes in plain language. This approach is about creating interpretable systems, building trust, safety, and simplifying operations. 🔹 Language-Based Control LLMs are good at taking general instructions like "Pick up the red object" and turning them into very specific movements. The few-shot prompting allows learning to perform sophisticated trajectories for robots, without requiring thorough programming of the robot moves. The development decreases time spent in training and simultaneously enhances flexibility. 🔹 Context-Aware Perception By externalizing internal states, the robot alerts the operator in the event of an imminent collision or when something is missing from the environment. This form of transparency, in other words, not only builds trust but also allows to make quicker and more informed decisions, hence reducing down times and risk. 🔹 Integrating Input from Vision VLMs process sequential images to provide robots with enhanced spatial awareness. This capability enables tasks like sorting items by attributes, avoiding obstacles, and identifying safe zones for operations. 🔹 Robot Structure Awareness Equipping LLMs with knowledge of the physical structure of a robot, such as reach or mechanical limits, allows for superior task planning. For instance, it avoids overreaching and unsafe movements by robots while ensuring the accuracy and safety of the workplace. 🔹 Key Takeaway The framework illustrated industrial tasks like stacking, obstacle avoidance, and grasping through simulation in: - Accurate generation of control patterns - Real-time contextual reasoning and feedback. - Performing multi-step tasks successfully with both structural and visual data. 📌 Practical Applications This research aims to make advanced robotics accessible to non-experts by bridging automation and collaboration. It promises faster deployment, enhanced safety, efficiency, and improved trust between human and robotic teams. 👉 How can AI and LLMs enhance decision-making in industrial robotics? What are the biggest challenges in implementing LLM-driven robotics? 👈 #ArtificialIntelligence #MachineLearning #AI #GenerativeAI #IndustrialAutomation #Robotics #SmartManufacturing Subscribe to my Newletter: https://lnkd.in/dQzKZJ79
-
Robotic AI’s reliance on vision is limiting its ability to interact with the physical world accurately. Vision systems dominate robotic AI because they’re cost-effective and can collect massive datasets. But this overemphasis on vision overlooks the critical role of force sensing—providing tactile data that vision simply can’t replicate. Without it, robots are limited to estimating force feedback from visuals, leading to inefficiencies in delicate tasks like assembly, gripping, or threading. As Edward Adelson, professor at Massachusetts Institute of Technology, explained in his TED Talk, “Force feedback allows robots to perform tactile tasks that vision alone cannot achieve—like folding a towel or threading a cable—by feeling their way through interactions, just as humans do.” Adelson’s work on GelSight technology highlights how tactile sensing can unlock superhuman precision for robots, enabling them to understand their environment through touch. The challenge? Force sensors are an added cost, generate less data, and are harder to integrate. But they offer essential benefits: • Reliability and Safety: For tasks where mistakes aren’t an option, force feedback provides the assurance vision alone cannot. • Deeper Learning: Force sensing enriches AI by adding layers of contact-based data for more robust decision-making. • Expanding Applications: From industrial automation to medical robotics, tactile data opens doors to tasks beyond vision’s reach. ATI Industrial Automation supports robotics through robust, precise robotic force sensors—helping to bring accuracy to robotic AI data collection. Edward Adelson’s TED Talk: https://lnkd.in/epeCvwqj #robotics
-
New paper: Compose by Focus — we make robot visuomotor skills robustly composable by focusing perception on the task-relevant parts of a scene using 3D scene graphs, then learning a single policy over those structured inputs. Motivation: planners (VLM/TAMP) can break a long task into sub-goals, but visuomotor policies often crumble in cluttered, novel scenes. The issue isn’t the planner — but the policy's brittle visual processing under distribution shift. We argue skills must be focused, attending only to relevant objects and relations. Method in a nutshell: for each sub-goal, we build a dynamic sub-scene graph—Grounded SAM segments relevant objects, a VLM infers relations; a GNN encodes the graph; CLIP encodes the skill text; a diffusion policy is conditioned on both language + graph features; a VLM planner sequences sub-goals. Why it matters: focusing on relevant objects/relations mitigates distribution shift, enabling reliable multi-skill composition in visually complex scenes. Scene graphs also form a natural interface between high-level planning and low-level control, easing long-horizon execution and reducing data demands. Project & paper: https://lnkd.in/eGv9JQ5n
-
Minimizing Sim2real Error in Robotics when deploying to Real World: Maybe the best open source content available on internet. This content is written directly by the Reinforcement Learning and Imitation Learning team at Limx Dynamics. https://lnkd.in/gFxXSCYe Deployment: When deploying a trained policy to the real robot, we often encounter stable simulation performance but markedly different real-world behavior – the Sim2Real Gap. This is a common challenge in robot locomotion training. When facing a significant Gap, we need to check if the following two aspects have been properly addressed: 1. Build accurate robot models: 2. Establish an accurate simulation environment: If a large gap persists after checking these conditions, we need to analyze whether our policy is overfitting, meaning it's only applicable within a narrow range of environments. In this case, focus on the appropriateness of Domain Randomization. Adding randomization items or increasing the randomization range can enhance the model's generalization performance. After optimization, the open-source algorithm's performance on the real robot closely approaches that of the deeply optimized algorithm in remote control mode. Opportunities for Further Optimization The algorithm developed this time is implemented based on the CTS framework. To further enhance performance, researchers are encouraged to explore the following directions in-depth: 1. CTS framework enhancement: Improve the CTS framework on multiple levels, such as innovatively designing the Teacher Policy mechanism, using knowledge distillation to improve adaptability in complex terrain, and incorporating insights into robot dynamics into the Student Policy. 2. State representation optimization: The current version explicitly models the robot's linear velocity. Further work could expand implicit feature learning from historical observation data. 3. Improved network architecture: Upgrade the existing MLP policy network to a sequence-aware architecture (e.g., CNN or GRU) to better handle temporal dependencies in motion control. 4.Introducing more observations: Incorporating sensor data like vision can further enrich the robot's environmental perception. Introduce other generative models: Experiment with methods like diffusion models to generate longer action sequence commands. Click here for the website: https://lnkd.in/gFxXSCYe Sales: Gigi YE Cofounder: Li Zhang Github: https://lnkd.in/gpAUqMbZ https://lnkd.in/gFxXSCYe Detailed article: https://lnkd.in/gD-F9iGp Paper:https://lnkd.in/gz7Wp64P #Robotics #BipedalRobot #TRON1 #ReinforcementLearning #RL #EmbodiedIntelligence #AI #MachineLearning #OpenSource #SDK #Simulation #Sim2Real #IsaacGym #CTS #RobotLocomotion #Algorithm #LimX #TechInnovation #Python #ControlSystems
-
LLMs in the service of Active Perception Perception and action (e.g. planning and control) are fundamental to intelligence, yet they are often studied separately in robotics. Traditionally, perception transforms sensory signals into symbolic representations, while action relies on symbolic models to generate movement. This approach assumes perception provides a static 3D model for planning, but it does not reflect biological reality. In nature, perception and action co-evolve, forming an interwoven process. For example, when cutting a tomato, one must first perceive and locate the knife, then grasp the knife, then perceive the tomato, then bring the knife to the tomato, perform the cutting motion, and finally verify the cut. These interleaved sequences of perception and action can be captured in Perception-Action-Coordination (PAC) programs (consisting of both perceptual and motor functions), which serve as modular, compositional building blocks of intelligence, enabling counterfactual reasoning and control. By shifting from traditional planning to PAC programs, we can integrate active perception and visual reasoning, which Large Language Models (LLMs) can help structure. Unlike natural language, programs have a clear grammar, making them ideal for LLMs. A key application is in attribute learning, where robots learn about object properties (e.g., weight, size) by interacting with them rather than relying on static datasets, which do not scale. Vision-Language Models (VLMs) may align linguistic instructions with visual information but fail to grasp non-visual attributes like weight. To address this, the proposed framework combines LLMs, VLMs, and robotic control functions to generate PAC programs that actively explore object attributes. These programs invoke sensory, manipulation, and navigation functions, allowing robots to reason beyond visual perception and understand an object’s properties through interaction. This approach moves towards more intelligent robots that can see, think, and act to explore their environments. The work, titled “Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs” by @Angelos Mavrogiannis, @Dehao Yuan, and @Yiannis Aloimonos, will be presented at ICRA 2025 in Atlanta. Project Page: https://lnkd.in/eff8grGq arXiv: https://lnkd.in/e2dDEFz3 #AI #Robotics #LLM #VLM #ActivePerception
-
What looks simple on camera is anything but simple under the hood. In this short video from our Capgemini 𝗔𝗜 𝗥𝗼𝗯𝗼𝘁𝗶𝗰𝘀 & 𝗘𝘅𝗽𝗲𝗿𝗶𝗲𝗻𝗰𝗲𝘀 𝗟𝗮𝗯, a robot executes a 𝗳𝘂𝗹𝗹𝘆 𝗮𝘂𝘁𝗼𝗻𝗼𝗺𝗼𝘂𝘀 pick-and-place scenario. Straightforward? Not quite. Behind this single motion lies the convergence of multiple disciplines: 🔹 Navigation & spatial awareness 🔹 Lidar, depth sensing & multi-modal sensors 🔹 Computer vision & shape recognition 🔹 Dexterity & manipulation in unstructured environments And here’s the real breakthrough: 𝗮𝘁 𝘁𝗵𝗲 𝗲𝗻𝗱 𝗼𝗳 𝘁𝗵𝗲 𝘃𝗶𝗱𝗲𝗼, you can actually see the robot 𝗮𝘂𝘁𝗼-𝗮𝗱𝗷𝘂𝘀𝘁 𝗶𝘁𝘀 𝗼𝘄𝗻 𝗽𝗹𝗮𝗰𝗲𝗺𝗲𝗻𝘁. When the environment shifts, it recalibrates in real time to 𝗴𝘂𝗮𝗿𝗮𝗻𝘁𝗲𝗲 𝟭𝟬𝟬% 𝘀𝘂𝗰𝗰𝗲𝘀𝘀 𝗼𝗳 𝘁𝗵𝗲 𝗽𝗶𝗰𝗸𝗶𝗻𝗴. This is beyond autonomy, it’s the ability to adapt constantly, synchronizing perception, decision-making, and action in dynamic contexts. What makes me proud is not only the technology itself, but the orchestration behind it. Navigation alone doesn’t solve this. Vision alone doesn’t solve this. Dexterity alone doesn’t solve this. It’s the way all of these capabilities are designed to work in harmony that turns a “simple” pick-and-place scenario into a showcase of true robotic intelligence. A big shout-out to my team who made this possible. This is where #AI, #robotics, and #human ingenuity meet. And where the future of adaptive automation is being built today. #PhysicalAI #AIRobotics Baptiste AMARE Jules Carpentier Marc Blanchon Antonio Jesús Jaramillo Mesa Nitin Dhemre Xavi Navarro Muncunill
-
🤖 Embracing Uncertainty in Robotics: Ajna Network's Breakthrough 🤯 In the intricate dance of robotics and dynamic environments, accurate predictions are the key to reliable decision-making. But what if we told you that there's now a mathematical formulation that can unravel uncertainty from noisy sensor data without any prior knowledge about the information? Enter Ajna, a transformative network that goes beyond predictions, offering insights into uncertainty levels for better robotic decision-making. Traditionally, predictions based on sensor data have often led to errors due to noisy measurements. The rise of neural networks has revolutionized perception-driven predictions in robotics, but these advancements often lacked associated uncertainty metrics. Our novel Ajna network bridges this gap by providing heteroscedastic aleatoric uncertainty estimation without assumptions about prediction labels or network architecture. A remarkable feature of Ajna is its minimal computational footprint. It seamlessly integrates into existing neural network training frameworks with just a small tweak to the loss function, making it an ideal fit even for resource-constrained robots. Ajna empowers robots to make real-time decisions based on nuanced uncertainty insights. We take this breakthrough even further by leveraging Ajna's uncertainty cues to tackle common robotics challenges. From evading dynamic obstacles to navigating through complex scenes, Ajna thrives on the uncertainties of optical flow from monocular cameras. Even flying through unknown gaps and segmenting object piles becomes possible without depth computation, thanks to Ajna's innovative approach. Our work has successfully tested Ajna across a spectrum of robotics and computer vision tasks, demonstrating its prowess against methods that directly use depth information. This is more than just a network; it's a paradigm shift in the way robotics harness uncertainty to enhance decision-making. 🚀 Stay informed, stay curious! 🌐📚 Science never ceases to amaze! 🌟✨ #RoboticsInnovation #AI #NeuralNetworks #DecisionMaking #UncertaintyInRobotics #AjnaNetwork #TechnologyBreakthrough #StayInformed #StayCurious University of Maryland Nitin J Sanket Chahat Deep Singh Cornelia Fermuller Yiannis Aloimonos National Science Foundation (NSF): BCS 1824198 National Science Foundation (NSF): CNS 1544787 Office of Naval Research: N00014-17-1-2622 https://lnkd.in/gpZ-JRyi