Recent LLM Breakthroughs in Complex Reasoning

Explore top LinkedIn content from expert professionals.

Summary

Recent breakthroughs in large language models (LLMs) have pushed the boundaries of complex reasoning, enabling these AI systems to solve multi-step problems, make creative decisions, and reliably interact with external information sources. Complex reasoning in LLMs refers to an AI's ability to tackle tasks that require logical thinking, planning, hypothesis generation, and adaptation beyond simple prediction.

  • Explore new reasoning architectures: Look into models that reason in continuous latent spaces or use looped computation for improved logic and planning on challenging tasks.
  • Balance thinking time: Consider using techniques that control how long a model "thinks" before answering, which can improve accuracy while saving computational resources.
  • Integrate external knowledge smartly: Adopt frameworks that allow models to retrieve and use multiple sources of information in parallel, reducing errors and grounding answers in real-world data.
Summarized by AI based on LinkedIn member posts
  • View profile for Ross Dawson
    Ross Dawson Ross Dawson is an Influencer

    Futurist | Board advisor | Global keynote speaker | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice | Founder: AHT Group - Informivity - Bondi Innovation

    34,044 followers

    Chain-of-Thought has been a fundamental architecture driving LLM performance. Now 'Chain of Continuous Thought' (Coconut) significantly improves reasoning performance through working in latent space rather than language space. This paper from Meta's AI research group lays out the logic and results: 💡 Continuous Reasoning Unlocks Efficiency: Large Language Models (LLMs) traditionally reason in "language space," where reasoning steps are expressed as explicit tokens, leading to inefficiencies. The Coconut (Chain of Continuous Thought) paradigm instead reasons in a continuous latent space by feeding the model’s hidden state back as input. This reduces reliance on explicit tokens and improves reasoning efficiency, especially for complex tasks requiring backtracking. 📊 Higher Accuracy in Complex Reasoning Tasks: Coconut achieves significant accuracy improvements on complex tasks requiring planning and logic. In ProsQA, a reasoning-intensive task, Coconut attains 97.0% accuracy, far exceeding Chain-of-Thought (CoT) at 77.5%. Similarly, in logical reasoning tasks like ProntoQA, it achieves near-perfect performance at 99.8% accuracy, outperforming or matching other baselines while demonstrating superior planning capabilities. ⚡ Greater Efficiency with Fewer Tokens: Coconut enhances reasoning efficiency by reducing the number of generated tokens while maintaining accuracy. For example, in GSM8k (math reasoning), Coconut achieves 34.1% accuracy using just 8.2 tokens, compared to CoT's 42.9% accuracy which requires 25 tokens. This token efficiency indicates that reasoning in latent space allows the model to process fewer explicit steps without sacrificing performance. 🌟 Parallel Reasoning Explores Multiple Alternative Steps: Coconut enables LLMs to simultaneously explore multiple reasoning paths by encoding alternative next steps in the continuous latent space. This parallel reasoning behavior mimics breadth-first search (BFS), allowing the model to avoid premature decisions and progressively narrow down the correct solution. 🔄 Multi-Stage Training Accelerates Learning: Coconut leverages a curriculum-based training strategy, where the reasoning chain is gradually replaced with latent thoughts. This phased approach facilitates model learning, improving performance on math problems (GSM8k) and logical tasks, outperforming baselines like No-CoT and iCoT. 🔍 Latent Reasoning Improves Planning and Focus: By reasoning in latent space, the model avoids premature decisions and progressively narrows down possibilities. Coconut shows reduced hallucinations and improved accuracy compared to CoT, demonstrating its ability to prioritize promising reasoning paths while pruning irrelevant ones. New model architectures are consistently improving LLM performance and efficiency. Even without more training data and underlying model progress we are seeing consistent advances. Link to paper in comments.

  • View profile for Andreas Sjostrom
    Andreas Sjostrom Andreas Sjostrom is an Influencer

    LinkedIn Top Voice | AI Agents | Robotics I Vice President at Capgemini's Applied Innovation Exchange | Author | Speaker | San Francisco | Palo Alto

    13,643 followers

    AI models are reasoning, creating, and evolving. The evidence is no longer theoretical; it's peer-reviewed, measurable, and, in some domains, superhuman. In the last 18 months, we’ve seen LLMs move far beyond next-token prediction. They’re beginning to demonstrate real reasoning, hypothesis generation, long-horizon planning, and even scientific creativity. Here are six breakthroughs that redefine what these models can do: Superhuman Clinical Reasoning (Nature Medicine, 2025) In a rigorous test across 12 specialties, GPT-4 scored 89% on the NEJM Knowledge+ medical reasoning exam, outperforming the average physician score of 74%. This wasn’t just Q&A; it involved multi-hop reasoning, risk evaluation, and treatment planning. That’s structured decision-making in high-stakes domains. Creative Research Ideation (Zhou et al., 2024 – arXiv:2412.10849) Across 10 fields from physics to economics, GPT-4 and Claude generated research questions rated more creative than human-generated ones in 53% of cases. This wasn’t trivia; domain experts blindly compared ideas from AI and researchers. In over half the cases, the AI won. Falsifiable Hypotheses from Raw Data (Nemati et al., 2024) GPT-4o was fed raw experimental tables from biology and materials science and asked to propose novel hypotheses. 46% of them were judged publishable by experts, outperforming PhD students (29%) on the same task. That’s not pattern matching, that’s creative scientific reasoning from scratch. Self-Evolving Agents (2024) LLM agents that reflect, revise memory, and re-prompt themselves improved their performance on coding benchmarks from 21% → 34% in just four self-corrective cycles, without retraining. This is meta-cognition in action: learning from failure, iterating, and adapting over time. Long-Term Agent Memory (A-MEM, 2025) Agents equipped with dynamic long-term memory (inspired by Zettelkasten) achieved 2× higher success on complex web tasks, planning across multiple steps with context continuity. Emergent Social Reasoning (AgentSociety, 2025) In a simulation of 1,000 LLM-driven agents, researchers observed emergent social behaviors: rumor spreading, collaborative planning, and even economic trade. No hardcoding. Just distributed reasoning, goal propagation, and learning-by-interaction. These findings span healthcare, science, software engineering, and multi-agent simulations. They reveal systems that generate, reason, and coordinate, not just predict. So when some argue that “AI is only simulating thought,” we should ask: Are the tests capturing how real reasoning happens? The Tower of Hanoi isn’t where science, medicine, or innovation happens. The real test is: 1. Can a model make a novel discovery? 2. Can it self-correct across steps? 3. Can it outperform domain experts in structured judgment? And increasingly, the answer is: yes. Let’s not confuse symbolic puzzles with intelligence. Reasoning is already here, and it’s evolving.

  • View profile for Kuldeep Singh Sidhu
    Kuldeep Singh Sidhu Kuldeep Singh Sidhu is an Influencer

    Senior Data Scientist @ Walmart | BITS Pilani

    13,357 followers

    Breaking: RAG-R1 Framework Revolutionizes How LLMs Handle External Knowledge Researchers from AWorld Team and Inclusion AI have just released RAG-R1, a groundbreaking training framework that fundamentally changes how Large Language Models interact with external knowledge sources during reasoning. The Core Innovation Traditional RAG systems suffer from a critical bottleneck: they generate only single search queries when external retrieval is needed, leading to substantial inference time and limited knowledge acquisition. RAG-R1 solves this with multi-query parallelism - enabling models to generate up to three parallel search queries simultaneously. Under the Hood Architecture The framework operates through a sophisticated two-stage training process: Stage 1: Format Learning SFT - The system generates samples integrating reasoning and search, segmented into four distinct categories. Models learn to respond in a "think-then-search" format using special tokens like <think>, <search>, and <answer> to structure their reasoning process. Stage 2: Retrieval-Augmented RL - Employs Proximal Policy Optimization with outcome-based rewards to enhance reasoning capabilities. The system implements retrieval masked loss to prevent retrieved tokens from interfering with the model's inherent reasoning abilities. Technical Breakthrough The multi-query parallelism returns results in JSON format, clearly aligning search queries with retrieved documents. This approach reduces retrieval rounds by 11.1% while maintaining comparable time per retrieval operation. Performance Impact Testing on seven question-answering benchmarks using Qwen2.5-7B-Instruct as the backbone model showed remarkable results: - Up to 13.2% improvement over strongest baselines - Significant performance gains across both general QA and multi-hop reasoning tasks - Excellent generalization across out-of-domain datasets The framework addresses the fundamental challenge of LLMs generating hallucinated or outdated responses by enabling adaptive leverage of both internal and external knowledge during the reasoning process. This represents a significant step forward in making AI systems more reliable and grounded in real-world knowledge.

  • View profile for Zain Hasan

    I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: vector DBs, data scientist, lecturer & health tech founder | 🇺🇸🇨🇦🇵🇰

    16,483 followers

    Can you tell reasoning models to think longer before producing an answer? Apparently yes, and there are multiple ways presented by CMU and Stanford! I recently explored two papers tackling the same challenge in AI reasoning from completely different angles: how to control and optimize the "thinking time" of large language models to balance computational resources and performance. The Problem Both Papers Address: Modern reasoning LLMs like OpenAI's o1 and DeepSeek's r1 can generate extended chain-of-thought sequences to solve complex problems. However, the uncontrolled nature of this reasoning leads to inefficiencies—sometimes wasting resources with lengthy reasoning, other times stopping too early on difficult problems. Two Different Solutions: Reinforcement Learning vs. Minimal Data L1: Reinforcement Learning for Precise Control Carnegie Mellon researchers Aggarwal and Welleck's "L1: Controlling How Long A Reasoning Model Thinks" takes a sophisticated reinforcement learning approach: >> Length Controlled Policy Optimization (LCPO): Trains models to generate correct answers while adhering to user-specified length constraints >> Two variants: L1-Exact (produces reasoning of exactly the requested length) and L1-Max (doesn't exceed the requested length) Results: Outperforms previous methods by up to 100% relative and 20% absolute performance on math tasks S1: Minimal Data and Test-Time Control Muennighoff et al.'s "s1: Simple test-time scaling" takes a remarkably lightweight approach: >>Minimal dataset: Just 1,000 carefully curated high-quality examples (s1K dataset) >>Budget forcing: Simple test-time mechanism to control reasoning length Efficiency: Training takes only 26 minutes on 16 H100 GPUs >>Results: Outperforms OpenAI's o1-preview on competition math by up to 27% Despite their different approaches, both methods demonstrate: >> Test-time scaling: Performance improves with allocated computation >> Strong generalization: Both methods work across various reasoning tasks >> Efficiency improvements: Both outperform previous approaches with fewer resources Perhaps most impressively, L1's 1.5B parameter model matches GPT-4o at equal reasoning lengths, while S1 outperforms o1-preview while using just 1/800th the training data of comparable models. Together, they suggest that reasoning in LLMs may be more accessible than previously thought—whether through careful optimization of smaller models or minimal fine-tuning of larger ones. Both papers are open-source with code available: L1: https://lnkd.in/gj6Pw4ki S1: https://lnkd.in/gPJsZ2iC

  • View profile for Jason Eshraghian

    Assistant Professor, University of California, Santa Cruz

    6,417 followers

    Rui-Jie Zhu has done it again. While our previous LLMs targeted efficiency (SpikeGPT, MatMul-free LM), this work is straight up focused on hitting SoTA. Our latest models "Ouro 1.4B" and "Ouro 2.6B", match or beat frontier models 2-3x larger (Qwen3, Llama3.2, Gemma3). The modern path to improving LLMs has been: 1) add parameters, 2) scale tokens, and 3) add "reasoning" by letting models talk to themselves via Chain-of-Thought. (1) More parameters require more data to remain compute-optimal. (2) Frontier models already train on most of the entire internet. We've run into a bit of a data wall. (3) These limits turned attention to reasoning models. Reasoning models are powerful, but they bloat your finite context window. They also under-utilize the trillions of tokens available in the pretraining process, as reasoning is usually accounted for post-hoc. So are parameters and tokens the only two scaling dimensions in pretraining? Our latest paper shows a third scaling dimension: the depth of computation. We scale latent-space reasoning via recurrent depth during pretraining. Inspired by Universal Transformers, we push this to an industrial scale, which introduced non-trivial challenges. - We pretrained on 7.7 trillion tokens. This is an unprecedented industry-scale of "looped computation". - We trained the model to learn when to stop. This needed an entropy regularized objective to prevent reward-hacking, and constantly maximizing the number of loops. For simple tokens, the model might exit early at T=2, while for complex problems, it is more likely to max out the recursion steps. - Ouro 1.4B (up to 4 loops) matches the performance of SOTA 4B Transformers across the board. Ouro 2.6B (4 loops) matches 8B & 12B models. - We tried to lift the hood on why this is so powerful. On raw knowledge storage (bits-per-parameter), Ouro is similar to standard LLMs. On manipulation/reasoning-heavy tasks, looped computation wins. This was consistent with the "physics of LLMs" view that more compute-per-tokens improves reasoning efficiency. Thank you to all collaborators for making this happen - across ByteDance, Princeton, Conscium, MILA, Univ of Montreal, Peking, Carnegie Mellon, UPenn, Univ of Manchester, and M-A-P. Base models and Thinking variants have been open-sourced with vLLM/SGLang integration: https://lnkd.in/giBjDKYE Preprint: https://lnkd.in/gmEkUpNs Rui-Jie Zhu: In need of a well-deserved nap.

  • View profile for Ketaki Sodhi, PhD

    Head of AI Enablement @ Moody’s | ex-MSFT & ex-Harvard D3 GenAI Council

    4,540 followers

    🧠 The evolution of AI reasoning is fascinating - and with all the buzz about AI Agents, we're seeing a rapid shift from "fast thinking" to "deliberate reasoning" in Large Language Models. Most LLMs today operate like System 1 (from Kahneman & Tversky's seminal work) thinking in humans - quick, intuitive, and sometimes prone to errors. But 2024 has brought exciting developments in pushing these models toward System 2 thinking - the slow, methodical reasoning we use for complex problems. 📝 Chain of Thought was the first breakthrough - imagine teaching someone by saying "show your work." Instead of jumping to answers, we prompt LLMs to write out their step-by-step reasoning. Simple but powerful: "First, I'll calculate X... Then, considering Y..." This dramatically improved accuracy on complex tasks. 🌳 Graph of Thought took this further - instead of a linear path, it explores multiple reasoning routes simultaneously. Think of it like brainstorming where you map out different approaches to a problem, evaluate each path, and choose the most promising one. This helps catch errors and find innovative solutions. 🎲 And now, researchers have introduced Monte Carlo Tree Search for LLM reasoning. Think of it like a chess grandmaster exploring possible moves, but instead of game positions, we're dealing with reasoning steps. Each potential path is tested hundreds of times, and the most promising ones are explored further. The implications? We're getting closer to AI systems that can tackle complex reasoning tasks with the kind of methodical approach that humans use for critical thinking. We're already seeing this with models like o1 that crush benchmarks for PhD level reasoning compared to GPT 4-class models, but the use cases for these will be different - as in, these models aren't just a 'more powerful GPT 4' but are useful for a different set of problems or applications, especially those requiring precise logical reasoning or complex problem-solving. And as GPT 4-class models start getting commoditized, more model providers will lean into developments in this area.

  • View profile for Prabhu Prakash Kagitha

    Research @ Drexel | Ex-Lead ML Engineer @ Akaike.ai | Scalable Deep learning/NLP/LLM solutions

    3,113 followers

    The recently released open-source model, DeepSeek-R1, is comparable to OpenAI O1 in multiple benchmarks. This O1-level complex reasoning ability, as mentioned in the paper, is due to "emergent long Chain-of-Thought (CoT) from large-scale reinforcement learning." This might sound alienating, but it should feel rather familiar after a brief reflection. Let's understand by contrasting the training recipes of standard LLMs vs reasoning "thinking" LLMs. Simplified recipe of training SOTA LLMs (DeepSeek-V3, Llama, Qwen, ...): 📌 Pre-training: Starting from random weights, train on ~10-20 Trillion token data 📌 Post-training: ✔️ Supervised fine-tune (SFT) with millions of (prompt, response) pairs ✔️ Perform preference learning (RLHF/PPO/DPO) with millions of human preferences or labels from reward models Before going over the recipe for thinking reasoning models, let's consider why we are building these new types of models. For some hard problems, like Olympiad-level math problems, rather than giving one answer, giving multiple answers or trying different approaches proved to be helpful. Even better if there is an internal thought process (long CoT) that tries an approach, regularly introspects itself and changes the approach, or explores new strategies if the previous one is not satisfactory. See the contrast here: standard CoT breaks down the problem into multiple steps but tries only one approach. This came to be known as inference-time scaling (at least, one way of doing this scaling). As in, we generate more tokens based on the complexity of the problem (could be tens of thousands of tokens) while inferring from a model. These extra generated tokens for exploration could be treated as ‘internal thinking’ tokens. (See an LLM having an 'aha moment' in one of the images attached). Usually, these thinking models are trained in math, coding, or logic-heavy domains. These tasks/questions have a correct answer and the correctness acts as a reward/preference for RL. Simplified recipe of training reasoning LLMs: (DeepSeek-R1, possibly OpenAI O1/O3) 1️⃣ After pre-training, perform SFT on very few warm-up data rather than millions of (prompt, response) pairs. 2️⃣ Perform RL for many iterations with correctness as the reward model. This is where long CoT emerged just from trial and error of RL training. 3️⃣ Perform SFT on large curated data for broad tasks. 4️⃣ Perform RL again with correctness as the reward model for math/code tasks. And LLM-as-a-judge type generative reward model for other tasks. It is important to note that emergent long CoT, internal thinking, with large-scale RL was only possible in domains where we know the exact answers. It is still a research exploration acquiring complex reasoning abilities outside of these domains.

  • View profile for Daron Yondem

    AI/ML/GenAI Lead at AWS | PhD in Leadership | Helping enterprises align AI and humans around real business outcomes | Former CTO | Speaker & Coach

    54,964 followers

    Ever wondered if LLMs can think too much? New research shows that maintaining excessive reasoning history actually HARMS performance. Less can be more when it comes to AI reasoning! Researchers have introduced "Atom of Thoughts" (AOT), a framework that transforms complex reasoning in Large Language Models into a Markov process. Unlike traditional approaches that store and process entire reasoning chains, AOT only maintains information relevant to the current step—mimicking how humans naturally solve problems. When applied to gpt-4o-mini, AOT achieved an 80.6% F1 score on HotpotQA, surpassing specialized reasoning models like o3-mini (+3.4%) and DeepSeek-R1 (+10.6%). The key innovation is a two-phase mechanism: first decomposing questions into dependency-based directed acyclic graphs, then contracting subquestions into simplified, independent problems. Most impressively, AOT can serve as both a standalone framework and a plug-in enhancement for existing methods. By eliminating redundant historical information, it achieves higher performance while using significantly fewer computational resources—the test-time efficiency curves show the steepest performance-to-cost ratio among all compared methods. Could this Markov-style reasoning be the key to making LLMs more efficient and effective at complex problem-solving? What other cognitive science principles might still be waiting to revolutionize AI reasoning? Full paper link in comments. #AIResearch #MachineLearning #LLMs #ReasoningAI #ComputationalEfficiency

  • View profile for Jon Salisbury

    CAIO - CEO @ Nexigen - Ultra Curious, Humble - Cyber Security, Cloud, Smart City, AI, Quantum, Human Centered, Psychology, Leadership. Cooperation, Patience, Encourage, Helpful, Christian, Love! (100k weekly views)

    17,109 followers

    Dynamic #Reasoning Graphs + LLMs = $$ Large Language Models (LLMs) often stumble on complex tasks when confined to linear reasoning. What if they could dynamically restructure their thought process like humans? A new paper introduces Adaptive Graph of Thoughts (AGoT), a test-time framework that replaces rigid prompting strategies (like Chain/Tree of Thought) with dynamic directed acyclic graphs (DAGs). Instead of forcing fixed reasoning steps, AGoT recursively decomposes problems into sub-tasks, selectively expanding only the most critical pathways. This is crucial for industries like scientific research or legal analysis, where problems demand non-linear, nested reasoning. The key innovation lies in complexity checks: AGoT assesses each reasoning node, spawning sub-graphs for intricate subtasks while resolving simpler ones directly. This mirrors how experts allocate mental effort with drilling into uncertainties while streamlining obvious steps. The framework achieved 46.2% improvement on GPQA (a notoriously hard science QA benchmark), rivaling gains from compute-heavy fine-tuning. By unifying chain, tree, and graph paradigms, AGoT retains CoT’s clarity, ToT’s exploration, and GoT’s flexibility without manual tuning. The result? LLMs that self-adapt their reasoning depth based on problem complexity with no architectural changes needed. Kelly Cohen Lohith Srikanth Pentapalli Josette Riep Joe Oquist💡 Lori Mazor Zachary Huhn Summer Crenshaw Monica Turner James Downs Link to paper: https://lnkd.in/gSJSgpbC

  • Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected. In our latest work, we compared each “thinking” LRM with its “non-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning traces—looking inside their long "thoughts". Our study reveals several interesting results: 1 - Three distinct performance regimes🟡🔵🔴: Under equal inference-compute budgets, standard LLMs outperform reasoning models on low-complexity tasks, reasoning models excel at medium complexity, and both collapse to zero accuracy on high-complexity tasks. 2 - Counterintuitive scaling limits 🔄: As problems get more difficult, reasoning models initially think more (good!) but then START THINKING LESS despite having plenty of token budget left. They give up right when they should work harder! 3 - Looking inside the "thoughts" 🔍: By replaying every intermediate move in our simulators, our puzzle setup shows that LRM find answers early but then "overthink" on simple problems, eventually reach correct solutions only after exploring wrong paths on medium problems, and completely fail on hard problems. 4 - Catastrophic failure on exact computation ⚠️: Even when given explicit solution algorithms, reasoning models still collapse at the same complexity thresholds—revealing fundamental symbolic manipulation limits and erratic performance across tasks, as shown by Claude 3.7 flawlessly handling ~100 Tower of Hanoi moves yet floundering after just four steps in the River Crossing puzzle. 5 - Scaling compute is helpful, but not enough to close the reasoning gaps 🧠: Our findings challenge assumptions about LRM capabilities. Despite sophisticated self-reflection mechanisms from RL training, our results suggest that these models can't follow algorithm steps and importantly can't generalize algorithmic reasoning beyond certain complexity thresholds. #Paper: https://lnkd.in/g3XJC-cX Work done with my colleagues at Apple: Parshin Shojaee, keivan alizadeh vahid, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar #AI #MachineLearning #LLM #reasoning

    • +1

Explore categories