How much do language models actually think? A recent paper from Apple, 'The Illusion of Thinking', explores this question by probing the limits of Large Reasoning Models (LRMs) such as Claude 3.7 Sonnet Thinking and DeepSeek-R1. These models aim to improve reasoning by generating long Chain-of-Thought (CoT) traces before producing an answer. Instead of relying on traditional math benchmarks, the authors designed controlled puzzle environments (like Tower of Hanoi and River Crossing), that allow them to systematically vary problem complexity and analyze model behavior step by step. Key takeaways from the paper: 🔹Three performance regimes: → At low complexity: non-thinking models often outperform LRMs in both accuracy and token efficiency. → At medium complexity: LRMs show benefits thanks to more elaborate reasoning traces. → At high complexity: both model types collapse (accuracy drops to zero). 🔹As problems grow more complex, models actually use fewer thinking tokens, despite having sufficient budget which highlights a possible inference-time scaling limitation. 🔹On simple tasks, models often reach the correct solution early but then continue generating incorrect or redundant reasoning. 🔹Even when the correct algorithm is provided in the prompt, models still fail at execution as complexity increases. The authors raise an important question: are today's LRMs truly engaging in reasoning or just producing more elaborate pattern completions? You can read the paper here: https://lnkd.in/dn3GTT66 The image used in the post is taken from the same paper. Curious to hear your take, especially if you work on reasoning, interpretability, or evaluation design. #technology #generativeai #artificialintelligence #llms #innovation
Limits of LLMs in Creative Problem Solving
Explore top LinkedIn content from expert professionals.
Summary
Large language models (LLMs) and their specialized reasoning variants show impressive performance on basic tasks, but recent research reveals major limitations when it comes to creative problem-solving. The term “limits-of-llms-in-creative-problem-solving” refers to how these AI systems often struggle with complex reasoning, generalizing solutions, and adapting to new challenges that go beyond simple pattern matching.
- Recognize boundaries: Be aware that LLMs can solve straightforward problems but tend to falter when task complexity rises or when asked to think creatively.
- Check for generalization: Understand that these models often rely on memorized patterns and may not adapt well to unfamiliar situations or new problem formats.
- Mind computational limits: Notice that increasing the amount of data or computational power alone does not guarantee deeper reasoning or improved problem-solving skills in language models.
-
-
The Illusion of Reasoning: A Timely Reflection from Apple’s Latest AI Research Apple recently published a white paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models.” As the AI community accelerates its focus on agentic AI and reasoning models, Apple’s study offers a timely reality check: What did they do? Apple rigorously tested several state-of-the-art reasoning-optimized language models (LLMs) using classic logic puzzles like Tower of Hanoi and River Crossing—structured problems that probe genuine reasoning ability. What did they find? • Models can solve simple problems reliably, but as complexity increases, performance drops sharply. • Minor changes (renaming puzzle elements, altering phrasing) can cause drastic performance shifts—highlighting a reliance on memorized patterns rather than true logical understanding. • Trace analysis shows that current models simulate reasoning by leveraging pattern-matching, not structured logic. 💡 Key Insight: While today’s models can appear to reason well under certain conditions, much of their success stems from surface-level pattern recognition—not deep, generalizable reasoning. Conclusion: This work reminds us that although current LLMs are powerful language processors, they remain far from achieving robust, transparent reasoning. As we build AI agents designed to assist in real-world decision-making, we must tread carefully—understanding both the capabilities and limitations of today’s models. In short, humility and rigor must accompany progress. Apple’s contribution is a welcome call for both. If you work with LLMs or agentic AI—how are you addressing reasoning robustness in your systems? Would love to hear your thoughts! #AI #ReasoningModels #AppleResearch #LLM #ResponsibleAI #engineeringtidbits
-
Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected. In our latest work, we compared each “thinking” LRM with its “non-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning traces—looking inside their long "thoughts". Our study reveals several interesting results: 1 - Three distinct performance regimes🟡🔵🔴: Under equal inference-compute budgets, standard LLMs outperform reasoning models on low-complexity tasks, reasoning models excel at medium complexity, and both collapse to zero accuracy on high-complexity tasks. 2 - Counterintuitive scaling limits 🔄: As problems get more difficult, reasoning models initially think more (good!) but then START THINKING LESS despite having plenty of token budget left. They give up right when they should work harder! 3 - Looking inside the "thoughts" 🔍: By replaying every intermediate move in our simulators, our puzzle setup shows that LRM find answers early but then "overthink" on simple problems, eventually reach correct solutions only after exploring wrong paths on medium problems, and completely fail on hard problems. 4 - Catastrophic failure on exact computation ⚠️: Even when given explicit solution algorithms, reasoning models still collapse at the same complexity thresholds—revealing fundamental symbolic manipulation limits and erratic performance across tasks, as shown by Claude 3.7 flawlessly handling ~100 Tower of Hanoi moves yet floundering after just four steps in the River Crossing puzzle. 5 - Scaling compute is helpful, but not enough to close the reasoning gaps 🧠: Our findings challenge assumptions about LRM capabilities. Despite sophisticated self-reflection mechanisms from RL training, our results suggest that these models can't follow algorithm steps and importantly can't generalize algorithmic reasoning beyond certain complexity thresholds. #Paper: https://lnkd.in/g3XJC-cX Work done with my colleagues at Apple: Parshin Shojaee, keivan alizadeh vahid, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar #AI #MachineLearning #LLM #reasoning
- +1
-
🚨 BIG! Apple's latest paper argues that Large Reasoning Models have significant limitations and COLLAPSE at high complexity. Has AI hit a wall? Was it AI hype all the time? 😱 What the paper says: "Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds. We identified three distinct reasoning regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at moderate complexity, and both collapse at high complexity. Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs. Our detailed analysis of reasoning traces further exposed complexity-dependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure on complex ones. These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning. Finally, we presented some surprising results on LRMs that lead to several open questions for future work. Most notably, we observed their limitations in performing exact computation; for example, when we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve. Moreover, investigating the first failure move of the models revealed surprising behaviors. For instance, they could perform up to 100 correct moves in the Tower of Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle. We believe our results can pave the way for future investigations into the reasoning capabilities of these systems" - 👉 Read the paper below. 👉 Never miss my updates and recommended papers: join my newsletter's 63,400+ subscribers.
-
The Illusion of Thinking in LLMs - Apple researchers have spilled the beans on the strengths and limitations of reasoning models. Reasoning models "collapse" beyond certain task complexities. "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" highlights several limitations of Large Language Models (LLMs) and their specialized variants, Large Reasoning Models (LRMs), particularly in the context of reasoning and problem-solving. Below is a list of the key limitations of LLMs identified by Apple researchers: (1) Poor Performance on Reasoning Benchmarks: Earlier iterations of LLMs exhibited poor performance on reasoning benchmarks, indicating fundamental challenges in reasoning capabilities (Page 4, Section 2). (2) Lack of Generalizable Reasoning: Despite advancements, LLMs and LRMs fail to develop generalizable problem-solving capabilities, especially for planning tasks. Performance collapses to zero beyond certain complexity thresholds in controlled puzzle environments (Page 3, Section 1; Page 11, Section 5). (3) Data Contamination Issues: Established mathematical and coding benchmarks suffer from data contamination, where models may have been exposed to similar problems during training, skewing performance evaluations (Page 2, Section 1; Page 5, Section 3). (4) Inefficiency in Low-Complexity Tasks: For simpler, low-compositional problems, standard LLMs demonstrate greater efficiency and accuracy compared to LRMs, suggesting that additional "thinking" mechanisms in LRMs may introduce unnecessary overhead (Page 3, Section 1; Page 7, Section 4.2.1). (5) Complete Collapse at High Complexity: Both LLMs and LRMs experience complete performance collapse when problem complexity exceeds a critical threshold, indicating a fundamental limitation in handling highly complex, compositionally deep tasks (Page 3, Section 1; Page 8, Section 4.2.2). (6) Counterintuitive Scaling Limitation: LRMs reduce their reasoning effort (measured by inference-time tokens) as problem complexity increases beyond a certain point, despite having ample token budgets, revealing a scaling limitation in reasoning capabilities (Page 3, Section 1; Page 8, Section 4.2.2). (7) Overthinking Phenomenon: In simpler problems, LLMs and LRMs often identify correct solutions early but continue exploring incorrect alternatives, wasting computational resources in an "overthinking" pattern (Page 3, Section 1; Page 9, Section 4.3)
-
There's a very important study on LLMs by Apple AI that's been going around -- everyone is reacting to it, but it's important to understand what they actually found. What they did is: they tested LRMs on complex puzzles to reveal limitations in reasoning capabilities. - Apple designed controlled puzzle simulations to assess how LLMs tackle problem complexity. They used games like Towers of Hanoi and river crossing to push models beyond familiar tasks. - They tested popular models like Claude Thinking, DeepSeek-R1, and o3-mini on puzzles they hadn’t encountered before, focusing on their performance as problems got tougher. - As puzzle complexity increased, results showed that these "reasoning" models dropped to 0% accuracy, regardless of additional computing power or resources. - Interestingly, when given precise solution algorithms, models still struggled with complex tasks, indicating an inability to follow instructions consistently under pressure. - The study identified three performance regimes: in low complexity, regular models performed better; medium complexity gave some edge to reasoning models; but high complexity led to total breakdowns. - The research highlighted that models excel at sophisticated pattern matching but fail when patterns become overly complex, proving they aren't genuinely reasoning. This study challenges the idea that we're close to AGI and suggests that current reasoning methods might have fundamental limitations that can't just be solved with more data or compute. Link to the paper: https://lnkd.in/gEwnujsB #AI #LLMs #GenAI
-
Apple’s machine learning team just released a paper that takes aim at one of the core assumptions behind Chain-of-Thought (CoT) prompting—a technique used to help large language models (LLMs) “think out loud” to solve complex problems. What they found? Many CoT-based models collapse when applied to complex reasoning tasks like the advanced levels in Tower of Hanoi (e.g., with more than 8 disks to place), despite performing well on traditional benchmarks. Why? Because these tasks go well beyond the narrow prompting examples used during fine-tuning and require longer sequences of precise reasoning than a CoT model can handle. An interesting observation from the paper is that, for the simple cases, the raw LLMs actually perform slightly better than LRMs, though LRMs significantly outperform raw LLMs in medium-level cases. This indicates that if we can decompose a long/difficult reasoning task into several medium-level tasks, we can still make the best use of existing LRMs, and if we can decompose them further into many simple-level tasks, a standard LLM would even be better than LRMs. Considering the fact that the response lengths of LRMs are usually much longer than standard LLMs (LRMs need to generate its reasoning process explicitly), we are actually not only solving the problem better, but also at a cheaper cost. What does this mean for users? If you’ve been relying on a single model to handle multi-step reasoning—like planning, logic puzzles, or simulations—this paper suggests you might want to rethink your approach. Here’s my take: - While I’ve always been skeptical of CoT-style large reasoning models (LRMs), I don’t think we should write them off completely. They’re specialists—and they can outperform on tough tasks like coding or niche benchmarks. But they are constrained by their inherent imprecision that emerges as tasks scale. - For broader, more general-purpose use cases, LLMs paired with multi-agent systems are a more robust path forward. Instead of pushing a single model to its limits, we can distribute reasoning across agents—each focused, each efficient—working together to scale intelligence more reliably. Worth a read: Apple’s study via The Guardian: https://lnkd.in/gEq2hYhK Cognizant, Xin Qiu, Elliot Meyerson
-
The Illusion of Thinking — a recent paper from Apple researchers — is a must-read if you're building or deploying reasoning-capable LLMs. The team conducted rigorous experiments using controllable puzzle environments (e.g., Tower of Hanoi, Blocks World) to measure how well Large Reasoning Models (LRMs) actually "think"—beyond benchmark accuracy. Key insights: • Three distinct regimes emerge: — At low complexity, standard LLMs outperform LRMs. — At medium complexity, structured reasoning gives LRMs an edge. — At high complexity, both collapse—accuracy drops to zero. • Surprisingly, as task complexity increases, LRMs reduce reasoning effort, even with ample compute budget—suggesting a fundamental inference-time scaling limit. • Even when given explicit algorithms, models struggle to execute them—exposing weaknesses in symbolic manipulation and generalization. This study reframes how we think about "reasoning" in LLMs. It challenges assumptions that more tokens or longer CoT traces imply better thinking—and urges us to focus on robust reasoning, not just reasoning-looking behavior. As we push generative AI into complex decision-making domains, understanding these limitations is not academic—it's essential. #AI #LLMs #Reasoning #ChainOfThought #AIResearch #GenerativeAI #LLMScaling #AppleAI
-
Apple report: LLMs don’t reason. They autocomplete based on probability. Apple released a report called "The Illusion of Thinking" that showed how LLMs (large language models) failed at solving new logic puzzles. They acted like students who memorized years of example tests & answers, but they were not able to solve new problems. Does that mean LLMs are not useful? Not at all. They are amazing tools. But if you're like me, you've been using AI tools like ChatGPT in your daily work, and you're starting to see its limits. I've come to understand LLMs as giant thought matching engines. It's like a Wikipedia with a conversational interface. Yes, it can write me a book, but only because it can match my prompt with related patterns of sentences that it already has memorized. Are you seeing AI hitting any limits lately? Here's a link to the report: https://lnkd.in/djY_ZRW6