You know all those arguments that LLMs think like humans? Turns out it's not true 😱 In our new paper we put this to the test by checking if LLMs form concepts the same way humans do. Do LLMs truly grasp concepts and meaning analogously to humans, or is their success primarily rooted in sophisticated statistical pattern matching over vast datasets? We used classic cognitive experiments as benchmarks. What we found is surprising... 🧐 We used seminal datasets from cognitive psychology that mapped how humans actually categorize things like "birds" or "furniture" ('robin' as a typical bird). The nice thing about these datasets is that they are not crowdsourced, they're rigorous scientific benchmarks. We tested 30+ LLMs (BERT, Llama, Gemma, Qwen, etc.) using an information-theoretic framework that measures the trade-off between: - Compression (how efficiently you organize info) - Meaning preservation (how much semantic detail you keep) Finding #1: The Good News LLMs DO form broad conceptual categories that align with humans significantly above chance. Surprisingly (or not?), smaller encoder models like BERT outperformed much larger models. Scale isn't everything! Finding #2: But LLMs struggle with fine-grained semantic distinctions. They can't capture "typicality" - like knowing a robin is a more typical bird than a penguin. Their internal concept structure doesn't match human intuitions about category membership. Finding #3: The Big Difference Here's the kicker: LLMs and humans optimize for completely different things. - LLMs: Aggressive statistical compression (minimize redundancy) - Humans: Adaptive richness (preserve flexibility and context) This explains why LLMs can be simultaneously impressive AND miss obvious human-like reasoning. They're not broken - they're just optimized for pattern matching rather than the rich, contextual understanding humans use. What this means: - Current scaling might not lead to human-like understanding - We need architectures that balance compression with semantic richness - The path to AGI ( 😅 ) might require rethinking optimization objectives Our paper gives tools to measure this compression-meaning trade-off. This could guide future AI development toward more human-aligned conceptual representations. Cool to see cognitive psychology and AI research coming together! Thanks to Chen Shani, Ph.D., who did all the work and Yann LeCun and Dan Jurafsky for their guidance
Limitations of LLMs in Recognizing Non-Human Logic
Explore top LinkedIn content from expert professionals.
Summary
Large language models (LLMs) are powerful tools for processing text, but they struggle to understand and reason with logic that isn’t built from human-like thinking patterns. The “limitations-of-llms-in-recognizing-non-human-logic” refers to how LLMs often rely on pattern matching and statistical shortcuts, which prevents them from truly grasping abstract concepts or complex reasoning, especially outside of familiar human contexts.
- Expect pattern reliance: Remember that LLM outputs often reflect memorized patterns rather than deep logical reasoning, especially with novel or complex problems.
- Monitor for errors: Always review AI-generated answers, since LLMs can generate incorrect or misleading results when faced with intricate logic, evolving rules, or unfamiliar scenarios.
- Keep humans involved: Use LLMs as supportive tools in research or decision-making, but ensure human oversight for tasks that demand reliable reasoning or contextual understanding.
-
-
Most AI systems today rely on a single cognitive mechanism: procedural memory. That’s the kind of memory involved in learning repeatable patterns — how to ride a bike, follow a recipe, or autocomplete a sentence. It’s also the dominant architecture behind LLMs: self-attention over statistical embeddings. That explains a lot about LLM strengths as well as their failures. LLMs do well in what psychologist Robin Hogarth called “kind” environments — stable, predictable domains where the same actions reliably lead to the same outcomes. But they tend to fail in “wicked” environments—settings where the rules shift, feedback is delayed, and the right answer depends on context that isn’t explicitly stated. In those environments, procedural strategies break down. Humans rely on other mechanisms instead: semantic memory for organizing abstract knowledge, associative learning for recognizing useful patterns, episodic memory for recalling prior experiences. LLMs don’t have those. So they: ➡️ miss abstract relationships between ideas ➡️ fail to generalize across context ➡️ lose track of evolving goals ➡️ don’t build up any durable sense of what works and what doesn’t This isn’t a matter of more data or better training. It’s an architectural limitation. At Aampe, we’ve had to grapple with these gaps directly — because customer engagement is a wicked learning environment. That pushed us to move beyond purely procedural systems and build machinery that can form and adapt conceptual associations over time. Working on these problems has made me uneasy about how singular LLM cognition really is. If one mechanism were enough, evolution wouldn't have given us several.
-
Hallucination in large language models (LLMs) has been widely studied, but the key question remains: Can it ever be eliminated? A recent paper systematically dismantles the idea that hallucination can be fully eradicated. Instead, it argues that hallucination is not just an incidental flaw but an inherent limitation of LLMs. 1️⃣ Hallucination is Unavoidable The paper establishes that LLMs cannot learn all computable functions, meaning they will inevitably generate incorrect outputs. Even with perfect training data, LLMs cannot always produce factually correct responses due to inherent computational constraints. No matter how much we refine architectures, training data, or mitigation techniques, hallucination cannot be eliminated—only minimized. 2️⃣ Mathematical Proofs of Hallucination They use concepts from learning theory and diagonalization arguments to prove that any LLM will fail on certain inputs. The research outlines that LLMs, even in their most optimized state, will hallucinate on infinitely many inputs when faced with complex, computation-heavy problems. 3️⃣ Identifying Hallucination-Prone Tasks Certain problem types are guaranteed to trigger hallucinations due to their computational complexity: 🔹 NP-complete problems (e.g., Boolean satisfiability) 🔹 Presburger arithmetic (exponential complexity) 🔹 Logical reasoning and entailment (undecidable problems) This means that asking LLMs to reason about intricate logic or mathematical problems will often lead to errors. 4️⃣ Why More Data and Bigger Models Won’t Fix It A common assumption is that hallucination can be mitigated by scaling—adding more parameters or training data. The paper challenges this notion: While larger models improve accuracy, they do not eliminate hallucination for complex, unsolvable problems. 5️⃣ Mitigation Strategies and Their Limitations Various techniques have been introduced to reduce hallucinations, but none can completely eliminate them: ✅ Retrieval-Augmented Generation (RAG) – helps provide factual grounding but does not guarantee accuracy. ✅ Chain-of-Thought Prompting – improves reasoning but does not fix fundamental hallucination limits. ✅ Guardrails & External Tools – can reduce risk but require human oversight. They suggest LLMs should never be used for fully autonomous decision-making in safety-critical applications. The Bigger Question: How Do We Build Safe AI? If hallucination is an unavoidable reality of LLMs, how do we ensure safe deployment? The research makes it clear: LLMs should not be blindly trusted. They should be integrated into workflows with: 🔹 Human in the loop 🔹 External fact-checking systems 🔹 Strict guidelines Are we designing AI with realistic expectations, or are we setting ourselves up for failure by expecting perfection? Should LLMs be used in high-stakes environments despite their hallucinations, or should we rethink their applications? #ai #artificialintelligence #technology
-
DON’T rely on AI to do your research… Large language models (LLMs) are often praised for their ability to process information and assist with problem-solving, but can they really reason like ourselves? The latest study by Apple researchers reveals significant limitations in their capacity for genuine mathematical reasoning - and raises important questions about their reliability in research contexts. What Apple Found: 1. Inconsistent results: LLMs struggle with variations of the same problem, even at a basic grade-school math level. This variability challenges the validity of current benchmarks like GSM8K, which rely on single-point accuracy metrics. 2. Fragility to complexity: As questions become slightly more challenging, performance drops drastically, exposing a fragile reasoning process. 3. Susceptibility to irrelevant information: When distracting but inconsequential details were included in problems, model performance plummeted by up to 65%. Even repeated exposure to similar questions or fine-tuning couldn’t fix this. 4. Pattern matching ≠ reasoning: The models often “solve” problems by sophisticated pattern matching, not genuine logical understanding. What this means for research: While LLMs are powerful tools for speeding up certain tasks, their inability to discern critical from irrelevant information, and their reliance on pattern recognition, makes them unreliable for rigorous, logic-based research. This is particularly true in fields like mathematics, engineering, and data-driven sciences, where accuracy and reasoning are non-negotiable. As exciting as these tools are, they’re not ready to replace human critical thinking (yet?). How do you see AI evolving in research applications? #research #chemicalengineering #scientist #engineering #professor PS. Full paper available on ArXiv under 2410.05229
-
The Illusion of Reasoning: A Timely Reflection from Apple’s Latest AI Research Apple recently published a white paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models.” As the AI community accelerates its focus on agentic AI and reasoning models, Apple’s study offers a timely reality check: What did they do? Apple rigorously tested several state-of-the-art reasoning-optimized language models (LLMs) using classic logic puzzles like Tower of Hanoi and River Crossing—structured problems that probe genuine reasoning ability. What did they find? • Models can solve simple problems reliably, but as complexity increases, performance drops sharply. • Minor changes (renaming puzzle elements, altering phrasing) can cause drastic performance shifts—highlighting a reliance on memorized patterns rather than true logical understanding. • Trace analysis shows that current models simulate reasoning by leveraging pattern-matching, not structured logic. 💡 Key Insight: While today’s models can appear to reason well under certain conditions, much of their success stems from surface-level pattern recognition—not deep, generalizable reasoning. Conclusion: This work reminds us that although current LLMs are powerful language processors, they remain far from achieving robust, transparent reasoning. As we build AI agents designed to assist in real-world decision-making, we must tread carefully—understanding both the capabilities and limitations of today’s models. In short, humility and rigor must accompany progress. Apple’s contribution is a welcome call for both. If you work with LLMs or agentic AI—how are you addressing reasoning robustness in your systems? Would love to hear your thoughts! #AI #ReasoningModels #AppleResearch #LLM #ResponsibleAI #engineeringtidbits
-
🤖 Why LLMs just pretend to think Two new papers challenge the hype around LLM "reasoning" capabilities, combining theoretical analysis with empirical testing to show that even the most advanced models like o1 and Claude are sophisticated imitators rather than genuine thinkers. → CoT is just fancy pattern matching: Shao and Cheng [1] argue that Chain-of-Thought works by activating training patterns that look like reasoning steps, but models aren't actually manipulating abstract concepts or deriving rules from first principles → Empirical evidence confirms the imitation hypothesis: Shojaee and Mirzadeh's [2] controlled experiments show that reasoning models hit a complexity wall and bizarrely reduce their thinking effort as problems get harder (exactly what you'd expect from pattern matchers, not true reasoners) → Three distinct regimes expose the illusion: Simple problems: regular LLMs beat reasoning models (overthinking hurts). Medium problems: reasoning models shine (patterns help). Complex problems: everything collapses to zero accuracy (no relevant patterns exist) → Even with cheat codes, imitation fails: When researchers provided the exact Tower of Hanoi algorithm, models still collapsed at the same complexity point, proving they can't execute logical steps reliably. They're just really good at mimicking reasoning format. → Current evaluation methods miss the real problem: Both papers show that focusing on final answer correctness creates an "illusion of reasoning". Models generate plausible-looking steps that are actually sophisticated imitations of reasoning schemata from training data I think these papers are important to probe the limitations of current reasoning techniques and inspire new methods. The theoretical framework (CoT as constrained imitation) perfectly explains the empirical findings (complexity collapse and reduced effort). It's nice to see some papers against the reasoning hype to stay grounded. :)
-
Lesson 2: Let LLMs Talk, Let Engines Think — The Right Way to Handle Math and Logic in GenAI Apps One of the most common misconceptions in LLM-based applications is this: “If it can write code and summarize reports, surely it can handle math too… right?” Not exactly — especially if you’re using non-reasoning models (base or instruction-tuned LLMs without tools or plug-ins). Try this: Ask an LLM to multiply 123456789 by 12.3456778. You’ll likely get a confident, well-worded answer — but it may be wrong. Or ask it to count how many times the word “client” appears in a long paragraph. Again, it may sound right — but be off by a few. Why? Because LLMs don’t do arithmetic, logic, or counting the way structured systems do. They don’t execute, they generate. When you ask for calculations, comparisons, or filtering, you’re relying on structured logic — rules that require precise, step-by-step execution (like those in SQL, Python, or Excel). LLMs don’t follow those rules. They predict based on patterns they’ve seen in text. That means: • Math is unreliable • Counting can be inconsistent • Filtering, sorting, or aggregating can silently fail • Even well-written answers may be wrong But this isn’t a flaw — it’s a limitation we can design around. LLMs are exceptional at understanding and generating language. Let them handle user intent and communication — not computation. Here’s the right pattern aka Natural Language Query (NLQ): • LLM interprets natural language • Analytical engine performs the logic • LLM communicates the results Even then — don’t assume the answer is always correct. We still need robust validation layers to verify accuracy before presenting results. I’ll dive into validation in later lessons. #LLM #GenAI #EnterpriseAI #ResponsibleAI #NaturalLanguageQuery #AIEngineering #LLMApplications #LLMRealityCheck #DataValidation
-
So, there was a viral paper by Apple researchers saying LLMs cannot "reason." One of its observations is: "Note that this model also achieves near-perfect accuracy when solving the Tower of Hanoi with (N=5), which requires 31 moves, while it fails to solve the River Crossing puzzle when (N=3), which has a solution of 11 moves. This likely suggests that examples of River Crossing with N>2 are scarce on the web, meaning LRMs may not have frequently encountered or memorized such instances during training." This is what we already knew. LLMs learn by browsing text found online, and they try to guess how to best autocomplete text based on it. In addition, an army of human annotators write down thousands of additional examples of how to solve common problems and how to give appropriate answers to inappropriate questions, which is further fed into the learning process. If a problem hasn't been spoken about much (either online or in the hand-written data), then the LLM will struggle to solve it. I've been saying that this is AI's main limitation for about eleven years now! In my book Smart Until It's Dumb, published in early 2023, I share an example of AI failing to identify a gigantic cow in a picture. The reason is that the cow is on the beach in this specific picture, whereas in most pictures cows are standing on grass, not sand. Nobody knows how to solve this limitation. Additional data is becoming scarce (they've pretty much scraped the whole internet already) and hand-written data is cannot be generated in a scalable way and will only cover well-known, specific problems. The implications of this are far-reaching. A great number of AI startups that promise to "do the job for you" will fail because their products will be unreliable. In order to use AI effectively and build good AI-based products, we need to acknowledge AI's limitations, not ignore them.
-
A debate is quietly reshaping how we think about reasoning in LLMs, and it has real implications for how we build AI systems today. In 𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴 recently published by Apple , researchers tested reasoning-augmented LLMs on structured problems like Tower of Hanoi, River Crossing, and Blocks World. The results were sharp. As task complexity increased, even models trained for reasoning began to fail. Performance dropped, not just in output quality, but in the effort models applied to thinking. The conclusion: reasoning in LLMs may appear to exist on the surface, but collapses when deeper, compositional logic is required. They argue that we should not mistake verbal fluency for true reasoning capability. A recent response, 𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝘁𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, offers a different angle. The authors do not dispute that models fail in some of these tasks. But they show that many of those failures are a result of poor task design. Some models were asked to generate outputs that exceeded their token limits. Others were penalized for correctly stating that a task had no solution. When tasks were reframed more realistically, such as asking the model to generate an algorithm instead of every step, models performed well. Their conclusion is that what looks like reasoning failure is often a mismatch between evaluation expectations and what the model is actually being asked to do. Taken together, these papers provide a much-needed framework for thinking about when LLMs and reasoning-focused models (LRMs) are useful and where they are not. For simple tasks like summarization, retrieval, or classification, classic LLMs work well. They are fast, general, and effective. Adding reasoning often adds cost and confusion without improving performance. For medium-complexity tasks like applying policy logic, referencing context, or handling multi-turn interactions, LRMs offer clear value. Their planning ability, when structured well, improves accuracy and consistency. For complex tasks like symbolic reasoning, recursive planning, or solving puzzles with deep constraints, both LLMs and LRMs fail more often than they succeed. They either give up early, apply shallow logic, or lose coherence midway. These tasks require additional architecture: modular agents, memory-aware execution, or fallback control. Take a contact center automation as an example. For routine account questions, classic LLMs may suffice. For dynamic policy explanation or billing disputes, LRMs can help. For high-stakes calls involving eligibility, compliance, or contract negotiation, more structure is required. But this is just one example. The bigger lesson is this. We should stop assuming reasoning scales cleanly with model size or prompt complexity. It does not. Reasoning has limits, and those limits depend on how we frame the task, what we ask the model to output, and how we measure success.