I spend a lot of time with technical founders building AI companies. Many assume that if we just make models bigger and feed them more data, we'll eventually reach true intelligence. I see a different reality. The fundamental limits of transformer architecture run deeper than most founders realize. Transformer models face three architectural barriers that no amount of scale can solve: 1️⃣ The Edge Case Wall An example in autonomous vehicles: Every time you think you've handled all scenarios, reality throws a new one: a child chasing a ball, construction patterns you've never seen, extreme weather conditions. The architecture itself can't generalize to truly novel situations, no matter how much data you feed it. 2️⃣ The Pattern Matching Trap Our portfolio companies building enterprise AI tools hit this constantly. Current models can mimic patterns brilliantly but struggle to reason about new scenarios. It's like having a highly skilled copywriter who can't generate original insights. The limitation isn't in the training—it's baked into how transformers work. 3️⃣ The Semantic Gap LMs process text without truly understanding meaning. We see this clearly in technical domains like software development. Models can generate syntactically perfect code but often miss fundamental logic because they don't grasp what the code actually does. This creates a massive opportunity for technical founders willing to rethink AI architecture from first principles. Some promising directions I'm tracking: → World models that understand causality and physical interaction → Architectures designed for reasoning during inference rather than training → Systems that combine multiple specialized models rather than one large generalist Founders: While others chase marginal improvements through scale, focus on solving the fundamental problems to build the next $100B+ business (and I'll be your first check ;))
Limitations of Large World Models (LWMs)
Explore top LinkedIn content from expert professionals.
Summary
Large world models (LWMs), such as advanced language models, are powerful AI systems designed to process enormous amounts of information but face notable limitations, especially in reasoning, handling novel scenarios, and maintaining transparency in decision-making. These shortcomings highlight why simply scaling up models does not guarantee true intelligence or reliable performance, especially when applied to complex, real-world tasks.
- Recognize reasoning gaps: Remember that LWMs often rely on pattern matching rather than genuine logical reasoning, which can lead to unreliable outcomes in unfamiliar or complex scenarios.
- Address transparency issues: Be aware that the inner workings of LWMs are difficult to interpret, making it challenging to understand how decisions are made and complicating efforts to ensure fairness and safety.
- Test with real-world complexity: When deploying LWMs, always assess their performance with diverse, realistic situations as they may struggle to generalize beyond their training data, especially in edge cases.
-
-
The "black box" nature of LLMs poses significant challenges for regulation and ensuring safety. Due to their opaque and complex internal workings, it is often not clear how these models arrive at specific answers or why they generate certain outputs. This lack of transparency complicates efforts to establish robust regulatory frameworks, as regulators find it difficult to assess compliance with ethical and legal standards, including privacy and fairness. Furthermore, without a clear understanding of how answers are generated, users may question the reliability and trustworthiness of the responses they receive. This uncertainty can deter wider adoption and reliance on LLMs. This study (https://lnkd.in/efjmvwiw) aims to address some of these issues by introducing CausalBench which is designed to address the limitations of existing causal evaluation methods by enhancing the complexity and diversity of the data, tasks, and prompt formats used in the assessments. The purpose of CausalBench is to test and understand the limits of LLMs in identifying and reasoning about causality particularly how well they can perform under conditions that mimic real-world examples. Using CausalBench, the authors then evaluated 19 leading LLMs on their capability to discern direct and indirect correlations, construct causal skeletons, and identify explicit causality from structured and unstructured data. Here are the key takeaways: • 𝗦𝗲𝗻𝘀𝗶𝘁𝗶𝘃𝗶𝘁𝘆 𝘁𝗼 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗦𝗰𝗮𝗹𝗲: LLMs are capable of recognizing direct correlations in smaller datasets, but their performance declines with larger, more complex datasets, particularly in detecting indirect correlations. This indicates a need for models trained on larger and more complex network structures. • 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗘𝗱𝗴𝗲 𝗼𝗳 𝗖𝗹𝗼𝘀𝗲𝗱-𝘀𝗼𝘂𝗿𝗰𝗲 𝗟𝗟𝗠𝘀: Closed-source LLMs like GPT3.5-Turbo and GPT4 outperform open-source models in causality-related tasks, suggesting that the extensive training data and diverse datasets used for these models enhance their ability to handle complex causal queries. • 𝗜𝗺𝗽𝗮𝗰𝘁 𝗼𝗳 𝗣𝗿𝗼𝗺𝗽𝘁 𝗗𝗲𝘀𝗶𝗴𝗻: The effectiveness of LLMs varies with different prompt formats, with combinations of variable names with structured data or background knowledge proving particularly beneficial. The development of comprehensive benchmarks like CausalBench is pivotal in demystifying the "black box" nature of LLMs. This enhanced transparency aids in complex reasoning tasks, guiding the selection of appropriate models for specific applications based on empirical performance data. Additionally, a more granular understanding of LLM capabilities and behaviors facilitates more effective regulation and risk management, addressing both ethical and practical concerns in deploying these models in sensitive or high-stakes environments.
-
"We find that all five studied off-the-shelf LLMs show forms of escalation and difficult-to-predict escalation patterns. We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons. Qualitatively, we also collect the models' reported reasonings for chosen actions and observe worrying justifications based on deterrence and first-strike tactics. Given the high stakes of military and foreign-policy contexts, we recommend further examination and cautious consideration before deploying autonomous language model agents for strategic military or diplomatic decision-making." Good work from Juan Pablo Rivera, Gabriel M., Anka Reuel, Max Lamparth, Ph.D., Chandler Smith, and Jacquelyn Schneider
-
DON’T rely on AI to do your research… Large language models (LLMs) are often praised for their ability to process information and assist with problem-solving, but can they really reason like ourselves? The latest study by Apple researchers reveals significant limitations in their capacity for genuine mathematical reasoning - and raises important questions about their reliability in research contexts. What Apple Found: 1. Inconsistent results: LLMs struggle with variations of the same problem, even at a basic grade-school math level. This variability challenges the validity of current benchmarks like GSM8K, which rely on single-point accuracy metrics. 2. Fragility to complexity: As questions become slightly more challenging, performance drops drastically, exposing a fragile reasoning process. 3. Susceptibility to irrelevant information: When distracting but inconsequential details were included in problems, model performance plummeted by up to 65%. Even repeated exposure to similar questions or fine-tuning couldn’t fix this. 4. Pattern matching ≠ reasoning: The models often “solve” problems by sophisticated pattern matching, not genuine logical understanding. What this means for research: While LLMs are powerful tools for speeding up certain tasks, their inability to discern critical from irrelevant information, and their reliance on pattern recognition, makes them unreliable for rigorous, logic-based research. This is particularly true in fields like mathematics, engineering, and data-driven sciences, where accuracy and reasoning are non-negotiable. As exciting as these tools are, they’re not ready to replace human critical thinking (yet?). How do you see AI evolving in research applications? #research #chemicalengineering #scientist #engineering #professor PS. Full paper available on ArXiv under 2410.05229
-
The Illusion of Reasoning: A Timely Reflection from Apple’s Latest AI Research Apple recently published a white paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models.” As the AI community accelerates its focus on agentic AI and reasoning models, Apple’s study offers a timely reality check: What did they do? Apple rigorously tested several state-of-the-art reasoning-optimized language models (LLMs) using classic logic puzzles like Tower of Hanoi and River Crossing—structured problems that probe genuine reasoning ability. What did they find? • Models can solve simple problems reliably, but as complexity increases, performance drops sharply. • Minor changes (renaming puzzle elements, altering phrasing) can cause drastic performance shifts—highlighting a reliance on memorized patterns rather than true logical understanding. • Trace analysis shows that current models simulate reasoning by leveraging pattern-matching, not structured logic. 💡 Key Insight: While today’s models can appear to reason well under certain conditions, much of their success stems from surface-level pattern recognition—not deep, generalizable reasoning. Conclusion: This work reminds us that although current LLMs are powerful language processors, they remain far from achieving robust, transparent reasoning. As we build AI agents designed to assist in real-world decision-making, we must tread carefully—understanding both the capabilities and limitations of today’s models. In short, humility and rigor must accompany progress. Apple’s contribution is a welcome call for both. If you work with LLMs or agentic AI—how are you addressing reasoning robustness in your systems? Would love to hear your thoughts! #AI #ReasoningModels #AppleResearch #LLM #ResponsibleAI #engineeringtidbits
-
🌶 This gap in modern LLMs hardly gets any attention. While many LLMs can process hundreds of thousands of input tokens, they often struggle to produce even a few thousand output tokens. Why is that? 🤔 It’s easy to see why this limitation is often ignored—most LLM tasks don’t need more than a few thousand tokens. But think about future uses, like having LLMs write entire movie scripts or books! This new paper explains that the issue happens because a model’s output length is usually limited by the longest outputs in its training data and o solve this, they also introduce "AgentWrite", a tool that breaks down long tasks into smaller parts, allowing LLMs to generate over 20,000 words smoothly. 📖 Insights 👉 The authors show that the primary limitation on LLM output length is due to the scarcity of long-output examples in existing SFT datasets. 👉This means that even though LLMs can process extensive input sequences, their output is capped by the longest examples they've encountered during fine-tuning, typically around 2,000 words. 👉 AgentWrite breaks down ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to produce coherent outputs exceeding 20,000 words. This method effectively bypasses the limitations imposed by existing SFT datasets. 👉 Leveraging AgentWrite, the authors generated the LongWriter-6k dataset, consisting of 6,000 SFT examples with output lengths ranging from 2,000 to 32,000 words. 👉 By incorporating the LongWriter-6k dataset into training, the authors successfully scaled the output length of models to over 10,000 words without compromising the quality of the generated text. ⛳ The paper introduces LongBench-Write, a new benchmark specifically designed to evaluate the ultra-long generation capabilities of LLMs. The authors’ 9B parameter model, further improved through Direct Preference Optimization (DPO), achieved state-of-the-art performance on this benchmark, surpassing even larger proprietary models. Link: https://lnkd.in/gvVE4sbi
-
#GeoAnalystBench: A #GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation. Recent advances in large language models (LLMs) have fueled growing interest in automating geospatial analysis and GIS workflows, yet their actual capabilities remain uncertain. We call for more rigorous evaluation of LLMs on well-defined geoprocessing tasks before making claims about full GIS automation. To this end, we present GeoAnalystBench, a benchmark of 50 Python-based tasks (with over a hundred sub-tasks) derived from real-world geospatial problems and validated by GIS experts. Each task is paired with a minimum deliverable product, and evaluation covers GIS workflow validity, structural alignment, semantic similarity, and code quality (CodeBLEU). Using this benchmark, we assess both proprietary and open-source models. Results reveal a clear gap: proprietary models such as ChatGPT-4o-mini achieve high validity (95%) and stronger code alignment (CodeBLEU 0.39), while smaller open-source models like DeepSeek-R1-7B often generate incomplete or inconsistent workflows (48.5% validity, 0.272 CodeBLEU). Tasks requiring deeper spatial reasoning, such as spatial relationship detection or optimal site selection, remain the most challenging across all models. These findings demonstrate both the promise and limitations of current LLMs in GIS automation and provide a reproducible framework to advance GeoAI research with human-in-the-loop support. Paper: https://lnkd.in/d53cdfcT Code and Benchmark Datasets: https://lnkd.in/dAriGtcy
-
🤔 Ever wondered why LLMs struggle with basic arithmetic? Tokenization could be one of the suspects! 🤫 Tokenization is the first step in feeding text to Large Language Models (LLMs). It slices text into smaller pieces; words, subwords, or even characters, and assigns each one a unique ID. For example: “John Doe” → [6721, 870, 17] (split into parts). 📉 As smart as it sounds, tokenization comes with hidden costs that affect how models handle context, numbers, and even whitespace. 🔍 So, what exactly are the challenges with tokenization? ✂️ Fragmentation and Context Loss: Models aim to balance vocabulary size and flexibility. Instead of creating unique tokens for every long or rare word, tokenization breaks them into subwords that are easier to process and reuse. However, this process can result in loss of semantic meaning when the word’s structure is complex or unfamiliar to the model. Example: 'unbelievable' → ["un", "believ", "able"]. 🔤 Case Sensitivity and Symbol Handling: “hello” is tokenized as [31373], while “HELLO” becomes [13909, 3069, 46]. These inconsistencies mean that identical words with different capitalization are treated as unrelated, creating challenges for case-sensitive applications like coding or structured data analysis. 🔢 Numerical Representation Issues: Handling numbers should be simple, but tokenization makes it complicated. For instance, while the number “380” may be treated as a single token [380], “381” could be split into two tokens [38, 1]. Such inconsistencies make tasks like arithmetic, pattern matching, and date handling significantly harder. Further, reversing the digits of “381” requires manipulating multiple tokens rather than one, making it difficult for models to operate with mathematical precision. ⬜ Whitespace - the Invisible Culprit: Consider the phrases “Once upon a” and “Once upon a ” (with trailing space). The former might tokenize as [once, upon, a], while the latter could include an extra whitespace token, [once, upon, a, ␣]. This subtle difference impacts text formatting, code generation, and prompt engineering, where exact outputs often matter. ❌ Vocabulary Incompatibility Across Models: Different models have different tokenization approaches and assign different IDs to the same tokens. This lack of standardization creates challenges for cross-model compatibility and multimodal systems that integrate text with code or visual data. So, while tokenization has been a practical bridge between human and machine language, future AI models must evolve to process text more naturally, mirroring the human brain’s ability to grasp context and meaning seamlessly. Researchers are developing token-free architectures like Meta’s recent Byte Latent Transformer (BLT), which processes raw bytes instead of tokens, offering greater scalability, improved multilingual handling, and fewer formatting errors. What are your thoughts on this? Follow Kavana Venkatesh for such AI content! #genai #llms #nlp
-
Large language models aren't effective for many languages. Scholars explain what's at stake for the approximately 5 billion people who don't speak English by Stanford Institute for Human-Centered Artificial Intelligence (HAI) 👨⚖️Five billion people worldwide risk systematic exclusion from the AI revolution due to linguistic barriers in large language models. Current AI systems excel for English speakers but fail dramatically for languages like Vietnamese and Nahuatl, creating a new form of digital stratification that transcends traditional economic divides. 💡The problem extends beyond translation—these models embed cultural assumptions that distort local contexts while potentially accelerating the homogenization of human knowledge systems. This creates a recursive exclusion where marginalized communities lose both immediate technological benefits and long-term influence over systems that will shape future social and economic structures. 🚨The technical challenge reflects deeper questions about who controls AI development and whose worldviews become encoded in these increasingly powerful systems. Policymakers face a critical window to intervene before technological path dependence locks in current inequities. 🧭Is key to establish international standards for multilingual AI benchmarking, create incentive structures rewarding genuine linguistic inclusivity over superficial localization, and develop governance frameworks treating language data as cultural infrastructure deserving protective policies—recognizing that linguistic diversity in AI isn't just about fairness but about preserving the cognitive and cultural resources humanity needs for complex global problem-solving. https://lnkd.in/etzw3HpB
-
One of the most interesting posters I saw at NeurIPS this year was this paper: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝘁𝗵𝗲 𝗪𝗼𝗿𝗹𝗱 𝗠𝗼𝗱𝗲𝗹 𝗜𝗺𝗽𝗹𝗶𝗰𝗶𝘁 𝗶𝗻 𝗮 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗠𝗼𝗱𝗲𝗹 It tackles a big question in AI: do transformers implicitly learn World Models through next-token prediction? This is important because having a robust world model isn’t just critical for out-of-distribution generalization and causal reasoning—it’s also a key step toward AGI, according to some researchers. In their work, Keyon Vafa and colleagues introduce two novel metrics and an experimental framework to evaluate how well transformers implicitly learn a World Model. Their setup explores whether transformers can infer the map of Manhattan when trained to predict the next direction in a route based on a given origin and destination. While the models excel at 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗻𝗴 routes, they demonstrate significant fragility in 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 the underlying map structure of Manhattan. To measure this, the authors propose two metrics: 1. 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗲 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻: Imagine traveling from Madison Ave & 43rd St to Lexington Ave & 38th St. There are multiple valid routes, such as “North → East → South” or “West → South → East.” Humans intuitively understand that once they reach Lexington & 38th, their subsequent decisions are independent of how they arrived there. Surprisingly, transformers struggle with this generalization and fail to predict consistent next steps after such sequences. 2. 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗲 𝗗𝗶𝘀𝘁𝗶𝗻𝗰𝘁𝗶𝗼𝗻: Consider two routes with different intermediate steps, such as “North → East” versus “West → South,” which result in different potential next moves (e.g., East vs. South). The authors formalize the precision and recall of the model’s understanding using a Myhill-Nerode boundary framework to evaluate these distinctions. Using these metrics, the authors demonstrate that transformers trained to predict routes perform poorly on both measures, even compared to random walk-trained models. This finding highlights that while transformers are adept at next-token prediction, they fail to internalize a true World Model. Such limitations lead to fragility and unpredictable behavior, particularly in novel or out-of-distribution scenarios. For instance, consider an autonomous car relying on a transformer-based routing system during a natural disaster when standard routes are unavailable—this could result in catastrophic failures. paper: https://lnkd.in/e73KJEKJ code: https://lnkd.in/eBXd_dhc