LLM Accuracy in Complex Context Scenarios

Explore top LinkedIn content from expert professionals.

Summary

LLM-accuracy-in-complex-context-scenarios refers to how reliably large language models (LLMs) can generate correct answers when faced with tasks that involve multiple steps, evolving information, or intricate reasoning, especially in real-world interactions. Improving this accuracy is crucial for making AI systems trustworthy and useful in practical applications like scientific analysis, negotiation, and question answering over knowledge graphs.

  • Test for real-world use: Evaluate LLMs using scenarios that reflect natural, multi-turn conversations and complex tasks to better understand their reliability and limitations.
  • Structure decision workflows: Guide LLMs with frameworks like game theory, ontologies, or iterative retrieval methods to boost reasoning and adaptability in situations with incomplete or dynamic information.
  • Invest in semantic tools: Use knowledge graphs and metadata to check, repair, and improve model-generated queries, which helps increase answer accuracy on complicated questions.
Summarized by AI based on LinkedIn member posts
  • View profile for Akash Sharma

    CEO at vellum

    14,606 followers

    🧠 If you're building apps with LLMs, this paper is a must-read. Researchers at Microsoft and Salesforce recently released LLMs Get Lost in Multi-Turn Conversation — and the findings resonate with our experience at Vellum. They ran 200,000+ simulations across 15 top models, comparing performance on the same task in two modes: - Single-turn (user provides a well-specified prompt upfront) - Multi-turn (user reveals task requirements gradually — like real users do) The result? ✅ 90% avg accuracy in single-turn 💬 65% avg accuracy in multi-turn 🔻 -39% performance drop across the board 😬 Unreliability more than doubled Even the best models get lost when the task unfolds over multiple messages. They latch onto early assumptions, generate bloated answers, and fail to adapt when more info arrives. For application builders, this changes how we think about evaluation and reliability: - One-shot prompt benchmarks ≠ user reality - Multi-turn behavior needs to be a first-class test case - Agents and wrappers won’t fix everything — the underlying model still gets confused This paper validates something we've seen in the wild: the moment users interact conversationally, reliability tanks — unless you're deliberate about managing context, fallback strategies, and prompt structure. 📌 If you’re building on LLMs, read this. Test differently. Optimize for the real-world path, not the happy path.

  • View profile for Ross Dawson
    Ross Dawson Ross Dawson is an Influencer

    Futurist | Board advisor | Global keynote speaker | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice | Founder: AHT Group - Informivity - Bondi Innovation

    34,050 followers

    LLMs struggle with rationality in complex game theory situations, which are very common in the real world. However integrating structured game theory workflows into LLMs enables them to compute and execute optimal strategies such as Nash Equilibria. This will be vital for bringing AI into real-world situations, especially with the rise of agentic AI. The paper "Game-theoretic LLM: Agent Workflow for Negotiation Games" (link in comments) examines the performance of LLMs in strategic games and how to improve them. Highlights from the paper: 💡 Strategic Limitations of LLMs in Game Theory: LLMs struggle with rationality in complex game scenarios, particularly as game complexity increases. Despite their ability to process large amounts of data, LLMs often deviate from Nash Equilibria in games with larger payoff matrices or sequential decision trees. This limitation suggests a need for structured guidance to improve their strategic reasoning capabilities. 🔄 Workflow-Driven Rationality Improvements: Integrating game-theoretic workflows significantly enhances the performance of LLMs in strategic games. By guiding decision-making with principles like Nash Equilibria, Pareto optimality, and backward induction, LLMs showed improved ability to identify optimal strategies and robust rationality even in negotiation scenarios. 🤝 Negotiation as a Double-Edged Sword: Negotiations improved outcomes in coordination games but sometimes led LLMs away from Nash Equilibria in scenarios where these equilibria were not Pareto optimal. This reflects a tendency for LLMs to prioritize fairness or trust over strict game-theoretic rationality when engaging in dialogue with other agents. 🌐 Challenges with Incomplete Information: In incomplete-information games, LLMs demonstrated difficulty handling private valuations and uncertainty. Novel workflows incorporating Bayesian belief updating allowed agents to reason under uncertainty and propose envy-free, Pareto-optimal allocations. However, these scenarios highlighted the need for more nuanced algorithms to account for real-world negotiation dynamics. 📊 Model Variance in Performance: Different LLM models displayed varying levels of rationality and susceptibility to negotiation-induced deviations. For instance, model o1 consistently adhered more closely to Nash Equilibria compared to others, underscoring the importance of model-specific optimization for strategic tasks. 🚀 Practical Implications: The findings suggest LLMs can be optimized for strategic applications like automated negotiation, economic modeling, and collaborative problem-solving. However, careful design of workflows and prompts is essential to mitigate their inherent biases and enhance their utility in high-stakes, interactive environments.

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    40,989 followers

    Researchers at UC San Diego and Tsinghua just solved a major challenge in making LLMs reliable for scientific tasks: knowing when to use tools versus solving problems directly. Their method, called Adapting While Learning (AWL), achieves this through a novel two-component training approach: (1) World knowledge distillation - the model learns to solve problems directly by studying tool-generated solutions (2) Tool usage adaptation - the model learns to intelligently switch to tools only for complex problems it can't solve reliably The results are impressive: * 28% improvement in answer accuracy across scientific domains * 14% increase in tool usage precision * Strong performance even with 80% noisy training data * Outperforms GPT-4 and Claude on custom scientific datasets Current approaches either make LLMs over-reliant on tools or prone to hallucinations when solving complex problems. This method mimics how human experts work - first assessing if they can solve a problem directly before deciding to use specialized tools. Paper https://lnkd.in/g37EK3-m — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai

  • View profile for Kuldeep Singh Sidhu
    Kuldeep Singh Sidhu Kuldeep Singh Sidhu is an Influencer

    Senior Data Scientist @ Walmart | BITS Pilani

    13,433 followers

    Excited to share a groundbreaking advancement in AI: KG-IRAG! I've been diving into this fascinating paper on Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that significantly enhances how Large Language Models (LLMs) handle complex temporal reasoning tasks. >> What makes KG-IRAG special? Unlike traditional RAG methods that often struggle with multi-step reasoning, KG-IRAG introduces an iterative approach that progressively gathers relevant data from knowledge graphs. This is particularly powerful for queries involving temporal dependencies - like planning trips based on weather conditions or traffic patterns. The framework employs two collaborative LLMs: - LLM1 identifies the initial exploration plan and generates a reasoning prompt - LLM2 evaluates retrieved data and determines if further retrieval steps are needed The magic happens in the iterative retrieval process, where the system incrementally explores the knowledge graph, retrieving only what's needed when it's needed. This prevents information overload while ensuring comprehensive data collection. >> Technical implementation details: The researchers constructed knowledge graphs treating time, location, and event status as key entities, with relationships capturing temporal, spatial, and event-based correlations. Their approach models time as an entity for easier retrieval and reasoning. The system follows a sophisticated algorithm where: 1. Initial time/location parameters are identified 2. Relevant triplets are retrieved from the KG 3. LLM2 evaluates if current data is sufficient 4. If insufficient, search criteria are adjusted based on detected "abnormal events" 5. This continues until enough information is gathered to generate an accurate answer >> Impressive results: The researchers at UNSW, MBZUAI and TII evaluated KG-IRAG on three custom datasets (weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW), demonstrating significant improvements in accuracy for complex temporal reasoning tasks compared to standard RAG methods. What's particularly impressive is how KG-IRAG outperforms other approaches on questions requiring dynamic temporal reasoning - like determining the latest time to leave early or earliest time to leave late to avoid adverse conditions. This work represents a significant step forward in making LLMs more capable of handling real-world temporal reasoning tasks. Excited to see how this technology evolves!

  • View profile for Juan Sequeda

    Principal Researcher at ServiceNow (data.world acquisition); co-host of Catalog & Cocktails, the honest, no-bs, non-salesy data podcast. 20 years working in Knowledge Graphs & Ontologies (way before it was cool)

    18,348 followers

    How can we further increase the accuracy of LLM-powered question answering systems? Ontologies to the rescue! That is the conclusion of the latest research coming from the data.world AI Lab with Dean Allemang. Based on our previous Knowledge Graph LLM Accuracy benchmark research, our intuition is that accuracy can be further increased by 1) leveraging the ontology of the knowledge graph to check for errors in the generated queries and 2) using the LLM to repair incorrect queries. We ask ourselves the following two research questions  1️⃣  To what extent can the accuracy increase by leveraging the ontology of a knowledge graph to detect errors of a SPARQL query and an LLM to repair the errors? 2️⃣ What types of errors are most commonly presented in SPARQL queries generated by an LLM? 🧪 Our hypothesis: An ontology can increase the accuracy of an LLM powered question answering system that answers a natural language question over a knowledge graph. 📏 Our approach consists of - Ontology-based Query Check (OBQC): checks deterministically if the query is valid by applying rules based on the semantics of the ontology. The rules check the the body of the query (i.e. WHERE clause) and the head of query (i.e. the SELECT clause). If a check does not pass, it returns an explanation.  - LLM Repair: repair the SPARQL query generated by the LLM. It takes as input the incorrect query and the explanation and sends a zero-shot prompt to the LLM. The result is a new query which can then be passed back to the OBQC. 🏅Results: Using our chat with the data benchmark and GPT-4 - Our OBQC and LLM Repair approach increased the accuracy to 72.55%. If the repairs were not successful after three iterations, an unknown result was returned, which occurred 8% of the time. Thus the final error rate is 19.44%. “I don’t know” is a valid answer which reduces the error rate. - Low complex questions on low complex schemas achieves an error rate of 10.46%, which is now arguably at levels deemed to be acceptable by users. - All questions on high complex schemas substantially increased the accuracy. - 70% of the repairs where done by rules checking the body of the query. The majority were rules related to the domain of a property.  Putting this all together with our previous work, LLM Question Answering accuracy that leverages Knowledge Graphs and Ontologies is over 4x the SQL accuracy! These results support the main conclusion of our research: investment in metadata, semantics, ontologies and Knowledge Graph are preconditions to achieve higher accuracy for LLM powered question answering systems. Link to paper in comments. We are honored that we get to work with strategic customers to push the barrier of the data catalog and knowledge graph industry, and the data.world product. We are proud that our research results are a core part of the the data.world AI Context Engine. Thanks for all the valuable feedback we have received from colleagues across industry and academia 

  • View profile for Cameron R. Wolfe, Ph.D.

    Research @ Netflix

    21,294 followers

    My favorite paper from NeurIPS’24 shows us that frontier LLMs don’t pay very close attention to their context windows… Needle In A Haystack: The needle in a haystack test is the most common way to test LLMs with long context windows. The test is conducted via the following steps: 1. Place a fact / statement within a corpus of text. 2. Ask the LLM to generate the fact given the corpus as input. 3. Repeat this test while increasing the size of the corpus and placing the fact at different locations. From this test, we see if an LLM “pays attention” to different regions of a long context window, but this test purely examines whether the LLM is able to recall information from its context. Where does this fall short? Most tasks being solved by LLMs require more than information recall. The LLM may need to perform inference, manipulate knowledge, or reason in order to solve a task. With this in mind, we might wonder if we could generalize the needle in a haystack test to analyze more complex LLM capabilities under different context lengths. BABILong generalizes the needle in a haystack test to perform long context reasoning. The LLM is tested based upon its ability to reason over facts that are distributed in very long text corpora. Reasoning tasks that are tested include fact chaining, induction, deduction, counting, list / set comprehension, and more. Such reasoning tasks are challenging, especially when necessary information is scattered in a large context window. “Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.” - BABILong paper Can LLMs reason over long context? We see in the BABILong paper that most frontier LLMs struggle to solve long context reasoning problems. Even top LLMs like GPT-4 and Gemini-1.5 seem to consistently use only ~20% of their context window. In fact, most LLMS struggle to answer questions about facts in texts longer than 10,000 tokens! What can we do about this? First, we should just be aware of this finding! Be wary of using super long contexts, as they might deteriorate the LLM’s ability to solve more complex problems that require reasoning. However, we see in the BABILong paper that these issues can be mitigated with a few different approaches: - Using RAG is helpful. However, this approach only works up to a certain context length and has limitations (e.g., struggles to solve problems where the order of facts matters). - Recurrent transformers can answer questions about facts from very long contexts.

  • “Don’t bet against the models” is the classic advice. And I believe it. But our latest paper revealed there’s still a long way to go before models excel at large context recall. The specific problem is the “needle in a haystack” scenario, where models have to dig through a large context window to find a small bit of information. Historically, this has been a tough problem to solve. Many papers find that: 1. Information buried in the middle of a context window is usually missed, with accuracy dipping from 75% to below 55% in some tests ("Lost in the Middle: How Language Models Use Long Contexts") 2. Conflicting information with training data lowers performance (up to 30% in tests from "LLM In-Context Recall is Prompt Dependent") 3. Performance does improve with state-of-the-art models IF the prompt is favorable ("LLM In-Context Recall is Prompt Dependent") 4. Performance often depends on prompt construction ("LLM In-Context Recall is Prompt Dependent") Which paints a bad picture: agent recall is dependent on luck, a ton of prompt crafting, and hoping that your new data doesn’t conflict with training data. And out of the box, even frontier models have a tough time with this if the stars don’t align. We can do better. In our case, we evaluate recall using the LongMem evaluation, which is representative of the complexity of enterprise GenAI use cases. The eval tests recall of small, nuanced details from a large chat history, such as a customer support conversation. The eval is a real LLM stress test. One thing became abundantly clear from our benchmarking: Zep scored far higher than having entire conversations in the context window. Specifically: - 18.5% aggregate accuracy improvement over the full-context baseline, and 100%+ for many individual tests - A 90% reduction in latency vs. full-context - Only uses 2% of the tokens compared to context stuffing It makes sense - Zep focuses on including only the most relevant information in the context window, instead of dumping everything. So “needle-in-a-haystack” becomes a non-issue. The LLM doesn’t have to sift through everything to find the relevant context - Zep surfaces it automatically. As for “don’t bet against the models”, we’ve clearly got a long way to go. And even if they get better and better at recall, without significant LLM architectural advances, filling the context window will still be slow and expensive. If accuracy and latency are your concern, I wouldn’t wait around!

  • View profile for Elena Gurevich

    AI & IP Attorney for Startups & SMEs | Speaker | Practical AI Governance & Compliance | Owner, EG Legal Services | EU GPAI Code of Practice WG | Board Member, Center for Art Law

    9,651 followers

    A new study by the Stanford Institute for Human-Centered Artificial Intelligence (HAI) of "legal hallucinations" in LLM models (GPT 3.5, Llama 2, and PaLM 2). One of the findings showed that large language models' performance "deteriorates when dealing with more complex tasks that require a nuanced understanding of legal issues or interpretation of legal texts. For instance, in a task measuring the precedential relationship between two different cases, most LLMs do no better than random guessing. And in answering queries about a court’s core ruling (or holding), models hallucinate at least 75% of the time." Link to paper in comments.

  • View profile for Chris Gaun

    AI today is the worst it will ever be

    7,000 followers

    Apple shows where LLMs have 0% accuracy and hit a complexity wall. Instead of using benchmarks, which LLMs train towards, Apple gave Claude Thinking, DeepSeek-R1, and o3-mini (the most advanced models) novel puzzles. All the reasoning models hit a wall where they had 0% accuracy against these high complexity problems no matter the computer power thrown at it. This has actually been shown before without as much fanfare. In this experiment they also gave the LLMs the exact algorithm to solve the problem and they still failed. Basically, the pattern matching works until a high degree of complexity is hit then it breaks down. https://lnkd.in/ePFdhH2n

  • View profile for Sivasankar Natarajan

    Technical Director | GenAI Practitioner | Azure Cloud Architect | Data & Analytics | Solutioning What’s Next

    9,733 followers

    𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 is the process of deliberately designing, structuring, and manipulating the inputs, metadata, memory, and environment surrounding a LLM to produce better, more reliable, and more useful outputs. 𝐇𝐞𝐫𝐞’𝐬 𝐡𝐨𝐰 𝐭𝐨 𝐭𝐡𝐢𝐧𝐤 𝐚𝐛𝐨𝐮𝐭 𝐢𝐭: - LLM is the CPU - Context Window is the RAM - Context Engineering is your OS Just like RAM, the context window has strict limits. What you load into it and when defines everything from performance to reliability. Think of it as "𝐏𝐫𝐨𝐦𝐩𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠" on steroids, with a focus on providing a rich and structured environment for the LLM to work within. 𝐇𝐞𝐫𝐞’𝐬 𝐭𝐡𝐞 𝐟𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤 𝐈 𝐤𝐞𝐞𝐩 𝐜𝐨𝐦𝐢𝐧𝐠 𝐛𝐚𝐜𝐤 𝐭𝐨: 𝐓𝐡𝐞 𝟒 𝐂𝐬 𝐨𝐟 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠: 1. Save Context Store important information outside the context window so it can be reused later. - Log task results - Storing conversation states and chat history - Persist metadata This is about Memory. Offload what the model doesn’t need right now but might need soon. 2. Select Context Pull relevant information into the context window for the task at hand. - Use search (RAG) - Lookup memory - Query prior interactions Selection quality = Output quality. Garbage in, Garbage out. 3. Compress Context When you exceed token limits, you compress. - Summarize - Cluster with embeddings - Trim token-by-token Think like a systems engineer. Signal > Noise. Token budgets are real. 4. Isolate Context Sometimes, the best boost in performance comes from narrowing scope. - Scope to one subtask - Modularize Agents - Run isolated threads Less clutter = Fewer Hallucinations = More Deterministic Behavior. --- 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐦𝐚𝐭𝐭𝐞𝐫𝐬 ? Most LLM failures aren’t because of weak prompts. They fail because the context window is overloaded, underutilized, or just ignored. 𝐋𝐞𝐭 𝐦𝐞 𝐤𝐧𝐨𝐰 𝐢𝐟 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐫𝐮𝐧𝐝𝐨𝐰𝐧 𝐨𝐟 𝐏𝐫𝐨𝐦𝐩𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐯𝐬 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠.

Explore categories