Impact of Decontextualization on LLM Performance

Explore top LinkedIn content from expert professionals.

Summary

Decontextualization refers to the loss or removal of surrounding context within data that large language models (LLMs) use to understand and answer questions. Recent research highlights that stripping away context, such as formatting or the order of facts, can sharply reduce LLM performance—especially when working with long documents or complex reasoning tasks.

  • Preserve structure: Retain meaningful formatting and organizational details when preparing data for LLMs, as these elements help the model interpret information accurately.
  • Prioritize relevance: Select information that is directly related to your query instead of simply increasing the amount of data included, since too much irrelevant content can distract or confuse the LLM.
  • Manage context size: Avoid overwhelming LLMs with excessively long inputs, and use retrieval techniques to provide focused, useful context for better problem solving and reasoning.
Summarized by AI based on LinkedIn member posts
  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    162,859 followers

    How good are LLMs in long-context RAGs? Databricks Mosaic Research ran over 2,000 experiments on 13 open and closed LLMs on 4 curated RAG datasets. 👀 TL;DR: 📈 Retrieving more documents generally improves RAG performance up to a point 🔬 Most models show decreased performance after ~32k context size (e.g., 32k for Llama-3.1-405b, 64k for GPT-4) 🆚 Different models fail in distinct ways with long contexts (e.g., copyright concerns, summarization instead of answering) 🚫 Claude 3.5 copyright-related failures increased from 3.7% at 16k context to 49.5% at 64k context. 👎🏻 DBRX failure to follow instructions jumped from 5.2% at 8k context to 50.4% at 32k. 🔄 Mixtral started to generated repeated content ("梦梦梦梦梦梦") 🤷🏻♂️ LLMs still suffer from the "lost in the middle", where they fail to utilize information from the middle portions of long texts effectively 📊 Optimal context size depends on both the model and the specific task 🧠 Lack of long-context post-training may be the reason for model failures Blog: https://lnkd.in/eCXDuZPP A good example where fine-tuning on domain/task-specific long-context synthetic datasets could significantly improve the performance of open models compared to closed models. 🚀

  • View profile for Cameron R. Wolfe, Ph.D.

    Research @ Netflix

    21,294 followers

    My favorite paper from NeurIPS’24 shows us that frontier LLMs don’t pay very close attention to their context windows… Needle In A Haystack: The needle in a haystack test is the most common way to test LLMs with long context windows. The test is conducted via the following steps: 1. Place a fact / statement within a corpus of text. 2. Ask the LLM to generate the fact given the corpus as input. 3. Repeat this test while increasing the size of the corpus and placing the fact at different locations. From this test, we see if an LLM “pays attention” to different regions of a long context window, but this test purely examines whether the LLM is able to recall information from its context. Where does this fall short? Most tasks being solved by LLMs require more than information recall. The LLM may need to perform inference, manipulate knowledge, or reason in order to solve a task. With this in mind, we might wonder if we could generalize the needle in a haystack test to analyze more complex LLM capabilities under different context lengths. BABILong generalizes the needle in a haystack test to perform long context reasoning. The LLM is tested based upon its ability to reason over facts that are distributed in very long text corpora. Reasoning tasks that are tested include fact chaining, induction, deduction, counting, list / set comprehension, and more. Such reasoning tasks are challenging, especially when necessary information is scattered in a large context window. “Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.” - BABILong paper Can LLMs reason over long context? We see in the BABILong paper that most frontier LLMs struggle to solve long context reasoning problems. Even top LLMs like GPT-4 and Gemini-1.5 seem to consistently use only ~20% of their context window. In fact, most LLMS struggle to answer questions about facts in texts longer than 10,000 tokens! What can we do about this? First, we should just be aware of this finding! Be wary of using super long contexts, as they might deteriorate the LLM’s ability to solve more complex problems that require reasoning. However, we see in the BABILong paper that these issues can be mitigated with a few different approaches: - Using RAG is helpful. However, this approach only works up to a certain context length and has limitations (e.g., struggles to solve problems where the order of facts matters). - Recurrent transformers can answer questions about facts from very long contexts.

  • View profile for Aishwarya Naresh Reganti

    Founder & CEO @ LevelUp Labs | Ex-AWS | Consulting, Training & Investing in AI

    114,025 followers

    🤔 This paper is super interesting—it shows that preserving the structure and semantics of HTML can actually boost LLM performance in RAG systems. The authors point out that most RAG systems strip HTML down to plain text, losing important details like headings, table layouts, and tags. This strips away context that could make a big difference in how well the LLM understands the data. To fix this, they propose HtmlRAG, which works in two steps: ⛳ HTML Cleaning: Gets rid of unnecessary elements like CSS, JavaScript, and comments while keeping the meaningful stuff. This shrinks the HTML to just 6% of its original size without losing context. ⛳ Block-Tree-Based Pruning: A more refined process with two stages: - Embedding-Based Pruning: Scores blocks based on how relevant they are to the query and tosses out the less important ones. - Generative Pruning: Uses a depth-first algorithm and token trees to ensure only the most relevant info stays. 💡 HtmlRAG outperforms plain text-based RAG systems across six QA datasets, showing that keeping HTML structure really pays off. What surprised me the most is that this works with off-the-shelf models that weren’t trained to handle HTML tags. It seems the semantic meaning those tags bring is enough to make a difference! Link: https://lnkd.in/ef-_dppd

  • View profile for Shivani Virdi

    AI Engineering | Founder @ NeoSage | ex-Microsoft • AWS • Adobe | Teaching 70K+ How to Build Production-Grade GenAI Systems

    73,948 followers

    If you think bigger LLM context windows help, think again (Here are 10 areas where they fail) 1. 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 𝘀𝗰𝗮𝗹𝗲𝘀 𝘄𝗶𝘁𝗵 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 More tokens = more comparisons. Attention grows quadratically, so long inputs slow everything down. System impact: Multi-hop workflows like agents become bottlenecked. Chained LLM calls with full context? Your system feels frozen. 2. 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗴𝗲𝘁𝘀 𝗱𝗶𝗹𝘂𝘁𝗲𝗱 The more you add, the harder it is for the model to know what matters. System impact: Core facts get drowned out. You might have included them, but they’re not 𝘶𝘴𝘦𝘥. 3. 𝗟𝗟𝗠𝘀 𝗮𝗿𝗲𝗻’𝘁 𝗴𝗼𝗼𝗱 𝗮𝘁 𝗻𝗲𝗲𝗱𝗹𝗲-𝗶𝗻-𝗮-𝗵𝗮𝘆𝘀𝘁𝗮𝗰𝗸 They don’t search, they pattern match. If the answer isn’t obvious or recent, it might get skipped entirely. System impact: Even if the answer is in the prompt, the model might miss it completely. 4. 𝗦𝘂𝗺𝗺𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗶𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗲𝘀 𝗹𝗼𝘀𝘀𝘆 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 To squeeze more into context, teams compress. But that strips nuance, rationale, and edge cases. System impact: Downstream LLM reasoning gets generic. Your system “almost” answers everything, but never sharply. 5. 𝗟𝗼𝗻𝗴 𝗰𝗼𝗻𝘁𝗲𝘅𝘁𝘀 𝗳𝗮𝘃𝗼𝗿 𝗿𝗲𝗰𝗲𝗻𝘁 𝘁𝗼𝗸𝗲𝗻𝘀 Models often give more weight to what comes last, especially in autoregressive setups. System impact: Older instructions, facts, or injected prompts fade away, leading to ignored constraints or context drift. 6. 𝗥𝗲𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗶𝗻𝗽𝘂𝘁𝘀 𝗯𝗹𝘂𝗿 𝗰𝗼𝗻𝗰𝗲𝗽𝘁𝘀 Similar content inside long prompts tends to get mixed up or merged together. System impact: You see weird blends in answers—part of doc A, part of doc B, fully wrong synthesis. 7. 𝗕𝗶𝗴𝗴𝗲𝗿 𝘄𝗶𝗻𝗱𝗼𝘄𝘀 ≠ 𝗯𝗲𝘁𝘁𝗲𝗿 𝗴𝗿𝗼𝘂𝗻𝗱𝗶𝗻𝗴 LLMs try to use everything you send. Without structure, they can’t tell what’s signal vs noise. System impact: Answers sound confident, but are stitched from partial truths. False confidence skyrockets. 8. 𝗖𝗼𝘀𝘁 𝗴𝗿𝗼𝘄𝘀 𝗹𝗶𝗻𝗲𝗮𝗿𝗹𝘆, 𝗯𝘂𝘁 𝘃𝗮𝗹𝘂𝗲 𝗱𝗼𝗲𝘀𝗻’𝘁 More tokens increase your bill, but not your accuracy. Past a point, you just pay more to get the same (or worse). System impact: You’re scaling spend for minimal quality improvement (or even degradation). 9. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝘀𝗶𝘇𝗲 ≠ 𝗺𝗲𝗺𝗼𝗿𝘆 A prompt is a one-time input. There’s no persistence—just a giant scratchpad that gets wiped after use. System impact: Builders confuse long prompts for memory. Real memory requires external state management. 10. 𝗠𝗼𝘀𝘁 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀 𝗮𝗿𝗲 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀 𝗶𝗻 𝗱𝗶𝘀𝗴𝘂𝗶𝘀𝗲 The real issue often isn’t size—it’s relevance. You need RAG. System impact: Bigger context becomes a crutch. Smart systems optimize 𝘸𝘩𝘢𝘵 is passed, not 𝘩𝘰𝘸 𝘮𝘶𝘤𝘩. ♻️ Repost this to help everyone in your network upskill ➕ Follow Shivani Virdi for more.

  • View profile for Adam Chan

    Bringing developers together to build epic projects with epic tools!

    9,159 followers

    As you build your next agent or optimize an existing one, ask yourself: Is everything in this context earning its keep? If not... Here are six ways to fix it: As we learned in the research paper, “Lost in the middle”, LLMs don't treat every token in their context window equally. Across 18 models (GPT-4, Claude, Gemini, etc.), performance degrades as input length grows in surprising ways. Four key failure modes have been put into the spotlight: • 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗼𝗶𝘀𝗼𝗻𝗶𝗻𝗴 - Errors that get repeatedly referenced • 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗗𝗶𝘀𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻 - Models focus on history instead of training • 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗖𝗼𝗻𝗳𝘂𝘀𝗶𝗼𝗻 - Too much content influences quality • 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗖𝗹𝗮𝘀𝗵 - Conflicting information degrades reasoning Here are 6 proven techniques to fix these issues: 1️⃣ 𝗥𝗔𝗚 - Selectively add only relevant information 2️⃣ 𝗧𝗼𝗼𝗹 𝗟𝗼𝗮𝗱𝗼𝘂𝘁 - Choose only relevant tools for your context 3️⃣ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗤𝘂𝗮𝗿𝗮𝗻𝘁𝗶𝗻𝗲 - Isolate contexts in dedicated threads 4️⃣ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝘂𝗻𝗶𝗻𝗴 - Remove irrelevant information 5️⃣ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗦𝘂𝗺𝗺𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 - Condense verbose content 6️⃣ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗢𝗳𝗳𝗹𝗼𝗮𝗱𝗶𝗻𝗴 - Store information outside LLM context

Explore categories