The interview is for a Generative AI Engineer role at Cohere. Interviewer: "Your client complains that the LLM keeps losing track of earlier details in a long chat. What's happening?" You: "That's a classic context window problem. Every LLM has a fixed memory limit - say 8k, 32k, or 200k tokens. Once that's exceeded, earlier tokens get dropped or compressed, and the model literally forgets." Interviewer: "So you just buy a bigger model?" You: "You can, but that's like using a megaphone when you need a microphone. A larger context window costs more, runs slower, and doesn't always reason better." Interviewer: "Then how do you manage long-term memory?" You: 1. Summarization memory - periodically condense earlier chat segments into concise summaries. 2. Vector memory - store older context as embeddings; retrieve only the relevant pieces later. 3. Hybrid memory - combine summaries for continuity and retrieval for precision. Interviewer: "So you’re basically simulating memory?" You: "Yep. LLMs are stateless by design. You build memory on top of them - a retrieval layer that acts like long-term memory. Otherwise, your chatbot becomes a goldfish." Interviewer: "And how do you know if the memory strategy works?" You: "When the system recalls context correctly without bloating cost or latency. If a user says, 'Remind me what I told you last week,' and it answers from stored embeddings - that’s memory done right." Interviewer: "So context management isn’t a model issue - it’s an architecture issue?" You: "Exactly. Most think 'context length' equals intelligence. But true intelligence is recall with relevance - not recall with redundancy." #ai #genai #llms #rag #memory
LLM Performance and Coherence Challenges
Explore top LinkedIn content from expert professionals.
Summary
LLM performance and coherence challenges refer to the difficulties large language models (LLMs) face in maintaining accuracy, consistency, and context over extended interactions or complex tasks. These issues can include memory limitations, loss of relevant information, unreliable reasoning in long conversations, and unpredictable behavior when handling real-world scenarios.
- Assess context limits: Remember that LLMs have built-in memory restrictions, so avoid designing applications around lengthy or sprawling conversations without building in context management strategies.
- Build robust workflows: Use clear steps and tightly scoped tasks whenever possible, as LLMs perform best when problems are broken down into structured, repeatable actions.
- Monitor and update: Regularly test and adjust your prompts and retrieval systems, since models can change over time and may handle information differently as your data or requirements evolve.
-
-
Most people still think of LLMs as “just a model.” But if you’ve ever shipped one in production, you know it’s not that simple. Behind every performant LLM system, there’s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs aren’t one-dimensional. They’re systems. And each dimension introduces new failure points or optimization levers. Let’s break it down: 🧠 Pre-Training Start with modality. → Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. → Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. 🛠 Fine-Tuning This is where most teams underestimate complexity: → PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. → Alignment techniques- RLHF, DPO, RAFT, aren’t interchangeable. They encode different human preference priors. → Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. ⚡️ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. 📏 Evaluation One benchmark doesn’t cut it. You need a full matrix: → NLG (summarization, completion), NLU (classification, reasoning), → alignment tests (honesty, helpfulness, safety), → dataset quality, and → cost breakdowns across training + inference + memory. Evaluation isn’t just a model task, it’s a systems-level concern. 🧾 Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isn’t trivial anymore. It’s an orchestration layer in itself. Whether you’re building for legal, education, robotics, or finance, the “general-purpose” tag doesn’t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
Some challenges in building LLM-powered applications (including RAG systems) for large companies: 1. Hallucinations are very damaging to the brand. It only takes one for people to lose faith in the tool completely. Contrary to popular belief, RAG doesn't fix hallucinations. 2. Chunking a knowledge base is not straightforward. This leads to poor context retrieval, which leads to bad answers from a model powering a RAG system. 3. As information changes, you also need to change your chunks and embeddings. Depending on the complexity of the information, this can become a nightmare. 4. Models are black boxes. We only have access to modify their inputs (prompts), but it's hard to determine cause-effect when troubleshooting (e.g., Why is "Produce concise answers" working better than "Reply in short sentences"?) 5. Prompts are too brittle. Every new version of a model can cause your previous prompts to stop working. Unfortunately, you don't know why or how to fix them (see #4 above.) 6. It is not yet clear how to reliably evaluate production systems. 7. Costs and latency are still significant issues. The best models out there cost a lot of money and are very slow. Cheap and fast models have very limited applicability. 8. There are not enough qualified people to deal with these issues. I cannot highlight this problem enough. You may encounter one or more of these problems in a project at once. Depending on your requirements, some of these issues may be showstoppers (hallucinating direction instructions for a robot) or simple nuances (support agent hallucinating an incorrect product description.) There's still a lot of work to do until these systems mature to a point where they are viable for most use cases.
-
My favorite paper from NeurIPS’24 shows us that frontier LLMs don’t pay very close attention to their context windows… Needle In A Haystack: The needle in a haystack test is the most common way to test LLMs with long context windows. The test is conducted via the following steps: 1. Place a fact / statement within a corpus of text. 2. Ask the LLM to generate the fact given the corpus as input. 3. Repeat this test while increasing the size of the corpus and placing the fact at different locations. From this test, we see if an LLM “pays attention” to different regions of a long context window, but this test purely examines whether the LLM is able to recall information from its context. Where does this fall short? Most tasks being solved by LLMs require more than information recall. The LLM may need to perform inference, manipulate knowledge, or reason in order to solve a task. With this in mind, we might wonder if we could generalize the needle in a haystack test to analyze more complex LLM capabilities under different context lengths. BABILong generalizes the needle in a haystack test to perform long context reasoning. The LLM is tested based upon its ability to reason over facts that are distributed in very long text corpora. Reasoning tasks that are tested include fact chaining, induction, deduction, counting, list / set comprehension, and more. Such reasoning tasks are challenging, especially when necessary information is scattered in a large context window. “Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.” - BABILong paper Can LLMs reason over long context? We see in the BABILong paper that most frontier LLMs struggle to solve long context reasoning problems. Even top LLMs like GPT-4 and Gemini-1.5 seem to consistently use only ~20% of their context window. In fact, most LLMS struggle to answer questions about facts in texts longer than 10,000 tokens! What can we do about this? First, we should just be aware of this finding! Be wary of using super long contexts, as they might deteriorate the LLM’s ability to solve more complex problems that require reasoning. However, we see in the BABILong paper that these issues can be mitigated with a few different approaches: - Using RAG is helpful. However, this approach only works up to a certain context length and has limitations (e.g., struggles to solve problems where the order of facts matters). - Recurrent transformers can answer questions about facts from very long contexts.
-
Many companies are diving into AI agents without a clear framework for when they are appropriate or how to assess their effectiveness. Several recent benchmarks offer a more structured view of where LLM agents are effective and where they are not. LLM agents consistently perform well in short, structured tasks involving tool use. A March 2025 survey on evaluation methods highlights their ability to decompose problems into tool calls, maintain state across multiple steps, and apply reflection to self-correct. Architectures like PLAN-and-ACT and AgentGen, which incorporate Monte Carlo Tree Search, improve task completion rates by 8 to 15 percent across domains such as information retrieval, scripting, and constrained planning. Structured hybrid pipelines are another area where agents perform reliably. Benchmarks like ThinkGeo and ToolQA show that when paired with stable interfaces and clearly defined tool actions, LLMs can handle classification, data extraction, and logic operations at production-grade accuracy. The performance drops sharply in more complex settings. In Vending-Bench, agents tasked with managing a vending operation over extended interactions failed after roughly 20 million tokens. They lost track of inventory, misordered events, or repeated actions indefinitely. These breakdowns occurred even when the full context was available, pointing to fundamental limitations in long-horizon planning and execution logic. SOP-Bench further illustrates this boundary. Across 1,800 real-world industrial procedures, Function-Calling agents completed only 27 percent of tasks. When exposed to larger tool registries, performance degraded significantly. Agents frequently selected incorrect tools, despite having structured metadata and step-by-step guidance. These findings suggest that LLM agents work best when the task is tightly scoped, repeatable, and structured around deterministic APIs. They consistently underperform when the workflow requires extended decision-making, coordination, or procedural nuance. To formalize this distinction, I use the SMART framework to assess agent fit: • 𝗦𝗰𝗼𝗽𝗲 & 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 – Is the process linear and clearly defined? • 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 & 𝗠𝗲𝗮𝘀𝘂𝗿𝗲𝗺𝗲𝗻𝘁 – Is there sufficient volume and quantifiable ROI? • 𝗔𝗰𝗰𝗲𝘀𝘀 & 𝗔𝗰𝘁𝗶𝗼𝗻𝗮𝗯𝗶𝗹𝗶𝘁𝘆 – Are tools and APIs integrated and callable? • 𝗥𝗶𝘀𝗸 & 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 – Can failures be logged, audited, and contained? • 𝗧𝗲𝗺𝗽𝗼𝗿𝗮𝗹 𝗟𝗲𝗻𝗴𝘁𝗵 – Is the task short, self-contained, and episodic? When all five criteria are met, agentic automation is likely to succeed. When even one is missing, the use case may require redesign before introducing LLM agents. The strongest agent implementations I’ve seen start with ruthless scoping, not ambitious scale. What filters do you use before greenlighting an AI agent?
-
Tackling Hallucination in LLMs: Mitigation & Evaluation Strategies As Large Language Models (LLMs) redefine how we interact with AI, one critical challenge is hallucination—when models generate false or misleading responses. This issue affects the reliability of LLMs, particularly in high-stakes applications like healthcare, legal, and education. To ensure trustworthiness, it’s essential to adopt robust strategies for mitigating and evaluating hallucination. The workflow outlined above presents a structured approach to addressing this challenge: 1️⃣ Hallucination QA Set Generation Starting with a raw corpus, we process knowledge bases and apply weighted sampling to create diverse, high-quality datasets. This includes generating baseline questions, multi-context queries, and complex reasoning tasks, ensuring a comprehensive evaluation framework. Rigorous filtering and quality checks ensure datasets are robust and aligned with real-world complexities. 2️⃣ Hallucination Benchmarking By pre-processing datasets, answers are categorized as correct or hallucinated, providing a benchmark for model performance. This phase involves tools like classification models and text generation to assess reliability under various conditions. 3️⃣ Hallucination Mitigation Strategies In-Context Learning: Enhancing output reliability by incorporating examples directly in the prompt. Retrieval-Augmented Generation: Supplementing model responses with real-time data retrieval. Parameter-Efficient Fine-Tuning: Fine-tuning targeted parts of the model for specific tasks. By implementing these strategies, we can significantly reduce hallucination risks, ensuring LLMs deliver accurate and context-aware responses across diverse applications. 💡 What strategies do you employ to minimize hallucination in AI systems? Let’s discuss and learn together in the comments!
-
LLMs' apparent understanding runs deeper than we thought. New research reveals a pervasive illusion: Meet 'Potemkin Understanding.' I've gone through another research paper, in depth, and this one's worth your while (I think). This groundbreaking paper, "Potemkin Understanding in Large Language Models", directly challenges the assumption that high benchmark scores mean large language models truly understand. Researchers from MIT, UChicago, and Harvard have identified a critical failure mode they call 'Potemkin Understanding'. Think of it as an LLM building a perfect-looking facade of knowledge. It can flawlessly define a concept, even pass tests, but its internal understanding is fundamentally incoherent, unlike any human. It might explain a perfect rhyming scheme, then write a poem that fails to rhyme. This illusion of comprehension is where LLMs answer complex questions correctly yet fundamentally misunderstand concepts in ways no human would. They often can't tell you when they're truly right or dangerously wrong. Some of this you may think: Yes, but we've had this before, Markus. Well, turns out this phenomenon's scale extends far beyond the occasional errors we are already aware of. The paper finds Potemkins are ubiquitous across models, tasks, and domains, exposing a deeper internal incoherence in concept representations. Critically, this invalidates existing benchmarks as measures of true understanding. This research scientifically validates what many of us have argued: flawless output doesn't equate to genuine understanding. It underscores the critical need for human judgment and the "expert in the loop" to discern genuine insight from mere statistical mimicry. This directly reinforces themes I've explored in "Thinking Machines That Don't", an article that is publishing at The Learning Guild this week, and the imperative for critical human discernment. This is essential reading for anyone relying on LLMs for strategic decisions. Read the full paper here: https://lnkd.in/gsckwVA3 Would love to hear your thoughts. #AIStrategy #TheEndeavorReport #AppliedAI
-
Unique Complexities of LLM Inference: LLM inference poses interesting challenges that I haven’t really experienced in other areas of HPC because you must optimize for both throughput and latency in an incredibly dynamic environment. This poses challenges - generally throughput and latency are at odds with one another - when you optimize for one, you hurt the other. But in the inference case, you have a real time user sending prompts over the internet expecting a response. They want to be able to receive text at least at the rate they can read, so they are not waiting (how impatient we are 😁). However, in a production environment, you may have thousands of users sending prompts. The more prompts you process at a time, the better your throughput gets but the worse your latency gets. Conversely, smaller batches help latency but kill throughput. In many ways, it’s the age old problem with a web server except for a unique difference - GPUs are specifically designed to be throughput monsters with expected latency overhead. Many techniques for performant GPU programming are around hiding that latency during processing of data. Furthermore, larger models don’t fit in a single GPU and need sharding across multiple GPUs. Very large models need parallelism both at the GPU level and at the node level. All of these incur communication overhead, which affects both throughput and latency. My favorite project right now is vLLM because they are dealing with these complexities in pretty cool ways - from optimizing the control plane on the CPU to utilize techniques like adaptive batching and efficient queueing and scheduling, to optimizing memory on the GPU using dynamic memory management, paged attention, and interesting pipelining. The rate at which their software is improving is also pretty cool - I benchmark their inference a lot and see consistent improvements from release to release. I’ve learned a lot from diving into their code and watching their office hours on the YouTube. If you are interested in high performance inference, it’s a great project to check out. If you like my content, feel free to follow or connect! #softwareengineering #gpus
-
Leading large language models (LLMs) are trained on public data. However, the majority of the world's data is dark data not publicly accessible, mainly in the form of private organizational data or enterprise data. 𝐓𝐡𝐞 𝐚𝐮𝐭𝐡𝐨𝐫𝐬 𝐬𝐡𝐨𝐰 𝐭𝐡𝐚𝐭 𝐭𝐡𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐨𝐟 𝐦𝐞𝐭𝐡𝐨𝐝𝐬 𝐛𝐚𝐬𝐞𝐝 𝐨𝐧 𝐋𝐋𝐌𝐬 𝐬𝐞𝐫𝐢𝐨𝐮𝐬𝐥𝐲 𝐝𝐞𝐠𝐫𝐚𝐝𝐞𝐬 𝐰𝐡𝐞𝐧 𝐭𝐞𝐬𝐭𝐞𝐝 𝐨𝐧 𝐫𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝐬. Current benchmarks, based on public data, overestimate the performance of LLMs. They release a new benchmark dataset, the Goby Benchmark, to advance discovery in enterprise data integration. Based on their experience with this enterprise benchmark, 𝐭𝐡𝐞 𝐚𝐮𝐭𝐡𝐨𝐫𝐬 𝐩𝐫𝐨𝐩𝐨𝐬𝐞 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬 𝐭𝐨 𝐮𝐩𝐥𝐢𝐟𝐭 𝐭𝐡𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐨𝐟 𝐋𝐋𝐌𝐬 𝐨𝐧 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐝𝐚𝐭𝐚, 𝐢𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠: (1) 𝐡𝐢𝐞𝐫𝐚𝐫𝐜𝐡𝐢𝐜𝐚𝐥 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐢𝐨𝐧, (2) 𝐫𝐮𝐧𝐭𝐢𝐦𝐞 𝐜𝐥𝐚𝐬𝐬-𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠, 𝐚𝐧𝐝 (3) 𝐨𝐧𝐭𝐨𝐥𝐨𝐠𝐲 𝐬𝐲𝐧𝐭𝐡𝐞𝐬𝐢𝐬. 𝐓𝐡𝐞𝐲 𝐬𝐡𝐨𝐰 𝐭𝐡𝐚𝐭, 𝐨𝐧𝐜𝐞 𝐭𝐡𝐞𝐬𝐞 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬 𝐚𝐫𝐞 𝐝𝐞𝐩𝐥𝐨𝐲𝐞𝐝, 𝐭𝐡𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐨𝐧 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐝𝐚𝐭𝐚 𝐛𝐞𝐜𝐨𝐦𝐞𝐬 𝐨𝐧 𝐩𝐚𝐫 𝐰𝐢𝐭𝐡 𝐭𝐡𝐚𝐭 𝐨𝐟 𝐩𝐮𝐛𝐥𝐢𝐜 𝐝𝐚𝐭𝐚. SOURCE: https://lnkd.in/gBC53NPb
-
LLMs struggle in multi-turn conversations. A new study (arXiv:2505.06120) shows that top models like GPT-4 and Gemini 2.5 drop ~39% in performance when conversations span multiple, underspecified turns. Why? They make early assumptions and rarely recover, leading to 112% more errors. This failure mode is universal and not solved by scale. We need better training and eval methods for real dialogue use cases, not just one-shot prompts. https://lnkd.in/gsrZhTSt