🚨 Why do we need to move beyond single-turn task evaluation of large language models (LLMs)? 🤔 I have long advocated for evaluation methods of LLMs and other GenAI applications in healthcare that reflect real clinical scenarios, rather than multiple-choice questions or clinical vignettes with medical jargon. For example, interactions between clinicians and patients typically involve multi-turn conversations. 🔬 A study by Microsoft and Salesforce tested 200,000 AI conversations, using large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. They selected a total of 15 LLMs from eight model families: OpenAI (GPT-4o-mini, GPT-4o, o3, and GPT-4.1), Anthropic (Claude 3 Haiku, Claude 3.7 Sonnet), Google’s Gemini (Gemini 2.5 Flash, Gemini 2.5 Pro), Meta’s Llama (Llama3.1-8B-Instruct, Llama3.3-70B-Instruct, Llama 4 Scout), AI2 OLMo-2-13B, Microsoft Phi-4, Deepseek-R1, and Cohere Command-A. ❓ The results? ❌ Multi-turn conversations resulted in an average 39% drop in performance across six generation tasks. ❌ Their analysis of conversations revealed a minor decline in aptitude and a significant increase in unreliability. 📉 Here's why LLMs stumble: • 🚧 Premature assumptions derail conversations. • 🗣️ Overly verbose replies confuse rather than clarify. • 🔄 Difficulty adapting after initial mistakes. 😵💫 Simply put: When an AI goes off track early, it gets lost and does not recover. ✅ The authors advocate: • Multi-turn conversations must become a priority. • Better multi-turn testing is crucial. Single-turn tests just aren’t realistic. • Users should be aware of these limitations. 🔗 to the original paper is in the first comment 👇 #AI #ConversationalAI #LargeLanguageModels #LLMs
Challenges Faced by Llms in Multi-Turn Conversations
Explore top LinkedIn content from expert professionals.
Summary
Large Language Models (LLMs) face significant challenges in multi-turn conversations, such as maintaining context, recovering from errors, and adapting to new information, leading to reduced reliability and accuracy. These hurdles highlight the need for realistic testing and improved conversational strategies to better mimic real-world interactions.
- Focus on realistic testing: Include multi-turn scenarios in evaluation processes to better assess and address the challenges LLMs face during dynamic conversations.
- Minimize information overload: Provide concise and structured prompts to reduce the risk of the model generating verbose or confusing responses.
- Handle errors proactively: Reset conversations or reframe incorrect assumptions when needed to help the model navigate back to the correct context.
-
-
In my AI+UXR workshops, I recommend starting a fresh chat each time you ask the LLM to do a significant task. Why? Because #UXresearch tools need to be reliable, and the more you talk to the LLM, the more that reliability takes a hit. This can introduce unknown errors. This happens for several reasons, but here are a few big (albeit interrelated) ones: 1️⃣ LLMs can get lost even in short multi-turn conversations According to recent research from Microsoft and Salesforce, providing instructions over multiple turns (vs. all at once upfront) can dramatically degrade the output of LLMs. This is true even for reasoning models like o3 and Deepseek-R1, which “deteriorate in similar ways.” 2️⃣ Past turns influence how the LLM weights different concepts In the workshop, I show a conversation that continuously, subtly references safaris, until the LLM takes a hard turn and generates content with a giraffe in it. Every token influences future tokens, and repeated concepts (even inadvertent ones) can “prime” the model to produce unexpected output. 3️⃣ Every turn is an opportunity for “context poisoning” “Context poisoning” is when inaccurate, irrelevant, or hallucinated information gets into the LLM context, causing misleading results or deviation from instructions. This is sometimes exploited to jailbreak LLMs, but it can happen unintentionally as well. In simple terms, bad assumptions early on are hard to recover from. To avoid these issues, I recommend: 🧩 Starting the conversation from scratch any time you’re doing an important research task (including turning off memory and custom instructions) 🧩 Using a single well-structured prompt when possible 🧩 And always, testing carefully and being alert to errors in LLM output I talk about these issues (and a lot more) in my workshops, and I’m writing about this today because the question was asked by some of my amazing workshop participants. Sign up in my profile to get notified about my next public workshop–or if you’re looking for private, in-house training for your team, drop me a note! #AI #UX #LLM #userresearch
-
Models can lose up to 40% accuracy after just two back-and-forths. A recent paper, LLMs GET LOST IN MULTI-TURN CONVERSATION (linked below), took single-shot prompts from popular benchmarks and broke them up such that the model had to have a multi-turn conversation to retrieve all of the information. The TL;DR: -Single-shot prompts: ~90% accuracy. -Multi-turn prompts: ~65% even across top models like Gemini 2.5 One solution here is that once you have all the context ready to go, share it all with a fresh LLM. This idea of concatenating the shards and sending to a model that didn't have the message history was able to get performance by up into the 90% range. If you’re shipping any AI feature with back-and-forth interactions, make multi-turn test cases a core part of your QA and evaluation suite.
-
LLMs Get Lost In Multi-Turn Conversation "Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*."
-
🧠 If you're building apps with LLMs, this paper is a must-read. Researchers at Microsoft and Salesforce recently released LLMs Get Lost in Multi-Turn Conversation — and the findings resonate with our experience at Vellum. They ran 200,000+ simulations across 15 top models, comparing performance on the same task in two modes: - Single-turn (user provides a well-specified prompt upfront) - Multi-turn (user reveals task requirements gradually — like real users do) The result? ✅ 90% avg accuracy in single-turn 💬 65% avg accuracy in multi-turn 🔻 -39% performance drop across the board 😬 Unreliability more than doubled Even the best models get lost when the task unfolds over multiple messages. They latch onto early assumptions, generate bloated answers, and fail to adapt when more info arrives. For application builders, this changes how we think about evaluation and reliability: - One-shot prompt benchmarks ≠ user reality - Multi-turn behavior needs to be a first-class test case - Agents and wrappers won’t fix everything — the underlying model still gets confused This paper validates something we've seen in the wild: the moment users interact conversationally, reliability tanks — unless you're deliberate about managing context, fallback strategies, and prompt structure. 📌 If you’re building on LLMs, read this. Test differently. Optimize for the real-world path, not the happy path.
-
LLMs Get Lost in Multi-Turn Conversations: New Research Reveals Major Reliability Gap Just read a fascinating new paper from Microsoft and Salesforce Research revealing a critical flaw in today's LLMs: they dramatically underperform in multi-turn conversations compared to single-turn interactions. 📊 Key findings: 🔗 LLMs suffer an average 39% performance drop in multi-turn settings across six generation tasks 🔗 This occurs even in conversations with as few as two turns 🔗 The problem affects ALL tested models, including the most advanced ones (Claude 3.7, GPT-4.1, Gemini 2.5) 🔍 The researchers call this the "lost in conversation" phenomenon - when LLMs take a wrong turn in conversation, they get lost and don't recover. This is caused by: 🔗 Making assumptions too early 🔗 Prematurely generating final solutions 🔗 Relying too heavily on previous (incorrect) answers 🔗 Producing overly verbose responses 💬 Why conversation-level evaluation matters: Traditional LLM benchmarks focus on single-turn performance, creating a dangerous blind spot. Real-world AI interactions are conversational by nature, and this research shows that even the most capable models struggle with maintaining context and adapting to new information over multiple turns. Without robust conversation-level evaluation, we risk deploying systems that perform brilliantly in lab tests but frustrate users in practice. 🔎 At Coval, this is exactly what we focus on: evaluating LLMs in realistic conversational scenarios rather than isolated prompts. By measuring how models handle the natural flow of information across turns, we can identify reliability issues before they impact users and guide development toward truly conversational AI. This research highlights a critical gap between how we evaluate LLMs (single-turn) versus how we use them in practice (multi-turn). As we build AI assistants and agents, addressing this reliability issue becomes essential.