🚨 New study reveals that when used to summarize scientific research, generative AI is nearly five times LESS accurate than humans. Many haven't realized, but Gen AI's accuracy problem is worse than initially thought: According to the paper "Generalization Bias in Large Language Model Summarization of Scientific Research," written by Uwe Peters & Benjamin Chin-Yee and published in the Royal Society Open Science Journal: "AI chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.370B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA3.3 70B overgeneralizing in 26–73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85,95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy." - 👉 Link to the paper below. 👉 NEVER MISS my updates and analyses: join my newsletter's 61,700+ subscribers (link below).
AI Chatbot Usage Insights
Explore top LinkedIn content from expert professionals.
-
-
Most people think of RAG (Retrieval-Augmented Generation) as: 𝘘𝘶𝘦𝘳𝘺 → 𝘝𝘦𝘤𝘵𝘰𝘳 𝘋𝘉 → 𝘓𝘓𝘔 → 𝘈𝘯𝘴𝘸𝘦𝘳 But that’s just step one. In 2025, we’re seeing a shift toward 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗥𝗔𝗚 systems—where LLMs don’t just retrieve and respond, but also 𝗿𝗲𝗮𝘀𝗼𝗻, 𝗽𝗹𝗮𝗻, 𝗮𝗻𝗱 𝗮𝗰𝘁. This visual breakdown captures the core idea: → A query is embedded and used to fetch relevant chunks from a vector DB. → An 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁 leverages those chunks to craft context-aware prompts. → It can also invoke external tools: • Web Search • APIs • Internal Databases This unlocks workflows that are: • Dynamic • Context-aware • Action-oriented It's not just answering — it's deciding 𝘄𝗵𝗮𝘁 𝘁𝗼 𝗱𝗼 𝗻𝗲𝘅𝘁. Toolkits like 𝗟𝗮𝗻𝗴𝗚𝗿𝗮𝗽𝗵, 𝗖𝗿𝗲𝘄𝗔𝗜, 𝗚𝗼𝗼𝗴𝗹𝗲 𝗔𝗗𝗞, and 𝗔𝘂𝘁𝗼𝗚𝗲𝗻 are making this architecture practical for real-world systems. What tools or techniques are 𝘺𝘰𝘶 using to take your LLM apps beyond static chatbots?
-
Most people evaluate LLMs by just benchmarks. But in production, the real question is- how well do they perform? When you’re running inference at scale, these are the 3 performance metrics that matter most: 1️⃣ Latency How fast does the model respond after receiving a prompt? There are two kinds to care about: → First-token latency: Time to start generating a response → End-to-end latency: Time to generate the full response Latency directly impacts UX for chat, speed for agentic workflows, and runtime cost for batch jobs. Even small delays add up fast at scale. 2️⃣ Context Window How much information can the model remember- both from the prompt and prior turns? This affects long-form summarization, RAG, and agent memory. Models range from: → GPT-3.5 / LLaMA 2: 4k–8k tokens → GPT-4 / Claude 2: 32k–200k tokens → GPT-OSS-120B: 131k tokens Larger context enables richer workflows but comes with tradeoffs: slower inference and higher compute cost. Use compression techniques like attention sink or sliding windows to get more out of your context window. 3️⃣ Throughput How many tokens or requests can the model handle per second? This is key when you’re serving thousands of requests or processing large document batches. Higher throughput = faster completion and lower cost. How to optimize based on your use case: → Real-time chat or tool use → prioritize low latency → Long documents or RAG → prioritize large context window → Agentic workflows → find a balance between latency and context → Async or high-volume processing → prioritize high throughput My 2 cents 🤌 → Choose in-region, lightweight models for lower latency → Use 32k+ context models only when necessary → Mix long-context models with fast first-token latency for agents → Optimize batch size and decoding strategy to maximize throughput Don’t just pick a model based on benchmarks. Pick the right tradeoffs for your workload. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
🚨 Why do we need to move beyond single-turn task evaluation of large language models (LLMs)? 🤔 I have long advocated for evaluation methods of LLMs and other GenAI applications in healthcare that reflect real clinical scenarios, rather than multiple-choice questions or clinical vignettes with medical jargon. For example, interactions between clinicians and patients typically involve multi-turn conversations. 🔬 A study by Microsoft and Salesforce tested 200,000 AI conversations, using large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. They selected a total of 15 LLMs from eight model families: OpenAI (GPT-4o-mini, GPT-4o, o3, and GPT-4.1), Anthropic (Claude 3 Haiku, Claude 3.7 Sonnet), Google’s Gemini (Gemini 2.5 Flash, Gemini 2.5 Pro), Meta’s Llama (Llama3.1-8B-Instruct, Llama3.3-70B-Instruct, Llama 4 Scout), AI2 OLMo-2-13B, Microsoft Phi-4, Deepseek-R1, and Cohere Command-A. ❓ The results? ❌ Multi-turn conversations resulted in an average 39% drop in performance across six generation tasks. ❌ Their analysis of conversations revealed a minor decline in aptitude and a significant increase in unreliability. 📉 Here's why LLMs stumble: • 🚧 Premature assumptions derail conversations. • 🗣️ Overly verbose replies confuse rather than clarify. • 🔄 Difficulty adapting after initial mistakes. 😵💫 Simply put: When an AI goes off track early, it gets lost and does not recover. ✅ The authors advocate: • Multi-turn conversations must become a priority. • Better multi-turn testing is crucial. Single-turn tests just aren’t realistic. • Users should be aware of these limitations. 🔗 to the original paper is in the first comment 👇 #AI #ConversationalAI #LargeLanguageModels #LLMs
-
The Illusion of Thinking in LLMs - Apple researchers have spilled the beans on the strengths and limitations of reasoning models. Reasoning models "collapse" beyond certain task complexities. "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" highlights several limitations of Large Language Models (LLMs) and their specialized variants, Large Reasoning Models (LRMs), particularly in the context of reasoning and problem-solving. Below is a list of the key limitations of LLMs identified by Apple researchers: (1) Poor Performance on Reasoning Benchmarks: Earlier iterations of LLMs exhibited poor performance on reasoning benchmarks, indicating fundamental challenges in reasoning capabilities (Page 4, Section 2). (2) Lack of Generalizable Reasoning: Despite advancements, LLMs and LRMs fail to develop generalizable problem-solving capabilities, especially for planning tasks. Performance collapses to zero beyond certain complexity thresholds in controlled puzzle environments (Page 3, Section 1; Page 11, Section 5). (3) Data Contamination Issues: Established mathematical and coding benchmarks suffer from data contamination, where models may have been exposed to similar problems during training, skewing performance evaluations (Page 2, Section 1; Page 5, Section 3). (4) Inefficiency in Low-Complexity Tasks: For simpler, low-compositional problems, standard LLMs demonstrate greater efficiency and accuracy compared to LRMs, suggesting that additional "thinking" mechanisms in LRMs may introduce unnecessary overhead (Page 3, Section 1; Page 7, Section 4.2.1). (5) Complete Collapse at High Complexity: Both LLMs and LRMs experience complete performance collapse when problem complexity exceeds a critical threshold, indicating a fundamental limitation in handling highly complex, compositionally deep tasks (Page 3, Section 1; Page 8, Section 4.2.2). (6) Counterintuitive Scaling Limitation: LRMs reduce their reasoning effort (measured by inference-time tokens) as problem complexity increases beyond a certain point, despite having ample token budgets, revealing a scaling limitation in reasoning capabilities (Page 3, Section 1; Page 8, Section 4.2.2). (7) Overthinking Phenomenon: In simpler problems, LLMs and LRMs often identify correct solutions early but continue exploring incorrect alternatives, wasting computational resources in an "overthinking" pattern (Page 3, Section 1; Page 9, Section 4.3)
-
LLMs are optimized for next turn response. This results in poor Human-AI collaboration, as it doesn't help users achieve their goals or clarify intent. A new model CollabLLM is optimized for long-term collaboration. The paper "CollabLLM: From Passive Responders to Active Collaborators" by Stanford University and Microsoft researchers tests this approach to improving outcomes from LLM interaction. (link in comments) 💡 CollabLLM transforms AI from passive responders to active collaborators. Traditional LLMs focus on single-turn responses, often missing user intent and leading to inefficient conversations. CollabLLM introduces a :"Multiturn-aware reward" system, apply reinforcement fine-tuning on these rewards. This enables AI to engage in deeper, more interactive exchanges by actively uncovering user intent and guiding users toward their goals. 🔄 Multiturn-aware rewards optimize long-term collaboration. Unlike standard reinforcement learning that prioritizes immediate responses, CollabLLM uses forward sampling - simulating potential conversations - to estimate the long-term value of interactions. This approach improves interactivity by 46.3% and enhances task performance by 18.5%, making conversations more productive and user-centered. 📊 CollabLLM outperforms traditional models in complex tasks. In document editing, coding assistance, and math problem-solving, CollabLLM increases user satisfaction by 17.6% and reduces time spent by 10.4%. It ensures that AI-generated content aligns with user expectations through dynamic feedback loops. 🤝 Proactive intent discovery leads to better responses. Unlike standard LLMs that assume user needs, CollabLLM asks clarifying questions before responding, leading to more accurate and relevant answers. This results in higher-quality output and a smoother user experience. 🚀 CollabLLM generalizes well across different domains. Tested on the Abg-CoQA conversational QA benchmark, CollabLLM proactively asked clarifying questions 52.8% of the time, compared to just 15.4% for GPT-4o. This demonstrates its ability to handle ambiguous queries effectively, making it more adaptable to real-world scenarios. 🔬 Real-world studies confirm efficiency and engagement gains. A 201-person user study showed that CollabLLM-generated documents received higher quality ratings (8.50/10) and sustained higher engagement over multiple turns, unlike baseline models, which saw declining satisfaction in longer conversations. It is time to move beyond the single-step LLM responses that we have been used to, to interactions that lead to where we want to go. This is a useful advance to better human-AI collaboration. It's a critical topic, I'll be sharing a lot more on how we can get there.
-
In the rapidly evolving world of conversational AI, Large Language Model (LLM) based chatbots have become indispensable across industries, powering everything from customer support to virtual assistants. However, evaluating their effectiveness is no simple task, as human language is inherently complex, ambiguous, and context-dependent. In a recent blog post, Microsoft's Data Science team outlined key performance metrics designed to assess chatbot performance comprehensively. Chatbot evaluation can be broadly categorized into two key areas: search performance and LLM-specific metrics. On the search front, one critical factor is retrieval stability, which ensures that slight variations in user input do not drastically change the chatbot's search results. Another vital aspect is search relevance, which can be measured through multiple approaches, such as comparing chatbot responses against a ground truth dataset or conducting A/B tests to evaluate how well the retrieved information aligns with user intent. Beyond search performance, chatbot evaluation must also account for LLM-specific metrics, which focus on how well the model generates responses. These include: - Task Completion: Measures the chatbot's ability to accurately interpret and fulfill user requests. A high-performing chatbot should successfully execute tasks, such as setting reminders or providing step-by-step instructions. - Intelligence: Assesses coherence, contextual awareness, and the depth of responses. A chatbot should go beyond surface-level answers and demonstrate reasoning and adaptability. - Relevance: Evaluate whether the chatbot’s responses are appropriate, clear, and aligned with user expectations in terms of tone, clarity, and courtesy. - Hallucination: Ensures that the chatbot’s responses are factually accurate and grounded in reliable data, minimizing misinformation and misleading statements. Effectively evaluating LLM-based chatbots requires a holistic, multi-dimensional approach that integrates search performance and LLM-generated response quality. By considering these diverse metrics, developers can refine chatbot behavior, enhance user interactions, and build AI-driven conversational systems that are not only intelligent but also reliable and trustworthy. #DataScience #MachineLearning #LLM #Evaluation #Metrics #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gAC8eXmy
-
The advancement of artificial intelligence, especially the development of sophisticated chatbots, has significantly changed how we find and share information. While these chatbots exhibit remarkable proficiency with human language—evident in their ability to craft compelling stories, mimic political speeches, and even produce creative works—it’s crucial to recognize their limitations. They are not perfect. In fact, chatbots are not only prone to mistakes but can also generate misleading or entirely fabricated information. These fabricated responses often appear indistinguishable from credible, evidence-based data, creating a serious challenge for informed decision-making and constructive dialogue. At the heart of these chatbots are large language models (LLMs), which function by predicting words based on massive datasets. This probabilistic mechanism enables them to produce logical, coherent text. However, it also means they are inherently prone to errors or "hallucinations." When chatbots are designed to sound authoritative, a mix of accurate and fabricated information can inadvertently contribute to the spread of both misinformation and disinformation. This risk becomes particularly alarming in areas like political communication or public policy, where persuasive language can easily slip into manipulation. Even with decades of advancements, modern AI technologies are still essentially advanced imitations of human conversation. These systems remain largely opaque "black boxes," whose internal operations are often not fully understood, even by their creators. These innovations have yielded groundbreaking applications for customer support, digital assistants, and creative writing, they also amplify the danger of users being misled by inaccuracies. From both regulatory and ethical perspectives, the rise of chatbots capable of fabricating information demands urgent attention. The responsibility for creating safeguards cannot exclusively lie with the companies that develop and benefit from these tools. Instead, a comprehensive, collaborative approach is critical. This approach should include greater transparency, stringent fact-checking mechanisms, and international cooperation to ensure that these powerful AI systems are used to educate and inform rather than mislead or deceive.
-
AI agents without proper memory are just expensive chatbots repeating the same mistakes. After building 20+ production agents, I discovered most developers only implement 1 out of 5 critical memory types. Here's the complete memory architecture powering agents at Google, Microsoft, and top AI startups: 𝗦𝗵𝗼𝗿𝘁-𝘁𝗲𝗿𝗺 𝗠𝗲𝗺𝗼𝗿𝘆 (𝗪𝗼𝗿𝗸𝗶𝗻𝗴 𝗠𝗲𝗺𝗼𝗿𝘆) → Maintains conversation context (last 5-10 turns) → Enables coherent multi-turn dialogues → Clears after session ends → Implementation: Rolling buffer/context window 𝗟𝗼𝗻𝗴-𝘁𝗲𝗿𝗺 𝗠𝗲𝗺𝗼𝗿𝘆 (𝗣𝗲𝗿𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗦𝘁𝗼𝗿𝗮𝗴𝗲) Unlike short-term memory, long-term memory persists across sessions and contains three specialized subsystems: 𝟭. 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗠𝗲𝗺𝗼𝗿𝘆 (𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗕𝗮𝘀𝗲) → Domain expertise and factual knowledge → Company policies, product catalogs → Doesn't change per user interaction → Implementation: Vector DB (Pinecone/Qdrant) + RAG 𝟮. 𝗘𝗽𝗶𝘀𝗼𝗱𝗶𝗰 𝗠𝗲𝗺𝗼𝗿𝘆 (𝗘𝘅𝗽𝗲𝗿𝗶𝗲𝗻𝗰𝗲 𝗟𝗼𝗴𝘀) → Specific past interactions and outcomes → "Last time user tried X, Y happened" → Enables learning from past actions → Implementation: Few-shot prompting + event logs 𝟯. 𝗣𝗿𝗼𝗰𝗲𝗱𝘂𝗿𝗮𝗹 𝗠𝗲𝗺𝗼𝗿𝘆 (𝗦𝗸𝗶𝗹𝗹 𝗦𝗲𝘁𝘀) → How to execute specific workflows → Learned task sequences and patterns → Improves with repetition → Implementation: Function definitions + prompt templates When processing user input, intelligent agents don't query memories in isolation: 1️⃣ Short-term provides immediate context 2️⃣ Semantic supplies relevant domain knowledge 3️⃣ Episodic recalls similar past scenarios 4️⃣ Procedural suggests proven action sequences This orchestrated approach enables agents to: - Handle complex multi-step tasks autonomously - Learn from failures without retraining - Provide contextually aware responses - Build relationships over time LangChain, LangGraph, and AutoGen all provide memory abstractions, but most developers only scratch the surface. The difference between a demo and production? Memory that actually remembers. Over to you: Which memory type is your agent missing?