There is growing interest in using large language models (LLMs) to retrieve scientific literature and answer medical questions. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. Systematic reviews (SRs), in which experts synthesize evidence across studies, are a cornerstone of clinical decision-making, research, and policy. Their rigorous evaluation of study quality and consistency makes them a strong source to evaluate expert reasoning, raising a simple question: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present: 🎯 MedEvidence Benchmark: A human-curated benchmark of 284 questions (from 100 open-access SRs) across 10 medical specialties. All questions are manually transformed into closed-form question answering to facilitate evaluation. 📊 Large-scale evaluation on MedEvidence: We analyze 24 LLMs spanning general-domain, medical-finetuned, and reasoning models. Through our systematic evaluation, we find that: 1. Reasoning does not necessarily improve performance 2. Larger models do not consistently yield greater gains 3. Medical fine-tuning degrades accuracy on MedEvidence. Instead, most models show overconfidence, and, contrary to human experts, lack scientific skepticism toward low-quality findings. 😨 These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians! 📄Paper: https://lnkd.in/ghTa3pVA 🌐Website: https://lnkd.in/gvCTcsxR Huge shoutout to my incredible first co-authors, Christopher Polzak and Min Woo Sun, and to James Burgess, Yuhui Zhang, and Serena Yeung-Levy for their amazing contributions and collaboration.
Evaluating LLM Accuracy on Familiar and Rare Information
Explore top LinkedIn content from expert professionals.
Summary
Evaluating LLM accuracy on familiar and rare information means measuring how well large language models (LLMs) can provide correct answers both for well-known topics and for less common or specialized questions. This process helps us understand if these AI tools can truly mimic human-level reasoning and caution, especially when dealing with complex or unfamiliar data.
- Compare approaches: Review both long context models and retrieval-based systems to identify which method best suits your use case, as they excel in different scenarios.
- Validate with benchmarks: Use curated question sets and expert-reviewed sources to assess if the LLM delivers reliable answers for both common and specialized queries.
- Monitor factual accuracy: Break down AI-generated responses into individual facts and verify them against trusted references to reduce errors and improve trustworthiness.
-
-
Fascinating new research comparing Long Context LLMs vs RAG approaches! A comprehensive study by researchers from Nanyang Technological University Singapore and Fudan University reveals key insights into how these technologies perform across different scenarios. After analyzing 12 QA datasets with over 19,000 questions, here's what they discovered: Key Technical Findings: - Long Context (LC) models excel at processing Wikipedia articles and stories, achieving 56.3% accuracy compared to RAG's 49.0% - RAG shows superior performance in dialogue-based contexts and fragmented information - RAPTOR, a hierarchical tree-based retrieval system, outperformed traditional chunk-based and index-based retrievers with 38.5% accuracy Under the Hood: The study implements a novel three-phase evaluation framework: 1. Empirical retriever assessment across multiple architectures 2. Direct LC vs RAG comparison using filtered datasets 3. Granular analysis of performance patterns across different question types and knowledge sources Most interesting finding: RAG exclusively answered 10% of questions that LC couldn't handle, suggesting these approaches are complementary rather than competitive. The research team also introduced an innovative question filtering methodology to ensure fair comparison by removing queries answerable through parametric knowledge alone. This work significantly advances our understanding of when to use each approach in production systems. A must-read for anyone working with LLMs or building RAG systems!
-
In our daily discussions about generative AI, the fear of AI 'hallucinating'—or fabricating information—often surfaces. This conversation, however, opens the door to an exciting question: Could AI surpass human accuracy in identifying truths? Enter a groundbreaking study by #Google #DeepMind and #Stanford researchers, which introduces a novel framework called SAFE. Tested across approximately 16,000 facts, SAFE demonstrated superhuman performance, aligning with human evaluators 72% of the time and besting them in 76% of contested cases, all while being 20 times more cost-effective than traditional methods. The essence of this methodology lies in two pivotal steps. Initially, the LongFact prompt set, crafted using GPT-4, targets the comprehensive assessment of long-form content's factuality over 38 varied topics. Then, the SAFE framework takes this base further by meticulously breaking down responses into individual facts and validating each through targeted Google Search queries. The process unfolds across four critical stages: 1. 𝗣𝗿𝗼𝗺𝗽𝘁 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝘄𝗶𝘁𝗵 𝗟𝗼𝗻𝗴𝗙𝗮𝗰𝘁: Crafting varied, fact-seeking prompts to elicit detailed LLM responses. 2. 𝗗𝗲𝗰𝗼𝗺𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻 𝗶𝗻𝘁𝗼 𝗜𝗻𝗱𝗶𝘃𝗶𝗱𝘂𝗮𝗹 𝗙𝗮𝗰𝘁𝘀: Segmenting these responses into distinct facts for precise evaluation. 3. 𝗙𝗮𝗰𝘁 𝗩𝗲𝗿𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝘃𝗶𝗮 𝗚𝗼𝗼𝗴𝗹𝗲 𝗦𝗲𝗮𝗿𝗰𝗵: Using LLMs to formulate and dispatch queries, checking each fact's accuracy against search results. 4. 𝗜𝘁𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗮𝗻𝗱 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻: Applying a multi-step reasoning process to assess the support level for each fact. This innovative approach doesn't just mark a leap in evaluating LLM-generated content's factuality; it also paves the way for more trustworthy AI applications in countless fields. For a deep dive into this fascinating study, including access to the LongFact prompts and the SAFE framework, visit: https://lnkd.in/eVr4rz-u Find the full paper here: https://lnkd.in/eSjZ5Tn9 #GenAI #LLM #Hallucination #FactChecking #DeepMind #Stanford #Google #SAFE #LongFact