How do we actually evaluate LLMs? A simple question with a long answer! I wrote a new article explaining the four main approaches: 1. Multiple-choice benchmarks 2. Verifiers 3. Leaderboards 4. and LLM-as-a-judge methods Each one with hands-on, from-scratch code examples so you can see how they work under the hood and what the trade-offs are. Of course, there is no single "best" method; each has its trade-offs and its place depending on the goal. But understanding these foundations hopefully helps make sense of all the leaderboards, papers, and various claims we see every week (/day). 🔗 https://lnkd.in/gHCmmzas
AI Evaluation Methods
Explore top LinkedIn content from expert professionals.
-
-
Don't just blindly use LLMs, evaluate them to see if they fit into your criteria. Not all LLMs are created equal. Here’s how to measure whether they’re right for your use case👇 Evaluating LLMs is critical to assess their performance, reliability, and suitability for specific tasks. Without evaluation, it would be impossible to determine whether a model generates coherent, relevant, or factually correct outputs, particularly in applications like translation, summarization, or question-answering. Evaluation ensures models align with human expectations, avoid biases, and improve iteratively. Different metrics cater to distinct aspects of model performance: Perplexity quantifies how well a model predicts a sequence (lower scores indicate better familiarity with the data), making it useful for gauging fluency. ROUGE-1 measures unigram (single-word) overlap between model outputs and references, ideal for tasks like summarization where content overlap matters. BLEU focuses on n-gram precision (e.g., exact phrase matches), commonly used in machine translation to assess accuracy. METEOR extends this by incorporating synonyms, paraphrases, and stemming, offering a more flexible semantic evaluation. Exact Match (EM) is the strictest metric, requiring verbatim alignment with the reference, often used in closed-domain tasks like factual QA where precision is paramount. Each metric reflects a trade-off: EM prioritizes literal correctness, while ROUGE and BLEU balance precision with recall. METEOR and Perplexity accommodate linguistic diversity, rewarding semantic coherence over exact replication. Choosing the right metric depends on the task—e.g., EM for factual accuracy in trivia, ROUGE for summarization breadth, and Perplexity for generative fluency. Collectively, these metrics provide a multifaceted view of LLM capabilities, enabling developers to refine models, mitigate errors, and align outputs with user needs. The table’s examples, such as EM scoring 0 for paraphrased answers, highlight how minor phrasing changes impact scores, underscoring the importance of context-aware metric selection. Know more about how to evaluate LLMs: https://lnkd.in/gfPBxrWc Here is my complete in-depth guide on evaluating LLMs: https://lnkd.in/gjWt9jRu Follow me on my YouTube channel so you don't miss any AI topic: https://lnkd.in/gMCpfMKh
-
IBM Research 𝗮𝗻𝗱 Yale University 𝗷𝘂𝘀𝘁 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝗱 𝗮 𝗳𝘂𝗹𝗹 360° 𝗿𝗲𝘃𝗶𝗲𝘄 𝘀𝗰𝗮𝗻 𝗼𝗻𝗲 𝗼𝗳 𝗵𝗼𝘄 𝘄𝗲 𝘁𝗲𝘀𝘁 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀. ⬇️ They looked at 120+ evaluation methods — and mapped out what’s working and what’s missing. Currently everyone’s building AI agents. Almost no one agrees on how to properly evaluate them. This is critical, because without rigorous evaluation, we can’t trust these systems to be reliable, safe, or ready for real-world use. 𝗛𝗲𝗿𝗲’𝘀 𝘄𝗵𝗮𝘁 𝘀𝘁𝗮𝗻𝗱𝘀 𝗼𝘂𝘁: ⬇️ 1. 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗶𝗻𝗴 𝗮𝗴𝗲𝗻𝘁𝘀 ≠ 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗶𝗻𝗴 𝗺𝗼𝗱𝗲𝗹𝘀 ➜ Agents aren’t static LLMs. They act, adapt, and evolve. Old-school metrics can’t keep up with real-world autonomy. 2. 𝗥𝗲𝗳𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗶𝘀 𝗺𝗲𝗮𝘀𝘂𝗿𝗮𝗯𝗹𝗲 𝗻𝗼𝘄 ➜ Benchmarks like LLF-Bench evaluate how agents process feedback and course-correct (which is crucial for evaulation quality). Without this, agents just repeat their mistakes. 3. 𝗖𝗼𝘀𝘁-𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 𝗶𝘀 𝗯𝗲𝗶𝗻𝗴 𝗶𝗴𝗻𝗼𝗿𝗲𝗱 — 𝗱𝗮𝗻𝗴𝗲𝗿𝗼𝘂𝘀𝗹𝘆 ➜ Top agents burn insane tokens and API calls. We need benchmarks that track performance and price. Otherwise no one can afford to deploy them. 4. 𝗙𝗼𝘂𝗿 𝘀𝗸𝗶𝗹𝗹𝘀 𝗱𝗲𝗳𝗶𝗻𝗲 𝘁𝗼𝗽-𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗶𝗻𝗴 𝗮𝗴𝗲𝗻𝘁𝘀 ➜ It's critical to evaluate each individual component — otherwise, key weaknesses can go unnoticed and compromise the overall performance: * Breaking down complex tasks (planning) * Using tools and APIs (tool use) * Learning from feedback (reflection) * Remembering previous steps (memory) 5. 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗶𝘀 𝗯𝗲𝗰𝗼𝗺𝗶𝗻𝗴 𝗺𝗼𝗿𝗲 𝗿𝗲𝗮𝗹𝗶𝘀𝘁𝗶𝗰 ➜ New benchmarks simulate actual jobs: * Online shopping (WebArena) * Debugging code (SWE-Bench) * Helping customers (τ-bench) * Research tasks (PaperBench) * Multi-step workflows (OSWorld, CRMWorld) More in the comments and below! 𝗪𝗮𝗻𝘁 𝗺𝗼𝗿𝗲 𝗯𝗿𝗲𝗮𝗸𝗱𝗼𝘄𝗻𝘀 𝗹𝗶𝗸𝗲 𝘁𝗵𝗶𝘀? Subscribe to Human in the Loop — my new weekly deep dive on AI agents, real-world tools, and strategic insights: https://lnkd.in/dbf74Y9E
-
Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW
-
OpenAI CPO: Evals are becoming a core skill for PMs. PM in 2025 is changing fast. PMs need to learn brand new skills: 1. AI Evals (https://lnkd.in/eGbzWMxf) 2. AI PRDs (https://lnkd.in/eMu59p_z) 3. AI Strategy (https://lnkd.in/egemMhMF) 4. AI Discovery (https://lnkd.in/e7Q6mMpc) 5. AI Prototyping (https://lnkd.in/eJujDhBV) And evals is amongst the deepest topics. There's 3 steps to them: 1. Observing (https://lnkd.in/e3eQBdMp) 2. Analyzing Errors (https://lnkd.in/eEG83W5D) 3. Building LLM Judges (https://lnkd.in/ez3stJRm) - - - - - - Here's your simple guide to evals in 5 minutes: (Repost this before anything else ♻️) 𝟭. 𝗕𝗼𝗼𝘁𝘀𝘁𝗿𝗮𝗽 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 Start with 100 diverse traces of your LLM pipeline. Use real data if you can, or systematic synthetic data generation across key dimensions if you can't. Quality over quantity here: aggressive filtering beats volume. 𝟮. 𝗔𝗻𝗮𝗹𝘆𝘇𝗲 𝗧𝗵𝗿𝗼𝘂𝗴𝗵 𝗢𝗽𝗲𝗻 𝗖𝗼𝗱𝗶𝗻𝗴 Read every trace carefully and label failure modes without preconceptions. Look for the first upstream failure in each trace. Continue until you hit theoretical saturation, when new traces reveal no fundamentally new error types. 𝟯. 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗬𝗼𝘂𝗿 𝗙𝗮𝗶𝗹𝘂𝗿𝗲 𝗠𝗼𝗱𝗲𝘀 Group similar failures into coherent, binary categories through axial coding. Focus on Gulf of Generalization failures (where clear instructions are misapplied) rather than Gulf of Specification issues (ambiguous prompts you can fix easily). 𝟰. 𝗕𝘂𝗶𝗹𝗱 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗼𝗿𝘀 Create dedicated evaluators for each failure mode. Use code-based checks when possible (regex, schema validation, execution tests). For subjective judgments, build LLM-as-Judge evaluators with clear Pass/Fail criteria, few-shot examples, and structured JSON outputs. 𝟱. 𝗗𝗲𝗽𝗹𝗼𝘆 𝘁𝗵𝗲 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗙𝗹𝘆𝘄𝗵𝗲𝗲𝗹 Integrate evals into CI/CD, monitor production with bias-corrected success rates, and cycle through Analyze→ Measure→ Improve continuously. New failure modes in production feed back into your evaluation artifacts. Evals are now a core skill for AI PMs. This is your map. - - - - - I learned this from Hamel Husain and Shreya Shankar. Get 35% off their course: https://lnkd.in/e5DSNJtM 📌 Want our step-by-step guide to evals? Comment 'steps' + DM me. Repost to cut the line. ➕ Follow Aakash Gupta to stay on top of AI x PM.
-
Booking.com released a guide on how they evaluate every single AI app they build. Evaluating LLMs in prod is a different game than evaluating traditional ML models. - Golden datasets matter: human-annotated data is still the foundation for building trustworthy judge-LLMs. Without reliable labels, automated evaluation breaks down. - Annotation protocols are key: whether you go with a single annotator (basic) or multiple with consensus/weights (advanced), the consistency of annotation directly impacts evaluation quality. - Judge-LLM can't be the same as target-LLM: a stronger LLM can be used to evaluate the outputs of another, allowing scalable and automated monitoring of GenAI systems. - Pointwise vs. comparative judges: pointwise scoring works for production monitoring, but comparative evaluation (A vs. B) often provides stronger signals for ranking and system improvement. - Automation + synthetic data are emerging directions: auto-prompt pipelines and synthetic golden datasets could significantly reduce the time and cost of judge-LLM development. ♻️ Share it with anyone who might benefit :) Link to a visual + interactive version of the article: https://lnkd.in/gsDdX-h3 I regularly share AI Agents and RAG projects on my newsletter 𝑨𝑰 𝑬𝒏𝒈𝒊𝒏𝒆𝒆𝒓𝒊𝒏𝒈 𝑾𝒊𝒕𝒉 𝑺𝒂𝒓𝒕𝒉𝒂𝒌: https://lnkd.in/gaJTcZBR Link to full article: https://lnkd.in/g3-qWFhB #AI #GenAI #LLMs
-
Evaluating LLMs is not like testing traditional software. Traditional systems are deterministic → pass/fail. LLMs are probabilistic → same input, different outputs, shifting behaviors over time. That makes model selection and monitoring one of the hardest engineering problems today. This is where Eval Protocol (EP) developed by Fireworks AI is so powerful. It’s an open-source framework for building an internal model leaderboard, where you can define, run, and track evals that actually reflect your business needs. → Simulated Users – generate synthetic but realistic user interactions to stress-test models under lifelike conditions. → evaluation_test – pytest-compatible evals (pointwise, groupwise, all) so you can treat model behavior like unit tests in CI/CD. → MCP Extensions – evaluate agents that use tools, multi-step reasoning, or multi-turn dialogue via Model Context Protocol. → UI Review – a dashboard to visualize eval results, compare across models, and catch regressions before they ship. Instead of relying on generic benchmarks, EP lets you encode your own success criteria and continuously measure models against them. If you’re serious about scaling LLMs in production, this is worth a look: evalprotocol.io
-
Researchers at UC San Diego and Tsinghua just solved a major challenge in making LLMs reliable for scientific tasks: knowing when to use tools versus solving problems directly. Their method, called Adapting While Learning (AWL), achieves this through a novel two-component training approach: (1) World knowledge distillation - the model learns to solve problems directly by studying tool-generated solutions (2) Tool usage adaptation - the model learns to intelligently switch to tools only for complex problems it can't solve reliably The results are impressive: * 28% improvement in answer accuracy across scientific domains * 14% increase in tool usage precision * Strong performance even with 80% noisy training data * Outperforms GPT-4 and Claude on custom scientific datasets Current approaches either make LLMs over-reliant on tools or prone to hallucinations when solving complex problems. This method mimics how human experts work - first assessing if they can solve a problem directly before deciding to use specialized tools. Paper https://lnkd.in/g37EK3-m — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai
-
More teams start to rely on LLMs as judges for evaluation, but the raw scores they produce are statistically misleading because the model often mislabels both correct and incorrect answers. “How to Correctly Report LLM-as-a-Judge Evaluations” shows that naive accuracy estimates are biased and can overstate or understate performance depending on the model’s sensitivity and specificity. The researchers designed plug-in estimator and confidence interval that corrects this bias by incorporating uncertainty from both the test dataset and a calibration dataset with ground-truth labels. They also introduced an adaptive calibration method that allocates samples where the evaluator is weakest, which reduces variance and yields more reliable accuracy reporting. This type of rigor will matter as more organizations depend on LLM-based evaluation at scale. Do give it a read. Human evaluators are expensive and not scalable! #LLM
-
RAG is good. Evaluation makes it better. The question is: 𝐇𝐨𝐰 𝐝𝐨 𝐲𝐨𝐮 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐞 𝐚 𝐑𝐀𝐆 𝐬𝐲𝐬𝐭𝐞𝐦? Retrieval-augmented generation (RAG) systems change how LLM generate responses by integrating real-time data retrieval into the generative process. But how do we ensure these systems are reliable in production? At the core of every RAG system are two key components: 1️⃣ Retriever: Identifies relevant information from a vector database using similarity search. 2️⃣ Generator: Generate retrieved documents with the user query to generate accurate responses. For RAG to work seamlessly, both components must perform optimally. This is where evaluation comes in. One way to evaluate RAG systems is using the TRIAD Framework by Trulens, which consists of three metrics: 🔹 Context Relevance: Ensures retrieved documents align with the query. 🔹 Faithfulness (Groundedness): Verifies if the response is factually accurate and grounded in the retrieved documents. 🔹 Answer Relevance: Measures how well the response addresses the query. But here’s the challenge: Traditional evaluation requires significant data collection and ground truth, which can be resource-intensive. Enter LLM-as-a-Judge—a faster, cost-effective alternative to human evaluation. 𝐇𝐨𝐰 𝐃𝐨𝐞𝐬 𝐋𝐋𝐌-𝐚𝐬-𝐚-𝐉𝐮𝐝𝐠𝐞 𝐖𝐨𝐫𝐤? The LLM evaluates generated outputs based on predefined guidelines. It can assess: ✅ Context Relevance: Is the retrieved document relevant to the query? ✅ Faithfulness: Is the response factually accurate? ✅ Answer Relevance: Does the response address the query effectively? 𝐖𝐡𝐲 𝐃𝐨𝐞𝐬 𝐋𝐋𝐌-𝐚𝐬-𝐚-𝐉𝐮𝐝𝐠𝐞 𝐖𝐨𝐫𝐤? Critiquing text is inherently easier than generating it. By leveraging the LLM’s classification capabilities, we can evaluate RAG systems effectively, even in production environments. 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬 𝐰𝐢𝐭𝐡 𝐋𝐋𝐌-𝐚𝐬-𝐚-𝐉𝐮𝐝𝐠𝐞: 🔸 Pairwise Comparison: Compare two responses and choose the better one. 🔸 Reference-Free Evaluation: Assess responses based on criteria like tone, bias, or correctness. 🔸 Reference-Based Evaluation: Judge responses against a reference document or context. While LLM-as-a-Judge isn’t perfect, it provides a robust framework for improving RAG systems. Key Takeaways: 🔑 RAG systems require rigorous evaluation to ensure reliability in production. 🔑 The TRIAD framework offers a structured approach to evaluating context relevance, faithfulness, and answer relevance. 🔑 LLM-as-a-Judge is a powerful tool for scalable, cost-effective evaluation. If you want to know about building the RAG Evaluation System with LLM-as-a-Judge, I recently wrote about them in my latest Newsletter post. ✍️Article Link: https://lnkd.in/gMpmWFj3 🔗RAG-To-Know Repository: https://lnkd.in/gQqqQd2a What are your thoughts on using LLMs to evaluate RAG systems? Let’s discuss it!