This paper by Apple researchers introduces a benchmark called TOOLSANDBOX, which intends to be a comprehensive evaluation framework for assessing how well LLMs can handle stateful, conversational, and interactive tasks using tools, and to offer new insights into the capabilities and limitations of these models. TOOLSANDBOX is a testing framework created to see how good LLMs are at using tools to complete various tasks. The tools could be anything from APIs to databases or even simple functions like checking the weather or making a restaurant reservation. Key concepts in the paper: - Stateful Tasks: This means the tasks require the AI to remember previous actions or decisions it made earlier in the conversation. For example, if the AI turned on the internet in a previous step, it should remember that the internet is now on and not try to turn it on again. - Tool Use: The AI needs to know when and how to use different tools. Some tasks might require using multiple tools in a sequence, and the AI has to figure out the correct order and timing for using these tools. - Evaluation: The benchmark tests the AI on various scenarios to see how well it handles tasks that require multiple steps, state management, and decision-making with limited information. The paper concludes that while AI models are getting better at handling simple tasks, they still struggle with more complex scenarios where they need to use multiple tools, remember previous actions, and make decisions based on incomplete information. This research helps in understanding the limitations of current AI models and where improvements are needed. Specifically, the text highlights the difficulty models like Mistral and Hermes face in identifying when to issue a tool call. E.g., Mistral often mistakenly treats a tool-use scenario as a code generation task, leading to poor performance. GPT-4o and Claude-3-Opus are also evaluated, with GPT-4o achieving the highest similarity score, although both models struggle with complex tool call sequences. In general, the challenges include managing tasks dependent on prior states, ensuring consistent tool use across contexts, and handling situations with incomplete data or on-the-fly decision-making. TOOLSANDBOX is compared with other benchmarks like BFCL (Berkeley Function Calling Leaderboard), ToolEval, and API-Bank. While these other benchmarks also focus on tool-use capabilities, TOOLSANDBOX is distinguished by its focus on stateful, interactive, and conversational tool use, along with a human-authored ground truth for evaluation. The benchmark highlights that even the most advanced state-of-the-art LLMs (SOTA LLMs) struggle with the complex tasks posed by TOOLSANDBOX, indicating the challenges in making LLMs effective tool users in real-world scenarios. by Apple researchers Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu (Janet) Li, Guoli Yin, Zirui Wang, Ruoming Pang
Challenges with AI Reasoning Benchmarks
Explore top LinkedIn content from expert professionals.
Summary
AI reasoning benchmarks are tools designed to evaluate how well artificial intelligence (AI) models perform reasoning tasks, often in complex, real-world scenarios. However, challenges such as managing memory, ensuring task consistency, and evaluating true reasoning abilities make these benchmarks difficult to design and execute effectively.
- Focus on realistic tasks: Ensure benchmarks assess real-world scenarios requiring stateful, multi-step reasoning, instead of relying heavily on artificial or rule-based puzzles.
- Address evaluation gaps: Create methods that measure both accuracy and adaptability in AI systems, especially for tasks requiring tool use, memory, or long-term planning.
- Mitigate data contamination: Develop benchmarks using fresh, unseen data to prevent training overlap, ensuring more accurate and fair AI evaluation.
-
-
📢 New paper: AI Agents That Matter 📢 Summary: we find pervasive shortcomings in the state of AI agent evaluation. We show how to incorporate cost into agent evaluation and optimization. We also show the importance of being precise about what a benchmark aims to measure and ensuring that it actually measures that. Our overarching goal is to stimulate the development of agents that are useful in the real world and not just accurate on benchmarks. Abstract: https://lnkd.in/e6stMh-f PDF: https://lnkd.in/e8apuwrZ HTML: https://lnkd.in/eh4bJwjc By Sayash Kapoor, Benedikt Ströbl, Zachary Siegel, Nitya Nadgir, and me. Why agents? Some of the most exciting applications of large language models involve taking real-world action, such as booking flight tickets or finding and fixing software bugs. AI systems that carry out such tasks are called agents. They use LLMs in combination with other software to use tools such as web search and code terminals. What's the problem? To measure the effectiveness of AI agents on real-world tasks, researchers have created benchmark datasets. But as we’ve said before, LLM evaluation is a minefield (https://lnkd.in/ekZguXXq). It turns out that agent evaluation has a bunch of additional pitfalls that affect today’s benchmarks and evaluation practices. This state of affairs encourages the development of agents that do well on benchmarks without being useful in practice. The paper makes five empirical contributions: – Cost-controlled evaluation yields a reassessment of leaderboard-topping agents. – Jointly optimizing accuracy and cost can yield better agent design. – Model developers and downstream developers have distinct benchmarking needs. – Agent benchmarks enable shortcuts. – Agent evaluations lack standardization and reproducibility. Conclusion AI agent benchmarking is new and best practices haven't yet been established, making it hard to distinguish genuine advances from hype. We think agents are sufficiently different from models that benchmarking practices need to be rethought. In our paper, we take the first steps toward a principled approach to agent benchmarking. We hope these steps will raise the rigor of AI agent evaluation and provide a firm foundation for progress.
-
Apple’s new paper, “The Illusion of Thinking,” is making waves for its critique of how today’s AI models reason. It’s a well-constructed piece, but also a deeply narrow one. Apple argues that large language models only appear to think, relying on superficial patterns rather than true reasoning. But what if the real illusion isn’t in the models… it’s in the tests designed to measure them? The authors focus on synthetic, rule-based puzzles like Tower of Hanoi and River Crossing to argue that LLMs collapse under complexity. But these tests bear little resemblance to how reasoning unfolds in the real world, whether in business, research, or agentic AI systems. Ironically, the paper shows models perform well on low- to medium-complexity tasks, the exact space where most practical value is being delivered today. And it completely misses how frontier models perform on meaningful, real-world reasoning benchmarks. In Stanford HAI AI Index 2025, GPT-4 Turbo and Claude Opus achieve near-human accuracy on MATH, GSM8K, and AIME, tough logic, arithmetic, and math Olympiad-style reasoning benchmarks. These aren’t puzzles; they’re tests of structured thinking, inferencing, and abstraction. Even more importantly, LLMs are now creating, not just predicting. A 2024 Stanford study involving over 100 NLP researchers found that GPT-4 and Claude-2 generated scientific research ideas rated significantly more novel than those proposed by human experts. In 2025, a large-scale creativity benchmark tested Claude and GPT-4 across 13 divergent thinking tasks, showing they performed at the 52nd percentile of human participants, matching the creativity of individual humans and small teams. In applied research contexts, agentic systems autonomously read papers, form hypotheses, write code, and iterate experiments, showcasing structured, goal-driven reasoning in open scientific workflows. While it’s true that many real-world tasks contain patterns, the consistent success of frontier models across math, code, research, and long-form analysis reflects more than mimicry. These models increasingly show abstraction, flexible reasoning, and adaptive behaviors, especially in agentic frameworks that involve memory, goal-setting, and tool use. Their ability to self-correct, synthesize, and operate in open-ended, ambiguous situations points to genuine cognitive capability, not just clever pattern reuse. The trajectory here suggests that scalable reasoning is emerging and accelerating. And when used in agentic patterns, these models are already: - Designing and running ML experiments autonomously - Building internal workflows - Reasoning over long horizons with purpose and feedback If Apple measured human reasoning like they measured LLMs, even Einstein might have failed the test. IMO, this paper says more about Apple’s position in the AI race than it does about the nature of reasoning. The real illusion is believing intelligence can be measured by rule-bound puzzles alone.
-
Agentic AI promises autonomous problem-solving, but it also brings tough technical challenges. Here are four key pitfalls that researchers are grappling with, both in theory and practice: Evaluation in open-ended tasks: Traditional AI benchmarks (accuracy, QA tests, etc.) fall short for agents operating in dynamic, multi-step environments. An agent might need to plan, use tools, remember context, and adapt – aspects that static benchmarks don’t capture. New evaluation methods (e.g. simulation-based benchmarks like AgentBench or CAMEL) aim to measure goal completion, adaptability, and long-horizon reasoning instead of one-shot answers. Loops & long-horizon planning: Autonomy means running iteratively towards a goal – but without robust control, agents can spiral into endless loops. Early experiments (e.g. AutoGPT) famously got stuck repeating tasks infinitely due to limited memory of past actions. In general, long-horizon planning remains brittle; many agents struggle to stay stable and recover from errors over extended sequences. Hallucinations & grounding: Agents built on large language models can hallucinate – confidently generating false information. In a multi-agent system this is even riskier: one agent’s mistake can propagate to others, causing cascading errors across the entire system. Mitigating this requires grounding the agent in real-world context. Techniques like retrieval-augmented generation (tool use, web search, databases) let the agent verify facts with up-to-date data, reducing hallucinations and enhancing trust. Safe termination criteria: When does the agent know a task is done? Defining clear stop conditions is critical to avoid runaway behavior. Common strategies include goal completion checks and rule-based limits (e.g. max iterations or timeouts) to prevent endless operations. Without reliable termination criteria, an agent might waste resources or even go off-track instead of gracefully stopping when appropriate. Each of these challenges highlights how agentic AI is harder than it looks. They’re sparking lively debates on evaluation standards, control mechanisms, and safety protocols for autonomous AI. How is your team addressing these issues? Are there other obstacles or solutions you find crucial? Let’s discuss – the path to truly reliable AI agents will require tackling all of the above.
-
Many companies have started experimenting with simple RAG systems, probably as their first use case, to test the effectiveness of generative AI in extracting knowledge from unstructured data like PDFs, text files, and PowerPoint files. If you've used basic RAG architectures with tools like LlamaIndex or LangChain, you might have already encountered three key problems: 𝟭. 𝗜𝗻𝗮𝗱𝗲𝗾𝘂𝗮𝘁𝗲 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Existing metrics fail to catch subtle errors like unsupported claims or hallucinations, making it hard to accurately assess and enhance system performance. 𝟮. 𝗗𝗶𝗳𝗳𝗶𝗰𝘂𝗹𝘁𝘆 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗖𝗼𝗺𝗽𝗹𝗲𝘅 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀: Standard RAG methods often struggle to find and combine information from multiple sources effectively, leading to slower responses and less relevant results. 𝟯. 𝗦𝘁𝗿𝘂𝗴𝗴𝗹𝗶𝗻𝗴 𝘁𝗼 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗮𝗻𝗱 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻𝘀: Basic RAG approaches often miss the deeper relationships between information pieces, resulting in incomplete or inaccurate answers that don't fully meet user needs. In this post I will introduce three useful papers to address these gaps: 𝟭. 𝗥𝗔𝗚𝗖𝗵𝗲𝗸𝗲𝗿: introduces a new framework for evaluating RAG systems with a focus on fine-grained, claim-level metrics. It proposes a comprehensive set of metrics: claim-level precision, recall, and F1 score to measure the correctness and completeness of responses; claim recall and context precision to evaluate the effectiveness of the retriever; and faithfulness, noise sensitivity, hallucination rate, self-knowledge reliance, and context utilization to diagnose the generator's performance. Consider using these metrics to help identify errors, enhance accuracy, and reduce hallucinations in generated outputs. 𝟮. 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗥𝗔𝗚: It uses a labeler and filter mechanism to identify and retain only the most relevant parts of retrieved information, reducing the need for repeated large language model calls. This iterative approach refines search queries efficiently, lowering latency and costs while maintaining high accuracy for complex, multi-hop questions. 𝟯. 𝗚𝗿𝗮𝗽𝗵𝗥𝗔𝗚: By leveraging structured data from knowledge graphs, GraphRAG methods enhance the retrieval process, capturing complex relationships and dependencies between entities that traditional text-based retrieval methods often miss. This approach enables the generation of more precise and context-aware content, making it particularly valuable for applications in domains that require a deep understanding of interconnected data, such as scientific research, legal documentation, and complex question answering. For example, in tasks such as query-focused summarization, GraphRAG demonstrates substantial gains by effectively leveraging graph structures to capture local and global relationships within documents. It's encouraging to see how quickly gaps are identified and improvements are made in the GenAI world.
-
Are today’s AI models truly reasoning, or are we mistaking structured output for understanding? Apple’s recent research on Large Reasoning Models (LRMs) critically examines state-of-the-art models like OpenAI’s o3, Claude 3.7, and Gemini. Key takeaways: - Performance collapses when problem complexity increases, regardless of model size or compute - Chain-of-thought prompting often creates an illusion of reasoning without true abstraction - Reasoning traces are inconsistent and inefficient, with models overthinking and still failing - Popular benchmarks reward pattern recognition, not genuine reasoning These findings challenge the notion that current architectures are close to human-like reasoning and suggest we may need fundamentally new approaches to make meaningful progress. How are you thinking about these limitations and what would meaningful progress in reasoning look like from here? #AI #LLMs #Reasoning #AppleResearch
-
A barrier to faster progress in generative AI is evaluations (evals), particularly of custom AI applications that generate free-form text. Let’s say you have a multi-agent research system that includes a researcher agent and a writer agent. Would adding a fact-checking agent improve the results? If we can’t efficiently evaluate the impact of such changes, it’s hard to know which changes to keep. For evaluating general-purpose foundation models such as large language models (LLMs) — which are trained to respond to a large variety of prompts — we have standardized tests like MMLU (multiple-choice questions) and HumanEval (testing code generation); the LMSYS Chatbot arena, which pits two LLMs’ responses against each other and asks a human to judge which response is superior; and large-scale benchmarking like HELM. These evaluation tools are invaluable for giving LLM users a sense of different models' relative performance. Nonetheless, they have limitations: For example, leakage of benchmarks datasets’ questions and answers into training data is a constant worry, and human preference for certain answers does not mean those answers are more accurate. In contrast, our current options for evaluating specific applications built using LLMs are far more limited. Here, I see two major types of applications. - For applications designed to deliver unambiguous, right-or-wrong responses, we have reasonable options. Let’s say we want an LLM to read a resume and extract the candidate's most recent job title, or read a customer email and route it to the right department. We can create a test set that comprises ground-truth labeled examples with the right responses, and measure the percentage of times the LLM generates the right output. The main bottleneck is creating the labeled test set, which is expensive but surmountable. - But many LLM-based applications generate free-text output with no single right response. For example, if we ask an LLM to summarize customer emails, there’s a multitude of possible good (and bad) responses. The same holds for a system to do web research and write an article about a topic, or a RAG system for answering questions. It’s impractical to hire an army of human experts to read the LLM’s outputs every time we tweak the algorithm and evaluate if the answers have improved — we need an automated way to test the outputs. Thus, many teams use an advanced language model to evaluate outputs. In the customer email summarization example, we might design an evaluation rubric (scoring criteria) for what makes a good summary. Given an email summary generated by our system, we might prompt an advanced LLM to read it and score it according to our rubric. I’ve found that the results of such a procedure, while better than nothing, can also be noisy — sometimes too noisy to reliably tell me if the way I’ve tweaked an algorithm is good or bad. [Reached LinkedIn's length limit. Rest of text: https://lnkd.in/gQEDtSr7 ]
-
Meta and HKUST’s new RAG benchmark shows how far we are from fully trustworthy question-answering systems. The new benchmark, called Comprehensive RAG Benchmark (CRAG), addresses the shortcomings of existing benchmarks by offering a diverse set of ~4.5k QA pairs across five domains: Finance, Sports, Music, Movie, and Open domain. It also simulates real-world scenarios using mock APIs for web and Knowledge Graph searches. Study findings: 1. Advanced LLMs like GPT-4 and Claude achieve only up to 34% accuracy on CRAG 2. Adding RAG straightforwardly improves accuracy to just 44% 3. State-of-the-art industry RAG solutions like Perplexity answer only 63% of questions without hallucination CRAG also reveals much lower accuracy for facts that change quickly over time (like stock prices and sports scores), less popular entities like lesser-known athletes and niche music artists, and complex questions that require multi-step reasoning. These findings surface the limitations of today's RAG method and suggest crucial directions for future research. Paper https://lnkd.in/gs87Y8Sm — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai
-
Imagine a world where some people cheat, but there’s no way to tell. And then suddenly, there is. The problem with LLM benchmarks is that they are contaminated. Model makers know the “answers” ahead of time, and train on them. This may not even be intentional, but it is hard to avoid, because lots of people write about LLM benchmarks, and they get copied into different places that are then swept up in the maw of a model’s web scraper. Whether it happens by accident or on purpose, it is the same as feeding your “test” data into your training set. Also the same as cheating on a test. You end up with a high test score that doesn’t generalize to new questions. MIT has a new benchmark called LiveCodeBench that tries to get around this problem. They do so by continuously sourcing fresh questions from code competition sites like LeetCode. The questions are tagged with their release date. When you test an LLM, you only include code problems released after its knowledge cutoff. Viola. - GPT4 and Claude Opus top the new, decontaminated charts - DeepSeek shows significant overfitting. This model powers DeepSeek Coder. - Google’s Gemini Pro is in the middle of the pack Here’s the full chart: Paper: https://buff.ly/4aCb9Pw Code: https://buff.ly/4aUWlLD #OpenSource #AI #MachineLearning #GPT4 #Benchmarks #LLMBenchmarks #Overfitting #CodingBenchmark #LeetCode #KnowledgeCutoff #ModelEvaluation #CheatingDetection #ContaminatedData #ArtificialIntelligence
-
Imagine you're interviewing a brilliant job candidate who aces every question... until you throw them a curveball. Suddenly, their performance plummets. This scenario isn't just happening in human interviews—it's playing out in the world of AI, and the results are eye-opening. Recent findings from LMSYS, a leading LLM benchmarking organization, have revealed a startling gap in the performance of many popular AI models when faced with more challenging tasks (Check out the drop on llama in the graph below). They have actually added a new category called "Hard Prompts" to measure this. Blog post - https://lnkd.in/ewG3TGaD While some models shine consistently, others struggle when the going gets tough. Key Takeaways: ↳Performance Divide: Many well-known models, including Google's Gemini 1.5 Pro and Meta's Llama 3, showed significant drops in performance when tackling harder prompts requiring complex reasoning. ↳Consistency Champions: OpenAI's GPT-4o and Anthropic's Claude 3 Opus maintained their high performance across both normal and challenging queries. ↳The Reasoning Gap: This discrepancy suggests that many current LLMs may still struggle with advanced reasoning tasks, despite their impressive performance on standard benchmarks. ↳Synthetic Data Advantage: Industry experts speculate that the use of synthetic training data might be giving some companies an edge in developing more robust AI reasoning capabilities. ↳Beyond English: As the AI landscape globalizes, evaluations are expanding to include performance in non-English languages, with Chinese and Russian seeing significant usage. ↳Closing the Gap: Open-weight models are rapidly catching up to their closed-weight counterparts, complicating the evaluation process. What This Means for the Future of AI: As AI continues to evolve, we may need to rethink how we evaluate and choose LLMs. ☑️ Future assessments might focus more on subjective qualities like creativity, humor, or articulateness—factors that are increasingly important in real-world applications. This shift is already visible in the market, with companies like xAI's "anti-woke" Grok model and OpenAI's "flirty" GPT-4o voice showcasing how personality and user experience are becoming key differentiators. As AI becomes more integrated into our daily lives and business operations, understanding these nuances in performance and capabilities is crucial. The next time you're considering an AI solution, remember: not all models are created equal, especially when it comes to tackling the tough stuff. What's your take on this evolving AI landscape? Have you noticed differences in AI performance in your work? Share your thoughts and experiences in the comments below!