Chain-of-Agents Interesting idea to train a single model with the capabilities of a multi-agent system. 84.6% reduction in inference cost! Distillation and Agentic RL are no joke! Here are my notes: ● Overview This work proposes training single models to natively behave like multi‐agent systems, coordinating “role‐playing” and tool agents end‐to‐end. They distill strong multi‐agent frameworks into CoA trajectories, then optimize with agentic RL on verifiable tasks. ● Paradigm shift CoA generalizes ReAct/TIR by dynamically activating multiple roles and tools within one model, preserving a single coherent state while cutting inter‐agent chatter. ● Training recipe 1) Multi‐agent distillation turns successful OAgents runs into CoA‐formatted traces with planning, tool calls, observations, and reflection, filtered for difficulty and quality; 2) Agentic RL targets hard queries where tools matter, with simple binary rewards via LLM‐as‐Judge for web tasks and executable or exact‐match rewards for code/math. ● Training framework Stage 1 (SFT) – uses reformatted ReAct-style data (both short and long reasoning chains) to give the model a solid “cold start.” Progressive filtering ensures only high-quality trajectories are used, emphasizing coherence, tool efficiency, and reflective reasoning. Stage 2 (RL) – builds on the SFT base. The model performs tool-aware rollouts on unused QA pairs. Rewards are computed from task correctness (via LLM-as-Judge, exact match, or test cases), and policy updates improve tool coordination and reasoning robustness. ● Main results With Qwen‐2.5‐32B backbones, Agent Foundation Models (AFM) sets new pass@1 on GAIA 55.3, BrowseComp 11.1, HLE 18.0, and leads WebWalker 63.0; it also tops multi‐hop QA suites across sizes. ● Code + math AFM‐RL‐32B reaches AIME25 59.8, MATH500 94.6, OlympiadBench 72.1, and LiveCodeBench v5 47.9, beating prior TIR methods including ReTool and Reveal. ● Efficiency and robustness Compared to traditional multi‐agent systems, AFM cuts inference tokens and tool calls substantially. The paper reports an 84.6% token cost reduction while staying competitive. It also generalizes to unseen tools better when strict formatting is required. ● Test‐time scaling Best‐of‐3 and pass@3 markedly boost AFM, e.g., GAIA 69.9 and HLE 33.2, closing the gap with larger proprietary agent stacks. Overall, Chain-of-Agents enables training single-agent foundation models that natively simulate multi-agent collaboration, combining multi-agent distillation with agentic RL to achieve state-of-the-art results.
- +4