Agent architectures in the Real World ⬇️ 𝗪𝗵𝗮𝘁'𝘀 𝘁𝗵𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺? There is plenty of advice on how to build agent prototypes that > use third-party API, like OpenAI or Antrhopic. > encapsulate all the agent + tooling logic inside a single Python program > run locally with docker compose But the problem is, this design DOES NOT scale. Meaning, if you are a company trying to use this blueprint, it will either work > too slow > too expensive, or > BOTH So the question is, how can you design agentic systems that are cost efficient (both in time and money?). Here is a blueprint. 𝗦𝘆𝘀𝘁𝗲𝗺 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 📐 > 𝗔𝗴𝗲𝗻𝘁 𝗔𝗣𝗜 𝗴𝗮𝘁𝗲𝘄𝗮𝘆 Routes incoming requests to the appropriate agent using a lightweight LLM. > 𝗔𝗴𝗲𝗻𝘁 𝗹𝗼𝗴𝗶𝗰 𝘀𝗲𝗿𝘃𝗶𝗰𝗲𝘀 A task-specific agent gets the task, and uses a step-wise workflow (e.g. Langgraph, CrewAI, Langchain...) that invokes one or several LLMs, and a set of external tools. > 𝗧𝗼𝗼𝗹 𝘀𝗲𝗿𝘃𝗲𝗿𝘀 They act as an interface between your agents and the backend services these agents need to solve the task. Here you can use MCP clients and servers, and a library like FastMCP. > 𝗟𝗟𝗠 𝘀𝗲𝗿𝘃𝗲𝗿𝘀 They need to run on dedicated GPU nodes, using tools like vLLM or NVIDIA NIM. Every service is running as a containerised app in your Kubernetes cluster. BOOM! 𝗪𝗮𝗻𝗻𝗮 𝗹𝗲𝗮𝗿𝗻 𝗺𝗼𝗿𝗲 𝗿𝗲𝗮𝗹 𝘄𝗼𝗿𝗹𝗱 𝗟𝗟𝗠𝗢𝗽𝘀? In the next weeks Marius Rugan and I will dig deeper into LLMOps system architecture. In public. For FREE. Follow Pau Labarta Bajo so you don't miss what is coming next
AI Agent System Fundamentals
Explore top LinkedIn content from expert professionals.
-
-
If you’re building AI agents that need to work reliably in production, not just in demos, this is the full-stack setup I’ve found useful From routing to memory, planning to monitoring, here’s how the stack breaks down 👇 🧠 Agent Orchestration → Agent Router handles load balancing using consistent hashing, so tasks always go to the right agent → Task Planner uses HTN (Hierarchical Task Network) and MCTS to break big problems into smaller ones and optimize execution order → Memory Manager stores both episodic and semantic memory, with vector search to retrieve relevant past experiences → Tool Registry keeps track of what tools the agent can use and runs them in sandboxed environments with schema validation ⚙️ Agent Runtime → LLM Engine runs models with optimizations like FP8 quantization, speculative decoding (which speeds things up), and key-value caching → Function Calls are run asynchronously, with retry logic and schema validation to prevent invalid requests → Vector Store supports hybrid retrieval using ChromaDB and Qdrant, plus FAISS for fast similarity search → State Management lets agents recover from failures by saving checkpoints in Redis or S3 🧱 Infrastructure → Kubernetes auto-scales agents based on usage, including GPU-aware scheduling → Monitoring uses OpenTelemetry, Prometheus, and Grafana to track what agents are doing and detect anomalies → Message Queue (Kafka + Redis Streams) helps route tasks with prioritization and fallback handling → Storage uses PostgreSQL for metadata and S3 for storing large data, with encryption and backups enabled 🔁 Execution Flow Every agent follows this basic loop → Reason (analyze the context) → Act (use the right tool or function) → Observe (check the result) → Reflect (store it in memory for next time) Why this matters → Without a good memory system, agents forget everything between steps → Without planning, tasks get run in the wrong order, or not at all → Without proper observability, you can’t tell what’s working or why it failed → And without the right infrastructure, the whole thing breaks when usage scales If you’re building something similar, would love to hear how you’re thinking about memory, planning, or runtime optimization 〰️〰️〰️〰️ ♻️ Repost this so other AI Engineers can see it! 🔔Follow me (Aishwarya Srinivasan) for more AI insights, news, and educational resources 📙I write long-form technical blogs on substack, if you'd like deeper dives: https://lnkd.in/dpBNr6Jg
-
𝗔𝟮𝗔 (𝗔𝗴𝗲𝗻𝘁-𝘁𝗼-𝗔𝗴𝗲𝗻𝘁) 𝗮𝗻𝗱 𝗠𝗖𝗣 (𝗠𝗼𝗱𝗲𝗹 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝗼𝘁𝗼𝗰𝗼𝗹) are two emerging protocols designed to facilitate advanced AI agent systems, but they serve distinct roles and are often used together in modern agentic architectures. 𝗛𝗼𝘄 𝗧𝗵𝗲𝘆 𝗪𝗼𝗿𝗸 𝗧𝗼𝗴𝗲𝘁𝗵𝗲𝗿 Rather than being competitors, 𝗔𝟮𝗔 𝗮𝗻𝗱 𝗠𝗖𝗣 𝗮𝗿𝗲 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝗿𝘆 𝗽𝗿𝗼𝘁𝗼𝗰𝗼𝗹𝘀 that address different layers of the agent ecosystem: • 𝗔𝟮𝗔 is about agents collaborating, delegating tasks, and sharing results across a distributed network. For example, an orchestrating agent might delegate subtasks to specialized agents (analytics, HR, finance) via A2A25. • 𝗠𝗖𝗣 is about giving an agent (often an LLM) structured access to external tools and data. Within an agent, MCP is used to invoke functions, fetch documents, or perform computations as needed. 𝗧𝘆𝗽𝗶𝗰𝗮𝗹 𝗪𝗼𝗿𝗸𝗳𝗹𝗼𝘄 𝗘𝘅𝗮𝗺𝗽𝗹𝗲: • A user submits a complex request. • The orchestrating agent uses 𝗔𝟮𝗔 to delegate subtasks to other agents. • One of those agents uses 𝗠𝗖𝗣 internally to access tools or data. • Results are returned via A2A, enabling end-to-end collaboration25. 𝗗𝗶𝘀𝘁𝗶𝗻𝗰𝘁 𝗦𝘁𝗿𝗲𝗻𝗴𝘁𝗵𝘀 • 𝗔𝟮𝗔 𝗲𝘅𝗰𝗲𝗹𝘀 𝗮𝘁: Multi-agent collaboration and orchestration Handling complex, multi-domain workflows Allowing independent scaling and updating of agents Supporting long-running, asynchronous tasks54 • 𝗠𝗖𝗣 𝗲𝘅𝗰𝗲𝗹𝘀 𝗮𝘁: Structured tool and data integration for LLMs Standardizing access to diverse resources Transparent, auditable execution steps Single-agent scenarios needing a precise tool 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗮𝗹 𝗔𝗻𝗮𝗹𝗼𝗴𝘆 • 𝗠𝗖𝗣 is like a 𝘶𝘯𝘪𝘷𝘦𝘳𝘴𝘢𝘭 𝘤𝘰𝘯𝘯𝘦𝘤𝘵𝘰𝘳 (USB-C port) between an agent and its tools/data. • 𝗔𝟮𝗔 is like a 𝘯𝘦𝘵𝘸𝘰𝘳𝘬 𝘤𝘢𝘣𝘭𝘦 connecting multiple agents, enabling them to form a collaborative team. 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗮𝗻𝗱 𝗖𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆 𝗖𝗼𝗻𝘀𝗶𝗱𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 • 𝗔𝟮𝗔 introduces many endpoints and requires robust authentication and authorization (OAuth2.0, API keys). • 𝗠𝗖𝗣 needs careful sandboxing of tool calls to prevent prompt injection or tool poisoning. Both are built with enterprise security in mind. 𝗜𝗻𝗱𝘂𝘀𝘁𝗿𝘆 𝗔𝗱𝗼𝗽𝘁𝗶𝗼𝗻 • 𝗔𝟮𝗔: Google, Salesforce, SAP, LangChain, Atlassian, Cohere, and others are building A2A-enabled agents. • 𝗠𝗖𝗣: Anthropic (Claude Desktop), Zed, Cursor AI, and tool-based LLM UIs. Modern agentic systems often combine both: 𝗔𝟮𝗔 𝗳𝗼𝗿 𝗶𝗻𝘁𝗲𝗿-𝗮𝗴𝗲𝗻𝘁 𝗼𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻, 𝗠𝗖𝗣 𝗳𝗼𝗿 𝗶𝗻𝘁𝗿𝗮-𝗮𝗴𝗲𝗻𝘁 𝘁𝗼𝗼𝗹 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻. This layered approach supports scalable, composable, and secure AI applications.
-
One of the biggest challenges to high Agentic performance has always been maintaining context integrity. Every iteration of prompt optimization or memory rewriting seems to make the agent forget what made it smart in the first place. The result is what researchers now call brevity bias and context collapse, where iterative rewriting simplifies prompts to the point that the system loses all nuance and depth. ACE (Agentic Context Engineering) eliminates the need to overwrite context while allowing it to evolve. It treats context as a living playbook that grows richer with every interaction. Each update is modular and incremental, allowing agents to: 🔹Keep learning strategies instead of discarding them 🔹Retain domain heuristics essential for multi-turn reasoning and tool use 🔹Adapt faster while reducing latency and rollout costs significantly To make this possible, ACE’s architecture introduces three intelligent components that mirror how we humans improve with experience: 🧠 Generator creates reasoning paths 🔍 Reflector learns from outcomes 📚 Curator organizes those lessons into a structured and reusable memory As a result, you have an agent that learns continuously without retraining. Check out benchmarks such as AppWorld and FiNER in the below paper, where ACE delivered double digit accuracy gains while reducing adaptation latency by more than eighty percent. From a product and systems perspective, ACE goes beyond an optimization framework to redefine how agents evolve. Context becomes the foundation for learning itself, making agents more performant, transparent, explainable, and self improving with every decision cycle. #AIAgents #AgentContextEngineering
-
Google announced Agent2Agent Protocol, how is it related to MCP and what is this all about ? 🤖 𝟏. 𝐌𝐨𝐝𝐞𝐥 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐏𝐫𝐨𝐭𝐨𝐜𝐨𝐥 (𝐌𝐂𝐏): 𝐌𝐨𝐝𝐞𝐥-𝐭𝐨-𝐓𝐨𝐨𝐥/𝐃𝐚𝐭𝐚 𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐏𝐮𝐫𝐩𝐨𝐬𝐞: MCP is designed to be a universal standard for how an AI model (or an application housing a model, sometimes called an "agent" in this context) securely connects to and interacts with external tools, APIs, and data sources (called "MCP servers"). 𝐆𝐨𝐚𝐥: To provide the AI model with necessary "context" (like files, database entries, real-time information) from these external sources and allow the model to trigger actions (like updating a record, sending a message) using those tools. It aims to eliminate the need for custom, one-off integrations for every tool. 𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐓𝐲𝐩𝐞: Primarily Client (AI model/app) <-> Server (Tool/API/Data Source). 𝐀𝐧𝐚𝐥𝐨𝐠𝐲: Think of MCP like a standardized USB port or HTTP protocol for AI. It allows any compatible AI model to "plug into" and use any compatible external tool or data source without needing a special adapter each time. 𝐅𝐨𝐜𝐮𝐬: Enhancing the capabilities of a single AI model/application by giving it secure and standardized access to the outside world. 𝟐. 𝐀𝐠𝐞𝐧𝐭-𝐭𝐨-𝐀𝐠𝐞𝐧𝐭 (𝐀𝟐𝐀) 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐏𝐫𝐨𝐭𝐨𝐜𝐨𝐥𝐬: 𝐀𝐠𝐞𝐧𝐭-𝐭𝐨-𝐀𝐠𝐞𝐧𝐭 𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐏𝐮𝐫𝐩𝐨𝐬𝐞: These protocols define standards for how multiple distinct autonomous AI agents communicate directly with each other to collaborate, coordinate tasks, negotiate, and share information. 𝐆𝐨𝐚𝐥: To enable complex multi-agent systems where agents can work together effectively, delegate tasks, and achieve goals that a single agent couldn't manage alone. This includes agents potentially built by different developers or organizations. 𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐓𝐲𝐩𝐞: Agent <-> Agent 𝐌𝐞𝐜𝐡𝐚𝐧𝐢𝐬𝐦: Often based on established theories defining message types (inform, request, query), message structures, interaction protocols, and sometimes shared languages/ontologies. Newer protocols like Google's A2A build on web standards (HTTP, JSON-RPC) for interoperability. 𝐀𝐧𝐚𝐥𝐨𝐠𝐲: Think of A2A protocols as a shared language, grammar, and set of conversational rules (etiquette) that allow different agents to understand each other and work together cooperatively. 𝐅𝐨𝐜𝐮𝐬: Enabling communication, collaboration, and coordination between multiple distinct AI agents. MCP Official: https://lnkd.in/gRMcrwpn A2A Official: https://lnkd.in/g6PCJZWn Follow Arpit Adlakha for more!
-
AI agents are getting smarter—but they’ve hit a wall. Here’s the thing: no matter how powerful your LLM is, it’s limited by one frustrating thing—the context window. If you’ve worked with AI agents, you know the pain: - The model forgets what happened earlier. - You lose track of the conversation. - Your agent starts acting like it has amnesia. This is where Model Context Protocol (MCP) steps in—and honestly, it’s a game changer. Instead of stuffing everything into a model’s tiny context window, MCP creates a bridge between your AI agents, tools, and data sources. It lets agents dynamically load the right context at the right time. No more hitting limits. No more starting over. This diagram shows how it works: - Your AI agent (whether it’s Claude, LangChain, CrewAI, or LlamaIndex) connects through MCP to tools like GitHub, Slack, Snowflake, Zendesk, Dropbox—you name it. - The MCP Server + Client handle everything behind the scenes: -- Tracking your session -- Managing tokens -- Pulling in conversation history and context -- Feeding your model exactly what it needs when it needs it The result? ✅ Your agent remembers the full conversation, even across multiple steps or sessions ✅ It taps into real-time enterprise data without losing performance ✅ It acts less like a chatbot and more like an actual teammate And this is just the start. Protocols like MCP are making AI agents way more reliable—which is key if we want them to handle real-world tasks like customer service, operations, data analysis, and more. Bottom line: If you’re building with AI right now and not thinking about context management, you’re going to hit scaling problems fast. Join The Ravit Show Newsletter — https://lnkd.in/dCpqgbSN Have you played around with MCP or similar setups yet? What’s your biggest frustration when it comes to building agents that can actually remember? #data #ai #agents #theravitshow
-
🚀 When building agentic applications, one of the most overlooked performance levers is prompt/prefix caching. Most agent workflows are multi-step, the agents needs to correctly execute 25-50 steps before task completion, where the same long context needs to be passed to the model repeatedly. With prompt caching, we can avoid re-computing large chunks of unchanged context, leading to big speedups and cost reductions. But here’s the important part: cache effectiveness isn’t just about infrastructure. It’s about how you prompt your models. ✅ Do: If you keep the bulk of the prompt constant and only append incremental context, cache hit rates stay high. ❌ Don't: If you keep updating the start of the prompt (like adding current time/date every turn), cache hit rates collapse, and your agents get slower and more expensive. Designing prompts with caching in mind is what separates a smooth, cost-efficient agent from one that feels sluggish and costly.
-
Context engineering is quickly becoming one of the most critical skills in applied AI. Not prompt tweaking. Not model fine-tuning. But knowing what information a model needs, and when to give it. That is the real unlock behind AI agents that actually work. At its core, context engineering is about delivering the right information to the model, at the right time, in the right format; so it can reason effectively. It pushes developers to think more intentionally about how they shape a model’s inputs: 🔸What does the model need to know for this task? 🔸Where should that information come from? 🔸How do we fit it within the limits of the context window? 🔸And how do we prevent irrelevant or conflicting signals from getting in the way? Why does this matter so much? In practice, most agent failures are not due to weak models. They happen because the model did not have the context it needed. It missed a key fact, relied on stale data, or was overloaded with noise. Context engineering addresses this directly. It forces you to design the flow of information step by step, not just what the model sees, but how and when it sees it. This context can come from many places: 🔹Long- and short-term memory (such as prior conversations or user history) 🔹Retrieved data from APIs, vector stores, or internal systems 🔹Tool definitions and their recent outputs 🔹Structured formats or schemas that define how information is used 🔹Global state shared across multi-step workflows Frameworks like LlamaIndex, LangGraph AI, LangChain, are evolving to support this shift, giving developers the tools to manage context with much more precision. And there are now better resources than ever to help teams write, select, compress, and organize context with real control. Image from Langchain blog. #contextengineering #llms #generativeai #artificialintelligence #technology
-
⚠️ While LLM providers keep increasing context window sizes, designing for long context windows isn’t the right approach. Here’s why: 🔹 Inefficiency at Scale Bigger windows mean more tokens per request. That adds overhead without delivering proportional gains in reasoning or outcomes. 🔹 Noise Over Signal Feeding everything into the model doesn’t ensure better answers. Longer contexts can blur what’s truly relevant, leading to weaker responses. 🔹 Memory Management & Loss of Context Models still struggle with remembering and prioritizing information over long spans. Extending the window doesn’t solve the deeper challenge of structured memory. 🔹 Smarter Patterns Exist Techniques like retrieval-augmented generation (RAG), summarization, structured memory, and lean agentic workflows provide sharper, more reliable results. 💡 The principle is simple: design for the right context, not the longest one. Every unnecessary token adds cost, compute, and carbon overhead. 👉 One practical advice: always design systems with constraints in mind — that’s where true efficiency emerges. For context windows, think: how would you solve the problem if you only had a smaller window available? That mindset pushes you toward leaner, more efficient solutions. #leanagenticai
-
Context is the new compute For the past month, I keep circling the same word in my notebook: context. Models are racing toward million-token windows and agentic behavior, but the winners won’t be whoever ships the biggest model. They’ll be the teams that shape, select, compress, and govern context best. Why does this matter? Long context is here. GPT-4.1, Gemini 1.5, Claude 3.5 Sonnet - huge windows are the new normal. But long ≠ useful. LLMs still miss info (“lost in the middle”). RAG is still the control plane. Retrieval (and re-ranking) stays essential, at least for now. Engineers are shifting from “writing code” to “composing context.” Orchestration, not syntax, is where leverage lives. The Context Stack 1. Context Orchestrators – Policy engines that decide what to retrieve, how much, how to compress, and where to place it. Expect dynamic retrieval and critique loops. 2. Contextual Retrieval > Vanilla RAG – Vector search drops nuance. Adding structure, authorship, and graph edges improves evidence and auditability. 3. Memory Infrastructure – Agents need persistent memory with TTLs, scopes, and consent. Think “Snowflake for memory,” not a bigger vector DB. 4. Context Observability & QA – Metrics for lost needles, reranker drift, and token economics. Databricks-style RAG evals should become standard QA suites. 5. Compression & Layout – Systems that say more with fewer tokens. Expect advances in quoting, snippets, and layout to beat “dump the PDF.” 6. Context Governance – Provenance, revocation, per-segment licensing, and user-level privacy that travels with data. 7. Hardware-aware Context – Prompt caching, streaming retrieval, and pre-embedding will be table stakes as context scales. Evaluation systems need a whole different post but they change with context was well. And MCP? What MCP Is Missing: The Model Context Protocol (MCP) standardizes how models talk to tools and data—but it’s about interfaces, not judgment. - Doesn’t solve how to prioritize, compress, or govern context. - Doesn’t decide which docs to include or how to avoid “lost-in-the-middle.” - Ignores observability, provenance, or cost optimization. - Weak at carrying context across servers and interfaces. MCP defines the pipes—but not the water. The uncomfortable truth Most AI products underperform not because the model is “bad,” but because their context pipeline is. If your system can’t: (a) find the right evidence (b) compress it without losing meaning (c) place it where the model will notice it (d) prove provenance …you’re wasting tokens and shipping untrustworthy answers. The next defensible moats may just be context moats. Winners will treat context like a product surface: instrumented, optimized, and governed end-to-end.