Large language models (LLMs) are typically optimized to answer peoples’ questions. But there is a trend toward models also being optimized to fit into agentic workflows. This will give a huge boost to agentic performance! Following ChatGPT’s breakaway success at answering questions, a lot of LLM development focused on providing a good consumer experience. So LLMs were tuned to answer questions (“Why did Shakespeare write Macbeth?”) or follow human-provided instructions (“Explain why Shakespeare wrote Macbeth”). A large fraction of the datasets for instruction tuning guide models to provide more helpful responses to human-written questions and instructions of the sort one might ask a consumer-facing LLM like those offered by the web interfaces of ChatGPT, Claude, or Gemini. But agentic workloads call on different behaviors. Rather than directly generating responses for consumers, AI software may use a model in part of an iterative workflow to reflect on its own output, use tools, write plans, and collaborate in a multi-agent setting. Major model makers are increasingly optimizing models to be used in AI agents as well. Take tool use (or function calling). If an LLM is asked about the current weather, it won’t be able to derive the information needed from its training data. Instead, it might generate a request for an API call to get that information. Even before GPT-4 natively supported function calls, application developers were already using LLMs to generate function calls, but by writing more complex prompts (such as variations of ReAct prompts) that tell the LLM what functions are available and then have the LLM generate a string that a separate software routine parses (perhaps with regular expressions) to figure out if it wants to call a function. Generating such calls became much more reliable after GPT-4 and then many other models natively supported function calling. Today, LLMs can decide to call functions to search for information for retrieval augmented generation (RAG), execute code, send emails, place orders online, and much more. Recently, Anthropic released a version of its model that is capable of computer use, using mouse-clicks and keystrokes to operate a computer (usually a virtual machine). I’ve enjoyed playing with the demo. While other teams have been prompting LLMs to use computers to build a new generation of RPA (robotic process automation) applications, native support for computer use by a major LLM provider is a great step forward. This will help many developers! [Reached length limit; full text: https://lnkd.in/gHmiM3Tx ]
DSPy Trends in Large Language Model Development
Explore top LinkedIn content from expert professionals.
Summary
“dspy-trends-in-large-language-model-development” refers to the growing practice of using the DSPy framework and related methods to build smarter, more reliable large language models (LLMs) that can act as agents, access real-time data, and automate tasks through structured programming rather than manual prompt tweaking. DSPy is a toolkit that lets developers specify their needs in code, automate prompt generation, and improve model behaviors through measurable feedback, taking LLMs beyond simple Q&A to complex agentic workflows.
- Specify requirements: Focus on writing clear, testable specifications for your AI tasks instead of relying on handwritten prompts, making it easier to adapt to new models and update workflows.
- Automate prompt tuning: Use frameworks like DSPy to generate prompts based on your code and success metrics, so each change improves results instead of causing unpredictable shifts.
- Integrate real-time tools: Combine DSPy with protocols like MCP to let your AI agents access live APIs and systems, enabling real-world automation such as dynamic booking or information retrieval.
-
-
Ten months ago, DeepSeek collapsed AI training costs by 90% using distillation - transferring knowledge from larger models to smaller ones at a fraction of the cost. Distillation works like a tutor training a student : a large model teaches a smaller one. As we’ve shifted from knowledge retrieval to agentic systems, we wondered if there was a parallel technique for tool calling. Could a large model teach a smaller one to call the right tools? The answer is yes, or at least yes in our case. Here’s our current effort : (first image) Every time we used Claude Code, we logged the session - our query, available tools, & which tools Claude chose. These logs became training examples showing the local model what good tool calling looks like. We wanted to choose the right data so we used algorithms to cherry-pick. We used SemDeDup & CaR, algorithms to find the data examples that lead to better results. Claude Code fired up our local model powered by GPT-OSS 20b & peppered it with the queries. Claude graded GPT on which tools it calls. Claude’s assessments were fed into a prompt-optimization system with DSPy & GEPA. All of that data was then fed to improve the prompt. DSPy searches for existing examples that could improve the prompt, while GEPA mutates or tests different mutations. Combined, we improved from a 12% Claude match rate to 93% in three iterations by increasing the data volume to cover different scenarios : (second image) DSPy improved accuracy from 0% to 12%, and GEPA pushed it much higher, all the way to 93%, after three phases. The local model now matches Claude’s tool call chain in 93% of cases. Make no mistake : matching Claude 93% doesn’t mean 93% accuracy. When we benchmarked Claude itself, it only produced consistent results about 50% of the time. This is non-determinism at work. This proof of concept works for a small set of tools written in the code mode fashion. It suggests there is a potential for tool calling distillation. If you’ve tried something similar, I’d love to hear from you.
-
Working with LLMs often means spending considerable time crafting the "perfect prompt." As we’ve come to know, even minor changes in wording, phrasing, keywords, or formatting can drastically alter a model's output. This becomes exponentially more complex in applications where multiple prompts interact or feed into each other. Debugging these systems quickly becomes unwieldy as small tweaks trigger unpredictable downstream changes. The fundamental brittleness of direct prompting and working with complicated text creates an endless cycle of adjustments and variations. To address this, Stanford's NLP lab released the framework DSPy (Declarative Self-improving Python), shifting the paradigm from "prompting" to "programming" language models. Drawing from established deep learning frameworks like PyTorch and TensorFlow, DSPy abstracts away from manual prompt engineering through a three-phase approach: The first phase focuses on defining the desired LLM interaction pattern. This could be as straightforward as "input -> output" for general processing, "context, question -> answer" for RAG, or "document -> summary" for summarization. While DSPy handles the initial universal prompt generation, the framework's true power lies in its optimization capabilities. Optimization relies fundamentally on clearly defined metrics - the second crucial phase of DSPy. Unlike traditional prompt engineering's trial-and-error approach, DSPy introduces rigorous measurement frameworks for every aspect of LLM interaction. This includes quantifying input quality, output accuracy, and overall program success through concrete, measurable criteria. More nuanced tasks can leverage LLM-based judges to score outputs on dimensions like conciseness, engagement, or factual consistency. By establishing these precise success criteria, DSPy flips prompt optimization from a subjective art into a systematic process that can be iteratively improved through measurable feedback loops. With these metrics in place, the third phase showcases DSPy's optimization potential. Its optimizers can find, create, or retrieve the most effective few-shot examples for your specific use case, or directly enhance instruction prompts through sophisticated algorithms. No more uncertain prompt tweaking - each change is validated against your defined metrics and examples, ensuring reliable improvements. This creates a clear optimization path where you can track progress toward your defined goals rather than making blind adjustments hoping for better results. Having spent the past weeks testing and documenting DSPy, I've created custom diagrams and explanations of their implemented solutions and put every piece of it to the test. To see exactly how it works and how to apply these concepts to your own programs, check out my latest video here: https://lnkd.in/ew4PacYt
Stop Prompt Engineering! Program Your LLMs with DSPy
https://www.youtube.com/
-
Played with adding MCP to DSPy (Community) recently. Building intelligent agents that actually do things, not just talk about them, has always felt like a gap in LLM workflows. That’s where integrating Model Context Protocol (MCP) with DSPy changes the game. MCP lets agents fetch real-time data from external tools, APIs, and live systems, instead of being stuck in static prompts or pre-indexed corpora. Think: flight booking systems that pull current flight info and actually book the ticket. DSPy, on the other hand, brings structure. It’s a declarative framework for LLMs, so instead of hand-crafting fragile prompts, we define how the model should behave using modules, signatures, and planning logic. Together, they form a tight loop: DSPy handles reasoning, MCP handles action. ✅ Real-time data access ✅ Clear modular structure ✅ Cleaner debugging ✅ Better user experience I've already used this stack for agents POCs with document automation and internal tools. It’s fast, composable, and very real-world friendly. Curious to see where else it goes, bullish on this as MCP takes care of abstractions, and DSPy takes care of the brittleness of LLM systems.
-
The emerging architectural pattern I see in successful AI implementations 👇 Write 𝘀𝗽𝗲𝗰𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀, not just prompts. Most teams today encode everything into a single massive prompt. ✍️ Business logic sits next to model-specific parsing requirements. ✍️ Data schemas are mixed with retry strategies. ✍️ Domain rules are tangled with reasoning formats. This works, but it creates 𝘁𝗶𝗴𝗵𝘁 𝗰𝗼𝘂𝗽𝗹𝗶𝗻𝗴. When you switch models or when updates arrive, you often need to rewrite everything because you've mixed what your system needs to do with how one specific model processes it. 𝗛𝗲𝗿𝗲'𝘀 𝘁𝗵𝗲 𝗼𝗽𝗽𝗼𝗿𝘁𝘂𝗻𝗶𝘁𝘆: instead of optimizing prompts, define your actual requirements. 🎯 Write down what goes in and what comes out. 🎯 Create test suites based on real data. 🎯 Build your control flow in code. 🎯 Then let your prompts be generated from these specifications. When models change, regenerate. Your core system stays intact. 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗺𝗼𝗱𝗲𝗹𝘀 𝗸𝗲𝗲𝗽 𝗴𝗲𝘁𝘁𝗶𝗻𝗴 𝗯𝗲𝘁𝘁𝗲𝗿 𝗮𝘁 𝗲𝘃𝗲𝗿𝘆𝘁𝗵𝗶𝗻𝗴 𝗲𝘅𝗰𝗲𝗽𝘁 𝗸𝗻𝗼𝘄𝗶𝗻𝗴 𝘄𝗵𝗮𝘁 𝗲𝘅𝗮𝗰𝘁𝗹𝘆 𝘆𝗼𝘂 𝘄𝗮𝗻𝘁 𝘁𝗼 𝗱𝗼. They won't magically understand your domain or handle your edge cases. That's still your job. You can focus on what matters: defining what your system does, not how to coax this month's model into doing it. Two frameworks are pioneering this approach. 𝗗𝗦𝗣𝘆 𝗳𝗿𝗼𝗺 𝗦𝘁𝗮𝗻𝗳𝗼𝗿𝗱 treats prompts as compiled output. You write signatures like "document -> revenue, profit" and it generates optimal prompts for any model. The same specification works across different LLMs. Your focus stays on defining the task, not crafting the perfect wording. 𝗕𝗔𝗠𝗟 takes a different angle, turning prompts into typed functions. You define schemas for your inputs and outputs, write a simple prompt template, and BAML handles the rest. When you switch models, your contracts remain the same. The framework ensures your outputs match your schemas regardless of which LLM runs underneath. 𝗕𝗼𝘁𝗵 𝗮𝗽𝗽𝗿𝗼𝗮𝗰𝗵𝗲𝘀 𝘀𝗵𝗮𝗿𝗲 𝗮 𝗰𝗿𝘂𝗰𝗶𝗮𝗹 𝗶𝗻𝘀𝗶𝗴𝗵𝘁: specifications are assets, prompts are compilation targets. Your business logic, evaluation criteria, and data contracts survive model transitions. No need for carefully tuned prompts, formatting tricks, and model-specific workarounds. #AI #Engineering #Architecture #SystemDesign