For the last couple of years, Large Language Models (LLMs) have dominated AI, driving advancements in text generation, search, and automation. But 2025 marks a shift—one that moves beyond token-based predictions to a deeper, more structured understanding of language. Meta’s Large Concept Models (LCMs), launched in December 2024, redefine AI’s ability to reason, generate, and interact by focusing on concepts rather than individual words. Unlike LLMs, which rely on token-by-token generation, LCMs operate at a higher abstraction level, processing entire sentences and ideas as unified concepts. This shift enables AI to grasp deeper meaning, maintain coherence over longer contexts, and produce more structured outputs. Attached is a fantastic graphic created by Manthan Patel How LCMs Work: 🔹 Conceptual Processing – Instead of breaking sentences into discrete words, LCMs encode entire ideas, allowing for higher-level reasoning and contextual depth. 🔹 SONAR Embeddings – A breakthrough in representation learning, SONAR embeddings capture the essence of a sentence rather than just its words, making AI more context-aware and language-agnostic. 🔹 Diffusion Techniques – Borrowing from the success of generative diffusion models, LCMs stabilize text generation, reducing hallucinations and improving reliability. 🔹 Quantization Methods – By refining how AI processes variations in input, LCMs improve robustness and minimize errors from small perturbations in phrasing. 🔹 Multimodal Integration – Unlike traditional LLMs that primarily process text, LCMs seamlessly integrate text, speech, and other data types, enabling more intuitive, cross-lingual AI interactions. Why LCMs Are a Paradigm Shift: ✔️ Deeper Understanding: LCMs go beyond word prediction to grasp the underlying intent and meaning behind a sentence. ✔️ More Structured Outputs: Instead of just generating fluent text, LCMs organize thoughts logically, making them more useful for technical documentation, legal analysis, and complex reports. ✔️ Improved Reasoning & Coherence: LLMs often lose track of long-range dependencies in text. LCMs, by processing entire ideas, maintain context better across long conversations and documents. ✔️ Cross-Domain Applications: From research and enterprise AI to multilingual customer interactions, LCMs unlock new possibilities where traditional LLMs struggle. LCMs vs. LLMs: The Key Differences 🔹 LLMs predict text at the token level, often leading to word-by-word optimizations rather than holistic comprehension. 🔹 LCMs process entire concepts, allowing for abstract reasoning and structured thought representation. 🔹 LLMs may struggle with context loss in long texts, while LCMs excel in maintaining coherence across extended interactions. 🔹 LCMs are more resistant to adversarial input variations, making them more reliable in critical applications like legal tech, enterprise AI, and scientific research.
LLM Performance Metrics
Explore top LinkedIn content from expert professionals.
-
-
The interview is for a Generative AI Engineer role at Cohere. Interviewer: "Your client complains that the LLM keeps losing track of earlier details in a long chat. What's happening?" You: "That's a classic context window problem. Every LLM has a fixed memory limit - say 8k, 32k, or 200k tokens. Once that's exceeded, earlier tokens get dropped or compressed, and the model literally forgets." Interviewer: "So you just buy a bigger model?" You: "You can, but that's like using a megaphone when you need a microphone. A larger context window costs more, runs slower, and doesn't always reason better." Interviewer: "Then how do you manage long-term memory?" You: 1. Summarization memory - periodically condense earlier chat segments into concise summaries. 2. Vector memory - store older context as embeddings; retrieve only the relevant pieces later. 3. Hybrid memory - combine summaries for continuity and retrieval for precision. Interviewer: "So you’re basically simulating memory?" You: "Yep. LLMs are stateless by design. You build memory on top of them - a retrieval layer that acts like long-term memory. Otherwise, your chatbot becomes a goldfish." Interviewer: "And how do you know if the memory strategy works?" You: "When the system recalls context correctly without bloating cost or latency. If a user says, 'Remind me what I told you last week,' and it answers from stored embeddings - that’s memory done right." Interviewer: "So context management isn’t a model issue - it’s an architecture issue?" You: "Exactly. Most think 'context length' equals intelligence. But true intelligence is recall with relevance - not recall with redundancy." #ai #genai #llms #rag #memory
-
Most people still think of LLMs as “just a model.” But if you’ve ever shipped one in production, you know it’s not that simple. Behind every performant LLM system, there’s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs aren’t one-dimensional. They’re systems. And each dimension introduces new failure points or optimization levers. Let’s break it down: 🧠 Pre-Training Start with modality. → Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. → Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. 🛠 Fine-Tuning This is where most teams underestimate complexity: → PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. → Alignment techniques- RLHF, DPO, RAFT, aren’t interchangeable. They encode different human preference priors. → Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. ⚡️ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. 📏 Evaluation One benchmark doesn’t cut it. You need a full matrix: → NLG (summarization, completion), NLU (classification, reasoning), → alignment tests (honesty, helpfulness, safety), → dataset quality, and → cost breakdowns across training + inference + memory. Evaluation isn’t just a model task, it’s a systems-level concern. 🧾 Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isn’t trivial anymore. It’s an orchestration layer in itself. Whether you’re building for legal, education, robotics, or finance, the “general-purpose” tag doesn’t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
After LLMs, LCMs: Large Concept Models. Because we don't think in tokens, so why should machines? Meta's AI Research team have built a model that operates at sentence level, resulting in some substantial performance improvements, notably in zero-shot translation tasks and text continuation. This is a promising direction, with great scope to vary the unit of 'concept', which I expect will work better at sub-sentence level. It is particularly interesting to envisage how this could be applied in Humans + AI cognition, with better integration with human thinking by working at more similar semantic levels. Key insights in the paper (link to paper and GitHub repo in comments): 🌟 Revolutionizing Semantic Understanding with Concepts. The LCM architecture shifts focus from token-level processing to higher-level "concepts," such as sentences. This abstraction enables reasoning across 200 languages and multiple modalities, surpassing conventional token-based LLMs. Practically, this design promotes efficiency in multilingual tasks, enabling scalable applications in text and speech analysis. 📚 Explicit Hierarchical Structuring for Enhanced Coherence. By processing information in a structured flow—from abstract concepts to detailed content—the LCM mirrors human planning methods like outlining essays or talks. This hierarchical design supports better readability and interactive edits, making it ideal for generating and analyzing long-form content. 🧠 Zero-Shot Generalization Across Languages and Modalities. Thanks to its use of the SONAR embedding space, the LCM excels in zero-shot tasks across text, speech, and experimental American Sign Language inputs. This capability reduces dependency on fine-tuning for new languages or modalities, broadening its use in global communication tools. 🔀 Diffusion-Based Models Offer Robust Text Generation. Diffusion-based methods within LCM demonstrate superior performance in generating coherent, semantically rich continuations for texts compared to other approaches like simple regression or quantization. These models also provide a balance between accuracy and creative variability. 🚀 Efficient Handling of Long Contexts. The LCM's concept-based representation significantly reduces the sequence length compared to token-based models. This efficiency allows it to process lengthy documents with reduced computational overhead, enhancing feasibility for large-scale applications. 🤖 Opportunities in Modality Integration. With modular encoders and decoders, the LCM avoids the competition issues faced by multimodal models. This extensibility supports the independent development of language or modality-specific components, making it a versatile backbone for diverse AI systems.
-
Evaluations —or “Evals”— are the backbone for creating production-ready GenAI applications. Over the past year, we’ve built LLM-powered solutions for our customers and connected with AI leaders, uncovering a common struggle: the lack of clear, pluggable evaluation frameworks. If you’ve ever been stuck wondering how to evaluate your LLM effectively, today's post is for you. Here’s what I’ve learned about creating impactful Evals: 𝗪𝗵𝗮𝘁 𝗠𝗮𝗸𝗲𝘀 𝗮 𝗚𝗿𝗲𝗮𝘁 𝗘𝘃𝗮𝗹? - Clarity and Focus: Prioritize a few interpretable metrics that align closely with your application’s most important outcomes. - Efficiency: Opt for automated, fast-to-compute metrics to streamline iterative testing. - Representation Matters: Use datasets that reflect real-world diversity to ensure reliability and scalability. 𝗧𝗵𝗲 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗼𝗳 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: 𝗙𝗿𝗼𝗺 𝗕𝗟𝗘𝗨 𝘁𝗼 𝗟𝗟𝗠-𝗔𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗘𝘃𝗮𝗹𝘀 Traditional metrics like BLEU and ROUGE paved the way but often miss nuances like tone or semantics. LLM-assisted Evals (e.g., GPTScore, LLM-Eval) now leverage AI to evaluate itself, achieving up to 80% agreement with human judgments. Combining machine feedback with human evaluators provides a balanced and effective assessment framework. 𝗙𝗿𝗼𝗺 𝗧𝗵𝗲𝗼𝗿𝘆 𝘁𝗼 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲: 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗘𝘃𝗮𝗹 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 - Create a Golden Test Set: Use tools like Langchain or RAGAS to simulate real-world conditions. - Grade Effectively: Leverage libraries like TruLens or Llama-Index for hybrid LLM+human feedback. - Iterate and Optimize: Continuously refine metrics and evaluation flows to align with customer needs. If you’re working on LLM-powered applications, building high-quality Evals is one of the most impactful investments you can make. It’s not just about metrics — it’s about ensuring your app resonates with real-world users and delivers measurable value.
-
Some challenges in building LLM-powered applications (including RAG systems) for large companies: 1. Hallucinations are very damaging to the brand. It only takes one for people to lose faith in the tool completely. Contrary to popular belief, RAG doesn't fix hallucinations. 2. Chunking a knowledge base is not straightforward. This leads to poor context retrieval, which leads to bad answers from a model powering a RAG system. 3. As information changes, you also need to change your chunks and embeddings. Depending on the complexity of the information, this can become a nightmare. 4. Models are black boxes. We only have access to modify their inputs (prompts), but it's hard to determine cause-effect when troubleshooting (e.g., Why is "Produce concise answers" working better than "Reply in short sentences"?) 5. Prompts are too brittle. Every new version of a model can cause your previous prompts to stop working. Unfortunately, you don't know why or how to fix them (see #4 above.) 6. It is not yet clear how to reliably evaluate production systems. 7. Costs and latency are still significant issues. The best models out there cost a lot of money and are very slow. Cheap and fast models have very limited applicability. 8. There are not enough qualified people to deal with these issues. I cannot highlight this problem enough. You may encounter one or more of these problems in a project at once. Depending on your requirements, some of these issues may be showstoppers (hallucinating direction instructions for a robot) or simple nuances (support agent hallucinating an incorrect product description.) There's still a lot of work to do until these systems mature to a point where they are viable for most use cases.
-
How good are LLMs in long-context RAGs? Databricks Mosaic Research ran over 2,000 experiments on 13 open and closed LLMs on 4 curated RAG datasets. 👀 TL;DR: 📈 Retrieving more documents generally improves RAG performance up to a point 🔬 Most models show decreased performance after ~32k context size (e.g., 32k for Llama-3.1-405b, 64k for GPT-4) 🆚 Different models fail in distinct ways with long contexts (e.g., copyright concerns, summarization instead of answering) 🚫 Claude 3.5 copyright-related failures increased from 3.7% at 16k context to 49.5% at 64k context. 👎🏻 DBRX failure to follow instructions jumped from 5.2% at 8k context to 50.4% at 32k. 🔄 Mixtral started to generated repeated content ("梦梦梦梦梦梦") 🤷🏻♂️ LLMs still suffer from the "lost in the middle", where they fail to utilize information from the middle portions of long texts effectively 📊 Optimal context size depends on both the model and the specific task 🧠 Lack of long-context post-training may be the reason for model failures Blog: https://lnkd.in/eCXDuZPP A good example where fine-tuning on domain/task-specific long-context synthetic datasets could significantly improve the performance of open models compared to closed models. 🚀
-
Unique Complexities of LLM Inference: LLM inference poses interesting challenges that I haven’t really experienced in other areas of HPC because you must optimize for both throughput and latency in an incredibly dynamic environment. This poses challenges - generally throughput and latency are at odds with one another - when you optimize for one, you hurt the other. But in the inference case, you have a real time user sending prompts over the internet expecting a response. They want to be able to receive text at least at the rate they can read, so they are not waiting (how impatient we are 😁). However, in a production environment, you may have thousands of users sending prompts. The more prompts you process at a time, the better your throughput gets but the worse your latency gets. Conversely, smaller batches help latency but kill throughput. In many ways, it’s the age old problem with a web server except for a unique difference - GPUs are specifically designed to be throughput monsters with expected latency overhead. Many techniques for performant GPU programming are around hiding that latency during processing of data. Furthermore, larger models don’t fit in a single GPU and need sharding across multiple GPUs. Very large models need parallelism both at the GPU level and at the node level. All of these incur communication overhead, which affects both throughput and latency. My favorite project right now is vLLM because they are dealing with these complexities in pretty cool ways - from optimizing the control plane on the CPU to utilize techniques like adaptive batching and efficient queueing and scheduling, to optimizing memory on the GPU using dynamic memory management, paged attention, and interesting pipelining. The rate at which their software is improving is also pretty cool - I benchmark their inference a lot and see consistent improvements from release to release. I’ve learned a lot from diving into their code and watching their office hours on the YouTube. If you are interested in high performance inference, it’s a great project to check out. If you like my content, feel free to follow or connect! #softwareengineering #gpus
-
Exciting Research Alert: Solving Knowledge Conflicts in Retrieval-augmented LLMs I just came across a fascinating new paper that addresses a critical challenge in retrieval-augmented large language models (LLMs): knowledge conflicts. >> The Problem When LLMs encounter conflicting information between their internal knowledge and retrieved external data, they often struggle to generate reliable responses. This creates uncertainty in decision-making and undermines the reliability of AI systems in real-world applications. >> The Solution: Swin-VIB Researchers from Nankai University, Haihe Lab of ITAI, Meta AI, and InnoPeak Technology, Inc. have developed an innovative framework called Swin-VIB that tackles this problem head-on. Swin-VIB integrates a pipeline of variational information bottleneck models to adaptively augment retrieved information and guide LLM preference during response generation. The approach is grounded in information theory and reveals a key insight: when the difference between conflicting and supplementary information is significant, LLMs confidently settle into stable preferences, but when the distinction is ambiguous, uncertainty increases dramatically. >> How It Works Under the Hood The system operates through a sliding window mechanism that: 1. Extracts attention scores from transformer decoder blocks 2. Compresses these representations through an encoder-decoder bottleneck structure 3. Adaptively accepts or rejects external context based on uncertainty predictions The framework uses a three-step process: - Retrieval: Gathering relevant contexts from external knowledge - Augmentation: Using trained bottlenecks to perform sliding-window inference - Response generation: Producing reliable outputs even with conflicting information >> Impressive Results The results speak for themselves: - Improved single-choice task accuracy by at least 7.54% over competitive baselines - Enhanced performance across single-choice, open-ended QA, and retrieval-augmented generation (RAG) tasks - Significantly reduced output uncertainty in LLMs - Maintained high-quality response generation without compromising authenticity or coherence This research represents a significant step forward in making retrieval-augmented LLMs more reliable in the wild, especially when dealing with misinformation, biases, or outdated knowledge. What are your thoughts on this approach to handling knowledge conflicts in AI systems? Have you encountered similar challenges in your work with LLMs?
-
Many companies are diving into AI agents without a clear framework for when they are appropriate or how to assess their effectiveness. Several recent benchmarks offer a more structured view of where LLM agents are effective and where they are not. LLM agents consistently perform well in short, structured tasks involving tool use. A March 2025 survey on evaluation methods highlights their ability to decompose problems into tool calls, maintain state across multiple steps, and apply reflection to self-correct. Architectures like PLAN-and-ACT and AgentGen, which incorporate Monte Carlo Tree Search, improve task completion rates by 8 to 15 percent across domains such as information retrieval, scripting, and constrained planning. Structured hybrid pipelines are another area where agents perform reliably. Benchmarks like ThinkGeo and ToolQA show that when paired with stable interfaces and clearly defined tool actions, LLMs can handle classification, data extraction, and logic operations at production-grade accuracy. The performance drops sharply in more complex settings. In Vending-Bench, agents tasked with managing a vending operation over extended interactions failed after roughly 20 million tokens. They lost track of inventory, misordered events, or repeated actions indefinitely. These breakdowns occurred even when the full context was available, pointing to fundamental limitations in long-horizon planning and execution logic. SOP-Bench further illustrates this boundary. Across 1,800 real-world industrial procedures, Function-Calling agents completed only 27 percent of tasks. When exposed to larger tool registries, performance degraded significantly. Agents frequently selected incorrect tools, despite having structured metadata and step-by-step guidance. These findings suggest that LLM agents work best when the task is tightly scoped, repeatable, and structured around deterministic APIs. They consistently underperform when the workflow requires extended decision-making, coordination, or procedural nuance. To formalize this distinction, I use the SMART framework to assess agent fit: • 𝗦𝗰𝗼𝗽𝗲 & 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 – Is the process linear and clearly defined? • 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 & 𝗠𝗲𝗮𝘀𝘂𝗿𝗲𝗺𝗲𝗻𝘁 – Is there sufficient volume and quantifiable ROI? • 𝗔𝗰𝗰𝗲𝘀𝘀 & 𝗔𝗰𝘁𝗶𝗼𝗻𝗮𝗯𝗶𝗹𝗶𝘁𝘆 – Are tools and APIs integrated and callable? • 𝗥𝗶𝘀𝗸 & 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 – Can failures be logged, audited, and contained? • 𝗧𝗲𝗺𝗽𝗼𝗿𝗮𝗹 𝗟𝗲𝗻𝗴𝘁𝗵 – Is the task short, self-contained, and episodic? When all five criteria are met, agentic automation is likely to succeed. When even one is missing, the use case may require redesign before introducing LLM agents. The strongest agent implementations I’ve seen start with ruthless scoping, not ambitious scale. What filters do you use before greenlighting an AI agent?