I’ve been thinking a lot about accuracy lately. In conversational AI, accuracy is the metric that tends to get the most attention. But the more time I spend with voice agents, the more I’ve realized: Accuracy captures only a thin slice of how voice agents actually perform. A voice agent can maintain a high accuracy score while drifting across turns, skipping guardrails, or responding to an ASR hallucination that never existed in the audio. And none of that shows up in a simple pass/fail metric. This came up in a previous episode of The Voice Loop (our podcast) with Fabian Seipel from ai-coustics. The audio layer alone introduces distortions most teams never evaluate: clipped speech, mic inconsistencies, compression artifacts, overlapping voices, etc. The reality is: Accuracy measures the endpoint and not the behavior of the pipeline that produced it. But conversational AI is a pipeline problem: ASR → retrieval → reasoning → response. If any of those layers misinterpret, drift, or break, the “accurate” answer can still be wrong in practice. I wrote more about my thoughts on the Hamming AI blog. Link in comments!