Confident, Fluent, and Wrong: The Silent Failure Mode Your Evals Aren't Catching

There's a specific production incident that doesn't announce itself. Your logs show HTTP 200s. Your monitoring reports zero errors. Your dashboard is green. And somewhere in your system, an AI agent has been returning plausible-looking garbage for 72 hours while your users make decisions based on it.

Towards AI documented exactly this scenario: a deployed agent silently hallucinating for three days, nothing in the infrastructure flagging it, HTTP 200s the whole way down. If you've been following this newsletter, you know I covered the green-dashboard problem back in April. What's worth revisiting now is why these failures are so hard to catch — and why the failure taxonomy matters for how you instrument.

The Two Failure Modes Aren't the Same Problem

Most teams treat hallucination as a single category. It isn't, and conflating them leads to evals that catch the easy case while missing the dangerous one.

The first type is pure fabrication: invented citations, made-up statistics, named entities with no grounding. These fail loudly in practice — a fabricated paper is traceable, a wrong statistic gets contradicted. Your evals probably catch most of these.

The second type is what Tian Pan calls operational hallucination: the model knows the right domain, selects the right tool, invokes the right concept — and gets the operationally critical detail wrong. A backup flag that silently no-ops. A date format one delimiter off. An API parameter from a neighboring library version that's syntactically valid but functionally broken. The code runs. The API returns 200. The backup "completes." Until it doesn't, in a way that's hard to trace back to the model.

This is the failure mode your evals are almost certainly not catching, because surface-level correctness checks pass. The output looks right. It would satisfy a human reviewer doing a quick scan.

Why the System Around the Model Fails to Notice

The deeper problem isn't the model — it's the missing layer between model output and downstream consequence. Roli Bosch at Hermes Labs describes it precisely: the system didn't just fail to alert you, it failed to record that something was missing in the first place. A null result treated as a valid empty response. An incomplete retrieval result passed downstream as if it were complete. The system kept going, and everyone downstream inherited an answer that looked complete.

Datadog's State of AI Engineering report from April 2026 found that roughly 1 in 20 production AI requests fail silently while continuing to return outputs that look correct — and that's the aggregate across all failure types. Operational hallucinations are a subset, but they're the subset with the worst blast radius because they tend to affect agentic workflows where the model is taking actions, not just generating text.

The Air Canada chatbot case — documented in detail by Pericherla and Srinivasan — is the canonical example of this at the business layer. The chatbot passed its demo. Someone tested it. It gave a clean, plausible-looking answer about bereavement fares. That answer was wrong in exactly the way that costs money: directionally correct (yes, there's a bereavement policy), operationally broken (the timing requirement was inverted). A British Columbia tribunal awarded the customer $650.88 plus interest. The chatbot was removed within weeks.

The pattern: confident, fluent, directionally correct, operationally wrong.

What Actually Catches This

Standard evals check for factual accuracy against a reference set. They don't check whether the execution of a model's output produces the intended outcome. That requires a different instrumentation layer.

Jaskaran Singh's production experience points to context window failures as a related silent failure class — models losing original instructions without warning as context fills, leading to inconsistent behavior in automated systems. The failure isn't visible in the output; it's visible only in behavioral drift over time.

For operational hallucinations specifically, the detection has to happen at the execution boundary, not the output boundary. That means:

Execution-layer validation: check the effect of tool calls, not just their syntax. Did the backup actually run? Did the row count change as expected?
Behavioral regression tests: run the same agentic workflow against a known state and verify the outcome, not the output string
Failure mode classification: tools like LangHeal's LLM-as-a-judge approach — classifying failures as schema violations, hallucinations, tool failures — give you the taxonomy to route different failure types to different remediation paths

The core insight from Parthasarathy's production evaluation retrospective is worth keeping: in traditional software, failures are crashes — they're loud. In LLM systems, failures are subtly wrong answers that look identical to correct ones in your logs. A right answer and a hallucinated answer are both just strings.

Your eval harness needs to stop treating them as equivalent.

Eval Patterns

Test at the execution boundary, not the output boundary. For agentic workflows, the eval question isn't "does this output look correct?" — it's "did the system state change in the expected way?" Write integration tests that verify downstream effects: row counts, file states, API side effects. Operational hallucinations are invisible to string-matching evals and only surface when you check what actually happened.

Reliability Notes

Add a null-result interception layer. Silent failures often originate when empty or incomplete retrieval results get passed downstream as valid responses. Before model output reaches any action layer, intercept null results explicitly — log them, route them to a fallback, and never let "no data found" silently become "here's a confident answer anyway." This is the architectural gap Bosch identifies as the layer most teams aren't building.

Cost Watch

Failure remediation is more expensive than failure prevention. The Air Canada case settled for under $1,000 — but the real cost was the news cycle, the legal overhead, and the chatbot shutdown. For smaller teams, the equivalent is customer churn from a campaign that bounced, or a data migration that silently corrupted a table. The compute cost of adding execution-layer validation is trivial compared to the incident response cost of discovering an operational hallucination three days after deployment. Instrument the execution boundary now; it's cheaper than the postmortem.