Hero image for "Your Dashboards Are Green. Your AI Is Producing Garbage."

Your Dashboards Are Green. Your AI Is Producing Garbage.


There's a specific kind of production incident that doesn't look like an incident. No pages fire. No error budgets burn. The status page stays green. But somewhere in the last six weeks, your AI feature quietly got worse — and you're finding out because a user complained, or because churn ticked up, or because someone on the team finally ran a manual spot-check and didn't like what they saw.

This is the monitoring gap. And it's not a tooling problem you can close by adding a Datadog plugin.

Traditional APM Was Built for a Different Kind of Failure

Standard application monitoring answers one question well: did the infrastructure work? Request completed, HTTP 200 returned, latency within SLA. For most services, that's enough. For LLM-powered features, it covers less than 20% of the actual failure space.

The other 80% hides in failure modes that don't raise exceptions. A request can complete in 300ms, return a 200, consume tokens, and produce an answer that is confidently wrong. No alert fires. No on-call page. The system just silently produces worse outputs at the same speed and cost.

The core problem is architectural: traditional observability treats the model as a black box that either returns a response or doesn't. It has no concept of whether that response is correct, grounded, or consistent with what the same prompt produced last month. As one SRE put it: "The failure mode isn't a 500 error — it's a confident hallucination delivered with perfect latency and a 200 status code. Your dashboards are green. Your AI is producing garbage."

The Failure Modes That Don't Show Up in Your Metrics

The most dangerous is semantic degradation — output quality declining gradually with no discrete trigger. Retrieval data drifts as document collections update. User queries evolve toward edge cases the system wasn't tuned for. Prompt changes accumulate small regressions that each look within tolerance but collectively pull quality down. Production RAG systems show significant retrieval accuracy degradation within 90 days of initial deployment — not because anything broke, but because everything quietly shifted.

Then there's the vendor status page problem. On April 20, 2026, a 90-minute partial ChatGPT outage went undetected by status pages initially. Earlier in March, Azure-hosted GPT-5.2 endpoints were returning HTTP 400 and 429 errors for 20 hours while aggregate availability signals stayed green. Vendor status pages rely on binary aggregates that mask silent latency creep — p99 latency doubling while the average stays within SLA bands, or regional degradation while global metrics look healthy.

And then there's cost. Inference cost is generated where routing decisions happen — model selection, retry logic, token budgets, context window management — while your observability monitors the infrastructure layer. These are different layers, and the gap between them is expensive. A poorly optimized prompt can cost more per day than the entire Kubernetes cluster running the application. Output tokens cost 3–10x more than input tokens. What looks like $500/month in a pilot becomes $15,000 at production scale, before accounting for growth.

What You Actually Need to Instrument

The metrics that matter for AI system health aren't the ones your APM vendor surfaces by default. Based on what practitioners are actually building:

Time-to-first-token (TTFT) is your earliest warning signal for provider-side queueing — it spikes before total response latency reveals an incident. Token throughput (tokens per second) catches slow-generation failures where the stream produces tokens at half-speed, bypassing standard wall-clock timeouts. Structured-output validation rates identify silent quality drift caused by model-routing fallbacks — invisible to traditional latency and availability metrics.

The tooling market reflects the problem. AI-native platforms like Langfuse, Arize Phoenix, and Helicone understand prompts, tokens, and semantic evaluation — but have no context about your infrastructure, SLOs, or cost centers. Traditional APM vendors understand infrastructure deeply but treat AI as just another microservice. OpenTelemetry's GenAI Semantic Conventions are the closest thing to a unifying standard — still experimental as of Q1 2026, not GA. The instrumentation layer is converging. Everything above it is fragmented.

I'd argue the practical path for small teams right now is layered: keep your existing APM for infrastructure health, add one AI-native tool for semantic and cost visibility, and instrument TTFT and output validation rates yourself via OpenTelemetry. It's not elegant. It's what the current tooling reality supports.


Eval Patterns

SLO-based monitoring is outperforming threshold alerting for LLM reliability. Honeycomb's production data shows SLOs help catch anomalies without triggering noisy alert floods — critical when a single prompt tweak can produce wildly different outputs. Pair SLOs with multi-window burn-rate alerts: page only when the rate of breaches consumes your error budget over a 5-minute window, not on individual long-prompt outliers.

Reliability Notes

Regional skew detection is non-negotiable if you're running on hosted endpoints. Global aggregates can stay healthy while EU regions are fully degraded — exactly the failure mode in the March 2026 Azure incident. Instrument per-region TTFT independently. Don't wait for your vendor's status page to tell you something you should have detected yourself.

Cost Watch

Reasoning models like o3 add internal thinking tokens that inflate consumption silently — and most teams don't see this until the bill arrives. If you're running any reasoning model in production, instrument token consumption at the model-routing layer, not just at the API response layer. The gap between what you think you're spending and what you're actually spending lives in that instrumentation blind spot.