AI Ops Weekly

7/21/2026

Your Prompt Logs Are Useless. Here's What to Capture Instead.

A request comes in. The model returns something wrong. You open your logs and find: status: 200, latency: 1.4s, tokens: 847. Completely healthy. Completely useless. This is the def…

7/14/2026

P99 Is the Number Your SLO Should Be Built Around. Most Teams Use P50.

There's a specific failure mode I see constantly in production AI systems: a team sets a latency SLO, hits it consistently on their dashboards, and still gets user complaints about…

7/7/2026

JSON Schema Validation Ate My Retry Budget. Here's What Actually Replaced Regex.

There's a specific kind of production incident that starts with a Slack message like "the extraction pipeline is returning nulls again." You dig in. The LLM returned valid JSON — s…

6/30/2026

Your Fallback Chain Looks Right on Paper. It's Silently Corrupting Data in Production.

Most teams discover their LLM fallback strategy is broken at the worst possible moment: not during a 503 from a provider, but three steps later, when a downstream service starts be…

6/23/2026

The Dashboards Are Green. The Answers Are Getting Worse. Here's How to Catch That.

There's a specific kind of production incident that never pages anyone. The agent runs on schedule. Latency is flat. Costs are stable. HTTP 200s all the way down. And over six week…

6/16/2026

You're Not Paying for AI. You're Paying to Resend the Same Tokens Over and Over.

A startup launched an AI research assistant. Their cost model said $0.04 per query. Their actual cost was $4.20 per session. By week three, they'd accumulated $67,000 in unexpected…

6/9/2026

Rate Limits Aren't Your Problem. Your Rate Limit Math Is.

Most teams discover their LLM capacity plan is wrong at 2am, not during sprint planning. The bill looked fine. The projected monthly spend was within budget. Then a queue drained,…

6/2/2026

Prompt Changes Are Code Changes. Start Treating Them That Way.

Most teams have a deployment process for code. Peer review, staging environment, automated tests, rollback plan. The whole apparatus. Then someone edits a system prompt in a shared…

5/26/2026

Confident, Fluent, and Wrong: The Silent Failure Mode Your Evals Aren't Catching

There's a specific production incident that doesn't announce itself. Your logs show HTTP 200s. Your monitoring reports zero errors. Your dashboard is green. And somewhere in your s…

5/19/2026

The Three-Layer Cache Stack That Actually Cuts Your AI Bill

Most teams treating LLM cost as a model selection problem are solving the wrong equation. You can swap GPT-4o for a cheaper model and claw back 30% — or you can fix your caching ar…

5/12/2026

SRE Taught You to Monitor Systems. AI Breaks That Assumption at the Foundation.

The fintech team that added a single comma to their system prompt didn't know they'd broken anything. Their application kept running. Latency was normal. Error rate: zero. Their in…

5/5/2026

Fine-Tuning Is a Capital Expenditure. Treat It Like One.

Most small teams reach for fine-tuning the moment their prompts stop working. That instinct is expensive — and usually wrong. The decision between prompt engineering, fine-tuning,…

4/28/2026

Your Dashboards Are Green. Your AI Is Producing Garbage.

There's a specific kind of production incident that doesn't look like an incident. No pages fire. No error budgets burn. The status page stays green. But somewhere in the last six…

4/21/2026

Your RAG Pipeline's Retrieval Layer Is Lying to You

Here's the uncomfortable starting point: most teams shipping RAG systems have no idea what their retrieval quality actually is. They ship, collect vague user feedback, and assume t…

4/15/2026

RAG Pipelines Don't Fail at the Model Layer. They Fail Three Steps Before It.

Your demo worked. The retrieval looked clean, the answers were coherent, and you shipped it. Then real users showed up with real questions, and the support tickets started. Wrong a…

4/7/2026

Prompt Regression Is a Deployment Problem, Not a Testing Problem

Most teams discover they have a prompt regression problem the same way: a model provider quietly ships an update, outputs start drifting, and someone notices in a support ticket th…

3/27/2026

Your API Bill Is the Cheap Part

Most teams budget for LLM costs by pulling up a pricing page and multiplying tokens by rate. That math isn't wrong — it's just incomplete by a factor of three or four, depending on…

3/25/2026

Benchmarks Gave You a Number. Production Gave You a Pager Alert.

There's a moment every team hits, usually around week three of a new model rollout, when someone pulls up the eval dashboard and says some version of: "But it scored higher on the…

3/13/2026

Benchmarks Pass. Production Burns. Here's What Eval Actually Looks Like.

A team ships a customer support bot with 94% accuracy on their internal test suite. Two weeks into production, escalations are up 40%. The model is fluent, latency is fine, HTTP 20…

The Queue Is the Feature: Why Your Fallback Chain Fails Before the Backup Model Fires

Recent posts