Latest issue

The SLO You Set in a Conference Room Will Lie to You in Production

8/2/2026

There's a particular kind of meeting that happens at most engineering organizations, usually sometime in Q1 or after a bad incident. Someone pulls up a blank doc, types "Service Level Objectives," and the room starts negotiating. By the end of an hour, you have numbers. Four nine…

Read latest post

Recent posts

7/26/2026
The Incident You Closed Is Still Teaching You — If You Bother to Listen
The postmortem is filed. The action items are in Jira. Someone marked the incident resolved at 4:17am and went back to sleep. By Monday morning, the ticket has a due date three spr…
7/19/2026
The Escalation Path Nobody Tested Is the One You'll Need Tonight
You've tested your rollback procedure. You've run game days. You've validated that your alerting fires within thirty seconds of a threshold breach. What you probably haven't tested…
7/12/2026
The Toil You're Not Measuring Is the Toil That's Eating Your Team
Somewhere on your team right now, someone is doing something for the third time this week that they've done a hundred times before. Restarting a service. Manually promoting a confi…
7/5/2026
The Chaos Monkey Problem: Why Failure Injection Tells You Less Than You Think
You schedule the chaos experiment for Tuesday at 2pm. You kill a pod. The system recovers. You write "resilience validated" in the ticket and close it. Three weeks later, a real fa…
6/28/2026
The Incident Command Vacuum: Why Your Response Breaks Down Before the First Slack Message
The pager fires at 2:47am. Within ninety seconds, three engineers are awake and in the incident channel. Someone starts checking dashboards. Someone else begins restarting services…
6/21/2026
The Monitoring Dashboard Nobody Looks at During an Incident
There's a particular kind of operational irony that only reveals itself at 2am: the team spent three months building a beautiful observability dashboard, and when the incident actu…
6/14/2026
The Reliability Debt You're Paying Without Knowing It
Nobody budgets for the cost of almost-incidents. The production system that degraded for eleven minutes and then recovered on its own. The deployment that caused elevated error rat…
6/7/2026
The Dependency You Don't Know About Is the One That Will Kill You
The failure mode nobody writes about in their postmortem isn't the database that crashed or the deploy that went sideways. It's the service three hops away that nobody on your team…
5/31/2026
The On-Call Handoff Is Where Your Incident Response Actually Breaks
The incident is over. The service is green. Someone writes "resolved" in the Slack thread and closes the bridge. The on-call engineer who fought through the night hands off to the…
5/24/2026
The Postmortem That Only Asks "What Broke" Is Missing the Harder Question
You're in the postmortem. The timeline is on the screen. Someone walks through the sequence: the deploy went out, the error rate climbed, the alert fired, the on-call responded, th…
5/17/2026
The Blast Radius Problem: Why Your Incident Scope Is Almost Always Wrong
The call comes in at 2:47am. Database latency is spiking. You page the on-call DBA, scope the incident to the database tier, and start working the problem. Forty minutes later, you…
5/10/2026
The Deployment That Worked Fine in Staging Will Betray You in Production
You've seen this movie. The change passes every test. Staging looks clean. The deploy goes out on a Tuesday afternoon — low traffic, good timing, cautious team. Then something star…
5/3/2026
The Incident You Survived Isn't the One That Will Break You
The postmortem is done. The action items are filed. Someone updated the runbook. You closed the ticket, and the on-call rotation moved on. Three months later, a different system fa…
4/26/2026
The Alert That Fires Every Time Teaches You Nothing
Three hundred alerts in a week. Forty of them actionable. The other two hundred and sixty? Your team learned to ignore those months ago — they just haven't gotten around to deletin…
4/19/2026
Your Oncall Rotation Is a Diagnostic Tool You're Not Reading
Most teams treat oncall as a staffing problem. Someone has to be paged; someone has to respond. The rotation exists to distribute that burden. Fair enough. But if that's the whole…
4/12/2026
The Runbook That Lies to You Is Worse Than No Runbook at All
You're twenty minutes into an incident. The alerts are firing, the on-call channel is filling up, and someone pastes a link to the runbook. You follow step three. Nothing changes.…
4/5/2026
The Runbook Nobody Updates Is the One You'll Need at 3am
There's a particular kind of dread that hits when you're mid-incident, you've pulled up the runbook, and the first step references a service that was deprecated eight months ago. T…
3/30/2026
The Runbook That Lies to You Is Worse Than No Runbook at All
It's 2:47am. The alert fires. You find the runbook, follow it step by step, and the system gets worse. Somewhere between the last incident and this one, the architecture changed —…
3/25/2026
Nobody Knows What "Operational" Means Anymore, and That's the Problem
There's a version of this newsletter that opens with a crisp thesis, three supporting data points, and a clean close. That version would be dishonest right now. Because the most op…