Rate Limits Aren't Your Problem. Your Rate Limit Math Is.

Most teams discover their LLM capacity plan is wrong at 2am, not during sprint planning. The bill looked fine. The projected monthly spend was within budget. Then a queue drained, a retry storm kicked in, and production started throwing 429s on a workload that was, by every cost-based metric, comfortably within limits.

This is the default failure mode. And it's not bad luck — it's a predictable consequence of planning from the wrong number.

Cost and Capacity Are Different Functions

When you multiply requests by tokens by price, you get a smooth, linear curve. Rate limits don't work that way. As DEV Community's LLM capacity planning breakdown puts it: cost is a smooth, linear function of tokens; rate limits are a step function of several independent dimensions, and the one that binds first is usually not the one you watched.

Anthropic's API enforces three separate counters per model class — RPM, ITPM (input tokens per minute), and OTPM (output tokens per minute) — and a 429 fires the instant any single one is exceeded. Respan's production guide gives the Tier 1 Sonnet numbers: 50 RPM, 30,000 ITPM, 8,000 OTPM. Fifty 30-token pings sail through on RPM. A single 40,000-token document is already over the input ceiling before the first response token is generated.

OpenAI adds a fourth dimension: daily caps (RPD and TPD). Respan's OpenAI guide flags these as the sneaky ones — a workload that runs comfortably under TPM all day can still walk into a wall at hour 19 because it crossed the daily token ceiling. Nothing in a cost model has a "day" in it.

The practical consequence: you can be at 4% of your RPM budget and 100% of your OTPM budget and get throttled. Your dashboard shows green. Your users see errors.

The Per-Second Trap Nobody Talks About

Here's the detail that catches teams who do think about RPM. A "60 RPM" limit is not "60 requests you can fire in a burst at t=0 and then idle for 60 seconds." Providers quantize per-minute limits down to shorter buckets — effectively per-second. As the DEV Community piece explains, a 60 RPM limit behaves much closer to "1 request per second." If your traffic is bursty — a cron job fans out, a queue drains, a retry storm kicks in — you can be averaging well under 60 RPM over the full minute and still 429, because you exceeded the instantaneous allowance.

This is why retry storms are so destructive. The agent hits a 429, retries, hits another 429, retries with more context appended, and the per-second bucket never recovers. TrueFoundry's gateway writeup describes the shape precisely: each retry is a full provider call, context grows, and the unit of waste is dollars per token rather than milliseconds per request.

Token-Based Rate Limiting at the Gateway Layer

The fix starts with metering in the right units. Zuplo's engineering blog makes the case directly: requests-per-minute is the wrong meter for LLM endpoints. One call can be 50 tokens or 50,000. A 60-RPM cap treats them identically, and the heavy user empties your provider budget before the cheap user finishes their first session.

The providers already know this — they publish token-based limits themselves. Your gateway should meter in the same units. That means tracking input and output tokens separately per identity, not just counting request completions.

The three primitives that actually work in production, per TrueFoundry's analysis: a token bucket per identity (so one runaway agent doesn't exhaust shared quota), a circuit breaker per pattern (so retry storms get cut off before they compound), and a fallback chain per route (so a 429 on the primary provider doesn't become a user-facing error).

429 and 529 Are Not the Same Error

One more distinction that matters operationally: Anthropic returns two different error shapes that look identical to a naive retry handler. Respan's Anthropic guide draws the line clearly. A 429 rate_limit_error means you exceeded your tier's limits — retry with exponential backoff and jitter, respect the retry-after header. A 529 overloaded_error means Anthropic's own capacity is saturated regardless of your tier. There's no retry-after value you can trust, and retrying into it just makes things worse. The right response to a 529 is to route to a fallback provider — Bedrock, Vertex, or a secondary key — not to back off and hammer the same endpoint again.

FutureAGI's fallback strategy guide documents exactly this failure mode: a team running a customer-support copilot on a single Anthropic key watched it go offline during a cluster failure, then manually swapped to OpenAI with a 14-minute redeploy. The fix isn't faster incident response. It's a gateway that distinguishes between "you're over quota" and "the provider is down" and routes accordingly without human intervention.

The capacity plan that keeps you off the 2am rotation isn't built from your monthly bill. It's built from the per-minute, per-dimension limits of your specific model tier — and it accounts for what happens when any one of those dimensions hits the ceiling before the others.