A startup launched an AI research assistant. Their cost model said $0.04 per query. Their actual cost was $4.20 per session. By week three, they'd accumulated $67,000 in unexpected LLM costs — not because the model was doing anything wrong, but because nobody had done the compound math.
That gap between estimated and actual cost is where most small teams get hurt. And it's almost never the model's fault.
The Bill Is Mostly History Replay
Here's the math that bites you. When a user sends a three-word message to your AI agent, they're not sending three words. They're sending the entire conversation history, every tool call result, and your system prompt — again. Vishal Bulbule's breakdown puts a concrete number on it: a single "What's the status?" can trigger 18,000 input tokens at Claude Sonnet rates. That's $0.054 for one message. Across a 10-seat team running five sessions a day, you're looking at $500–$3,000/month — most of it history replay, not new content.
The $67,000 startup story is instructive because every component was individually defensible. An 8,000-token system prompt covering all the edge cases? Reasonable. Injecting the 10 most relevant RAG chunks at 1,500 tokens each? Sounds thorough. Preserving conversation history across a research session? Good UX. Prompting for detailed, well-sourced answers? Obviously correct. The problem is that these decisions compound. By turn 10 of a 15-turn session, you're paying for nine prior turns of full questions and comprehensive answers on every single call. The actual per-session cost was 100x the per-query estimate.
This is why cost modeling on a "clean single-turn basis" is almost always wrong for agentic workloads.
The 50x Model Mismatch Nobody Talks About
The compound context problem is one failure mode. The model selection problem is a different one, and it's arguably more expensive at scale.
Neurometric's analysis of token engineering puts the spread bluntly: frontier models cost $5–$25 per million tokens. Small models handle commodity tasks — extraction, classification, formatting — for $0.10–$0.50 per million tokens. That's a 50x to 250x price difference, paid on every request, every day, for work the cheaper model handles just as well on benchmarks.
The reason teams don't fix this isn't ignorance — it's that model routing requires ongoing maintenance. Major model releases land weekly. Gemma 4 went from non-existent to routing 240 billion tokens per week on OpenRouter. Keeping your routing logic current with that pace of change is a real engineering burden, which is why most teams default to one frontier model for everything and pay the premium on tasks that don't need it.
The fix isn't complicated in principle: classify your tasks by complexity, route simple ones to cheaper models, and reserve frontier spend for the calls that actually require it. The operational discipline to maintain that routing is where it gets hard.
Caching Is the Easiest Money You're Leaving on the Table
If you're not using prompt caching and your system prompt is longer than a few hundred tokens, you're paying full price to resend static content thousands of times a day. The AI Corner's token cost playbook puts the ceiling on Claude's prompt caching at 90% off repeated input tokens — the catch is that prefix ordering matters, and you need to hit a meaningful cache hit rate before the economics work.
The broader pattern: most production apps resend the same system prompt, the same tool list, and the same documents on every call. That's not a model problem. That's an architecture problem with a straightforward fix.
Getmaxim's guide to LLM cost tracking makes the operational point that you can't optimize what you can't see. Traditional monitoring wasn't built for token-based pricing — costs vary per request based on input length, output length, model, and cache status. Without per-request attribution at the user, feature, and team level, you're doing cost archaeology after the bill arrives instead of catching waste in real time.
The Discipline That's Actually Missing
Sam Altman acknowledged at a recent enterprise event that AI costs went from a non-issue to "a huge issue" in the span of a few months. Uber reportedly set token caps after its COO said spending was becoming harder to justify. Amazon shut down its internal token leaderboard. These are not small-team problems anymore.
The teams that get this right aren't doing anything exotic. They're treating token spend the way they treat database query costs: instrument everything, attribute costs to features, set budgets with enforcement (not just alerts), and review the breakdown in the same meeting where you review latency and error rates.
The $47,000 overage that one engineering team discovered in their April architecture review wasn't a billing surprise. It was a visibility failure. Nobody could explain which team, which feature, or which developer burned the most tokens. The logs existed. The attribution didn't.
Token debt compounds exactly like technical debt. The teams paying the least aren't using cheaper models — they're the ones who built the instrumentation before the bill got interesting.
