The Three-Layer Cache Stack That Actually Cuts Your AI Bill

Most teams treating LLM cost as a model selection problem are solving the wrong equation. You can swap GPT-4o for a cheaper model and claw back 30% — or you can fix your caching architecture and cut 50–90% without touching your model at all. The math isn't close.

The reason teams miss this: "caching" in LLM infrastructure isn't one thing. It's three distinct layers, each eliminating cost at a different point in the request lifecycle. Running only one layer while ignoring the others is like patching one memory leak and calling the system optimized.

The Three Layers Don't Overlap — They Stack

Akshay Ghalme's deep-dive on LLM caching lays out the architecture cleanly: provider prompt caching, semantic caching, and edge/response caching operate independently. A request that misses all three pays full price. A request that hits layer one still calls the LLM. A request that hits layer two never touches the LLM at all.

Layer 1: Provider prompt caching. Anthropic and OpenAI both cache the KV attention matrices for prompt prefixes — meaning if your system prompt, document context, or tool definitions appear at the front of every request, the provider can skip recomputing them. The discount on cached input tokens runs 50–90%, depending on what fraction of your prompt is stable. The implementation difference matters: OpenAI activates this automatically with limited control, while Anthropic exposes explicit cache_control breakpoints in the API, giving you precise control over which sections get cached. If you're on Anthropic and haven't set cache breakpoints, you're leaving money on the table every request.

The structural rule is simple: stable content at the front, dynamic content at the rear. Your system prompt, retrieved documents, and tool schemas go first. The user's actual query goes last. Teams that invert this — prepending dynamic context before a stable system prompt — get near-zero cache hit rates and wonder why the bill isn't moving.

Layer 2: Semantic caching. This is where you eliminate the LLM call entirely. Instead of asking the model the same question twice, you embed incoming queries into vectors, check a vector store for semantically similar past queries above a similarity threshold (typically 0.92–0.97), and return the cached response on a hit. Production implementations report 20–73% token cost reductions, with cache hit latency around 50ms versus 1–3 seconds for a full LLM call.

The workloads where this pays off most are the predictable ones: customer support bots, FAQ systems, documentation assistants. One benchmark across 45,000 queries reported a 40% cache hit rate with 24x faster response times on hits. The embedding cost to power this — roughly $0.02 per million tokens for lightweight models — is negligible against the inference spend it displaces.

The failure mode worth watching: similarity threshold tuning. Set it too low and you return cached answers to queries that are semantically adjacent but contextually different. Set it too high and your hit rate collapses. This is the trickiest operational knob in the stack, and it needs workload-specific calibration, not a default value from a tutorial.

Layer 3: Edge/response caching. For deterministic, parameter-driven responses — think report generation with fixed inputs, or AI features where the same user action always produces the same prompt — you can cache the final response at your CDN like any HTTP response. Sub-10ms latency, zero per-request LLM cost. This layer applies to a narrower slice of traffic than the other two, but where it fits, it's the cheapest possible outcome.

The Token Budget Problem Upstream of All This

Caching is the highest-leverage optimization, but it doesn't fix bloated prompts. Teams at magically.life, processing over 1 billion tokens per week, found that smart optimization strategies — including prompt compression alongside caching — can reduce costs 70–80%. The claim that most development teams waste 40–60% of their token budgets on suboptimal implementations is consistent with what the caching numbers imply: a lot of those tokens are redundant context being re-sent on every call.

Context compression is the complement to caching. Proxy-based compression approaches report 95%+ accuracy preservation with 40–90% token reduction — though that summary doesn't include enough implementation detail to evaluate the methodology. I'd treat those numbers as directionally interesting rather than a firm benchmark until you've tested against your own workload.

The practical sequence: compress what you send, cache what you send repeatedly. Doing one without the other is half an optimization.

Eval Patterns

Cache correctness needs its own eval pass. Build a test set of query pairs that are semantically similar but should return different answers — edge cases for your similarity threshold. Run it before you tune the threshold down for higher hit rates. A 73% cost reduction means nothing if 5% of cache hits are returning wrong answers silently.

Reliability Notes

Semantic cache staleness is the production failure nobody talks about until it happens. If your underlying data changes — product catalog, policy docs, knowledge base — cached responses can go stale without any error signal. Implement TTLs on cache entries tied to your data update cadence, not a generic expiration window. Watch for user complaints about outdated information as a lagging indicator that your TTL is too long.

Cost Watch

Before you build anything: instrument your current prompt structure and measure what fraction of input tokens are stable versus dynamic per request. If stable content is less than 40% of your average prompt, provider-level prompt caching will underperform. Fix the prompt structure first. The optimization only works if there's something worth caching.