Back
Experiments
April 15, 202518 min read

How I Learned to Stop Worrying and Love the Cache

Why our 97% cost reduction came from forgetting everything we knew about modern architecture

Z
Zev
Founder, Esy

Building Esy's RAG system taught us that sometimes the most sophisticated solution is knowing when to be simple. This is the story of how we reduced our inference costs by 97% by going backwards.

When we first architected our retrieval-augmented generation system, we followed every best practice. We built a microservices architecture with separate vector databases, implemented complex caching strategies, and used the latest embedding models. The system was beautiful, modern, and expensive.

The Problem

Our API costs were spiraling out of control. Every user query triggered multiple embedding calls, vector similarity searches, and LLM inference requests. We were burning through $400/hour during peak usage.

python

Our initial approach - no caching

async def get_embedding(text: str):

This ran on EVERY. SINGLE. QUERY.

response = openai.Embedding.create( input=text, model="text-embedding-3-small" ) return response['data'][0]['embedding']

With 1000+ requests per minute, we had a problem

The Solution

The breakthrough came when we analyzed our query patterns. We discovered that 80% of queries fell into predictable categories. Instead of computing everything on-demand, we could pre-compute and cache the most common response patterns.

python

Our simple but effective caching strategy

class SmartCache: def init(self, ttl=3600): self.cache = self.ttl = ttl

def get_or_compute(self, query_hash, compute_fn): if query_hash in self.cache: cached_time, result = self.cache[query_hash] if time.time() - cached_time < self.ttl: return result

Cache miss - compute and store

result = compute_fn() self.cache[query_hash] = (time.time(), result) return result

def invalidate_pattern(self, pattern): keys_to_remove = [k for k in self.cache.keys() if pattern in k] for key in keys_to_remove: del self.cache[key]

But the real game-changer wasn't the caching algorithm—it was what we chose to cache. Instead of caching individual embedding vectors or database results, we cached complete response chains.

The Results

The system now handles 10x the traffic at 1/30th the cost. Response times dropped from 2+ seconds to under 200ms. Most importantly, we achieved this without sacrificing quality—our cache hit rate sits consistently above 95%.

Performance Metrics

97%

Cost Reduction

90%

Faster Response

95%

Cache Hit Rate

Cache Performance: Before vs After Optimization

Architecture Evolution

Before: Complex Microservices

User Query → Query Service → Vector DB → Embedding Service → LLM Service → Response Assembly → Cache Layer

After: Smart Monolith

User Query → Pattern Matcher → Cache Hit/Miss → [LLM if needed] → Response

Cache Strategy Deep Dive

The breakthrough came when we analyzed our query patterns. We discovered that 80% of queries fell into predictable categories. Instead of computing everything on-demand, we could pre-compute and cache the most common response patterns.

Query Pattern Analysis

Query Distribution by Type

General Questions
45%
Code Help
25%
Debugging
20%
Other
10%

Cost Breakdown

Monthly API Costs (USD)

Before Optimization
$2,400
After Optimization
$75

Savings: $2,325/month (97% reduction) • ROI: 3,000% in first month

Monthly Infrastructure Costs

Lessons Learned

"Sometimes the most sophisticated engineering solution is knowing when to be simple. We spent months building a complex, beautiful system that solved the wrong problem."

The key insight wasn't technical—it was analytical. By deeply understanding our usage patterns before optimizing, we found a solution that was simultaneously simpler and more effective than our original architecture.

Key Takeaways

  • Measure first, optimize second: The data will often surprise you

  • Cache computed results, not raw data: Cache the expensive calculation, not the inputs

  • Monitor religiously: Set up alerts for cache performance metrics

  • The boring solution is often the right solution: Your CFO will love you for thinking about costs

This experience reinforced a fundamental principle: measure first, optimize second. The data will often surprise you, and the best engineering solution isn't always the most complex one.

Next Article
Creating Impossible Architecture with AI
Tags:performancearchitecturecost-optimization

Subscribe to Esy Journal

Daily insights from the frontlines of AI research and development

Daily deliveryZero spam

More like this

View all
ThoughtsTrending
85% match

The AI winter that never came

Everyone keeps predicting an AI winter. But what if we're actually in permanent summer?

ZevMarch 26, 2025
4 min
Experiments
72% match

Experiment: RAG vs fine-tuning for research tasks

I spent the weekend testing different approaches for making LLMs better at research.

ZevMarch 25, 2025
7 min
Vision
68% match

Why I'm betting everything on writing tools

Most people think AI will replace writing. I think it will make writing more important than ever.

ZevMarch 24, 2025
6 min