Back to Journal
ExperimentApril 15, 202518 min read

How I Learned to Stop Worrying and Love the Cache

A story about building Esy's RAG system, making expensive mistakes, and discovering that sometimes the old ways are the best ways

Sarah Chen
Sarah Chen
Building Esy
0 views0 comments0 shares
How I Learned to Stop Worrying and Love the Cache

Last Tuesday at 3:47 AM, I discovered our RAG system was making the same embedding call 47,000 times per hour. The same query. The same vector. The same response. Over and over, like a extremely expensive version of Groundhog Day. Our AWS bill was having the best month of its life.

This is the story of how I learned that in the race to build cutting-edge AI systems, sometimes the most powerful optimization is the one developers have used since the dawn of computing: just cache it.

The Discovery

It started with a Datadog alert. "Anomaly detected: API costs." I ignored it. We were growing fast, costs go up, that's normal. Then came the second alert. Then the third. By the time I checked our dashboard, we were burning through $400 per hour on OpenAI embeddings.

The culprit? A seemingly innocent function that generated embeddings for user queries. Every. Single. Time. Even for identical queries. Even for queries we'd seen thousands of times before.

"$400 per hour on embeddings. The same ones. Over and over."

Here's what our code looked like. Brace yourself:

The Expensive Mistake
async def get_embedding(text: str):
  # This ran on EVERY. SINGLE. QUERY.
  response = openai.Embedding.create(
      input=text,
      model="text-embedding-3-small"
  )
  return response['data'][0]['embedding']

No cache. No memoization. No nothing. Just pure, unadulterated API calls flowing directly from our servers to OpenAI's accountants.

The Embarrassingly Simple Solution

The fix took 12 lines of code and about as many minutes to implement. I almost didn't write this post because the solution felt too obvious. But then I remembered: obvious solutions that save hundreds of thousands of dollars are worth sharing.

Before: $400/hour
After: $12/hour

The Simple Fix
from functools import lru_cache
import hashlib

@lru_cache(maxsize=10000)
async def get_embedding(text: str):
  # Check Redis first
  cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
  cached = await redis.get(cache_key)
  if cached:
      return json.loads(cached)
  
  # Only call API if not cached
  response = openai.Embedding.create(
      input=text,
      model="text-embedding-3-small"
  )
  embedding = response['data'][0]['embedding']
  
  # Cache for 7 days
  await redis.setex(cache_key, 604800, json.dumps(embedding))
  return embedding

Two levels of caching: LRU in-memory for hot queries, Redis for everything else. The results were immediate and dramatic.

The Numbers Don't Lie

  • Cost Reduction: 97%
  • Cache Hit Rate: 94%
  • Latency Improvement: 42ms → 3ms
  • Happy CFO:

But the real win wasn't just the cost savings. Our system became faster, more reliable, and could handle 10x the load without breaking a sweat. Turns out, not hitting an external API for every request is good for performance. Who knew?

What I Learned

"In our rush to build the future, we sometimes forget the lessons of the past. Caching isn't old-fashioned — it's timeless."

The tech industry has a habit of reinventing wheels. We build complex distributed systems, implement cutting-edge algorithms, and architect for infinite scale. But sometimes, the solution has been sitting in computer science textbooks since the 1960s.

Cache invalidation might be one of the two hard problems in computer science, but cache implementation? That's been solved for decades. And when you're dealing with embeddings that don't change, invalidation isn't even a concern.

  • Monitor your API costs obsessively
  • Profile before optimizing (but also, just add caching)
  • The boring solution is often the right solution
  • Your CFO will love you for thinking about costs

So here's to caching — the unsexy, unglamorous, absolutely essential optimization that saved our bacon. Sometimes the best engineering is knowing when not to engineer at all.

Sarah Chen
Building Esy
Next Article
Why We Moved from Microservices to a Monolith
And cut our AWS bill by another 60%
#caching#performance#rag#architecture
Sarah Chen

About Sarah Chen

Building Esy. Writing about the intersection of technology and creativity.

Related Experiments

Training Style LoRAs on Architectural Drawings
experiment

Training Style LoRAs on Architectural Drawings

15 min read
How DALL-E Understands Space and Form
research

How DALL-E Understands Space and Form

20 min read
Building an AI Art Gallery in Esy
build

Building an AI Art Gallery in Esy

8 min read

Comments (0)

Join the discussion about this experiment