How I Learned to Stop Worrying and Love the Cache
A story about building Esy's RAG system, making expensive mistakes, and discovering that sometimes the old ways are the best ways
Last Tuesday at 3:47 AM, I discovered our RAG system was making the same embedding call 47,000 times per hour. The same query. The same vector. The same response. Over and over, like a extremely expensive version of Groundhog Day. Our AWS bill was having the best month of its life.
This is the story of how I learned that in the race to build cutting-edge AI systems, sometimes the most powerful optimization is the one developers have used since the dawn of computing: just cache it.
The Discovery
It started with a Datadog alert. "Anomaly detected: API costs." I ignored it. We were growing fast, costs go up, that's normal. Then came the second alert. Then the third. By the time I checked our dashboard, we were burning through $400 per hour on OpenAI embeddings.
The culprit? A seemingly innocent function that generated embeddings for user queries. Every. Single. Time. Even for identical queries. Even for queries we'd seen thousands of times before.
"$400 per hour on embeddings. The same ones. Over and over."
Here's what our code looked like. Brace yourself:
async def get_embedding(text: str):
# This ran on EVERY. SINGLE. QUERY.
response = openai.Embedding.create(
input=text,
model="text-embedding-3-small"
)
return response['data'][0]['embedding']
No cache. No memoization. No nothing. Just pure, unadulterated API calls flowing directly from our servers to OpenAI's accountants.
The Embarrassingly Simple Solution
The fix took 12 lines of code and about as many minutes to implement. I almost didn't write this post because the solution felt too obvious. But then I remembered: obvious solutions that save hundreds of thousands of dollars are worth sharing.
Before: $400/hour
After: $12/hour
from functools import lru_cache
import hashlib
@lru_cache(maxsize=10000)
async def get_embedding(text: str):
# Check Redis first
cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
cached = await redis.get(cache_key)
if cached:
return json.loads(cached)
# Only call API if not cached
response = openai.Embedding.create(
input=text,
model="text-embedding-3-small"
)
embedding = response['data'][0]['embedding']
# Cache for 7 days
await redis.setex(cache_key, 604800, json.dumps(embedding))
return embedding
Two levels of caching: LRU in-memory for hot queries, Redis for everything else. The results were immediate and dramatic.
The Numbers Don't Lie
- Cost Reduction: 97%
- Cache Hit Rate: 94%
- Latency Improvement: 42ms → 3ms
- Happy CFO: ∞
But the real win wasn't just the cost savings. Our system became faster, more reliable, and could handle 10x the load without breaking a sweat. Turns out, not hitting an external API for every request is good for performance. Who knew?
What I Learned
"In our rush to build the future, we sometimes forget the lessons of the past. Caching isn't old-fashioned — it's timeless."
The tech industry has a habit of reinventing wheels. We build complex distributed systems, implement cutting-edge algorithms, and architect for infinite scale. But sometimes, the solution has been sitting in computer science textbooks since the 1960s.
Cache invalidation might be one of the two hard problems in computer science, but cache implementation? That's been solved for decades. And when you're dealing with embeddings that don't change, invalidation isn't even a concern.
- Monitor your API costs obsessively
- Profile before optimizing (but also, just add caching)
- The boring solution is often the right solution
- Your CFO will love you for thinking about costs
So here's to caching — the unsexy, unglamorous, absolutely essential optimization that saved our bacon. Sometimes the best engineering is knowing when not to engineer at all.
About Sarah Chen
Building Esy. Writing about the intersection of technology and creativity.