Esy Journal

Building Esy's RAG system taught us that sometimes the most sophisticated solution is knowing when to be simple. This is the story of how we reduced our inference costs by 97% by going backwards.

When we first architected our retrieval-augmented generation system, we followed every best practice. We built a microservices architecture with separate vector databases, implemented complex caching strategies, and used the latest embedding models. The system was beautiful, modern, and expensive.

The Problem

Our API costs were spiraling out of control. Every user query triggered multiple embedding calls, vector similarity searches, and LLM inference requests. We were burning through $400/hour during peak usage.

python

Our initial approach - no caching
async def get_embedding(text: str):
This ran on EVERY. SINGLE. QUERY.
response = openai.Embedding.create(
input=text,
model="text-embedding-3-small"
)
return response['data'][0]['embedding']
With 1000+ requests per minute, we had a problem

The Solution

The breakthrough came when we analyzed our query patterns. We discovered that 80% of queries fell into predictable categories. Instead of computing everything on-demand, we could pre-compute and cache the most common response patterns.

python

Our simple but effective caching strategy
class SmartCache:
def init(self, ttl=3600):
self.cache = 
self.ttl = ttl
def get_or_compute(self, query_hash, compute_fn):
if query_hash in self.cache:
cached_time, result = self.cache[query_hash]
if time.time() - cached_time < self.ttl:
return result
Cache miss - compute and store
result = compute_fn()
self.cache[query_hash] = (time.time(), result)
return result
def invalidate_pattern(self, pattern):
keys_to_remove = [k for k in self.cache.keys()
if pattern in k]
for key in keys_to_remove:
del self.cache[key]

But the real game-changer wasn't the caching algorithm—it was what we chose to cache. Instead of caching individual embedding vectors or database results, we cached complete response chains.

The Results

The system now handles 10x the traffic at 1/30th the cost. Response times dropped from 2+ seconds to under 200ms. Most importantly, we achieved this without sacrificing quality—our cache hit rate sits consistently above 95%.

Performance Metrics

97%

Cost Reduction

90%

Faster Response

95%

Cache Hit Rate

Cache Performance: Before vs After Optimization

Architecture Evolution

Before: Complex Microservices

User Query → Query Service → Vector DB → Embedding Service → LLM Service → Response Assembly → Cache Layer

After: Smart Monolith

User Query → Pattern Matcher → Cache Hit/Miss → [LLM if needed] → Response

Cache Strategy Deep Dive

Query Pattern Analysis

Query Distribution by Type

General Questions

45%

Code Help

25%

Debugging

20%

Other

10%

Cost Breakdown

Monthly API Costs (USD)

Before Optimization

$2,400

→

After Optimization

$75

Savings: $2,325/month (97% reduction) • ROI: 3,000% in first month

Monthly Infrastructure Costs

Lessons Learned

"Sometimes the most sophisticated engineering solution is knowing when to be simple. We spent months building a complex, beautiful system that solved the wrong problem."

The key insight wasn't technical—it was analytical. By deeply understanding our usage patterns before optimizing, we found a solution that was simultaneously simpler and more effective than our original architecture.

Key Takeaways

Measure first, optimize second: The data will often surprise you
Cache computed results, not raw data: Cache the expensive calculation, not the inputs
Monitor religiously: Set up alerts for cache performance metrics
The boring solution is often the right solution: Your CFO will love you for thinking about costs

This experience reinforced a fundamental principle: measure first, optimize second. The data will often surprise you, and the best engineering solution isn't always the most complex one.

Creating Impossible Architecture with AI

Tags:performance architecture cost-optimization

How I Learned to Stop Worrying and Love the Cache

The Problem

Our initial approach - no caching

This ran on EVERY. SINGLE. QUERY.

With 1000+ requests per minute, we had a problem

The Solution

Our simple but effective caching strategy

Cache miss - compute and store