monolith vs microservices
“We need to move to microservices. The monolith can’t scale.”
This was my tech lead’s opening line in our architecture review meeting. We had just crossed 100,000 users, and things were starting to break. Pages were loading slowly. Background jobs were backing up. The database was groaning under load.
I had heard this story before. “Monoliths don’t scale” is the rallying cry of every developer who’s read one too many Netflix engineering blogs. And honestly, I wanted to believe it. Microservices sounded exciting. Modern. The kind of thing you put on your resume.
But I had this nagging feeling we were about to make a very expensive mistake.
So I did something unpopular—I asked us to spend two weeks actually profiling our monolith to understand what was breaking. Not what we thought was breaking. What was actually breaking.
What we found surprised everyone. And it saved us from a six-month rewrite that would have solved exactly zero of our real problems.
The Breaking Point (monolith vs microservices)
Let me paint you a picture of our monolith at 100K users:
- Rails application running on 4 servers
- PostgreSQL database with 200GB of data
- Redis for caching and background jobs
- Response times creeping from 200ms to 2+ seconds
- Background job queue backing up to 6 hours
- Database connection pool constantly maxed out
- Random timeout errors 2-3 times per day
The conventional wisdom said: “Your monolith is too big. Break it into microservices.”
But here’s what nobody tells you—microservices don’t magically fix performance problems. They just distribute them.
What We Actually Found
I spent two weeks with our APM tools, database logs, and a lot of coffee. Here’s what was really breaking:
Problem 1: One Feature Was Eating 40% of Our Resources
We had a “similar items” recommendation feature. Every time someone viewed a product, we calculated similar items based on 15 different attributes using a complex scoring algorithm.
In memory.
For every page view.
# The innocent-looking code
def similar_items
all_items = Item.where(category: self.category).limit(1000)
all_items.map do |item|
score = calculate_similarity_score(self, item)
[item, score]
end.sort_by { |_, score| -score }.first(10).map(&:first)
end
def calculate_similarity_score(item1, item2)
# 15 different attribute comparisons
# Vector calculations
# String similarity algorithms
# You get the idea - this was SLOW
endThis single method was:
- Loading 1,000 records from the database
- Running similarity calculations 1,000 times per page view
- Doing this for 40% of our page views
The “microservices” solution: Extract this to a recommendation service!
What we actually did: Cached the recommendations for 6 hours.
def similar_items
Rails.cache.fetch("similar_items:#{id}", expires_in: 6.hours) do
# Same calculation, but cached
calculate_similar_items
end
end
Result:
- Response times: 2,100ms → 340ms
- Database load: Down 35%
- Cost: 4 hours of engineering time
A microservice would have taken 3 weeks to build, introduced network latency, added deployment complexity, and solved nothing.
Problem 2: We Were Making API Calls in the Request Cycle
Our checkout process called a third-party payment verification API. It took 1.2 seconds on average. Sometimes 4+ seconds.
def create
@order = Order.new(order_params)
# This blocks the entire request
payment_result = PaymentGateway.verify_payment(params[:payment_token])
if payment_result.success?
@order.save
redirect_to thank_you_path
else
render :new
end
end
The “microservices” solution: Extract payments to a separate service!
What we actually did: Made it asynchronous.
def create
@order = Order.new(order_params)
@order.status = 'pending'
@order.save
# Process in background
PaymentVerificationJob.perform_later(@order.id, params[:payment_token])
redirect_to processing_path
end
# Background job
class PaymentVerificationJob < ApplicationJob
def perform(order_id, payment_token)
order = Order.find(order_id)
result = PaymentGateway.verify_payment(payment_token)
if result.success?
order.update(status: 'confirmed')
OrderMailer.confirmation(order).deliver_later
else
order.update(status: 'failed')
OrderMailer.payment_failed(order).deliver_later
end
end
end
Result:
- Checkout response time: 3,800ms → 420ms
- No more timeout errors
- Better user experience (instant confirmation page with real-time status updates)
- Cost: 2 days of work
Again, microservices would have added complexity without solving the actual problem—blocking I/O in the request cycle.
Problem 3: N+1 Queries Everywhere
This is the problem every Rails developer knows about but somehow still writes:
# In our admin dashboard
@orders = Order.includes(:user).limit(50)
# In the view
<% @orders.each do |order| %>
<tr>
<td><%= order.user.name %></td>
<td><%= order.items.count %></td> <!-- N+1 here -->
<td><%= order.total %></td>
<td><%= order.shipping_address.city %></td> <!-- And here -->
</tr>
<% end %>
This generated 150+ queries to render one admin page.
The “microservices” solution: None. Microservices don’t fix N+1 queries. They just make them harder to debug across network boundaries.
What we actually did: Fixed the eager loading.
@orders = Order.includes(:user, :items, :shipping_address).limit(50)
Result:
- Admin dashboard load time: 4,200ms → 680ms
- 150 queries → 4 queries
- Cost: 1 hour
Problem 4: No Database Indexing Strategy
Our slowest queries looked like this:
SELECT * FROM orders
WHERE user_id = 123
AND status IN ('pending', 'processing')
ORDER BY created_at DESC;
-- Execution time: 2,400ms
Running an EXPLAIN showed a sequential scan across 2 million rows.
We had indexes on user_id and created_at separately, but not together. PostgreSQL couldn’t use them efficiently.
The “microservices” solution: Split the orders table across multiple databases!
What we actually did: Added composite indexes.
# In a migration
add_index :orders, [:user_id, :status, :created_at],
name: 'index_orders_on_user_status_date'
Result:
- Query time: 2,400ms → 12ms
- Database CPU: Down 40%
- Cost: 2 hours (including testing)
Problem 5: Background Jobs Processing Synchronously
Our image processing jobs were running in Sidekiq, but we configured them wrong:
class ImageProcessor
include Sidekiq::Worker
def perform(image_id)
image = Image.find(image_id)
# Processing multiple sizes synchronously
process_thumbnail(image) # 2 seconds
process_medium(image) # 4 seconds
process_large(image) # 6 seconds
process_watermark(image) # 3 seconds
# Total: 15 seconds per job
end
end
With 100K users uploading images, our job queue backed up for hours.
The “microservices” solution: Separate image processing service with its own infrastructure!
What we actually did: Parallel processing with proper job splitting.
class ImageProcessor
include Sidekiq::Worker
def perform(image_id)
image = Image.find(image_id)
# Split into parallel jobs
ProcessThumbnailJob.perform_async(image_id)
ProcessMediumJob.perform_async(image_id)
ProcessLargeJob.perform_async(image_id)
ProcessWatermarkJob.perform_async(image_id)
end
end
class ProcessThumbnailJob
include Sidekiq::Worker
sidekiq_options queue: :image_processing
def perform(image_id)
image = Image.find(image_id)
process_thumbnail(image)
end
end
# Same pattern for other sizes
We also added more Sidekiq workers:
# Before: 2 workers
# After: 8 workers with dedicated queues
:queues:
- [critical, 4]
- [default, 2]
- [image_processing, 8]
- [low_priority, 1]
Result:
- Job processing time: 6 hours backlog → real-time processing
- Image uploads: 15 seconds → 2 seconds (user-facing time)
- Cost: 3 days of work + $120/month for additional worker servers
When Microservices Actually Made Sense
I’m not saying microservices are always wrong. After fixing these issues, we did extract one service—and for good reasons.
Our email sending was becoming a bottleneck:
- 200K+ emails per day
- Complex templating logic
- Multiple providers (SendGrid, Mailgun, AWS SES)
- Required maintaining provider health checks and failover
- Growing compliance requirements (GDPR, CAN-SPAM)
This was a clear bounded context that:
- Had minimal coupling to our core business logic
- Required different scaling characteristics
- Needed specialized monitoring and alerting
- Had a stable, well-defined interface
# Simple API contract
POST /api/v1/emails
{
"template": "order_confirmation",
"to": "user@example.com",
"data": { "order_id": 12345 }
}
We extracted it to a separate Rails app. Deployment was independent. Scaling was independent. When email providers had issues, they didn’t affect our main application.
That’s when microservices make sense—not when you’re trying to fix performance problems.
What We Didn’t Do (That Everyone Suggested)
1. Rewrite in Go/Node/[Insert Hipster Language]
“Rails can’t handle 100K users!” Yes, it can. GitHub, Shopify, and Basecamp handle millions.
2. Switch to Kubernetes
We were running on 4 EC2 instances with Capistrano deployments. Kubernetes would have been operational overhead we didn’t need.
3. Implement Event Sourcing
Cool architecture pattern. Completely unnecessary for our use case.
4. Split the Database
Database splitting is hard. You lose transactions, joins become complex, and data consistency becomes your new nightmare. We didn’t need it.
5. Add a Message Queue Architecture
We already had Redis and Sidekiq. That was enough.
The Real Numbers After 6 Months
We took our “monolith that can’t scale” from 100K users to 350K users without breaking it apart.
Before optimization:
- Average response time: 2,100ms
- 95th percentile: 5,800ms
- Database connections: Constantly maxed at 100
- Background job lag: 6 hours
- Error rate: 0.8%
- Servers: 4 application, 1 database
- Monthly cost: $4,200
After optimization:
- Average response time: 280ms
- 95th percentile: 620ms
- Database connections: Averaging 35/100
- Background job lag: Real-time
- Error rate: 0.02%
- Servers: 6 application, 1 database, 1 email service
- Monthly cost: $4,800
We handled 3.5x more traffic with minimal cost increase and significantly better performance.
What Actually Matters at Scale
Here’s what I learned about scaling to 100K+ users:
1. Caching Strategy Beats Architecture
Most requests shouldn’t hit your database. Most calculations shouldn’t run every time.
We implemented caching at multiple levels:
- Browser caching (CDN, HTTP headers)
- Application caching (Rails.cache, Redis)
- Database query caching
- Fragment caching for expensive views
This had more impact than any architectural decision.
2. Database Optimization Beats Everything
Your database will be the bottleneck. Always.
- Add proper indexes
- Use EXPLAIN to understand query plans
- Implement connection pooling correctly
- Use read replicas for reporting
- Archive old data
We spent more time optimizing SQL than anything else, and it showed.
3. Async Processing Is Non-Negotiable
If it doesn’t need to happen in the request cycle, don’t do it in the request cycle.
- Email sending
- Image processing
- Report generation
- Third-party API calls
- Analytics tracking
Background jobs aren’t optional at scale.
4. Monitoring Shows You What Actually Matters
We used:
- New Relic for APM
- Datadog for infrastructure
- Sentry for error tracking
- Custom dashboards for business metrics
Without monitoring, we would have chased ghost problems based on assumptions.
5. Premature Optimization Is Real
We almost spent 6 months building a microservices architecture to fix problems we didn’t understand. Instead, we spent 3 weeks fixing the actual problems.
The time to break apart your monolith is not when performance degrades. It’s when:
- Team coordination becomes impossible
- Deployment coupling causes problems
- Different parts need to scale independently
- Domain boundaries are clear and stable
The Microservices Trap
Here’s what nobody tells you about microservices:
Complexity doesn’t disappear—it moves.
Before:
- One codebase
- One deployment
- One database transaction
- Stack traces that make sense
After:
- N codebases to maintain
- N deployment pipelines
- Distributed transactions (good luck)
- Errors across network boundaries
- Service discovery
- API versioning between services
- Testing becomes exponentially harder
- Local development requires running 10+ services
All that complexity costs time and money. It’s an investment that only pays off if you actually need it.
When Should You Actually Split?
Based on our experience and watching other companies, here are the real triggers:
Organizational Boundaries
When you have 3+ teams all deploying to the same codebase and stepping on each other’s toes.
Different Scaling Characteristics
When one feature needs 50 servers while everything else needs 5.
Technology Constraints
When you genuinely need different tech stacks (rare, but it happens—video encoding, ML models, etc.).
Deployment Independence
When you’re deploying 20 times a day and every deploy affects the entire application.
Regulatory Requirements
When different parts of your system have different compliance needs.
Our Actual Architecture at 350K Users
Here’s what we’re running today:
Main Application (Rails Monolith):
- Core business logic
- User management
- Order processing
- Product catalog
- Admin interfaces
Email Service (Extracted):
- Template management
- Provider failover
- Bounce handling
- Compliance tracking
Infrastructure:
- 6 application servers behind ALB
- 1 primary database with 2 read replicas
- Redis cluster for caching and jobs
- CloudFront CDN
- S3 for assets and uploads
That’s it. No Kubernetes, service mesh, event bus and No microservices for the sake of microservices.
Advice for Your 100K User Moment
When you hit scaling problems, resist the urge to rewrite. Instead:
Week 1: Measure
- Enable APM tooling
- Profile database queries
- Track slow endpoints
- Monitor background jobs
- Watch error rates
then
Week 2: Fix the Obvious
- Add caching
- Fix N+1 queries
- Add database indexes
- Move blocking I/O to background jobs
- Optimize your slowest endpoints
then
Week 3: Optimize
- Implement read replicas
- Add CDN for static assets
- Optimize asset delivery
- Tune your database
- Scale horizontally (add servers)
Week 4: Plan
- NOW you can decide if you need microservices
- With real data
- Based on actual problems
- Not blog posts from Netflix
Nine times out of ten, you’ll find that your “monolith that can’t scale” just needed some basic optimization.
At The End….
We didn’t need microservices. We needed to fix our code.
Caching, indexing, async processing, and horizontal scaling took us from 100K users to 350K users. Our monolith is still running strong.
Could we benefit from microservices eventually? Maybe. When we have 20 engineers instead of 5 or we’re deploying multiple times per day instead of twice per week. When we have real organizational boundaries that justify the complexity.
But not today. And probably not for you either.
The next time someone tells you “the monolith can’t scale,” ask them what they’ve actually measured. Chances are, they’re solving the wrong problem.


