Monolith vs Microservices: What Broke at 100K Users (Real Data)

monolith vs microservices

“We need to move to microservices. The monolith can’t scale.”

This was my tech lead’s opening line in our architecture review meeting. We had just crossed 100,000 users, and things were starting to break. Pages were loading slowly. Background jobs were backing up. The database was groaning under load.

I had heard this story before. “Monoliths don’t scale” is the rallying cry of every developer who’s read one too many Netflix engineering blogs. And honestly, I wanted to believe it. Microservices sounded exciting. Modern. The kind of thing you put on your resume.

But I had this nagging feeling we were about to make a very expensive mistake.

So I did something unpopular—I asked us to spend two weeks actually profiling our monolith to understand what was breaking. Not what we thought was breaking. What was actually breaking.

What we found surprised everyone. And it saved us from a six-month rewrite that would have solved exactly zero of our real problems.

The Breaking Point (monolith vs microservices)

Let me paint you a picture of our monolith at 100K users:

Rails application running on 4 servers
PostgreSQL database with 200GB of data
Redis for caching and background jobs
Response times creeping from 200ms to 2+ seconds
Background job queue backing up to 6 hours
Database connection pool constantly maxed out
Random timeout errors 2-3 times per day

The conventional wisdom said: “Your monolith is too big. Break it into microservices.”

But here’s what nobody tells you—microservices don’t magically fix performance problems. They just distribute them.

What We Actually Found

I spent two weeks with our APM tools, database logs, and a lot of coffee. Here’s what was really breaking:

Problem 1: One Feature Was Eating 40% of Our Resources

We had a “similar items” recommendation feature. Every time someone viewed a product, we calculated similar items based on 15 different attributes using a complex scoring algorithm.

In memory.

For every page view.

# The innocent-looking code
def similar_items
  all_items = Item.where(category: self.category).limit(1000)
  
  all_items.map do |item|
    score = calculate_similarity_score(self, item)
    [item, score]
  end.sort_by { |_, score| -score }.first(10).map(&:first)
end

def calculate_similarity_score(item1, item2)
  # 15 different attribute comparisons
  # Vector calculations
  # String similarity algorithms
  # You get the idea - this was SLOW
end

# The innocent-looking code
def similar_items
  all_items = Item.where(category: self.category).limit(1000)
  
  all_items.map do |item|
    score = calculate_similarity_score(self, item)
    [item, score]
  end.sort_by { |_, score| -score }.first(10).map(&:first)
end

def calculate_similarity_score(item1, item2)
  # 15 different attribute comparisons
  # Vector calculations
  # String similarity algorithms
  # You get the idea - this was SLOW
end

This single method was:

Loading 1,000 records from the database
Running similarity calculations 1,000 times per page view
Doing this for 40% of our page views

The “microservices” solution: Extract this to a recommendation service!

What we actually did: Cached the recommendations for 6 hours.

def similar_items
  Rails.cache.fetch("similar_items:#{id}", expires_in: 6.hours) do
    # Same calculation, but cached
    calculate_similar_items
  end
end

def similar_items
  Rails.cache.fetch("similar_items:#{id}", expires_in: 6.hours) do
    # Same calculation, but cached
    calculate_similar_items
  end
end

Result:

Response times: 2,100ms → 340ms
Database load: Down 35%
Cost: 4 hours of engineering time

A microservice would have taken 3 weeks to build, introduced network latency, added deployment complexity, and solved nothing.

Problem 2: We Were Making API Calls in the Request Cycle

Our checkout process called a third-party payment verification API. It took 1.2 seconds on average. Sometimes 4+ seconds.

def create
  @order = Order.new(order_params)
  
  # This blocks the entire request
  payment_result = PaymentGateway.verify_payment(params[:payment_token])
  
  if payment_result.success?
    @order.save
    redirect_to thank_you_path
  else
    render :new
  end
end

def create
  @order = Order.new(order_params)
  
  # This blocks the entire request
  payment_result = PaymentGateway.verify_payment(params[:payment_token])
  
  if payment_result.success?
    @order.save
    redirect_to thank_you_path
  else
    render :new
  end
end

The “microservices” solution: Extract payments to a separate service!

What we actually did: Made it asynchronous.

def create
  @order = Order.new(order_params)
  @order.status = 'pending'
  @order.save
  
  # Process in background
  PaymentVerificationJob.perform_later(@order.id, params[:payment_token])
  
  redirect_to processing_path
end

# Background job
class PaymentVerificationJob < ApplicationJob
  def perform(order_id, payment_token)
    order = Order.find(order_id)
    result = PaymentGateway.verify_payment(payment_token)
    
    if result.success?
      order.update(status: 'confirmed')
      OrderMailer.confirmation(order).deliver_later
    else
      order.update(status: 'failed')
      OrderMailer.payment_failed(order).deliver_later
    end
  end
end

def create
  @order = Order.new(order_params)
  @order.status = 'pending'
  @order.save
  
  # Process in background
  PaymentVerificationJob.perform_later(@order.id, params[:payment_token])
  
  redirect_to processing_path
end

# Background job
class PaymentVerificationJob < ApplicationJob
  def perform(order_id, payment_token)
    order = Order.find(order_id)
    result = PaymentGateway.verify_payment(payment_token)
    
    if result.success?
      order.update(status: 'confirmed')
      OrderMailer.confirmation(order).deliver_later
    else
      order.update(status: 'failed')
      OrderMailer.payment_failed(order).deliver_later
    end
  end
end

Result:

Checkout response time: 3,800ms → 420ms
No more timeout errors
Better user experience (instant confirmation page with real-time status updates)
Cost: 2 days of work

Again, microservices would have added complexity without solving the actual problem—blocking I/O in the request cycle.

Problem 3: N+1 Queries Everywhere

This is the problem every Rails developer knows about but somehow still writes:

# In our admin dashboard
@orders = Order.includes(:user).limit(50)

# In the view
<% @orders.each do |order| %>
  <tr>
    <td><%= order.user.name %></td>
    <td><%= order.items.count %></td>  <!-- N+1 here -->
    <td><%= order.total %></td>
    <td><%= order.shipping_address.city %></td>  <!-- And here -->
  </tr>
<% end %>

# In our admin dashboard
@orders = Order.includes(:user).limit(50)

# In the view
<% @orders.each do |order| %>
  <tr>
    <td><%= order.user.name %></td>
    <td><%= order.items.count %></td>  <!-- N+1 here -->
    <td><%= order.total %></td>
    <td><%= order.shipping_address.city %></td>  <!-- And here -->
  </tr>
<% end %>

This generated 150+ queries to render one admin page.

The “microservices” solution: None. Microservices don’t fix N+1 queries. They just make them harder to debug across network boundaries.

What we actually did: Fixed the eager loading.

@orders = Order.includes(:user, :items, :shipping_address).limit(50)

@orders = Order.includes(:user, :items, :shipping_address).limit(50)

Result:

Admin dashboard load time: 4,200ms → 680ms
150 queries → 4 queries
Cost: 1 hour

Problem 4: No Database Indexing Strategy

Our slowest queries looked like this:

SELECT * FROM orders 
WHERE user_id = 123 
AND status IN ('pending', 'processing') 
ORDER BY created_at DESC;

-- Execution time: 2,400ms

SELECT * FROM orders 
WHERE user_id = 123 
AND status IN ('pending', 'processing') 
ORDER BY created_at DESC;

-- Execution time: 2,400ms

Running an EXPLAIN showed a sequential scan across 2 million rows.

We had indexes on user_id and created_at separately, but not together. PostgreSQL couldn’t use them efficiently.

The “microservices” solution: Split the orders table across multiple databases!

What we actually did: Added composite indexes.

# In a migration
add_index :orders, [:user_id, :status, :created_at], 
          name: 'index_orders_on_user_status_date'

# In a migration
add_index :orders, [:user_id, :status, :created_at], 
          name: 'index_orders_on_user_status_date'

Result:

Query time: 2,400ms → 12ms
Database CPU: Down 40%
Cost: 2 hours (including testing)

Problem 5: Background Jobs Processing Synchronously

Our image processing jobs were running in Sidekiq, but we configured them wrong:

class ImageProcessor
  include Sidekiq::Worker
  
  def perform(image_id)
    image = Image.find(image_id)
    
    # Processing multiple sizes synchronously
    process_thumbnail(image)    # 2 seconds
    process_medium(image)        # 4 seconds
    process_large(image)         # 6 seconds
    process_watermark(image)     # 3 seconds
    
    # Total: 15 seconds per job
  end
end

class ImageProcessor
  include Sidekiq::Worker
  
  def perform(image_id)
    image = Image.find(image_id)
    
    # Processing multiple sizes synchronously
    process_thumbnail(image)    # 2 seconds
    process_medium(image)        # 4 seconds
    process_large(image)         # 6 seconds
    process_watermark(image)     # 3 seconds
    
    # Total: 15 seconds per job
  end
end

With 100K users uploading images, our job queue backed up for hours.

The “microservices” solution: Separate image processing service with its own infrastructure!

What we actually did: Parallel processing with proper job splitting.

class ImageProcessor
  include Sidekiq::Worker
  
  def perform(image_id)
    image = Image.find(image_id)
    
    # Split into parallel jobs
    ProcessThumbnailJob.perform_async(image_id)
    ProcessMediumJob.perform_async(image_id)
    ProcessLargeJob.perform_async(image_id)
    ProcessWatermarkJob.perform_async(image_id)
  end
end

class ProcessThumbnailJob
  include Sidekiq::Worker
  sidekiq_options queue: :image_processing
  
  def perform(image_id)
    image = Image.find(image_id)
    process_thumbnail(image)
  end
end

# Same pattern for other sizes

class ImageProcessor
  include Sidekiq::Worker
  
  def perform(image_id)
    image = Image.find(image_id)
    
    # Split into parallel jobs
    ProcessThumbnailJob.perform_async(image_id)
    ProcessMediumJob.perform_async(image_id)
    ProcessLargeJob.perform_async(image_id)
    ProcessWatermarkJob.perform_async(image_id)
  end
end

class ProcessThumbnailJob
  include Sidekiq::Worker
  sidekiq_options queue: :image_processing
  
  def perform(image_id)
    image = Image.find(image_id)
    process_thumbnail(image)
  end
end

# Same pattern for other sizes

We also added more Sidekiq workers:

# Before: 2 workers
# After: 8 workers with dedicated queues

:queues:
  - [critical, 4]
  - [default, 2]
  - [image_processing, 8]
  - [low_priority, 1]

# Before: 2 workers
# After: 8 workers with dedicated queues

:queues:
  - [critical, 4]
  - [default, 2]
  - [image_processing, 8]
  - [low_priority, 1]

Result:

Job processing time: 6 hours backlog → real-time processing
Image uploads: 15 seconds → 2 seconds (user-facing time)
Cost: 3 days of work + $120/month for additional worker servers

When Microservices Actually Made Sense

I’m not saying microservices are always wrong. After fixing these issues, we did extract one service—and for good reasons.

Our email sending was becoming a bottleneck:

200K+ emails per day
Complex templating logic
Multiple providers (SendGrid, Mailgun, AWS SES)
Required maintaining provider health checks and failover
Growing compliance requirements (GDPR, CAN-SPAM)

This was a clear bounded context that:

Had minimal coupling to our core business logic
Required different scaling characteristics
Needed specialized monitoring and alerting
Had a stable, well-defined interface

# Simple API contract
POST /api/v1/emails
{
  "template": "order_confirmation",
  "to": "user@example.com",
  "data": { "order_id": 12345 }
}

# Simple API contract
POST /api/v1/emails
{
  "template": "order_confirmation",
  "to": "user@example.com",
  "data": { "order_id": 12345 }
}

We extracted it to a separate Rails app. Deployment was independent. Scaling was independent. When email providers had issues, they didn’t affect our main application.

That’s when microservices make sense—not when you’re trying to fix performance problems.

What We Didn’t Do (That Everyone Suggested)

1. Rewrite in Go/Node/[Insert Hipster Language]

“Rails can’t handle 100K users!” Yes, it can. GitHub, Shopify, and Basecamp handle millions.

2. Switch to Kubernetes

We were running on 4 EC2 instances with Capistrano deployments. Kubernetes would have been operational overhead we didn’t need.

3. Implement Event Sourcing

Cool architecture pattern. Completely unnecessary for our use case.

4. Split the Database

Database splitting is hard. You lose transactions, joins become complex, and data consistency becomes your new nightmare. We didn’t need it.

5. Add a Message Queue Architecture

We already had Redis and Sidekiq. That was enough.

The Real Numbers After 6 Months

We took our “monolith that can’t scale” from 100K users to 350K users without breaking it apart.

Before optimization:

Average response time: 2,100ms
95th percentile: 5,800ms
Database connections: Constantly maxed at 100
Background job lag: 6 hours
Error rate: 0.8%
Servers: 4 application, 1 database
Monthly cost: $4,200

After optimization:

Average response time: 280ms
95th percentile: 620ms
Database connections: Averaging 35/100
Background job lag: Real-time
Error rate: 0.02%
Servers: 6 application, 1 database, 1 email service
Monthly cost: $4,800

We handled 3.5x more traffic with minimal cost increase and significantly better performance.

What Actually Matters at Scale

Here’s what I learned about scaling to 100K+ users:

1. Caching Strategy Beats Architecture

Most requests shouldn’t hit your database. Most calculations shouldn’t run every time.

We implemented caching at multiple levels:

Browser caching (CDN, HTTP headers)
Application caching (Rails.cache, Redis)
Database query caching
Fragment caching for expensive views

This had more impact than any architectural decision.

2. Database Optimization Beats Everything

Your database will be the bottleneck. Always.

Add proper indexes
Use EXPLAIN to understand query plans
Implement connection pooling correctly
Use read replicas for reporting
Archive old data

We spent more time optimizing SQL than anything else, and it showed.

3. Async Processing Is Non-Negotiable

If it doesn’t need to happen in the request cycle, don’t do it in the request cycle.

Email sending
Image processing
Report generation
Third-party API calls
Analytics tracking

Background jobs aren’t optional at scale.

4. Monitoring Shows You What Actually Matters

We used:

New Relic for APM
Datadog for infrastructure
Sentry for error tracking
Custom dashboards for business metrics

Without monitoring, we would have chased ghost problems based on assumptions.

5. Premature Optimization Is Real

We almost spent 6 months building a microservices architecture to fix problems we didn’t understand. Instead, we spent 3 weeks fixing the actual problems.

The time to break apart your monolith is not when performance degrades. It’s when:

Team coordination becomes impossible
Deployment coupling causes problems
Different parts need to scale independently
Domain boundaries are clear and stable

The Microservices Trap

Here’s what nobody tells you about microservices:

Complexity doesn’t disappear—it moves.

Before:

One codebase
One deployment
One database transaction
Stack traces that make sense

After:

N codebases to maintain
N deployment pipelines
Distributed transactions (good luck)
Errors across network boundaries
Service discovery
API versioning between services
Testing becomes exponentially harder
Local development requires running 10+ services

All that complexity costs time and money. It’s an investment that only pays off if you actually need it.

When Should You Actually Split?

Based on our experience and watching other companies, here are the real triggers:

Organizational Boundaries

When you have 3+ teams all deploying to the same codebase and stepping on each other’s toes.

Different Scaling Characteristics

When one feature needs 50 servers while everything else needs 5.

Technology Constraints

When you genuinely need different tech stacks (rare, but it happens—video encoding, ML models, etc.).

Deployment Independence

When you’re deploying 20 times a day and every deploy affects the entire application.

Regulatory Requirements

When different parts of your system have different compliance needs.

Our Actual Architecture at 350K Users

Here’s what we’re running today:

Main Application (Rails Monolith):

Core business logic
User management
Order processing
Product catalog
Admin interfaces

Email Service (Extracted):

Template management
Provider failover
Bounce handling
Compliance tracking

Infrastructure:

6 application servers behind ALB
1 primary database with 2 read replicas
Redis cluster for caching and jobs
CloudFront CDN
S3 for assets and uploads

That’s it. No Kubernetes, service mesh, event bus and No microservices for the sake of microservices.

Advice for Your 100K User Moment

When you hit scaling problems, resist the urge to rewrite. Instead:

Week 1: Measure

Enable APM tooling
Profile database queries
Track slow endpoints
Monitor background jobs
Watch error rates

then

Week 2: Fix the Obvious

Add caching
Fix N+1 queries
Add database indexes
Move blocking I/O to background jobs
Optimize your slowest endpoints

then

Week 3: Optimize

Implement read replicas
Add CDN for static assets
Optimize asset delivery
Tune your database
Scale horizontally (add servers)

Week 4: Plan

NOW you can decide if you need microservices
With real data
Based on actual problems
Not blog posts from Netflix

Nine times out of ten, you’ll find that your “monolith that can’t scale” just needed some basic optimization.

At The End….

We didn’t need microservices. We needed to fix our code.

Caching, indexing, async processing, and horizontal scaling took us from 100K users to 350K users. Our monolith is still running strong.

Could we benefit from microservices eventually? Maybe. When we have 20 engineers instead of 5 or we’re deploying multiple times per day instead of twice per week. When we have real organizational boundaries that justify the complexity.

But not today. And probably not for you either.

The next time someone tells you “the monolith can’t scale,” ask them what they’ve actually measured. Chances are, they’re solving the wrong problem.

Don't Miss Out on Expert Insights!

Join 10,000+ smart readers who get our best articles, exclusive tips, and actionable strategies delivered straight to their inbox. No spam, no fluff—just valuable content you can't find anywhere else. Plus, get our free guide when you subscribe today!

Monolith vs Microservices: What Actually Broke at 100K Users

The Breaking Point (monolith vs microservices)

What We Actually Found

Problem 1: One Feature Was Eating 40% of Our Resources

Problem 2: We Were Making API Calls in the Request Cycle

Problem 3: N+1 Queries Everywhere

Problem 4: No Database Indexing Strategy

Problem 5: Background Jobs Processing Synchronously

When Microservices Actually Made Sense

What We Didn’t Do (That Everyone Suggested)

The Real Numbers After 6 Months

What Actually Matters at Scale

1. Caching Strategy Beats Architecture

2. Database Optimization Beats Everything

3. Async Processing Is Non-Negotiable

4. Monitoring Shows You What Actually Matters

5. Premature Optimization Is Real

The Microservices Trap

When Should You Actually Split?

Organizational Boundaries

Different Scaling Characteristics

Technology Constraints

Deployment Independence

Regulatory Requirements

Our Actual Architecture at 350K Users

Advice for Your 100K User Moment

At The End….

Don't Miss Out on Expert Insights!

Comments

Leave a Reply Cancel reply