monolith vs microservices

Monolith vs Microservices: What Actually Broke at 100K Users

monolith vs microservices

“We need to move to microservices. The monolith can’t scale.”

This was my tech lead’s opening line in our architecture review meeting. We had just crossed 100,000 users, and things were starting to break. Pages were loading slowly. Background jobs were backing up. The database was groaning under load.

I had heard this story before. “Monoliths don’t scale” is the rallying cry of every developer who’s read one too many Netflix engineering blogs. And honestly, I wanted to believe it. Microservices sounded exciting. Modern. The kind of thing you put on your resume.

But I had this nagging feeling we were about to make a very expensive mistake.

So I did something unpopular—I asked us to spend two weeks actually profiling our monolith to understand what was breaking. Not what we thought was breaking. What was actually breaking.

What we found surprised everyone. And it saved us from a six-month rewrite that would have solved exactly zero of our real problems.

The Breaking Point (monolith vs microservices)

Let me paint you a picture of our monolith at 100K users:

  • Rails application running on 4 servers
  • PostgreSQL database with 200GB of data
  • Redis for caching and background jobs
  • Response times creeping from 200ms to 2+ seconds
  • Background job queue backing up to 6 hours
  • Database connection pool constantly maxed out
  • Random timeout errors 2-3 times per day

The conventional wisdom said: “Your monolith is too big. Break it into microservices.”

But here’s what nobody tells you—microservices don’t magically fix performance problems. They just distribute them.

What We Actually Found

I spent two weeks with our APM tools, database logs, and a lot of coffee. Here’s what was really breaking:

Problem 1: One Feature Was Eating 40% of Our Resources

We had a “similar items” recommendation feature. Every time someone viewed a product, we calculated similar items based on 15 different attributes using a complex scoring algorithm.

In memory.

For every page view.

# The innocent-looking code
def similar_items
  all_items = Item.where(category: self.category).limit(1000)
  
  all_items.map do |item|
    score = calculate_similarity_score(self, item)
    [item, score]
  end.sort_by { |_, score| -score }.first(10).map(&:first)
end

def calculate_similarity_score(item1, item2)
  # 15 different attribute comparisons
  # Vector calculations
  # String similarity algorithms
  # You get the idea - this was SLOW
end

This single method was:

  • Loading 1,000 records from the database
  • Running similarity calculations 1,000 times per page view
  • Doing this for 40% of our page views

The “microservices” solution: Extract this to a recommendation service!

What we actually did: Cached the recommendations for 6 hours.

def similar_items
  Rails.cache.fetch("similar_items:#{id}", expires_in: 6.hours) do
    # Same calculation, but cached
    calculate_similar_items
  end
end

Result:

  • Response times: 2,100ms → 340ms
  • Database load: Down 35%
  • Cost: 4 hours of engineering time

A microservice would have taken 3 weeks to build, introduced network latency, added deployment complexity, and solved nothing.

Problem 2: We Were Making API Calls in the Request Cycle

Our checkout process called a third-party payment verification API. It took 1.2 seconds on average. Sometimes 4+ seconds.

def create
  @order = Order.new(order_params)
  
  # This blocks the entire request
  payment_result = PaymentGateway.verify_payment(params[:payment_token])
  
  if payment_result.success?
    @order.save
    redirect_to thank_you_path
  else
    render :new
  end
end

The “microservices” solution: Extract payments to a separate service!

What we actually did: Made it asynchronous.

def create
  @order = Order.new(order_params)
  @order.status = 'pending'
  @order.save
  
  # Process in background
  PaymentVerificationJob.perform_later(@order.id, params[:payment_token])
  
  redirect_to processing_path
end

# Background job
class PaymentVerificationJob < ApplicationJob
  def perform(order_id, payment_token)
    order = Order.find(order_id)
    result = PaymentGateway.verify_payment(payment_token)
    
    if result.success?
      order.update(status: 'confirmed')
      OrderMailer.confirmation(order).deliver_later
    else
      order.update(status: 'failed')
      OrderMailer.payment_failed(order).deliver_later
    end
  end
end

Result:

  • Checkout response time: 3,800ms → 420ms
  • No more timeout errors
  • Better user experience (instant confirmation page with real-time status updates)
  • Cost: 2 days of work

Again, microservices would have added complexity without solving the actual problem—blocking I/O in the request cycle.

Problem 3: N+1 Queries Everywhere

This is the problem every Rails developer knows about but somehow still writes:

# In our admin dashboard
@orders = Order.includes(:user).limit(50)

# In the view
<% @orders.each do |order| %>
  <tr>
    <td><%= order.user.name %></td>
    <td><%= order.items.count %></td>  <!-- N+1 here -->
    <td><%= order.total %></td>
    <td><%= order.shipping_address.city %></td>  <!-- And here -->
  </tr>
<% end %>

This generated 150+ queries to render one admin page.

The “microservices” solution: None. Microservices don’t fix N+1 queries. They just make them harder to debug across network boundaries.

What we actually did: Fixed the eager loading.

@orders = Order.includes(:user, :items, :shipping_address).limit(50)

Result:

  • Admin dashboard load time: 4,200ms → 680ms
  • 150 queries → 4 queries
  • Cost: 1 hour

Problem 4: No Database Indexing Strategy

Our slowest queries looked like this:

SELECT * FROM orders 
WHERE user_id = 123 
AND status IN ('pending', 'processing') 
ORDER BY created_at DESC;

-- Execution time: 2,400ms

Running an EXPLAIN showed a sequential scan across 2 million rows.

We had indexes on user_id and created_at separately, but not together. PostgreSQL couldn’t use them efficiently.

The “microservices” solution: Split the orders table across multiple databases!

What we actually did: Added composite indexes.

# In a migration
add_index :orders, [:user_id, :status, :created_at], 
          name: 'index_orders_on_user_status_date'

Result:

  • Query time: 2,400ms → 12ms
  • Database CPU: Down 40%
  • Cost: 2 hours (including testing)

Problem 5: Background Jobs Processing Synchronously

Our image processing jobs were running in Sidekiq, but we configured them wrong:

class ImageProcessor
  include Sidekiq::Worker
  
  def perform(image_id)
    image = Image.find(image_id)
    
    # Processing multiple sizes synchronously
    process_thumbnail(image)    # 2 seconds
    process_medium(image)        # 4 seconds
    process_large(image)         # 6 seconds
    process_watermark(image)     # 3 seconds
    
    # Total: 15 seconds per job
  end
end

With 100K users uploading images, our job queue backed up for hours.

The “microservices” solution: Separate image processing service with its own infrastructure!

What we actually did: Parallel processing with proper job splitting.

class ImageProcessor
  include Sidekiq::Worker
  
  def perform(image_id)
    image = Image.find(image_id)
    
    # Split into parallel jobs
    ProcessThumbnailJob.perform_async(image_id)
    ProcessMediumJob.perform_async(image_id)
    ProcessLargeJob.perform_async(image_id)
    ProcessWatermarkJob.perform_async(image_id)
  end
end

class ProcessThumbnailJob
  include Sidekiq::Worker
  sidekiq_options queue: :image_processing
  
  def perform(image_id)
    image = Image.find(image_id)
    process_thumbnail(image)
  end
end

# Same pattern for other sizes

We also added more Sidekiq workers:

# Before: 2 workers
# After: 8 workers with dedicated queues

:queues:
  - [critical, 4]
  - [default, 2]
  - [image_processing, 8]
  - [low_priority, 1]

Result:

  • Job processing time: 6 hours backlog → real-time processing
  • Image uploads: 15 seconds → 2 seconds (user-facing time)
  • Cost: 3 days of work + $120/month for additional worker servers

When Microservices Actually Made Sense

I’m not saying microservices are always wrong. After fixing these issues, we did extract one service—and for good reasons.

Our email sending was becoming a bottleneck:

  • 200K+ emails per day
  • Complex templating logic
  • Multiple providers (SendGrid, Mailgun, AWS SES)
  • Required maintaining provider health checks and failover
  • Growing compliance requirements (GDPR, CAN-SPAM)

This was a clear bounded context that:

  1. Had minimal coupling to our core business logic
  2. Required different scaling characteristics
  3. Needed specialized monitoring and alerting
  4. Had a stable, well-defined interface
# Simple API contract
POST /api/v1/emails
{
  "template": "order_confirmation",
  "to": "user@example.com",
  "data": { "order_id": 12345 }
}

We extracted it to a separate Rails app. Deployment was independent. Scaling was independent. When email providers had issues, they didn’t affect our main application.

That’s when microservices make sense—not when you’re trying to fix performance problems.

What We Didn’t Do (That Everyone Suggested)

1. Rewrite in Go/Node/[Insert Hipster Language]

“Rails can’t handle 100K users!” Yes, it can. GitHub, Shopify, and Basecamp handle millions.

2. Switch to Kubernetes

We were running on 4 EC2 instances with Capistrano deployments. Kubernetes would have been operational overhead we didn’t need.

3. Implement Event Sourcing

Cool architecture pattern. Completely unnecessary for our use case.

4. Split the Database

Database splitting is hard. You lose transactions, joins become complex, and data consistency becomes your new nightmare. We didn’t need it.

5. Add a Message Queue Architecture

We already had Redis and Sidekiq. That was enough.

The Real Numbers After 6 Months

We took our “monolith that can’t scale” from 100K users to 350K users without breaking it apart.

Before optimization:

  • Average response time: 2,100ms
  • 95th percentile: 5,800ms
  • Database connections: Constantly maxed at 100
  • Background job lag: 6 hours
  • Error rate: 0.8%
  • Servers: 4 application, 1 database
  • Monthly cost: $4,200

After optimization:

  • Average response time: 280ms
  • 95th percentile: 620ms
  • Database connections: Averaging 35/100
  • Background job lag: Real-time
  • Error rate: 0.02%
  • Servers: 6 application, 1 database, 1 email service
  • Monthly cost: $4,800

We handled 3.5x more traffic with minimal cost increase and significantly better performance.

What Actually Matters at Scale

Here’s what I learned about scaling to 100K+ users:

1. Caching Strategy Beats Architecture

Most requests shouldn’t hit your database. Most calculations shouldn’t run every time.

We implemented caching at multiple levels:

  • Browser caching (CDN, HTTP headers)
  • Application caching (Rails.cache, Redis)
  • Database query caching
  • Fragment caching for expensive views

This had more impact than any architectural decision.

2. Database Optimization Beats Everything

Your database will be the bottleneck. Always.

  • Add proper indexes
  • Use EXPLAIN to understand query plans
  • Implement connection pooling correctly
  • Use read replicas for reporting
  • Archive old data

We spent more time optimizing SQL than anything else, and it showed.

3. Async Processing Is Non-Negotiable

If it doesn’t need to happen in the request cycle, don’t do it in the request cycle.

  • Email sending
  • Image processing
  • Report generation
  • Third-party API calls
  • Analytics tracking

Background jobs aren’t optional at scale.

4. Monitoring Shows You What Actually Matters

We used:

  • New Relic for APM
  • Datadog for infrastructure
  • Sentry for error tracking
  • Custom dashboards for business metrics

Without monitoring, we would have chased ghost problems based on assumptions.

5. Premature Optimization Is Real

We almost spent 6 months building a microservices architecture to fix problems we didn’t understand. Instead, we spent 3 weeks fixing the actual problems.

The time to break apart your monolith is not when performance degrades. It’s when:

  • Team coordination becomes impossible
  • Deployment coupling causes problems
  • Different parts need to scale independently
  • Domain boundaries are clear and stable

The Microservices Trap

Here’s what nobody tells you about microservices:

Complexity doesn’t disappear—it moves.

Before:

  • One codebase
  • One deployment
  • One database transaction
  • Stack traces that make sense

After:

  • N codebases to maintain
  • N deployment pipelines
  • Distributed transactions (good luck)
  • Errors across network boundaries
  • Service discovery
  • API versioning between services
  • Testing becomes exponentially harder
  • Local development requires running 10+ services

All that complexity costs time and money. It’s an investment that only pays off if you actually need it.

When Should You Actually Split?

Based on our experience and watching other companies, here are the real triggers:

Organizational Boundaries

When you have 3+ teams all deploying to the same codebase and stepping on each other’s toes.

Different Scaling Characteristics

When one feature needs 50 servers while everything else needs 5.

Technology Constraints

When you genuinely need different tech stacks (rare, but it happens—video encoding, ML models, etc.).

Deployment Independence

When you’re deploying 20 times a day and every deploy affects the entire application.

Regulatory Requirements

When different parts of your system have different compliance needs.

Our Actual Architecture at 350K Users

Here’s what we’re running today:

Main Application (Rails Monolith):

  • Core business logic
  • User management
  • Order processing
  • Product catalog
  • Admin interfaces

Email Service (Extracted):

  • Template management
  • Provider failover
  • Bounce handling
  • Compliance tracking

Infrastructure:

  • 6 application servers behind ALB
  • 1 primary database with 2 read replicas
  • Redis cluster for caching and jobs
  • CloudFront CDN
  • S3 for assets and uploads

That’s it. No Kubernetes, service mesh, event bus and No microservices for the sake of microservices.

Advice for Your 100K User Moment

When you hit scaling problems, resist the urge to rewrite. Instead:

Week 1: Measure

  • Enable APM tooling
  • Profile database queries
  • Track slow endpoints
  • Monitor background jobs
  • Watch error rates

then

Week 2: Fix the Obvious

  • Add caching
  • Fix N+1 queries
  • Add database indexes
  • Move blocking I/O to background jobs
  • Optimize your slowest endpoints

then

Week 3: Optimize

  • Implement read replicas
  • Add CDN for static assets
  • Optimize asset delivery
  • Tune your database
  • Scale horizontally (add servers)

Week 4: Plan

  • NOW you can decide if you need microservices
  • With real data
  • Based on actual problems
  • Not blog posts from Netflix

Nine times out of ten, you’ll find that your “monolith that can’t scale” just needed some basic optimization.

At The End….

We didn’t need microservices. We needed to fix our code.

Caching, indexing, async processing, and horizontal scaling took us from 100K users to 350K users. Our monolith is still running strong.

Could we benefit from microservices eventually? Maybe. When we have 20 engineers instead of 5 or we’re deploying multiple times per day instead of twice per week. When we have real organizational boundaries that justify the complexity.

But not today. And probably not for you either.

The next time someone tells you “the monolith can’t scale,” ask them what they’ve actually measured. Chances are, they’re solving the wrong problem.


Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *