AWS cost optimization

We Cut Our AWS Bill by 42% Using Only Configuration Changes

AWS cost optimization

I was in a meeting with our CFO when she slid a printed AWS bill across the table. “$47,000 for last month. That’s up 18% from last quarter. What are we paying for?”

I did what every engineer does when confronted with money questions—I got defensive. “We’re growing. More users means more infrastructure. That’s just how it works.”

She wasn’t buying it. “Your user base grew 12%, but costs grew 18%. Something doesn’t add up.”

She was right. I spent the next two weeks diving into our AWS console, and what I found was embarrassing. We weren’t paying for growth—we were paying for laziness, defaults, and things we’d set up two years ago and never touched again.

Three weeks later, our bill was $27,000. Same infrastructure, performance and Same uptime. We just stopped being stupid with our money.

Here’s exactly what we changed.

Discovery 1: We Were Paying for Servers That Were Doing Nothing

I started with AWS Cost Explorer and filtered by service. EC2 was eating 62% of our budget—$29,000 per month. That seemed high for our traffic.

I pulled up our EC2 dashboard and started clicking through instances. That’s when I saw it:

Instance: api-worker-04

  • Status: Running
  • Type: c5.2xlarge ($0.34/hour = $245/month)
  • CPU Utilization (30 days): 2.3%
  • Network In/Out: Basically zero
  • Last deployment: 8 months ago

This server was doing absolutely nothing. It was running because someone spun it up during a load test last year and forgot to shut it down. $245/month for eight months is $1,960 down the drain.

I found eleven more just like it.

Some were old staging environments. Some were testing boxes that never got cleaned up. One was literally called “temp-test-delete-me” and had been running for six months.

Action taken:

# Tagged all instances with purpose and owner
aws ec2 create-tags --resources i-1234567890abcdef0 \
    --tags Key=Purpose,Value=production \
           Key=Owner,Value=devops-team \
           Key=Project,Value=api-backend

Then I wrote a simple Lambda function that ran weekly:

import boto3
from datetime import datetime, timedelta

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    cloudwatch = boto3.client('cloudwatch')
    
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )
    
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            
            # Check CPU utilization over last 7 days
            stats = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=datetime.now() - timedelta(days=7),
                EndTime=datetime.now(),
                Period=3600,
                Statistics=['Average']
            )
            
            avg_cpu = sum(s['Average'] for s in stats['Datapoints']) / len(stats['Datapoints']) if stats['Datapoints'] else 0
            
            # Alert if CPU < 5% for a week
            if avg_cpu < 5:
                print(f"Low utilization alert: {instance_id} - {avg_cpu}% CPU")
                # Send to Slack, email, etc.

Savings: $3,200/month from terminating zombie instances.

Discovery 2: Our RDS Instances Were Hilariously Oversized

Next stop: RDS. Our primary database was running on a db.r5.4xlarge (16 vCPUs, 128 GB RAM) costing us $1,872/month.

I checked CloudWatch metrics:

  • CPU Utilization: Peak 18%, average 8%
  • Memory: Using 42 GB out of 128 GB
  • IOPS: Not even close to the provisioned limit

We were paying for a Ferrari but driving like it’s a Honda Civic.

What we did:

  1. Enabled Enhanced Monitoring (should have done this from day one)
  2. Watched metrics for two weeks during peak traffic
  3. Realized we could comfortably run on db.r5.xlarge (4 vCPUs, 32 GB RAM)

The downgrade process was scary but smooth:

# Create a snapshot first (always)
aws rds create-db-snapshot \
    --db-instance-identifier production-db \
    --db-snapshot-identifier pre-downsize-snapshot-2024

# Modify instance class
aws rds modify-db-instance \
    --db-instance-identifier production-db \
    --db-instance-class db.r5.xlarge \
    --apply-immediately

We scheduled it during a low-traffic window at 3 AM on a Tuesday. The instance was down for about 8 minutes during the resize.

Result: From $1,872/month to $468/month. Savings: $1,404/month

And performance didn’t change at all. Turns out we never needed that much horsepower.

Discovery 3: We Were Paying for Storage We Weren’t Using

S3 was costing us $6,200/month. For file storage. That seemed insane.

I dug into the bucket analytics:

aws s3api list-buckets --query "Buckets[].Name" | \
    xargs -I {} aws s3api get-bucket-location --bucket {}

What I found:

  • 2.4 TB of user uploads (totally fine)
  • 8.7 TB of application logs from 2022-2023 (what?!)
  • 1.9 TB of database backups older than 6 months
  • 670 GB of temporary files that were never cleaned up

We had logs from two years ago sitting in S3 Standard storage at $0.023 per GB/month.

The fix:

Created lifecycle policies for every bucket:

{
  "Rules": [
    {
      "Id": "Move old logs to Glacier",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      },
      "Filter": {
        "Prefix": "logs/"
      }
    },
    {
      "Id": "Delete temp files",
      "Status": "Enabled",
      "Expiration": {
        "Days": 7
      },
      "Filter": {
        "Prefix": "temp/"
      }
    },
    {
      "Id": "Intelligent tiering for uploads",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "INTELLIGENT_TIERING"
        }
      ],
      "Filter": {
        "Prefix": "uploads/"
      }
    }
  ]
}

Applied via CLI:

aws s3api put-bucket-lifecycle-configuration \
    --bucket my-app-storage \
    --lifecycle-configuration file://lifecycle-policy.json

This automatically:

  • Moved old logs to Glacier ($0.004/GB vs $0.023/GB)
  • Deleted logs older than 1 year
  • Moved infrequently accessed uploads to cheaper storage
  • Cleaned up temp files after 7 days

Savings: $4,100/month in S3 costs alone.

Discovery 4: Our Load Balancers Were More Expensive Than Our Servers

We had five Application Load Balancers running:

  • Production API: $16.20/month + $0.008/LCU-hour
  • Staging API: $16.20/month + minimal traffic
  • Development API: $16.20/month + minimal traffic
  • Admin dashboard: $16.20/month + barely used
  • Legacy service (we forgot about): $16.20/month + zero traffic

Each ALB costs a base $16.20/month plus usage. But here’s the kicker—you’re also paying for each rule, each target group, each health check.

Our production ALB had 47 rules. We needed maybe 10.

What we did:

  1. Consolidated staging and dev environments behind a single ALB using host-based routing:
# Before: 2 ALBs
- ALB 1: staging-api.example.com
- ALB 2: dev-api.example.com

# After: 1 ALB with rules
- ALB: shared-nonprod.example.com
  Rules:
    - Host: staging-api.example.com → Target Group A
    - Host: dev-api.example.com → Target Group B
  1. Moved the admin dashboard (5 requests/minute) to the main ALB
  2. Deleted the legacy ALB that wasn’t serving any traffic

Savings: $800/month in ALB base costs and reduced LCU charges.

Discovery 5: Reserved Instances for Predictable Workloads

This one was painful because we should have done it years ago.

Our production servers ran 24/7/365. We were paying on-demand pricing—the most expensive option—because we “wanted flexibility.”

That’s like renting a car daily for three years instead of just buying one.

I pulled a utilization report for the last 12 months:

  • 8 EC2 instances ran constantly: t3.xlarge
  • 2 RDS instances ran constantly: db.r5.xlarge
  • 3 ElastiCache nodes ran constantly: cache.r5.large

These never changed. They never scaled down. They just… ran.

Action taken:

Purchased 1-year Reserved Instances with partial upfront payment:

# Example for EC2
aws ec2 purchase-reserved-instances-offering \
    --reserved-instances-offering-id <offering-id> \
    --instance-count 8

For RDS:

aws rds purchase-reserved-db-instances-offering \
    --reserved-db-instances-offering-id <offering-id> \
    --db-instance-count 2

Pricing comparison:

  • t3.xlarge on-demand: $0.1664/hour = $1,461/month for 8 instances
  • t3.xlarge reserved (1-yr, partial upfront): $0.0984/hour = $864/month for 8 instances

Savings: $7,164/month across EC2, RDS, and ElastiCache.

Yes, that’s $7,000 per month for clicking a few buttons.

Discovery 6: Auto Scaling Groups That Never Scaled

Our ASGs were configured with:

Min: 4 instances
Max: 20 instances
Desired: 4 instances

In two years, they had scaled to 5 instances exactly twice—during a product launch—and immediately scaled back down.

We were maintaining the infrastructure for scaling we never used:

  • Multiple target groups
  • Complex scaling policies
  • CloudWatch alarms
  • SNS topics for notifications

What we changed:

  1. Reduced max capacity to 8 (still plenty of headroom)
  2. Simplified scaling policies to only CPU-based triggers
  3. Deleted unused CloudWatch alarms ($0.10/alarm/month adds up)
  4. Combined similar ASGs where possible

We had three separate ASGs for the same application because “microservices.” They could have been one ASG with different user data.

Savings: $280/month in reduced overhead and smaller max capacity.

Discovery 7: NAT Gateway Bills Were Out of Control

Our NAT Gateway was costing $387/month. For what? Private subnets to access the internet.

But here’s the thing—most of our private subnet traffic was talking to other AWS services: S3, DynamoDB, RDS.

NAT Gateways charge $0.045 per GB of data processed. We were processing 7.2 TB/month through them.

The fix:

Set up VPC endpoints for AWS services:

# S3 Gateway Endpoint (free!)
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-12345678 \
    --service-name com.amazonaws.us-east-1.s3 \
    --route-table-ids rtb-12345678

# DynamoDB Gateway Endpoint (also free!)
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-12345678 \
    --service-name com.amazonaws.us-east-1.dynamodb \
    --route-table-ids rtb-12345678

This routes traffic to S3 and DynamoDB through AWS’s internal network instead of through the NAT Gateway.

We also found that several services were pulling Docker images from Docker Hub through the NAT Gateway. We moved to Amazon ECR and used VPC endpoints for that too.

Result: Data through NAT Gateway dropped from 7.2 TB to 1.1 TB.

Savings: $274/month

Discovery 8: CloudWatch Logs Were Killing Us

We had verbose logging enabled everywhere. Every API request, every query, every function call—logged.

CloudWatch Logs charges $0.50 per GB ingested and $0.03 per GB stored. We were ingesting 890 GB per month.

That’s $445/month for logs we never looked at.

What we did:

  1. Reduced log levels in production (INFO instead of DEBUG)
  2. Set retention periods on all log groups:
# List all log groups without retention
aws logs describe-log-groups \
    --query 'logGroups[?!retentionInDays].[logGroupName]' \
    --output text | \
    xargs -I {} aws logs put-retention-policy \
        --log-group-name {} \
        --retention-in-days 7
  1. Used metric filters to extract important data, then deleted the raw logs
  2. Moved long-term logs to S3 for cheaper storage

Savings: $320/month

The Quick Wins Checklist

After finding all these issues, I created a monthly audit checklist:

Compute:

  • [ ] Any EC2 instances with <10% CPU for 7 days?
  • [ ] Any instances without tags?
  • [ ] Any instances running for >30 days without deployment?
  • [ ] Can we use Reserved Instances for steady workloads?

Storage:

  • [ ] Any S3 buckets without lifecycle policies?
  • [ ] Any EBS volumes not attached to instances?
  • [ ] Any snapshots older than 3 months?
  • [ ] Any old AMIs we’re not using?

Database:

  • [ ] RDS instances sized correctly for actual usage?
  • [ ] Multi-AZ enabled only where needed?
  • [ ] Automated backups set to minimum retention?
  • [ ] Any old manual snapshots?

Network:

  • [ ] Any unused load balancers?
  • [ ] NAT Gateway traffic reducible via VPC endpoints?
  • [ ] Any Elastic IPs not associated with instances? ($0.005/hour each)

Monitoring:

  • [ ] CloudWatch log retention set appropriately?
  • [ ] Any unused CloudWatch alarms?
  • [ ] Metrics retention longer than needed?

The Tools That Helped

1. AWS Cost Explorer

This is built into AWS Console. I set up daily email reports:

aws ce get-cost-and-usage \
    --time-period Start=2024-01-01,End=2024-01-31 \
    --granularity DAILY \
    --metrics BlendedCost \
    --group-by Type=DIMENSION,Key=SERVICE

2. AWS Trusted Advisor

Free with Business/Enterprise support. Shows:

  • Underutilized instances
  • Idle load balancers
  • Unassociated Elastic IPs
  • RDS idle DB instances

3. AWS Compute Optimizer

Analyzes CloudWatch metrics and recommends right-sizing:

aws compute-optimizer get-ec2-instance-recommendations \
    --instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0

4. Cost Anomaly Detection

Set up alerts for unexpected cost increases:

aws ce create-anomaly-monitor \
    --anomaly-monitor Name="ProductionCostMonitor" \
        MonitorType=DIMENSIONAL \
        MonitorDimension=SERVICE

The Results After 3 Months

Before:

  • Monthly AWS bill: $47,000
  • Cost per user: $3.92
  • Cost per API call: $0.0023

After:

  • Monthly AWS bill: $27,200
  • Cost per user: $2.27
  • Cost per API call: $0.0013

Total savings: $19,800/month = $237,600/year

Breakdown by category:

  • EC2 optimization: $7,800/month
  • Storage optimization: $4,600/month
  • RDS right-sizing: $1,800/month
  • Reserved Instances: $4,200/month
  • Network optimization: $800/month
  • Monitoring/logging: $600/month

And we didn’t change a single line of application code. No performance degradation. No downtime (except that 8-minute RDS resize).

What I Learned

1. Defaults are expensive: AWS defaults to convenient, not cheap. You have to opt into savings.

2. Nobody watches the bill: Development teams spin things up; nobody spins things down.

3. Tagging is not optional: Without tags, you can’t track what anything is for.

4. Audit monthly: Costs drift. What made sense in January is wasteful by June.

5. Reserved Instances are free money: If it runs 24/7, buy a reservation. Period.

6. Storage policies save thousands: Lifecycle rules cost nothing to set up and save constantly.

Your Action Plan

Start here:

Week 1: Discovery

  • Install AWS Cost Explorer
  • Run Trusted Advisor
  • Tag everything that’s running
  • Identify zombie resources

then

Week 2: Quick Wins

  • Set up S3 lifecycle policies
  • Delete unused resources
  • Set CloudWatch log retention
  • Enable Cost Anomaly Detection

then

Week 3: Optimization

  • Right-size oversized instances
  • Purchase Reserved Instances for steady workloads
  • Set up VPC endpoints
  • Consolidate load balancers

Week 4: Automation

  • Create Lambda for usage monitoring
  • Set up monthly cost reports
  • Document what everything costs
  • Create a cost review calendar

The Real Lesson

Our CFO was right. We were being lazy. We treated AWS like it was free because we never looked at the bill.

The infrastructure grew organically—someone needed a server, they spun one up. Someone needed to test something, they created a new environment. Nobody cleaned up after themselves.

We cut our bill by 42% in three weeks without touching our application. That’s not optimization. That’s basic housekeeping.

If you haven’t audited your AWS costs in the last three months, you’re throwing money away. I guarantee you have zombie instances, oversized databases, and storage you forgot about.

Go check right now. I’ll wait.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *