AWS cost optimization
I was in a meeting with our CFO when she slid a printed AWS bill across the table. “$47,000 for last month. That’s up 18% from last quarter. What are we paying for?”
I did what every engineer does when confronted with money questions—I got defensive. “We’re growing. More users means more infrastructure. That’s just how it works.”
She wasn’t buying it. “Your user base grew 12%, but costs grew 18%. Something doesn’t add up.”
She was right. I spent the next two weeks diving into our AWS console, and what I found was embarrassing. We weren’t paying for growth—we were paying for laziness, defaults, and things we’d set up two years ago and never touched again.
Three weeks later, our bill was $27,000. Same infrastructure, performance and Same uptime. We just stopped being stupid with our money.
Here’s exactly what we changed.
Discovery 1: We Were Paying for Servers That Were Doing Nothing
I started with AWS Cost Explorer and filtered by service. EC2 was eating 62% of our budget—$29,000 per month. That seemed high for our traffic.
I pulled up our EC2 dashboard and started clicking through instances. That’s when I saw it:
Instance: api-worker-04
- Status: Running
- Type: c5.2xlarge ($0.34/hour = $245/month)
- CPU Utilization (30 days): 2.3%
- Network In/Out: Basically zero
- Last deployment: 8 months ago
This server was doing absolutely nothing. It was running because someone spun it up during a load test last year and forgot to shut it down. $245/month for eight months is $1,960 down the drain.
I found eleven more just like it.
Some were old staging environments. Some were testing boxes that never got cleaned up. One was literally called “temp-test-delete-me” and had been running for six months.
Action taken:
# Tagged all instances with purpose and owner
aws ec2 create-tags --resources i-1234567890abcdef0 \
--tags Key=Purpose,Value=production \
Key=Owner,Value=devops-team \
Key=Project,Value=api-backend
Then I wrote a simple Lambda function that ran weekly:
import boto3
from datetime import datetime, timedelta
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
# Check CPU utilization over last 7 days
stats = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.now() - timedelta(days=7),
EndTime=datetime.now(),
Period=3600,
Statistics=['Average']
)
avg_cpu = sum(s['Average'] for s in stats['Datapoints']) / len(stats['Datapoints']) if stats['Datapoints'] else 0
# Alert if CPU < 5% for a week
if avg_cpu < 5:
print(f"Low utilization alert: {instance_id} - {avg_cpu}% CPU")
# Send to Slack, email, etc.
Savings: $3,200/month from terminating zombie instances.
Discovery 2: Our RDS Instances Were Hilariously Oversized
Next stop: RDS. Our primary database was running on a db.r5.4xlarge (16 vCPUs, 128 GB RAM) costing us $1,872/month.
I checked CloudWatch metrics:
- CPU Utilization: Peak 18%, average 8%
- Memory: Using 42 GB out of 128 GB
- IOPS: Not even close to the provisioned limit
We were paying for a Ferrari but driving like it’s a Honda Civic.
What we did:
- Enabled Enhanced Monitoring (should have done this from day one)
- Watched metrics for two weeks during peak traffic
- Realized we could comfortably run on db.r5.xlarge (4 vCPUs, 32 GB RAM)
The downgrade process was scary but smooth:
# Create a snapshot first (always)
aws rds create-db-snapshot \
--db-instance-identifier production-db \
--db-snapshot-identifier pre-downsize-snapshot-2024
# Modify instance class
aws rds modify-db-instance \
--db-instance-identifier production-db \
--db-instance-class db.r5.xlarge \
--apply-immediately
We scheduled it during a low-traffic window at 3 AM on a Tuesday. The instance was down for about 8 minutes during the resize.
Result: From $1,872/month to $468/month. Savings: $1,404/month
And performance didn’t change at all. Turns out we never needed that much horsepower.
Discovery 3: We Were Paying for Storage We Weren’t Using
S3 was costing us $6,200/month. For file storage. That seemed insane.
I dug into the bucket analytics:
aws s3api list-buckets --query "Buckets[].Name" | \
xargs -I {} aws s3api get-bucket-location --bucket {}
What I found:
- 2.4 TB of user uploads (totally fine)
- 8.7 TB of application logs from 2022-2023 (what?!)
- 1.9 TB of database backups older than 6 months
- 670 GB of temporary files that were never cleaned up
We had logs from two years ago sitting in S3 Standard storage at $0.023 per GB/month.
The fix:
Created lifecycle policies for every bucket:
{
"Rules": [
{
"Id": "Move old logs to Glacier",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
},
"Filter": {
"Prefix": "logs/"
}
},
{
"Id": "Delete temp files",
"Status": "Enabled",
"Expiration": {
"Days": 7
},
"Filter": {
"Prefix": "temp/"
}
},
{
"Id": "Intelligent tiering for uploads",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "INTELLIGENT_TIERING"
}
],
"Filter": {
"Prefix": "uploads/"
}
}
]
}
Applied via CLI:
aws s3api put-bucket-lifecycle-configuration \
--bucket my-app-storage \
--lifecycle-configuration file://lifecycle-policy.json
This automatically:
- Moved old logs to Glacier ($0.004/GB vs $0.023/GB)
- Deleted logs older than 1 year
- Moved infrequently accessed uploads to cheaper storage
- Cleaned up temp files after 7 days
Savings: $4,100/month in S3 costs alone.
Discovery 4: Our Load Balancers Were More Expensive Than Our Servers
We had five Application Load Balancers running:
- Production API: $16.20/month + $0.008/LCU-hour
- Staging API: $16.20/month + minimal traffic
- Development API: $16.20/month + minimal traffic
- Admin dashboard: $16.20/month + barely used
- Legacy service (we forgot about): $16.20/month + zero traffic
Each ALB costs a base $16.20/month plus usage. But here’s the kicker—you’re also paying for each rule, each target group, each health check.
Our production ALB had 47 rules. We needed maybe 10.
What we did:
- Consolidated staging and dev environments behind a single ALB using host-based routing:
# Before: 2 ALBs
- ALB 1: staging-api.example.com
- ALB 2: dev-api.example.com
# After: 1 ALB with rules
- ALB: shared-nonprod.example.com
Rules:
- Host: staging-api.example.com → Target Group A
- Host: dev-api.example.com → Target Group B
- Moved the admin dashboard (5 requests/minute) to the main ALB
- Deleted the legacy ALB that wasn’t serving any traffic
Savings: $800/month in ALB base costs and reduced LCU charges.
Discovery 5: Reserved Instances for Predictable Workloads
This one was painful because we should have done it years ago.
Our production servers ran 24/7/365. We were paying on-demand pricing—the most expensive option—because we “wanted flexibility.”
That’s like renting a car daily for three years instead of just buying one.
I pulled a utilization report for the last 12 months:
- 8 EC2 instances ran constantly: t3.xlarge
- 2 RDS instances ran constantly: db.r5.xlarge
- 3 ElastiCache nodes ran constantly: cache.r5.large
These never changed. They never scaled down. They just… ran.
Action taken:
Purchased 1-year Reserved Instances with partial upfront payment:
# Example for EC2
aws ec2 purchase-reserved-instances-offering \
--reserved-instances-offering-id <offering-id> \
--instance-count 8
For RDS:
aws rds purchase-reserved-db-instances-offering \
--reserved-db-instances-offering-id <offering-id> \
--db-instance-count 2
Pricing comparison:
- t3.xlarge on-demand: $0.1664/hour = $1,461/month for 8 instances
- t3.xlarge reserved (1-yr, partial upfront): $0.0984/hour = $864/month for 8 instances
Savings: $7,164/month across EC2, RDS, and ElastiCache.
Yes, that’s $7,000 per month for clicking a few buttons.
Discovery 6: Auto Scaling Groups That Never Scaled
Our ASGs were configured with:
Min: 4 instances
Max: 20 instances
Desired: 4 instances
In two years, they had scaled to 5 instances exactly twice—during a product launch—and immediately scaled back down.
We were maintaining the infrastructure for scaling we never used:
- Multiple target groups
- Complex scaling policies
- CloudWatch alarms
- SNS topics for notifications
What we changed:
- Reduced max capacity to 8 (still plenty of headroom)
- Simplified scaling policies to only CPU-based triggers
- Deleted unused CloudWatch alarms ($0.10/alarm/month adds up)
- Combined similar ASGs where possible
We had three separate ASGs for the same application because “microservices.” They could have been one ASG with different user data.
Savings: $280/month in reduced overhead and smaller max capacity.
Discovery 7: NAT Gateway Bills Were Out of Control
Our NAT Gateway was costing $387/month. For what? Private subnets to access the internet.
But here’s the thing—most of our private subnet traffic was talking to other AWS services: S3, DynamoDB, RDS.
NAT Gateways charge $0.045 per GB of data processed. We were processing 7.2 TB/month through them.
The fix:
Set up VPC endpoints for AWS services:
# S3 Gateway Endpoint (free!)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345678 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-12345678
# DynamoDB Gateway Endpoint (also free!)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345678 \
--service-name com.amazonaws.us-east-1.dynamodb \
--route-table-ids rtb-12345678
This routes traffic to S3 and DynamoDB through AWS’s internal network instead of through the NAT Gateway.
We also found that several services were pulling Docker images from Docker Hub through the NAT Gateway. We moved to Amazon ECR and used VPC endpoints for that too.
Result: Data through NAT Gateway dropped from 7.2 TB to 1.1 TB.
Savings: $274/month
Discovery 8: CloudWatch Logs Were Killing Us
We had verbose logging enabled everywhere. Every API request, every query, every function call—logged.
CloudWatch Logs charges $0.50 per GB ingested and $0.03 per GB stored. We were ingesting 890 GB per month.
That’s $445/month for logs we never looked at.
What we did:
- Reduced log levels in production (INFO instead of DEBUG)
- Set retention periods on all log groups:
# List all log groups without retention
aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].[logGroupName]' \
--output text | \
xargs -I {} aws logs put-retention-policy \
--log-group-name {} \
--retention-in-days 7
- Used metric filters to extract important data, then deleted the raw logs
- Moved long-term logs to S3 for cheaper storage
Savings: $320/month
The Quick Wins Checklist
After finding all these issues, I created a monthly audit checklist:
Compute:
- [ ] Any EC2 instances with <10% CPU for 7 days?
- [ ] Any instances without tags?
- [ ] Any instances running for >30 days without deployment?
- [ ] Can we use Reserved Instances for steady workloads?
Storage:
- [ ] Any S3 buckets without lifecycle policies?
- [ ] Any EBS volumes not attached to instances?
- [ ] Any snapshots older than 3 months?
- [ ] Any old AMIs we’re not using?
Database:
- [ ] RDS instances sized correctly for actual usage?
- [ ] Multi-AZ enabled only where needed?
- [ ] Automated backups set to minimum retention?
- [ ] Any old manual snapshots?
Network:
- [ ] Any unused load balancers?
- [ ] NAT Gateway traffic reducible via VPC endpoints?
- [ ] Any Elastic IPs not associated with instances? ($0.005/hour each)
Monitoring:
- [ ] CloudWatch log retention set appropriately?
- [ ] Any unused CloudWatch alarms?
- [ ] Metrics retention longer than needed?
The Tools That Helped
1. AWS Cost Explorer
This is built into AWS Console. I set up daily email reports:
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
2. AWS Trusted Advisor
Free with Business/Enterprise support. Shows:
- Underutilized instances
- Idle load balancers
- Unassociated Elastic IPs
- RDS idle DB instances
3. AWS Compute Optimizer
Analyzes CloudWatch metrics and recommends right-sizing:
aws compute-optimizer get-ec2-instance-recommendations \
--instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0
4. Cost Anomaly Detection
Set up alerts for unexpected cost increases:
aws ce create-anomaly-monitor \
--anomaly-monitor Name="ProductionCostMonitor" \
MonitorType=DIMENSIONAL \
MonitorDimension=SERVICE
The Results After 3 Months
Before:
- Monthly AWS bill: $47,000
- Cost per user: $3.92
- Cost per API call: $0.0023
After:
- Monthly AWS bill: $27,200
- Cost per user: $2.27
- Cost per API call: $0.0013
Total savings: $19,800/month = $237,600/year
Breakdown by category:
- EC2 optimization: $7,800/month
- Storage optimization: $4,600/month
- RDS right-sizing: $1,800/month
- Reserved Instances: $4,200/month
- Network optimization: $800/month
- Monitoring/logging: $600/month
And we didn’t change a single line of application code. No performance degradation. No downtime (except that 8-minute RDS resize).
What I Learned
1. Defaults are expensive: AWS defaults to convenient, not cheap. You have to opt into savings.
2. Nobody watches the bill: Development teams spin things up; nobody spins things down.
3. Tagging is not optional: Without tags, you can’t track what anything is for.
4. Audit monthly: Costs drift. What made sense in January is wasteful by June.
5. Reserved Instances are free money: If it runs 24/7, buy a reservation. Period.
6. Storage policies save thousands: Lifecycle rules cost nothing to set up and save constantly.
Your Action Plan
Start here:
Week 1: Discovery
- Install AWS Cost Explorer
- Run Trusted Advisor
- Tag everything that’s running
- Identify zombie resources
then
Week 2: Quick Wins
- Set up S3 lifecycle policies
- Delete unused resources
- Set CloudWatch log retention
- Enable Cost Anomaly Detection
then
Week 3: Optimization
- Right-size oversized instances
- Purchase Reserved Instances for steady workloads
- Set up VPC endpoints
- Consolidate load balancers
Week 4: Automation
- Create Lambda for usage monitoring
- Set up monthly cost reports
- Document what everything costs
- Create a cost review calendar
The Real Lesson
Our CFO was right. We were being lazy. We treated AWS like it was free because we never looked at the bill.
The infrastructure grew organically—someone needed a server, they spun one up. Someone needed to test something, they created a new environment. Nobody cleaned up after themselves.
We cut our bill by 42% in three weeks without touching our application. That’s not optimization. That’s basic housekeeping.
If you haven’t audited your AWS costs in the last three months, you’re throwing money away. I guarantee you have zombie instances, oversized databases, and storage you forgot about.
Go check right now. I’ll wait.


