Cloud Cost Optimization Strategies

Ways to cut waste in cloud environments while keeping performance and resilience where they need to be.

Baikal Signal
FinOps without buzzword fog: measure, rightsize, automate, repeat.

Cloud costs can spiral out of control without proper management. This guide covers practical strategies to reduce spending while maintaining performance and reliability.

Cost Visibility

You can't optimize what you can't measure. Start with comprehensive cost tracking:

Tagging Strategy

Tag all resources with:

  • Environment: production, staging, development
  • Team: ownership and accountability
  • Project: cost allocation
  • CostCenter: business unit tracking
aws ec2 run-instances \
                                                                          --tag-specifications 'ResourceType=instance,Tags=[
                                                                            {Key=Environment,Value=production},
                                                                            {Key=Team,Value=platform},
                                                                            {Key=Project,Value=api-backend}
                                                                          ]'

Instance Rightsizing

Most instances are oversized. Monitor actual utilization and resize accordingly.

Analysis Process

  1. Collect CPU and memory metrics for 14 days
  2. Identify instances with <50% average utilization
  3. Test smaller instance types in staging
  4. Gradually migrate production workloads

Example savings from downsizing:

# Before: r5.2xlarge (8 vCPU, 64GB RAM) = $0.504/hour
                                                                        # Actual usage: 30% CPU, 20GB RAM
                                                                        # After: r5.xlarge (4 vCPU, 32GB RAM) = $0.252/hour
                                                                        # Savings: 50% = $185/month per instance

Storage Optimization

Storage costs add up, especially with snapshots and old data.

Lifecycle Policies

Implement automatic tiering:

  • Move to infrequent access after 30 days
  • Archive to Glacier after 90 days
  • Delete old snapshots after 180 days
{
                                                                          "Rules": [{
                                                                            "Id": "archive-old-data",
                                                                            "Transitions": [
                                                                              { "Days": 30, "StorageClass": "STANDARD_IA" },
                                                                              { "Days": 90, "StorageClass": "GLACIER" }
                                                                            ],
                                                                            "Expiration": { "Days": 365 }
                                                                          }]
                                                                        }

Reserved Instances

For stable workloads, reserved instances offer significant savings:

  • 1-year commitment: 30-40% discount
  • 3-year commitment: 50-60% discount

When to Use

Reserve capacity for:

  • Databases running 24/7
  • Baseline compute capacity
  • Predictable batch workloads

Use spot instances for:

  • Fault-tolerant processing
  • Development environments
  • CI/CD runners

Automated Scaling

Scale resources based on actual demand:

apiVersion: autoscaling/v2
                                                                        kind: HorizontalPodAutoscaler
                                                                        metadata:
                                                                          name: api-hpa
                                                                        spec:
                                                                          minReplicas: 3
                                                                          maxReplicas: 20
                                                                          metrics:
                                                                          - type: Resource
                                                                            resource:
                                                                              name: cpu
                                                                              target:
                                                                                type: Utilization
                                                                                averageUtilization: 70

Schedule-Based Scaling

Reduce resources during off-hours:

# Scale down at night
                                                                        0 22 * * * kubectl scale deployment api --replicas=2
                                                                        # Scale up in morning
                                                                        0 7 * * * kubectl scale deployment api --replicas=10

Summary

Cloud cost optimization is ongoing work. Start with visibility through tagging, rightsize instances based on actual usage, optimize storage with lifecycle policies, use reserved instances for baseline capacity, and implement autoscaling. Review costs monthly and adjust strategies based on workload changes.