Building a High-Availability, Cost-Optimized Ruby Stack on AWS

Leveraging AWS Spot Instances for Ruby Application Servers

Achieving cost optimization on AWS for a Ruby stack, particularly for stateless application servers, hinges on strategically utilizing Spot Instances. These instances offer significant savings (up to 90% off On-Demand prices) but come with the caveat of potential interruption. For a high-availability setup, we must design our architecture to gracefully handle these interruptions without impacting end-user experience.

The core principle is to treat Spot Instances as ephemeral compute resources. This means no persistent data should reside on the instance itself. All application state, session data, and critical information must be externalized to managed AWS services. For a typical Ruby on Rails application, this translates to using services like RDS for the database, ElastiCache for session storage and caching, and S3 for file storage.

We’ll employ Auto Scaling Groups (ASGs) to manage our fleet of Ruby application servers. ASGs can be configured to launch Spot Instances, and crucially, they can be set up to receive termination notices, allowing us to gracefully drain connections and shut down instances before they are reclaimed by AWS.

Configuring Auto Scaling Groups for Spot Instance Interruption Handling

The key to a resilient Spot Instance strategy lies in the ASG’s ability to react to termination notices. AWS provides a mechanism for this via the instance metadata service. When an instance is scheduled for termination, AWS sends a signal that can be detected by applications running on the instance. We can leverage this signal to trigger a graceful shutdown sequence.

A common approach is to run a small background process or daemon on each application server that periodically polls the instance metadata service for termination notices. Upon detecting a notice, this process can initiate a graceful shutdown of the Ruby application server, signaling to the load balancer to stop sending new requests and allowing existing requests to complete.

Spot Instance Interruption Handling Script (Bash)

Here’s a sample Bash script that can be run as a systemd service on your EC2 instances. This script polls the metadata service and, upon detecting a termination notice, triggers a signal to a hypothetical application shutdown script.

#!/bin/bash

# Path to your application's graceful shutdown script
SHUTDOWN_SCRIPT="/opt/your_app/bin/graceful_shutdown.sh"
METADATA_URL="http://169.254.169.254/latest/meta-data/spot/instance-action"
POLL_INTERVAL_SECONDS=30

echo "Starting Spot Instance interruption handler..."

while true; do
    # Check if there's a scheduled instance action
    ACTION=$(curl -s --connect-timeout 5 $METADATA_URL)

    if [[ "$ACTION" == *"action"* ]]; then
        echo "Spot Instance termination notice detected. Initiating graceful shutdown..."
        # Execute your application's graceful shutdown script
        if [ -x "$SHUTDOWN_SCRIPT" ]; then
            "$SHUTDOWN_SCRIPT" &
        else
            echo "Shutdown script not found or not executable at $SHUTDOWN_SCRIPT"
        fi
        # Exit the handler script after initiating shutdown
        exit 0
    fi

    # Wait before polling again
    sleep $POLL_INTERVAL_SECONDS
done

Graceful Shutdown Script Example (Ruby)

This Ruby script demonstrates how you might signal your application server (e.g., Puma, Unicorn) to stop accepting new requests and finish processing existing ones. The exact implementation will depend on your application server and how it exposes its control interface.

#!/usr/bin/env ruby

# This is a simplified example. You'll need to adapt it to your specific
# application server (e.g., Puma, Unicorn) and its PID file location.

APP_PID_FILE = "/var/run/your_app.pid" # Adjust this path

def send_signal_to_app(pid, signal)
  begin
    Process.kill(signal, pid.to_i)
    puts "Sent #{signal} signal to PID #{pid}"
  rescue Errno::ESRCH
    puts "Process with PID #{pid} not found."
  rescue => e
    puts "Error sending signal #{signal} to PID #{pid}: #{e.message}"
  end
end

if File.exist?(APP_PID_FILE)
  pid = File.read(APP_PID_FILE).strip
  puts "Found application PID: #{pid}"

  # Signal the application server to stop accepting new requests.
  # For Puma, this might be SIGUSR1. For Unicorn, SIGUSR2.
  # Consult your application server's documentation.
  # We'll use SIGUSR1 as a placeholder.
  send_signal_to_app(pid, "SIGUSR1")

  # Optionally, you might want to wait for a short period
  # to allow existing requests to finish before the instance is terminated.
  # This is a delicate balance, as waiting too long can lead to data loss
  # if the instance is terminated abruptly.
  # sleep 30 # Example: wait for 30 seconds

  # For a more robust solution, you'd integrate with your load balancer
  # to deregister the instance from the target group.
else
  puts "Application PID file not found at #{APP_PID_FILE}. Cannot perform graceful shutdown."
end

# Ensure the application process eventually exits or is terminated by AWS.
# In a real-world scenario, you might also want to signal the load balancer
# to deregister this instance from the target group.

Integrating with Elastic Load Balancing (ELB)

High availability requires a load balancer to distribute traffic across your application servers. AWS Elastic Load Balancing (ELB) is the natural choice. For Spot Instances, it’s crucial to configure ELB health checks effectively and to ensure that instances are deregistered gracefully before termination.

When a Spot Instance receives a termination notice, your graceful shutdown script should ideally trigger the deregistration of that instance from the ELB’s target group. This prevents the load balancer from sending any new traffic to an instance that is about to disappear. Most application servers have mechanisms to signal this to ELB, or you can use the AWS CLI/SDK within your shutdown script.

Deregistering Instance from Target Group (AWS CLI Example)

This command can be executed within your `graceful_shutdown.sh` script after signaling your application server. You’ll need to obtain the instance ID and the target group ARN.

#!/bin/bash

# --- Variables ---
# These would typically be dynamically retrieved or passed as arguments
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-ruby-app-tg/abcdef1234567890" # Replace with your ARN
REGION="us-east-1" # Replace with your region

echo "Attempting to deregister instance $INSTANCE_ID from target group $TARGET_GROUP_ARN..."

aws elbv2 deregister-targets \
    --target-group-arn "$TARGET_GROUP_ARN" \
    --targets Id="$INSTANCE_ID" \
    --region "$REGION"

if [ $? -eq 0 ]; then
    echo "Successfully initiated deregistration for instance $INSTANCE_ID."
else
    echo "Failed to deregister instance $INSTANCE_ID."
fi

# After deregistration, you might want to wait a bit for existing connections
# to drain before the instance is terminated by AWS.
# sleep 60 # Example: wait for 60 seconds

Important Note: Ensure the IAM role attached to your EC2 instances has the necessary permissions to call `elasticloadbalancing:DeregisterTargets`.

Database and Session Management: Externalizing State

As mentioned, Spot Instances are ephemeral. This means no data should be stored locally. For a Ruby stack, this is critical for your database and session management.

Database: Amazon RDS

Amazon Relational Database Service (RDS) is the standard for managed relational databases on AWS. For cost optimization, consider using RDS Reserved Instances or Savings Plans for your primary database if your workload is predictable. For less critical or development databases, On-Demand instances might suffice. Ensure your RDS instance is in a Multi-AZ configuration for high availability, independent of your application server’s Spot Instance strategy.

Session Management: Amazon ElastiCache (Redis)

Storing user sessions in memory on application servers is a common anti-pattern, especially with ephemeral instances. Amazon ElastiCache for Redis provides a highly available, in-memory data store perfect for session management. Configure your Ruby application (e.g., Rails `redis-rails` gem) to use ElastiCache for sessions.

# config/initializers/session_store.rb (Rails Example)

Rails.application.config.session_store :redis_session_store,
  redis: {
    host: ENV.fetch('REDIS_HOST', 'localhost'), # e.g., 'my-redis-cluster.xxxxxx.ng.0001.use1.cache.amazonaws.com'
    port: ENV.fetch('REDIS_PORT', 6379).to_i,
    db: ENV.fetch('REDIS_DB', 0).to_i,
    # Add other Redis options as needed, e.g., password, ssl
  },
  key: '_your_app_session_id',
  expire_after: 1.week # Adjust session expiration

Ensure your ElastiCache cluster is configured for Multi-AZ replication for high availability. For cost savings, consider using ElastiCache Reserved Instances if your usage patterns are stable.

Static Assets and File Storage: Amazon S3

All static assets (images, CSS, JavaScript) and user-uploaded files should be stored in Amazon S3. Use a Content Delivery Network (CDN) like Amazon CloudFront to serve these assets efficiently and reduce latency for your users. This completely offloads static file serving from your application servers.

Orchestration and Deployment

For deploying your Ruby application to these Spot Instances, consider using tools like AWS CodeDeploy, Ansible, or Chef. Your deployment process should ensure that new instances are properly configured with your application code, dependencies, and the Spot Instance interruption handler. When deploying updates, you’ll typically perform a rolling update, launching new instances with the updated code and then terminating the old ones, further minimizing downtime.

When configuring your Auto Scaling Group, define a “Launch Template” or “Launch Configuration” that specifies the AMI, instance type, security groups, IAM role, and user data script to bootstrap your instances. Crucially, set the ASG’s “Spot Allocation Strategy” to `lowest-price` for maximum cost savings, and configure “On-Demand Options” to `use-spot-instances-only` or `use-spot-and-on-demand` based on your availability requirements and budget. For true cost optimization, `use-spot-instances-only` is preferred, relying on the interruption handling and ASG’s ability to rebalance capacity.

Monitoring and Alerting

Robust monitoring is essential for any production system, especially one relying on Spot Instances. Key metrics to monitor include:

Spot Instance Interruptions: Monitor the `SpotInstanceInterruption` metric in CloudWatch. Set up alarms to notify your team when interruptions occur, allowing you to investigate patterns or potential issues with your interruption handling.
ASG Health Checks: Monitor the `UnhealthyHostCount` metric for your Auto Scaling Group. A rising count indicates problems with your application or its health check endpoint.
ELB Health Checks: Similar to ASG health checks, monitor the health status of targets in your ELB.
Application Performance: Use application performance monitoring (APM) tools (e.g., New Relic, Datadog) to track request latency, error rates, and resource utilization on your application servers.
Resource Utilization: Monitor CPU, memory, and network utilization for your EC2 instances and RDS/ElastiCache instances.

Alerting on these metrics will provide early warning of potential issues, allowing you to proactively address them before they impact users.

Conclusion: Balancing Cost and Availability

Building a high-availability Ruby stack on AWS with a strong focus on cost optimization is achievable by embracing the ephemeral nature of Spot Instances. By externalizing state to managed services like RDS and ElastiCache, leveraging ELB for traffic distribution, and implementing robust Spot Instance interruption handling, you can significantly reduce your AWS infrastructure costs without compromising application availability. The key is meticulous design, thorough testing of your interruption handling mechanisms, and vigilant monitoring.