Disaster Recovery 101: Architecting Auto-Failovers for DynamoDB and Ruby Deployments on Google Cloud

Establishing Multi-Region DynamoDB Replication

A robust disaster recovery strategy for a cloud-native application hinges on resilient data storage. For applications leveraging Amazon DynamoDB, this means implementing global tables to ensure data availability across multiple AWS regions. This isn’t just about backups; it’s about active-active replication that allows for seamless failover with minimal data loss.

The core mechanism for this is DynamoDB Global Tables. When configured, DynamoDB automatically replicates data changes across all specified regions. This replication is asynchronous but designed for high durability and availability. The key is to select regions that are geographically diverse but also have acceptable latency for your application’s read and write patterns.

Consider a scenario where your primary region is us-east-1 and your secondary DR region is eu-west-1. You would create your DynamoDB table in us-east-1 and then add the eu-west-1 region to its global table configuration. This is typically done via the AWS Management Console, AWS CLI, or SDKs.

AWS CLI Configuration for Global Tables

Here’s how you’d initiate this using the AWS CLI. First, create your table in the primary region:

aws dynamodb create-table \
    --table-name MyApplicationTable \
    --attribute-definitions AttributeName=id,AttributeType=S \
    --key-schema AttributeName=id,KeyType=HASH \
    --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --region us-east-1

Once the table is created and stable, you can add a replica in another region. This command associates an existing table in one region with a new replica table in another region, effectively creating a global table.

aws dynamodb update-table \
    --table-name MyApplicationTable \
    --replica-updates '[{"Create": {"RegionName": "eu-west-1"}}]' \
    --region us-east-1

You can verify the global table status and replica creation progress using:

aws dynamodb describe-table --table-name MyApplicationTable --region us-east-1

Look for the Replicas section in the output to confirm the status of the replica in eu-west-1. It will transition from CREATING to ACTIVE.

Architecting Ruby Deployments for Auto-Failover

For a Ruby on Rails application deployed on Google Cloud Platform (GCP), auto-failover requires a multi-region deployment strategy for your compute instances and a mechanism to direct traffic to the healthy region.

GCP Compute Engine Instance Groups and Load Balancing

We’ll leverage GCP’s Managed Instance Groups (MIGs) and a Global External HTTP(S) Load Balancer. MIGs allow for auto-scaling and auto-healing of your application instances. By creating MIGs in multiple regions (e.g., us-central1 and europe-west1), we establish redundant compute capacity.

The Global External HTTP(S) Load Balancer will be configured with backend services pointing to these regional MIGs. Crucially, the load balancer’s health checks will monitor the instances within each MIG. If a region becomes unhealthy, the load balancer will automatically stop sending traffic to it and direct all traffic to the healthy region.

Terraform for Infrastructure as Code

Managing this multi-region infrastructure is best done with Infrastructure as Code (IaC). Terraform is an excellent choice for this. Below is a simplified Terraform configuration snippet demonstrating the setup.

# main.tf

provider "google" {
  project = "your-gcp-project-id"
  region  = "us-central1" # Primary region for initial configuration
}

# Define instance template for the application
resource "google_compute_instance_template" "app_template" {
  name_prefix  = "ruby-app-template-"
  machine_type = "e2-medium"
  tags         = ["ruby-app", "http-server"]

  disk {
    source_image = "debian-cloud/debian-11"
    auto_delete  = true
    boot         = true
  }

  network_interface {
    network = "default"
    access_config {
      // Ephemeral IP
    }
  }

  metadata_startup_script = file("startup-script.sh") # Script to install Ruby, deploy app, etc.

  lifecycle {
    create_before_destroy = true
  }
}

# Managed Instance Group in us-central1
resource "google_compute_region_instance_group_manager" "app_mig_us" {
  name               = "ruby-app-mig-us"
  region             = "us-central1"
  base_instance_name = "ruby-app-us"
  version {
    instance_template = google_compute_instance_template.app_template.id
    name              = "v1"
  }
  target_size = 2 # Initial desired instance count

  auto_healing_policies {
    health_check      = google_compute_health_check.app_health_check.id
    initial_delay_sec = 300
  }
}

# Managed Instance Group in europe-west1
resource "google_compute_region_instance_group_manager" "app_mig_eu" {
  name               = "ruby-app-mig-eu"
  region             = "europe-west1"
  base_instance_name = "ruby-app-eu"
  version {
    instance_template = google_compute_instance_template.app_template.id
    name              = "v1"
  }
  target_size = 2

  auto_healing_policies {
    health_check      = google_compute_health_check.app_health_check.id
    initial_delay_sec = 300
  }
}

# Health Check for the application
resource "google_compute_health_check" "app_health_check" {
  name                = "ruby-app-health-check"
  check_interval_sec  = 5
  timeout_sec         = 5
  healthy_threshold   = 2
  unhealthy_threshold = 3

  http_health_check {
    port         = 80
    request_path = "/health" # Your application's health check endpoint
  }
}

# Backend Service for us-central1 MIG
resource "google_compute_backend_service" "app_backend_us" {
  name                  = "ruby-app-backend-us"
  protocol              = "HTTP"
  port_name             = "http"
  timeout_sec           = 10
  enable_cdn            = false
  load_balancing_scheme = "EXTERNAL_MANAGED"

  backend {
    group = google_compute_region_instance_group_manager.app_mig_us.instance_group
  }

  health_checks = [google_compute_health_check.app_health_check.id]
}

# Backend Service for europe-west1 MIG
resource "google_compute_backend_service" "app_backend_eu" {
  name                  = "ruby-app-backend-eu"
  protocol              = "HTTP"
  port_name             = "http"
  timeout_sec           = 10
  enable_cdn            = false
  load_balancing_scheme = "EXTERNAL_MANAGED"

  backend {
    group = google_compute_region_instance_group_manager.app_mig_eu.instance_group
  }

  health_checks = [google_compute_health_check.app_health_check.id]
}

# URL Map to route traffic to backend services
resource "google_compute_url_map" "app_url_map" {
  name            = "ruby-app-url-map"
  default_service = google_compute_backend_service.app_backend_us.id # Default to primary region

  # This is a simplified example. For true multi-region failover,
  # you'd typically use a single backend service that aggregates
  # multiple regional backend services, or rely on the load balancer's
  # global nature to distribute across healthy regions.
  # For explicit failover logic, consider custom health checks or
  # a more advanced routing setup if needed.
}

# Global External HTTP(S) Load Balancer Frontend
resource "google_compute_global_forwarding_rule" "app_http_forwarding_rule" {
  name                  = "ruby-app-http-forwarding-rule"
  ip_protocol           = "TCP"
  load_balancing_scheme = "EXTERNAL_MANAGED"
  port_range            = "80"
  target                = google_compute_url_map.app_url_map.id
  ip_address            = google_compute_global_address.app_static_ip.address
}

# Static IP for the Load Balancer
resource "google_compute_global_address" "app_static_ip" {
  name = "ruby-app-static-ip"
}

# Output the Load Balancer IP
output "load_balancer_ip" {
  value = google_compute_global_address.app_static_ip.address
}

The startup-script.sh would contain commands to install Ruby, Bundler, your application’s dependencies, configure your web server (e.g., Puma), and start the application. It’s crucial that this script is idempotent and handles potential re-runs gracefully.

Application-Level Considerations

Your Ruby application needs to be aware of its environment and how to connect to the correct DynamoDB endpoint. When running in GCP, the application will typically use the default service account credentials. Ensure this service account has the necessary IAM permissions for DynamoDB access.

For DynamoDB, the AWS SDK for Ruby will automatically use the endpoint for the region where the code is executing. When running in us-central1, it will talk to dynamodb.us-central1.amazonaws.com. If the application fails over to europe-west1, it will automatically connect to dynamodb.europe-west1.amazonaws.com. This automatic regional endpoint selection is a key benefit of using AWS SDKs with global DynamoDB tables.

Your application’s configuration should not hardcode region-specific DynamoDB endpoints. Instead, rely on the SDK’s default behavior or environment variables that can be set per region in your MIG instance templates or startup scripts.

Implementing Automated Failover Procedures

The automated failover is primarily handled by the GCP Global External HTTP(S) Load Balancer’s health checks. When the health check for the primary region (e.g., us-central1) fails consistently, the load balancer will automatically shift traffic to the secondary region (e.g., europe-west1) if it remains healthy.

Health Check Endpoint Design

The /health endpoint in your Ruby application is critical. It should perform essential checks:

Verify that the application process is running.
Attempt a read operation against the local DynamoDB replica. This is crucial to ensure data consistency and connectivity to the replicated data store.
Check any other critical external dependencies (e.g., Redis, external APIs).

A simple Rails controller action for this might look like:

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def show
    # Check DynamoDB connectivity
    begin
      # Perform a simple, low-cost read operation.
      # Ensure your table has at least one item for this to work reliably.
      # Alternatively, check if the table exists and is accessible.
      # Example: Fetching a single item by its primary key.
      # Replace 'YourModel' with an actual ActiveRecord model or a direct DynamoDB client call.
      # For a pure DynamoDB client approach:
      # dynamodb = Aws::DynamoDB::Client.new
      # params = { table_name: 'MyApplicationTable', key: { id: 'some_known_key' } }
      # dynamodb.get_item(params)

      # If using ActiveRecord with a DynamoDB adapter:
      # YourModel.find('some_known_key') # This might be too slow if it involves DB lookup

      # A more robust check might be to simply ping the service endpoint
      # or check table status if the SDK provides it.
      # For this example, we'll assume a successful SDK initialization implies basic connectivity.
      # A more thorough check would involve a quick read/write.
      # For simplicity, we'll just check if the SDK client can be instantiated.
      Aws::DynamoDB::Client.new # This implicitly checks credentials and region config

      render json: { status: "ok", database: "connected" }, status: :ok
    rescue Aws::DynamoDB::Errors::ServiceError => e
      render json: { status: "error", database: "disconnected", message: e.message }, status: :internal_server_error
    rescue StandardError => e
      render json: { status: "error", message: e.message }, status: :internal_server_error
    end
  end
end

And the corresponding route:

# config/routes.rb
Rails.application.routes.draw do
  get '/health', to: 'health#show'
  # ... other routes
end

The initial_delay_sec in the auto_healing_policies for the MIGs is important. It gives new instances time to boot up, install dependencies, and start the application before they are subjected to health checks. Adjust this based on your application’s startup time.

Monitoring and Alerting

While auto-failover is automated, proactive monitoring and alerting are essential for understanding when and why failovers occur. GCP Cloud Monitoring (formerly Stackdriver) can be configured to:

Monitor the health check status of your load balancer and backend services.
Track the number of unhealthy instances in your MIGs.
Alert on high error rates from your application.
Monitor DynamoDB read/write capacity and latency in each region.

Set up alerts for when a region’s backend service becomes unhealthy or when traffic is significantly shifted to the secondary region. This allows your operations team to investigate the root cause of the failure in the primary region without impacting users.

Testing and Validation

A disaster recovery plan is only as good as its tested execution. Regularly simulate failures to validate your auto-failover mechanisms.

Simulating Failures

You can simulate a failure in a region by:

Manually stopping instances within a MIG in the primary region.
Temporarily modifying the health check endpoint in your application to return an unhealthy status.
Using GCP’s network simulation tools to inject packet loss or latency to the health check endpoint.
If possible, simulating a DynamoDB regional outage (though this is extremely rare and difficult to orchestrate).

Observe how the GCP Load Balancer reacts. Verify that traffic is rerouted to the healthy region and that your application remains accessible to users. Monitor logs and metrics during the test to identify any bottlenecks or unexpected behavior.

After a simulated failover, ensure that the system automatically recovers when the primary region is restored. This might involve restarting the stopped instances or reverting the health check endpoint. The load balancer should then gradually shift traffic back to the primary region as it becomes healthy again.