Automating Multi-Region Redundancy for Ruby Architectures on Google Cloud

Establishing Multi-Region Redundancy for Ruby Applications on Google Cloud

Achieving robust disaster recovery for critical Ruby applications on Google Cloud Platform (GCP) necessitates a multi-region strategy. This involves not just replicating infrastructure but also implementing automated failover mechanisms and data synchronization. This guide details a practical approach using GCP services like Cloud SQL, Compute Engine, Cloud Load Balancing, and Cloud DNS, orchestrated with Infrastructure as Code (IaC) principles.

Designing the Multi-Region Architecture

The core of a multi-region setup involves deploying identical application stacks in at least two distinct GCP regions. A primary region will handle normal traffic, while a secondary region remains on standby, ready to take over in case of a regional outage. Key components include:

Compute Engine Instances: Identical Ruby application servers deployed in both regions.
Cloud SQL Instances: Replicated database instances to ensure data consistency.
Cloud Load Balancing: Global load balancing to direct traffic to the active region and facilitate failover.
Cloud DNS: Managing DNS records for seamless traffic redirection.
Infrastructure as Code (IaC): Tools like Terraform to manage and provision infrastructure consistently across regions.

Database Replication and Synchronization

Data integrity is paramount. For Cloud SQL, we’ll leverage read replicas and cross-region replication for disaster recovery. For a primary/standby setup, cross-region read replicas are ideal. In a disaster scenario, the replica can be promoted to a standalone instance.

Configuring Cloud SQL Cross-Region Read Replicas

This can be managed via the `gcloud` CLI or Terraform. Below is a Terraform example for setting up a primary instance and a cross-region read replica.

resource "google_sql_database_instance" "primary_db" {
  name             = "ruby-app-primary-db"
  region           = "us-central1"
  database_version = "POSTGRES_13" # Or your preferred version
  settings {
    tier = "db-f1-micro" # Adjust tier as needed
    ip_configuration {
      ipv4_enabled    = true
      private_network = "projects/your-gcp-project-id/global/networks/your-vpc-network"
    }
  }
}

resource "google_sql_database_instance" "replica_db" {
  name             = "ruby-app-replica-db"
  region           = "us-east1" # Secondary region
  database_version = "POSTGRES_13"
  settings {
    tier = "db-f1-micro"
    ip_configuration {
      ipv4_enabled    = true
      private_network = "projects/your-gcp-project-id/global/networks/your-vpc-network"
    }
  }

  # Configure as a read replica of the primary instance
  master_instance_name = google_sql_database_instance.primary_db.name
}

In this configuration, `ruby-app-replica-db` in `us-east1` will continuously replicate data from `ruby-app-primary-db` in `us-central1`. For automated failover, we’ll need a mechanism to promote the replica.

Automating Application Deployment and Scaling

Consistency in deployment across regions is crucial. We’ll use Compute Engine instance templates and managed instance groups (MIGs) to ensure identical application stacks. Terraform is ideal for managing these resources.

Instance Templates and Managed Instance Groups

Define your application’s compute resources using instance templates. These templates are then used to create MIGs, which manage the scaling and health of your application instances.

resource "google_compute_instance_template" "app_template" {
  name_prefix  = "ruby-app-template-"
  machine_type = "e2-medium" # Adjust as needed
  tags         = ["ruby-app", "webserver"]

  disk {
    source_image = "debian-cloud/debian-11" # Or your preferred base image
    auto_delete  = true
    boot         = true
  }

  network_interface {
    network = "your-vpc-network"
    access_config {
      // Ephemeral IP for initial setup, or use static IPs with NAT
    }
  }

  metadata = {
    # User data for bootstrapping (e.g., cloud-init, startup scripts)
    startup-script = file("scripts/startup.sh")
  }

  service_account {
    scopes = ["cloud-platform"]
  }
}

resource "google_compute_instance_group_manager" "primary_mig" {
  name               = "ruby-app-primary-mig"
  base_instance_name = "ruby-app-primary"
  zone               = "us-central1-a" # Specific zone within the region
  target_size        = 2 # Initial number of instances

  version {
    instance_template = google_compute_instance_template.app_template.id
    name              = "v1"
  }

  # Health check configuration
  health_check {
    check_interval_sec = 30
    timeout_sec        = 5
    unhealthy_threshold  = 2
    healthy_threshold    = 2
    request_path         = "/healthz" # Your application's health check endpoint
    type                 = "HTTP"
  }
}

resource "google_compute_instance_group_manager" "secondary_mig" {
  name               = "ruby-app-secondary-mig"
  base_instance_name = "ruby-app-secondary"
  zone               = "us-east1-b" # Specific zone in secondary region
  target_size        = 0 # Standby, scale up on failover

  version {
    instance_template = google_compute_instance_template.app_template.id
    name              = "v1"
  }

  health_check {
    check_interval_sec = 30
    timeout_sec        = 5
    unhealthy_threshold  = 2
    healthy_threshold    = 2
    request_path         = "/healthz"
    type                 = "HTTP"
  }
}

The `startup.sh` script would handle application installation, configuration, and starting the Ruby web server (e.g., Puma, Unicorn). It should also be configured to connect to the appropriate Cloud SQL instance based on the region.

Global Load Balancing and Traffic Management

Google Cloud Load Balancing provides a global anycast IP address that directs traffic to the closest healthy backend. For multi-region failover, we’ll use a combination of a global external HTTP(S) load balancer and backend services configured for each region.

Configuring Global External HTTP(S) Load Balancer

This involves setting up a backend service that points to the MIGs in each region. The load balancer will automatically detect unhealthy backends and route traffic away from them.

# Forwarding rule for the global load balancer
resource "google_compute_global_forwarding_rule" "default" {
  name                  = "ruby-app-global-forwarding-rule"
  ip_protocol           = "TCP"
  load_balancing_scheme = "EXTERNAL"
  port_range            = "80" # For HTTP, use 443 for HTTPS with SSL certificates
  target                = google_compute_url_map.default.id
  ip_address            = "0.0.0.0" # Any IP, or specify a static IP
}

# URL map to route requests to backend services
resource "google_compute_url_map" "default" {
  name            = "ruby-app-url-map"
  default_service = google_compute_backend_service.primary_backend.id
}

# Backend service for the primary region
resource "google_compute_backend_service" "primary_backend" {
  name                  = "ruby-app-primary-backend"
  protocol              = "HTTP"
  port_name             = "http"
  timeout_sec           = 10
  enable_cdn            = false
  load_balancing_scheme = "EXTERNAL"

  backend {
    group = google_compute_instance_group_manager.primary_mig.instance_group
  }

  # Health check for the backend service
  health_checks = [google_compute_health_check.default.id]
}

# Backend service for the secondary region
resource "google_compute_backend_service" "secondary_backend" {
  name                  = "ruby-app-secondary-backend"
  protocol              = "HTTP"
  port_name             = "http"
  timeout_sec           = 10
  enable_cdn            = false
  load_balancing_scheme = "EXTERNAL"

  backend {
    group = google_compute_instance_group_manager.secondary_mig.instance_group
  }

  # Health check for the backend service
  health_checks = [google_compute_health_check.default.id]
}

# Health check configuration
resource "google_compute_health_check" "default" {
  name                = "ruby-app-health-check"
  check_interval_sec  = 5
  timeout_sec         = 5
  healthy_threshold   = 2
  unhealthy_threshold = 2

  http_health_check {
    request_path = "/healthz"
  }
}

# Update URL map to include secondary backend for failover
resource "google_compute_url_map" "default" {
  name            = "ruby-app-url-map"
  default_service = google_compute_backend_service.primary_backend.id

  # This is where failover logic is implicitly handled by the load balancer
  # If primary_backend becomes unhealthy, traffic will be routed to the next available backend.
  # For explicit failover routing, consider using a backend service with multiple backends
  # or a more advanced traffic management setup.
  # For simplicity here, we rely on the LB's automatic failover.
}

The global load balancer automatically directs traffic to the healthy MIG. If the primary MIG in `us-central1` becomes unhealthy, the load balancer will seamlessly shift traffic to the secondary MIG in `us-east1` (provided it’s healthy).

Automated Failover and Failback Orchestration

While the load balancer handles automatic failover for stateless applications, database promotion and scaling up the secondary region require explicit orchestration. This can be achieved using Cloud Functions, Cloud Run, or custom scripts triggered by monitoring alerts.

Database Promotion and MIG Scaling

When a regional outage is detected (e.g., via Cloud Monitoring alerts on the primary region’s health checks or database replication lag), a process should be initiated to:

Promote the Cloud SQL Read Replica: This involves a manual step or an automated script.
Update Application Configuration: Ensure application instances in the secondary region connect to the newly promoted primary database.
Scale Up the Secondary MIG: Increase `target_size` for `google_compute_instance_group_manager.secondary_mig` to handle the expected traffic.

A Cloud Function triggered by a Pub/Sub message (from a Cloud Monitoring alert) could execute these steps. The function would use the GCP client libraries to interact with Cloud SQL and Compute Engine APIs.

# Example Python Cloud Function for database promotion and scaling
import googleapiclient.discovery
import google.auth

def promote_and_scale(event, context):
    # Authenticate with GCP
    credentials, project = google.auth.default()
    sqladmin = googleapiclient.discovery.build('sqladmin', 'v1beta4', credentials=credentials)
    compute = googleapiclient.discovery.build('compute', 'v1', credentials=credentials)

    primary_region = "us-central1"
    replica_region = "us-east1"
    primary_db_instance = "ruby-app-primary-db"
    replica_db_instance = "ruby-app-replica-db"
    secondary_mig_name = "ruby-app-secondary-mig"
    secondary_mig_zone = "us-east1-b"
    new_target_size = 5 # Scale up to 5 instances

    # 1. Promote the replica
    print(f"Promoting replica {replica_db_instance} in {replica_region}...")
    request = sqladmin.instances().promoteReplica(
        project=project,
        instance=replica_db_instance,
        body={"project": project, "instance": replica_db_instance}
    )
    response = request.execute()
    print(f"Promotion initiated: {response}")

    # Wait for promotion to complete (simplified, in reality, poll status)
    # In a real-world scenario, you'd poll the instance status until it's RUNNABLE and not a replica.

    # 2. Update application configuration (if necessary, e.g., via startup scripts or environment variables)
    # This might involve updating instance templates or re-deploying the MIG with new configurations.
    # For simplicity, assume application instances can dynamically reconfigure or are deployed with region-aware settings.

    # 3. Scale up the secondary MIG
    print(f"Scaling up MIG {secondary_mig_name} in {secondary_mig_zone} to {new_target_size} instances...")
    request = compute.instanceGroupManagers().resize(
        project=project,
        zone=secondary_mig_zone,
        instanceGroupManager=secondary_mig_name,
        size=new_target_size
    )
    response = request.execute()
    print(f"Resize initiated: {response}")

    return "Failover process initiated."

Failback Strategy

Failback involves returning operations to the primary region once it’s restored. This typically includes:

Ensuring the primary database is up-to-date with data from the secondary region.
Shifting traffic back to the primary region via the load balancer.
Scaling down the secondary MIG.
Re-establishing replication from the new primary (in the secondary region) back to the original primary region.

Failback is often a more manual process to ensure data consistency and minimize downtime during the transition.

Monitoring and Alerting

Comprehensive monitoring is essential for detecting failures and triggering automated responses. Google Cloud Monitoring (formerly Stackdriver) should be configured to:

Monitor the health of Cloud SQL instances (replication lag, CPU, memory).
Monitor the health of Compute Engine instances and MIGs (instance health checks, CPU utilization).
Monitor the global load balancer’s backend health.
Set up alerting policies for critical metrics (e.g., high replication lag, unhealthy instances, load balancer errors).

Example Cloud Monitoring Alerting Policy

An alert can be configured to fire when the Cloud SQL replication lag for the secondary instance exceeds a defined threshold. This alert can then trigger the Cloud Function for failover.

# Example alert configuration (conceptual, actual configuration via Cloud Console or API)
ALERT POLICY: "Cloud SQL Replication Lag Critical"
CONDITION:
  METRIC: "cloudsql.googleapis.com/database/replication_lag"
  RESOURCE_TYPE: "cloudsql_database"
  FILTER: "instance_name=~'ruby-app-replica-db.*'"
  THRESHOLD:
    OPERATOR: "above"
    VALUE: 300 # 5 minutes lag
TRIGGER:
  COUNT: 1
NOTIFICATION:
  PUBSUB_TOPIC: "projects/your-gcp-project-id/topics/failover-trigger"

This Pub/Sub topic would then be configured to trigger the Cloud Function described earlier.

Conclusion

Implementing multi-region redundancy for Ruby applications on GCP involves a layered approach. By leveraging IaC for consistent infrastructure, Cloud SQL for data resilience, Global Load Balancing for traffic management, and automated orchestration for failover, you can build a highly available and disaster-resilient architecture. Continuous testing of failover and failback procedures is critical to ensure the system performs as expected during a real incident.