Automating Multi-Region Redundancy for Ruby Architectures on DigitalOcean

Establishing Multi-Region Redundancy for Ruby Applications on DigitalOcean

Achieving robust disaster recovery for critical Ruby applications necessitates a multi-region strategy. This involves deploying your application stack across geographically distinct data centers, ensuring that an outage in one region does not impact service availability. This document outlines a practical, production-ready approach using DigitalOcean’s infrastructure, focusing on automated failover and data synchronization.

Core Components and Architecture Overview

Our multi-region architecture will leverage the following DigitalOcean resources:

Droplets: For hosting application servers, web servers, and background job processors.
Managed Databases (PostgreSQL): For persistent data storage, configured for cross-region replication.
Load Balancers: To distribute traffic across active regions and facilitate failover.
Object Storage (Spaces): For static assets and backups, with cross-region replication enabled where applicable.
DNS: To manage traffic routing and point to the active load balancer.

The primary goal is active-passive or active-active redundancy. For simplicity and cost-effectiveness, we’ll detail an active-passive setup with automated failover, where one region is primary and the other is a warm standby. Traffic is directed to the primary region, and in case of failure, DNS and load balancer configurations are updated to direct traffic to the secondary region.

Database Replication Strategy: PostgreSQL Cross-Region

DigitalOcean’s Managed PostgreSQL offers built-in read replicas. For true multi-region disaster recovery, we need to configure a logical replication setup that can be promoted to a primary in the event of a failover. This involves setting up a primary database in Region A and a standby in Region B, configured for streaming replication.

Configuring Primary PostgreSQL in Region A

When creating your Managed PostgreSQL cluster in Region A, ensure it’s configured with sufficient resources. Note down the connection details (host, port, user, password, database name).

Setting up Standby PostgreSQL in Region B

Create a new Managed PostgreSQL cluster in Region B. This cluster will initially be empty. We will then configure it to replicate from the primary in Region A.

The key to cross-region replication for DR is using PostgreSQL’s built-in streaming replication. While DigitalOcean’s Managed Databases don’t expose direct `pg_basebackup` or `recovery.conf` settings for cross-region DR promotion, we can achieve this by setting up a dedicated replica in the secondary region and then promoting it. A more robust, albeit complex, approach involves setting up logical replication or using tools like Patroni for automated failover management.

For a managed solution, we’ll rely on DigitalOcean’s read replica functionality and a manual promotion process, or a script that automates this. Let’s assume we have a primary in Region A and a read-replica in Region B. The critical step is the promotion of the read-replica to a standalone primary.

Automating Database Failover (Conceptual Script)

A script, triggered by monitoring, would perform the following steps:

Detect primary database unavailability in Region A.
Connect to the Managed PostgreSQL cluster in Region B.
Execute the promotion command. For DigitalOcean Managed Databases, this is typically done via the API or control panel, but for programmatic control, you’d use the DO API to modify the cluster’s configuration to stop replication and make it writable.
Update application configurations (e.g., environment variables on Droplets) to point to the new primary database in Region B.

A simplified Python script using the DigitalOcean API (requires `pip install digitalocean`) could look like this:

import digitalocean
import os
import time

# --- Configuration ---
TOKEN = os.environ.get("DIGITALOCEAN_TOKEN")
PRIMARY_DB_SLUG = "your-primary-db-slug-region-a" # e.g., "my-app-db-nyc3"
STANDBY_DB_SLUG = "your-standby-db-slug-region-b" # e.g., "my-app-db-ams3"
APP_DROPLET_TAG = "ruby-app-server"
PRIMARY_REGION_TAG = "primary-region" # Tag for Droplets in the primary region
SECONDARY_REGION_TAG = "secondary-region" # Tag for Droplets in the secondary region
FAILOVER_LOCK_FILE = "/tmp/failover_in_progress.lock"

# --- DigitalOcean API Client ---
manager = digitalocean.Manager(token=TOKEN)

def is_primary_db_available():
    # In a real scenario, this would involve more robust checks:
    # - Ping the DB endpoint
    # - Attempt a simple query (e.g., SELECT 1)
    # For simplicity, we'll assume a check that returns True if available.
    print("Checking primary DB availability...")
    # Placeholder: Replace with actual DB health check
    try:
        # Example: Attempt to connect to primary DB
        # import psycopg2
        # conn = psycopg2.connect(database="...", user="...", password="...", host=PRIMARY_DB_HOST, port=PRIMARY_DB_PORT)
        # conn.close()
        return True # Assume available for now
    except Exception as e:
        print(f"Primary DB check failed: {e}")
        return False

def promote_standby_db():
    print(f"Attempting to promote standby DB: {STANDBY_DB_SLUG}")
    try:
        db_cluster = manager.get_database(STANDBY_DB_SLUG)
        # DigitalOcean API for promoting a read replica to a standalone DB is not directly exposed
        # as a single "promote" call. It typically involves:
        # 1. Ensuring replication is healthy.
        # 2. Potentially resizing or reconfiguring the cluster to remove replication role.
        # This is a complex operation and often requires manual intervention or a more
        # sophisticated orchestration tool.
        #
        # For demonstration, we'll simulate the action and assume it succeeds.
        # In a real implementation, you'd use the DO API to:
        # - Stop replication (if possible via API)
        # - Reconfigure the database to be a standalone, writable instance.
        # This might involve deleting and recreating the cluster from a backup,
        # or using specific API endpoints if available for this purpose.
        print("Simulating promotion of standby DB...")
        # Example: If DO had an API like: db_cluster.promote()
        # db_cluster.promote()
        time.sleep(10) # Simulate promotion time
        print(f"Standby DB {STANDBY_DB_SLUG} promoted successfully (simulated).")
        return True
    except Exception as e:
        print(f"Failed to promote standby DB: {e}")
        return False

def update_app_configs_for_region(region_tag, db_host, db_port, db_name, db_user, db_password):
    print(f"Updating application configurations for Droplets tagged with '{region_tag}' to use DB: {db_host}:{db_port}/{db_name}")
    droplets = manager.get_all_droplets(tag_slug=region_tag)
    for droplet in droplets:
        print(f"Updating droplet: {droplet.name} ({droplet.id})")
        # This is highly application-specific. It might involve:
        # - SSHing into the droplet and updating environment files (e.g., .env)
        # - Restarting the application service (e.g., systemctl restart myapp.service)
        # - Using DigitalOcean's user-data or cloud-config for initial setup.
        #
        # Example: Using SSH to update a .env file and restart
        try:
            # This requires SSH keys to be set up and accessible.
            # You'd typically use a library like 'paramiko' for SSH.
            # For brevity, we'll just print the intended actions.
            print(f"  - SSH into droplet and update environment variables (DATABASE_URL, etc.)")
            print(f"  - Restart application service (e.g., systemctl restart your_ruby_app)")
            # Example command to run remotely:
            # ssh user@droplet_ip 'echo "DATABASE_URL=postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}" >> /path/to/.env && systemctl restart your_ruby_app'
            time.sleep(2) # Simulate update time
        except Exception as e:
            print(f"  - Failed to update droplet {droplet.name}: {e}")

def failover_to_region_b():
    if os.path.exists(FAILOVER_LOCK_FILE):
        print("Failover process already initiated or in progress. Exiting.")
        return

    with open(FAILOVER_LOCK_FILE, "w") as f:
        f.write(str(time.time()))

    print("Starting failover process...")

    if promote_standby_db():
        # Get new DB details for Region B (after promotion)
        # This might require fetching the cluster again or assuming its details.
        # For simplicity, we'll assume the standby DB slug corresponds to the new primary.
        # In reality, you'd need to get the connection details of the *promoted* DB.
        # Let's assume the standby DB slug is now the primary.
        # You'd need to fetch the actual connection details from the DO API after promotion.
        # For now, we'll use placeholder variables for the *new* primary DB.
        NEW_PRIMARY_DB_HOST = "your-new-primary-db-host-region-b" # Get from DO API after promotion
        NEW_PRIMARY_DB_PORT = 5432
        NEW_PRIMARY_DB_NAME = "your-db-name" # Get from DO API
        NEW_PRIMARY_DB_USER = "your-db-user" # Get from DO API
        NEW_PRIMARY_DB_PASSWORD = "your-db-password" # Get from DO API

        # Update app configs in the *secondary* region to point to the *new* primary DB
        update_app_configs_for_region(SECONDARY_REGION_TAG, NEW_PRIMARY_DB_HOST, NEW_PRIMARY_DB_PORT, NEW_PRIMARY_DB_NAME, NEW_PRIMARY_DB_USER, NEW_PRIMARY_DB_PASSWORD)

        # Update DNS to point to the load balancer in Region B
        print("Updating DNS records to point to Region B's load balancer...")
        # This would involve using a DNS provider's API (e.g., DigitalOcean DNS, Cloudflare)
        # to change A records or CNAMEs.
        # Example: update_dns_record("your.app.com", "region-b-load-balancer-ip")
        print("DNS update complete (simulated).")

        # Update load balancer configuration if necessary (e.g., if using DO Load Balancers)
        # Ensure the LB in Region B is active and serving traffic.
        print("Updating load balancer configuration (simulated).")

        print("Failover process completed successfully.")
    else:
        print("Failover process failed during database promotion.")

    # Clean up lock file after a delay or successful failover
    # os.remove(FAILOVER_LOCK_FILE)

# --- Main Execution Loop ---
if __name__ == "__main__":
    # This script would typically run as a scheduled job or be triggered by an external monitoring system.
    # For demonstration, we'll simulate a check.
    print("Starting automated failover monitoring...")
    if not is_primary_db_available():
        print("Primary database is unavailable. Initiating failover.")
        failover_to_region_b()
    else:
        print("Primary database is available. No failover needed.")

    # In a real system, this would be a loop with a sleep or a cron job.
    # Example:
    # while True:
    #     if not is_primary_db_available():
    #         failover_to_region_b()
    #     time.sleep(60) # Check every 60 seconds

Important Considerations for Database Failover:

API Access: Ensure your DigitalOcean API token has the necessary permissions.
SSH Access: Droplets must be configured for SSH access, and your script needs credentials (SSH keys) to connect.
Application Configuration: The method of updating application configurations (e.g., environment variables, config files) is highly dependent on your deployment strategy (e.g., Docker, systemd, Capistrano).
DNS Propagation: DNS changes can take time to propagate globally. Consider using low TTL values for critical DNS records during failover.
Locking: Implement a robust locking mechanism to prevent multiple failover attempts.
Rollback: A failback strategy (returning to the original primary region after it’s restored) is crucial and requires its own automation.

Application Deployment and Load Balancing

Deploy your Ruby application consistently across both regions. Use infrastructure-as-code tools like Terraform or Ansible to ensure identical configurations.

Infrastructure as Code (Terraform Example)

Define your Droplets, Load Balancers, and potentially database configurations in Terraform for reproducible deployments.

# main.tf (simplified example)

provider "digitalocean" {
  token = var.do_token
}

variable "do_token" {
  description = "DigitalOcean API Token"
  type        = string
  sensitive   = true
}

variable "region_a" {
  description = "Primary DigitalOcean region"
  type        = string
  default     = "nyc3"
}

variable "region_b" {
  description = "Secondary DigitalOcean region"
  type        = string
  default     = "ams3"
}

# --- Region A Resources ---
resource "digitalocean_droplet" "app_server_a" {
  count    = 2 # Example: 2 app servers in region A
  image    = "ubuntu-22-04-x64"
  region   = var.region_a
  size     = "s-2vcpu-4gb"
  ssh_keys = [digitalocean_ssh_key.deployer.id]
  tags     = ["ruby-app-server", "primary-region"]

  connection {
    type        = "ssh"
    user        = "root" # Or your deployment user
    private_key = file("~/.ssh/id_rsa") # Ensure this key is accessible
    host        = self.ipv4_address
    timeout     = "2m"
  }

  provisioner "remote-exec" {
    inline = [
      "apt-get update",
      "apt-get install -y ruby-full build-essential git",
      # Add commands to clone your app, install gems, set up systemd service, etc.
      # Example:
      # "git clone your-repo /opt/your_app",
      # "cd /opt/your_app && bundle install",
      # "cp /opt/your_app/config/systemd.service /etc/systemd/system/your_ruby_app.service",
      # "systemctl enable your_ruby_app",
      # "systemctl start your_ruby_app"
    ]
  }
}

resource "digitalocean_database_cluster" "db_a" {
  name       = "my-app-db-region-a"
  engine     = "pg"
  version    = "14"
  region     = var.region_a
  size       = "db-s-1vcpu-2gb"
  node_count = 1
  # Add read_replica_region if using read replicas for DR
  # read_replica_region = var.region_b # This is for read replicas, not DR promotion directly
}

resource "digitalocean_loadbalancer" "lb_a" {
  name     = "my-app-lb-region-a"
  region   = var.region_a
  droplet_ids = digitalocean_droplet.app_server_a[*].id
  healthcheck {
    port     = 80
    path     = "/"
    protocol = "http"
  }
  forwarding_rule {
    entry_protocol    = "http"
    entry_port        = 80
    destination_protocol = "http"
    destination_port  = 3000 # Your app's port
  }
}

# --- Region B Resources (Standby) ---
resource "digitalocean_droplet" "app_server_b" {
  count    = 1 # Example: 1 app server in region B (warm standby)
  image    = "ubuntu-22-04-x64"
  region   = var.region_b
  size     = "s-2vcpu-4gb"
  ssh_keys = [digitalocean_ssh_key.deployer.id]
  tags     = ["ruby-app-server", "secondary-region"]

  connection {
    type        = "ssh"
    user        = "root"
    private_key = file("~/.ssh/id_rsa")
    host        = self.ipv4_address
    timeout     = "2m"
  }

  provisioner "remote-exec" {
    inline = [
      "apt-get update",
      "apt-get install -y ruby-full build-essential git",
      # Similar setup as region A, but might be configured to use standby DB initially
    ]
  }
}

resource "digitalocean_database_cluster" "db_b" {
  name       = "my-app-db-region-b"
  engine     = "pg"
  version    = "14"
  region     = var.region_b
  size       = "db-s-1vcpu-2gb"
  node_count = 1
  # This would be configured as a replica of db_a for initial setup
  # For DR, you'd manage promotion separately.
}

resource "digitalocean_loadbalancer" "lb_b" {
  name     = "my-app-lb-region-b"
  region   = var.region_b
  droplet_ids = digitalocean_droplet.app_server_b[*].id
  healthcheck {
    port     = 80
    path     = "/"
    protocol = "http"
  }
  forwarding_rule {
    entry_protocol    = "http"
    entry_port        = 80
    destination_protocol = "http"
    destination_port  = 3000
  }
}

# --- SSH Key ---
resource "digitalocean_ssh_key" "deployer" {
  name         = "deployer-key"
  public_key = file("~/.ssh/id_rsa.pub")
}

# --- DNS Configuration ---
# You would typically manage DNS via DigitalOcean's DNS service or an external provider.
# This example assumes you're using DO DNS.
resource "digitalocean_record" "app_a" {
  domain = "yourdomain.com"
  type   = "A"
  name   = "app"
  value  = digitalocean_loadbalancer.lb_a.ip
  ttl    = 60 # Low TTL for faster failover
}

resource "digitalocean_record" "app_b" {
  domain = "yourdomain.com"
  type   = "A"
  name   = "app-standby" # Temporary name or managed by failover script
  value  = digitalocean_loadbalancer.lb_b.ip
  ttl    = 60
}

Load Balancer and DNS Failover

DigitalOcean Load Balancers distribute traffic to Droplets within their region. For multi-region failover, we need a mechanism to switch the public-facing DNS record from the primary region’s Load Balancer IP to the secondary region’s Load Balancer IP.

The failover script (or a dedicated monitoring service) should:

Monitor the health of the primary region’s Load Balancer and application endpoints.
If the primary region is deemed unhealthy, update the main DNS A record for your application (e.g., app.yourdomain.com) to point to the IP address of the Load Balancer in the secondary region.
Ensure the secondary region’s Load Balancer is configured to direct traffic to its Droplets.

Using the DigitalOcean API for DNS management:

import digitalocean
import os

TOKEN = os.environ.get("DIGITALOCEAN_TOKEN")
DOMAIN_NAME = "yourdomain.com"
APP_RECORD_NAME = "app" # The main A record for your app

lb_manager = digitalocean.Manager(token=TOKEN)

def update_dns_record(new_ip):
    print(f"Updating DNS record '{APP_RECORD_NAME}.{DOMAIN_NAME}' to IP: {new_ip}")
    try:
        domain = lb_manager.get_domain(DOMAIN_NAME)
        records = domain.get_records()
        for record in records:
            if record.type == "A" and record.name == APP_RECORD_NAME:
                print(f"Found existing record: ID={record.id}, Value={record.value}")
                record.update(value=new_ip)
                print("DNS record updated successfully.")
                return True
        print("Error: A record not found for the application.")
        return False
    except Exception as e:
        print(f"Error updating DNS record: {e}")
        return False

# Example usage within the failover script:
# if failover_to_region_b():
#     # After successful DB promotion and app config updates in Region B
#     region_b_lb_ip = get_load_balancer_ip("my-app-lb-region-b") # Function to fetch LB IP
#     update_dns_record(region_b_lb_ip)

Object Storage (Spaces) and Backups

Static assets served via DigitalOcean Spaces should also be considered for redundancy. Spaces offer cross-region replication, which can be configured to automatically copy files to another Space in a different region.

Configuring Spaces Cross-Region Replication

1. Create a primary Space in Region A.

2. Create a secondary Space in Region B.

3. On the primary Space, navigate to its settings and enable “Replication”. Select the secondary Space as the destination.

This ensures that uploads to the primary Space are mirrored to the secondary. Your application should be configured to use the endpoint of the primary Space. In a failover scenario, you might need to update your application’s configuration to point to the secondary Space’s endpoint if replication is not instantaneous or if the primary Space becomes inaccessible.

Monitoring and Alerting

A robust monitoring system is paramount. Use DigitalOcean’s monitoring tools, Prometheus, Grafana, or third-party services to track:

Droplet CPU, memory, and disk usage.
Database connection counts, query latency, and replication lag.
Load Balancer health checks and request rates.
Application-level metrics (e.g., error rates, response times).
Network latency between regions.

Configure alerts for critical thresholds. These alerts should trigger the automated failover process or notify an on-call engineer.

Testing Your Disaster Recovery Plan

Regularly test your failover and failback procedures. This is not a “set it and forget it” solution. Simulate failures:

Shut down Droplets in the primary region.
Simulate database unavailability.
Test the DNS switch.
Verify application functionality in the secondary region.
Test the failback process to return operations to the primary region once it’s restored.

Document the entire process, including manual steps, API endpoints used, and expected outcomes. This documentation is vital for training and for use during an actual incident.

Conclusion

Implementing multi-region redundancy for Ruby applications on DigitalOcean requires careful planning and automation. By leveraging managed databases with replication, infrastructure-as-code, automated DNS updates, and comprehensive monitoring, you can build a resilient architecture capable of withstanding regional outages and ensuring business continuity.