Automating Multi-Region Redundancy for Ruby Architectures on Linode

Establishing Multi-Region Redundancy: A Linode-Centric Approach for Ruby Applications

Achieving robust disaster recovery for critical Ruby applications necessitates a multi-region strategy. This post details a practical, production-ready implementation leveraging Linode’s global infrastructure, focusing on automated failover and data synchronization. We’ll cover database replication, application server deployment, and load balancing across geographically dispersed data centers.

Database Replication Strategy: PostgreSQL in a Primary-Secondary Configuration

For relational data, PostgreSQL’s built-in streaming replication is a reliable and performant choice. We’ll set up a primary instance in one Linode region and a synchronous or asynchronous standby in another. Asynchronous replication is generally preferred for cross-region setups to avoid latency-induced write stalls, though synchronous offers stronger consistency guarantees at the cost of performance.

Primary Linode (e.g., us-east) Configuration:

Ensure PostgreSQL is installed and running. Modify postgresql.conf and pg_hba.conf.

# /etc/postgresql/[version]/main/postgresql.conf
listen_addresses = '*'
wal_level = replica
max_wal_senders = 5
wal_keep_segments = 64
synchronous_commit = off
synchronous_standby_names = '' # For async replication

# /etc/postgresql/[version]/main/pg_hba.conf
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     replicator      [secondary_ip]/32       md5
host    all             all             0.0.0.0/0               md5

Create a replication user:

CREATE USER replicator REPLICATION LOGIN PASSWORD 'your_replication_password';
GRANT CONNECT ON DATABASE your_database TO replicator;

Restart PostgreSQL on the primary.

Secondary Linode (e.g., eu-west) Configuration:

Install PostgreSQL. Stop the PostgreSQL service before proceeding.

sudo systemctl stop postgresql

Remove the default data directory and initialize a new one, then start the recovery process.

sudo rm -rf /var/lib/postgresql/[version]/main/*
sudo -u postgres pg_basebackup -h [primary_ip] -U replicator -D /var/lib/postgresql/[version]/main -P -v -W
sudo -u postgres touch /var/lib/postgresql/[version]/main/recovery.signal

Create a standby.signal file (or recovery.conf for older versions) in the data directory to indicate it’s a standby.

# For PostgreSQL 12+ create standby.signal file.
# For older versions, create recovery.conf:
# /var/lib/postgresql/[version]/main/recovery.conf
# standby_mode = 'on'
# primary_conninfo = 'host=[primary_ip] port=5432 user=replicator password=your_replication_password'
# trigger_file = '/tmp/promote_standby'

Ensure listen_addresses is set to ‘*’ in postgresql.conf on the secondary if you intend to promote it and have other services connect directly. Restart PostgreSQL.

sudo systemctl start postgresql

Monitor replication status on the primary:

SELECT * FROM pg_stat_replication;

Application Server Deployment: Stateless Ruby Instances

Our Ruby application servers (e.g., Puma, Unicorn) should be stateless. This means any session data, file uploads, or temporary data must be stored externally. We’ll deploy identical application stacks on multiple Linode instances across different regions.

Infrastructure as Code (IaC): Use tools like Terraform or Ansible to provision and configure these servers consistently. This ensures identical environments and simplifies updates.

Example Ansible Playbook Snippet (for provisioning a web server):

---
- name: Deploy Ruby Application Server
  hosts: webservers
  become: yes
  vars:
    ruby_version: "3.1.2"
    app_dir: "/srv/my_ruby_app"
    repo_url: "[email protected]:your_org/my_ruby_app.git"

  tasks:
    - name: Install system dependencies
      apt:
        name: ['git', 'build-essential', 'libssl-dev', 'zlib1g-dev', 'libreadline-dev', 'libyaml-dev', 'libsqlite3-dev', 'sqlite3', 'libxml2-dev', 'libxslt1-dev', 'libcurl4-openssl-dev', 'software-properties-common', 'libffi-dev']
        state: present

    - name: Add Ruby PPA and install Ruby
      apt_repository:
        repo: 'ppa:rael-gc/ruby-3.1'
        state: present
      when: ansible_distribution == 'Ubuntu'

    - name: Install Ruby
      apt:
        name: "ruby{{ ruby_version }}"
        state: present
      when: ansible_distribution == 'Ubuntu'

    - name: Install Bundler
      gem:
        name: bundler
        version: "~> 2.3"
        executable: "/usr/bin/gem-{{ ruby_version }}"
      environment:
        PATH: "/usr/local/bin:{{ ansible_env.PATH }}"

    - name: Create application directory
      file:
        path: "{{ app_dir }}"
        state: directory
        owner: www-data
        group: www-data
        mode: '0755'

    - name: Clone or update application repository
      git:
        repo: "{{ repo_url }}"
        dest: "{{ app_dir }}"
        version: main # Or a specific tag/branch
        accept_hostkey: yes
      become_user: www-data

    - name: Install application gems
      bundler:
        chdir: "{{ app_dir }}"
        state: present
      environment:
        PATH: "/usr/local/bin:{{ ansible_env.PATH }}"
      become_user: www-data

    - name: Configure application environment (e.g., .env file)
      template:
        src: templates/env.j2
        dest: "{{ app_dir }}/.env"
        owner: www-data
        group: www-data
        mode: '0644'

    - name: Ensure application service is running (e.g., systemd unit)
      systemd:
        name: my_ruby_app
        state: started
        enabled: yes
        daemon_reload: yes
      notify: Restart application service

  handlers:
    - name: Restart application service
      systemd:
        name: my_ruby_app
        state: restarted
        daemon_reload: yes

Ensure you have a corresponding systemd service file (e.g., /etc/systemd/system/my_ruby_app.service) to manage your Puma/Unicorn process.

Global Load Balancing and Health Checks

Linode’s NodeBalancers are region-specific. For true multi-region load balancing and failover, we need a global solution. This can be achieved using DNS-based load balancing with health checks.

Strategy:

Deploy a Linode NodeBalancer in each region where your application is deployed.
Configure each NodeBalancer to point to the application servers within its region.
Use a global DNS provider (e.g., Cloudflare, AWS Route 53, or Linode’s own DNS Manager with advanced features) to manage your primary domain.
Configure DNS records (e.g., A or CNAME) for your application’s domain to point to the IP addresses of your regional NodeBalancers.
Implement health checks at the DNS level or via an external monitoring service. The DNS provider should automatically route traffic away from unhealthy regions.

Example DNS Configuration (Conceptual using a provider that supports health checks):

# Assuming a DNS provider like Cloudflare with Load Balancer/Health Check features
# Or using Linode DNS Manager with external monitoring integration

# Primary DNS Record: myapp.yourdomain.com
# Type: A or CNAME
# Value: IP Address of NodeBalancer in Region A (e.g., us-east)
# Health Check: Enabled, pointing to a specific health check endpoint on the NodeBalancer/App Server (e.g., /health)
# Failover: If Region A is unhealthy, automatically switch to Region B.

# Secondary DNS Record (Failover):
# Type: A or CNAME
# Value: IP Address of NodeBalancer in Region B (e.g., eu-west)
# Health Check: Enabled (less critical if it's purely failover)

# Note: Some DNS providers allow direct integration with load balancers or health check services.
# For Linode, you might configure external monitoring services (e.g., UptimeRobot, Pingdom)
# and then use their API or a custom script to update DNS records via Linode's API
# when a region becomes unhealthy.

NodeBalancer Configuration (Linode UI/API):

Create a NodeBalancer in each region (e.g., `us-east`, `eu-west`).
Add backend nodes pointing to your application server IPs within that region.
Configure a listener on port 80 (HTTP) or 443 (HTTPS).
Set up a health check:

Protocol: HTTP/HTTPS
Path: /health (ensure your Ruby app has a dedicated health check endpoint)
Check Interval: 10-15 seconds
Response Timeout: 5 seconds
Unhealthy Threshold: 3
Healthy Threshold: 2

Automating Failover and Data Synchronization

Manual failover is prone to error and delay. Automation is key.

Database Failover Automation

Promoting a PostgreSQL standby requires careful execution. A common approach involves a monitoring script that checks replication lag and primary health. If the primary is deemed unhealthy, the script can attempt to promote the standby.

import psycopg2
import subprocess
import time
import requests
import os

PRIMARY_DB_HOST = os.environ.get("PRIMARY_DB_HOST", "primary.db.example.com")
SECONDARY_DB_HOST = os.environ.get("SECONDARY_DB_HOST", "secondary.db.example.com")
REPLICATION_USER = os.environ.get("REPLICATION_USER", "replicator")
REPLICATION_PASSWORD = os.environ.get("REPLICATION_PASSWORD", "your_replication_password")
PRIMARY_APP_URL = os.environ.get("PRIMARY_APP_URL", "http://app.us-east.example.com")
SECONDARY_APP_URL = os.environ.get("SECONDARY_APP_URL", "http://app.eu-west.example.com")
HEALTH_CHECK_PATH = "/health"
MONITOR_INTERVAL = 30 # seconds

def is_primary_healthy():
    try:
        # Check if primary DB is reachable and accepting connections
        conn = psycopg2.connect(host=PRIMARY_DB_HOST, user=REPLICATION_USER, password=REPLICATION_PASSWORD, dbname="postgres", connect_timeout=5)
        conn.close()
        # Check if application endpoint is responding
        response = requests.get(f"{PRIMARY_APP_URL}{HEALTH_CHECK_PATH}", timeout=5)
        return response.status_code == 200
    except (psycopg2.OperationalError, requests.exceptions.RequestException):
        return False

def is_standby_ready_to_promote():
    try:
        conn = psycopg2.connect(host=SECONDARY_DB_HOST, user=REPLICATION_USER, password=REPLICATION_PASSWORD, dbname="postgres", connect_timeout=5)
        cur = conn.cursor()
        # Check if standby is running and not in recovery
        cur.execute("SELECT pg_is_in_recovery();")
        in_recovery = cur.fetchone()[0]
        cur.close()
        conn.close()
        # If not in recovery, it's ready to promote (or already promoted)
        return not in_recovery
    except psycopg2.OperationalError:
        return False

def promote_standby():
    print(f"Attempting to promote standby at {SECONDARY_DB_HOST}...")
    try:
        # For PostgreSQL 12+, use pg_ctl promote
        # For older versions, use trigger_file mechanism
        if os.path.exists("/var/lib/postgresql/12/main/standby.signal"): # Adjust path for your PG version
             subprocess.run(['sudo', 'pg_ctl', 'promote', '-D', '/var/lib/postgresql/12/main'], check=True)
        else:
             # Assuming trigger_file is configured in recovery.conf
             trigger_file_path = '/tmp/promote_standby' # Must match recovery.conf
             if os.path.exists(trigger_file_path):
                 print(f"Trigger file {trigger_file_path} already exists. Standby might be promoting or already promoted.")
                 return True # Assume it's handled
             else:
                 subprocess.run(['sudo', 'touch', trigger_file_path], check=True)
                 print(f"Created trigger file: {trigger_file_path}")

        print("Promotion command sent. Waiting for standby to become primary...")
        # Give it some time to promote
        time.sleep(15)
        return True
    except Exception as e:
        print(f"Error promoting standby: {e}")
        return False

def update_dns_records():
    # This is a placeholder. You would use the Linode API, Cloudflare API, etc.
    # to update DNS records to point to the new primary (or a load balancer).
    print("Updating DNS records to point to the new primary region...")
    # Example: Call Linode API to update A record for app.yourdomain.com
    # Example: Call Cloudflare API to update CNAME/A record
    pass

def main():
    while True:
        if not is_primary_healthy():
            print("Primary is unhealthy. Checking standby...")
            if is_standby_ready_to_promote():
                if promote_standby():
                    # Wait a bit for the new primary to stabilize
                    time.sleep(30)
                    # Now, update DNS to direct traffic to the new primary region
                    update_dns_records()
                    print("Failover initiated. DNS records updated.")
                    # Optionally, restart app servers in the new primary region if needed
                    # Or ensure load balancers are correctly configured.
                    break # Exit loop after successful failover
            else:
                print("Standby is not ready for promotion. Further investigation needed.")
        else:
            print("Primary is healthy. Replication status OK.")

        time.sleep(MONITOR_INTERVAL)

if __name__ == "__main__":
    main()

This script needs to be deployed on a separate monitoring server or a highly available bastion host. It requires appropriate API credentials for DNS updates and SSH access (or direct commands) to the database servers.

Application Health Check Endpoint

Your Ruby application must expose a health check endpoint (e.g., /health). This endpoint should verify:

The application process is running.
It can connect to the database (read-only check is sufficient).
Any essential external services are reachable.

Example Rails Controller Action:

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user! # Adjust as needed

  def show
    # Check database connection
    begin
      ActiveRecord::Base.connection.execute('SELECT 1')
      db_status = :ok
    rescue => e
      db_status = :error
      Rails.logger.error("Database health check failed: #{e.message}")
    end

    # Add checks for other critical services (e.g., Redis, external APIs)

    if db_status == :ok # && other_services_ok
      render json: { status: 'ok', database: db_status }, status: :ok
    else
      render json: { status: 'error', database: db_status }, status: :service_unavailable
    end
  end
end

# config/routes.rb
Rails.application.routes.draw do
  get '/health', to: 'health#show'
  # ... other routes
end

Data Synchronization for Non-Relational Data

For data not stored in PostgreSQL (e.g., files in object storage, cache data), ensure a cross-region strategy is in place.

Object Storage: Use services like Linode Object Storage, AWS S3, or Google Cloud Storage, which offer cross-region replication features. Configure replication from your primary region to your secondary region.
Caching: If using Redis or Memcached, consider a distributed cache solution or accept that cache data will be lost during a failover and will need to be repopulated. For critical caching, explore solutions like Redis Cluster with replication or managed services that offer cross-region capabilities.
Background Jobs: Ensure your job queue (e.g., Sidekiq, Resque) is either region-aware or that jobs can be processed by workers in either region. If using a centralized Redis for Sidekiq, ensure it’s highly available and potentially replicated cross-region.

Testing and Validation

Regularly test your failover procedures. This is non-negotiable.

Simulated Failures: Periodically stop the primary database, block network traffic to a region, or shut down application servers in one region to simulate an outage.
Monitor Failover Time: Measure the Recovery Time Objective (RTO) – how long it takes for the system to become fully operational in the secondary region.
Data Integrity Checks: After failover, perform checks to ensure no data was lost or corrupted.
DNS Propagation Testing: Verify that DNS changes propagate as expected across different DNS resolvers.

By implementing these strategies, you can build a resilient Ruby architecture on Linode capable of withstanding regional outages and ensuring business continuity.