Automating Multi-Region Redundancy for Ruby Architectures on Linode
Establishing Multi-Region Redundancy: A Linode-Centric Approach for Ruby Applications
Achieving robust disaster recovery for critical Ruby applications necessitates a multi-region strategy. This post details a practical, production-ready implementation leveraging Linode’s global infrastructure, focusing on automated failover and data synchronization. We’ll cover database replication, application server deployment, and load balancing across geographically dispersed data centers.
Database Replication Strategy: PostgreSQL in a Primary-Secondary Configuration
For relational data, PostgreSQL’s built-in streaming replication is a reliable and performant choice. We’ll set up a primary instance in one Linode region and a synchronous or asynchronous standby in another. Asynchronous replication is generally preferred for cross-region setups to avoid latency-induced write stalls, though synchronous offers stronger consistency guarantees at the cost of performance.
Primary Linode (e.g., us-east) Configuration:
Ensure PostgreSQL is installed and running. Modify postgresql.conf and pg_hba.conf.
# /etc/postgresql/[version]/main/postgresql.conf listen_addresses = '*' wal_level = replica max_wal_senders = 5 wal_keep_segments = 64 synchronous_commit = off synchronous_standby_names = '' # For async replication
# /etc/postgresql/[version]/main/pg_hba.conf # TYPE DATABASE USER ADDRESS METHOD host replication replicator [secondary_ip]/32 md5 host all all 0.0.0.0/0 md5
Create a replication user:
CREATE USER replicator REPLICATION LOGIN PASSWORD 'your_replication_password'; GRANT CONNECT ON DATABASE your_database TO replicator;
Restart PostgreSQL on the primary.
Secondary Linode (e.g., eu-west) Configuration:
Install PostgreSQL. Stop the PostgreSQL service before proceeding.
sudo systemctl stop postgresql
Remove the default data directory and initialize a new one, then start the recovery process.
sudo rm -rf /var/lib/postgresql/[version]/main/* sudo -u postgres pg_basebackup -h [primary_ip] -U replicator -D /var/lib/postgresql/[version]/main -P -v -W sudo -u postgres touch /var/lib/postgresql/[version]/main/recovery.signal
Create a standby.signal file (or recovery.conf for older versions) in the data directory to indicate it’s a standby.
# For PostgreSQL 12+ create standby.signal file. # For older versions, create recovery.conf: # /var/lib/postgresql/[version]/main/recovery.conf # standby_mode = 'on' # primary_conninfo = 'host=[primary_ip] port=5432 user=replicator password=your_replication_password' # trigger_file = '/tmp/promote_standby'
Ensure listen_addresses is set to ‘*’ in postgresql.conf on the secondary if you intend to promote it and have other services connect directly. Restart PostgreSQL.
sudo systemctl start postgresql
Monitor replication status on the primary:
SELECT * FROM pg_stat_replication;
Application Server Deployment: Stateless Ruby Instances
Our Ruby application servers (e.g., Puma, Unicorn) should be stateless. This means any session data, file uploads, or temporary data must be stored externally. We’ll deploy identical application stacks on multiple Linode instances across different regions.
Infrastructure as Code (IaC): Use tools like Terraform or Ansible to provision and configure these servers consistently. This ensures identical environments and simplifies updates.
Example Ansible Playbook Snippet (for provisioning a web server):
---
- name: Deploy Ruby Application Server
hosts: webservers
become: yes
vars:
ruby_version: "3.1.2"
app_dir: "/srv/my_ruby_app"
repo_url: "[email protected]:your_org/my_ruby_app.git"
tasks:
- name: Install system dependencies
apt:
name: ['git', 'build-essential', 'libssl-dev', 'zlib1g-dev', 'libreadline-dev', 'libyaml-dev', 'libsqlite3-dev', 'sqlite3', 'libxml2-dev', 'libxslt1-dev', 'libcurl4-openssl-dev', 'software-properties-common', 'libffi-dev']
state: present
- name: Add Ruby PPA and install Ruby
apt_repository:
repo: 'ppa:rael-gc/ruby-3.1'
state: present
when: ansible_distribution == 'Ubuntu'
- name: Install Ruby
apt:
name: "ruby{{ ruby_version }}"
state: present
when: ansible_distribution == 'Ubuntu'
- name: Install Bundler
gem:
name: bundler
version: "~> 2.3"
executable: "/usr/bin/gem-{{ ruby_version }}"
environment:
PATH: "/usr/local/bin:{{ ansible_env.PATH }}"
- name: Create application directory
file:
path: "{{ app_dir }}"
state: directory
owner: www-data
group: www-data
mode: '0755'
- name: Clone or update application repository
git:
repo: "{{ repo_url }}"
dest: "{{ app_dir }}"
version: main # Or a specific tag/branch
accept_hostkey: yes
become_user: www-data
- name: Install application gems
bundler:
chdir: "{{ app_dir }}"
state: present
environment:
PATH: "/usr/local/bin:{{ ansible_env.PATH }}"
become_user: www-data
- name: Configure application environment (e.g., .env file)
template:
src: templates/env.j2
dest: "{{ app_dir }}/.env"
owner: www-data
group: www-data
mode: '0644'
- name: Ensure application service is running (e.g., systemd unit)
systemd:
name: my_ruby_app
state: started
enabled: yes
daemon_reload: yes
notify: Restart application service
handlers:
- name: Restart application service
systemd:
name: my_ruby_app
state: restarted
daemon_reload: yes
Ensure you have a corresponding systemd service file (e.g., /etc/systemd/system/my_ruby_app.service) to manage your Puma/Unicorn process.
Global Load Balancing and Health Checks
Linode’s NodeBalancers are region-specific. For true multi-region load balancing and failover, we need a global solution. This can be achieved using DNS-based load balancing with health checks.
Strategy:
- Deploy a Linode NodeBalancer in each region where your application is deployed.
- Configure each NodeBalancer to point to the application servers within its region.
- Use a global DNS provider (e.g., Cloudflare, AWS Route 53, or Linode’s own DNS Manager with advanced features) to manage your primary domain.
- Configure DNS records (e.g., A or CNAME) for your application’s domain to point to the IP addresses of your regional NodeBalancers.
- Implement health checks at the DNS level or via an external monitoring service. The DNS provider should automatically route traffic away from unhealthy regions.
Example DNS Configuration (Conceptual using a provider that supports health checks):
# Assuming a DNS provider like Cloudflare with Load Balancer/Health Check features # Or using Linode DNS Manager with external monitoring integration # Primary DNS Record: myapp.yourdomain.com # Type: A or CNAME # Value: IP Address of NodeBalancer in Region A (e.g., us-east) # Health Check: Enabled, pointing to a specific health check endpoint on the NodeBalancer/App Server (e.g., /health) # Failover: If Region A is unhealthy, automatically switch to Region B. # Secondary DNS Record (Failover): # Type: A or CNAME # Value: IP Address of NodeBalancer in Region B (e.g., eu-west) # Health Check: Enabled (less critical if it's purely failover) # Note: Some DNS providers allow direct integration with load balancers or health check services. # For Linode, you might configure external monitoring services (e.g., UptimeRobot, Pingdom) # and then use their API or a custom script to update DNS records via Linode's API # when a region becomes unhealthy.
NodeBalancer Configuration (Linode UI/API):
- Create a NodeBalancer in each region (e.g., `us-east`, `eu-west`).
- Add backend nodes pointing to your application server IPs within that region.
- Configure a listener on port 80 (HTTP) or 443 (HTTPS).
- Set up a health check:
- Protocol: HTTP/HTTPS
- Path:
/health(ensure your Ruby app has a dedicated health check endpoint) - Check Interval: 10-15 seconds
- Response Timeout: 5 seconds
- Unhealthy Threshold: 3
- Healthy Threshold: 2
Automating Failover and Data Synchronization
Manual failover is prone to error and delay. Automation is key.
Database Failover Automation
Promoting a PostgreSQL standby requires careful execution. A common approach involves a monitoring script that checks replication lag and primary health. If the primary is deemed unhealthy, the script can attempt to promote the standby.
import psycopg2
import subprocess
import time
import requests
import os
PRIMARY_DB_HOST = os.environ.get("PRIMARY_DB_HOST", "primary.db.example.com")
SECONDARY_DB_HOST = os.environ.get("SECONDARY_DB_HOST", "secondary.db.example.com")
REPLICATION_USER = os.environ.get("REPLICATION_USER", "replicator")
REPLICATION_PASSWORD = os.environ.get("REPLICATION_PASSWORD", "your_replication_password")
PRIMARY_APP_URL = os.environ.get("PRIMARY_APP_URL", "http://app.us-east.example.com")
SECONDARY_APP_URL = os.environ.get("SECONDARY_APP_URL", "http://app.eu-west.example.com")
HEALTH_CHECK_PATH = "/health"
MONITOR_INTERVAL = 30 # seconds
def is_primary_healthy():
try:
# Check if primary DB is reachable and accepting connections
conn = psycopg2.connect(host=PRIMARY_DB_HOST, user=REPLICATION_USER, password=REPLICATION_PASSWORD, dbname="postgres", connect_timeout=5)
conn.close()
# Check if application endpoint is responding
response = requests.get(f"{PRIMARY_APP_URL}{HEALTH_CHECK_PATH}", timeout=5)
return response.status_code == 200
except (psycopg2.OperationalError, requests.exceptions.RequestException):
return False
def is_standby_ready_to_promote():
try:
conn = psycopg2.connect(host=SECONDARY_DB_HOST, user=REPLICATION_USER, password=REPLICATION_PASSWORD, dbname="postgres", connect_timeout=5)
cur = conn.cursor()
# Check if standby is running and not in recovery
cur.execute("SELECT pg_is_in_recovery();")
in_recovery = cur.fetchone()[0]
cur.close()
conn.close()
# If not in recovery, it's ready to promote (or already promoted)
return not in_recovery
except psycopg2.OperationalError:
return False
def promote_standby():
print(f"Attempting to promote standby at {SECONDARY_DB_HOST}...")
try:
# For PostgreSQL 12+, use pg_ctl promote
# For older versions, use trigger_file mechanism
if os.path.exists("/var/lib/postgresql/12/main/standby.signal"): # Adjust path for your PG version
subprocess.run(['sudo', 'pg_ctl', 'promote', '-D', '/var/lib/postgresql/12/main'], check=True)
else:
# Assuming trigger_file is configured in recovery.conf
trigger_file_path = '/tmp/promote_standby' # Must match recovery.conf
if os.path.exists(trigger_file_path):
print(f"Trigger file {trigger_file_path} already exists. Standby might be promoting or already promoted.")
return True # Assume it's handled
else:
subprocess.run(['sudo', 'touch', trigger_file_path], check=True)
print(f"Created trigger file: {trigger_file_path}")
print("Promotion command sent. Waiting for standby to become primary...")
# Give it some time to promote
time.sleep(15)
return True
except Exception as e:
print(f"Error promoting standby: {e}")
return False
def update_dns_records():
# This is a placeholder. You would use the Linode API, Cloudflare API, etc.
# to update DNS records to point to the new primary (or a load balancer).
print("Updating DNS records to point to the new primary region...")
# Example: Call Linode API to update A record for app.yourdomain.com
# Example: Call Cloudflare API to update CNAME/A record
pass
def main():
while True:
if not is_primary_healthy():
print("Primary is unhealthy. Checking standby...")
if is_standby_ready_to_promote():
if promote_standby():
# Wait a bit for the new primary to stabilize
time.sleep(30)
# Now, update DNS to direct traffic to the new primary region
update_dns_records()
print("Failover initiated. DNS records updated.")
# Optionally, restart app servers in the new primary region if needed
# Or ensure load balancers are correctly configured.
break # Exit loop after successful failover
else:
print("Standby is not ready for promotion. Further investigation needed.")
else:
print("Primary is healthy. Replication status OK.")
time.sleep(MONITOR_INTERVAL)
if __name__ == "__main__":
main()
This script needs to be deployed on a separate monitoring server or a highly available bastion host. It requires appropriate API credentials for DNS updates and SSH access (or direct commands) to the database servers.
Application Health Check Endpoint
Your Ruby application must expose a health check endpoint (e.g., /health). This endpoint should verify:
- The application process is running.
- It can connect to the database (read-only check is sufficient).
- Any essential external services are reachable.
Example Rails Controller Action:
# app/controllers/health_controller.rb
class HealthController < ApplicationController
skip_before_action :authenticate_user! # Adjust as needed
def show
# Check database connection
begin
ActiveRecord::Base.connection.execute('SELECT 1')
db_status = :ok
rescue => e
db_status = :error
Rails.logger.error("Database health check failed: #{e.message}")
end
# Add checks for other critical services (e.g., Redis, external APIs)
if db_status == :ok # && other_services_ok
render json: { status: 'ok', database: db_status }, status: :ok
else
render json: { status: 'error', database: db_status }, status: :service_unavailable
end
end
end
# config/routes.rb Rails.application.routes.draw do get '/health', to: 'health#show' # ... other routes end
Data Synchronization for Non-Relational Data
For data not stored in PostgreSQL (e.g., files in object storage, cache data), ensure a cross-region strategy is in place.
- Object Storage: Use services like Linode Object Storage, AWS S3, or Google Cloud Storage, which offer cross-region replication features. Configure replication from your primary region to your secondary region.
- Caching: If using Redis or Memcached, consider a distributed cache solution or accept that cache data will be lost during a failover and will need to be repopulated. For critical caching, explore solutions like Redis Cluster with replication or managed services that offer cross-region capabilities.
- Background Jobs: Ensure your job queue (e.g., Sidekiq, Resque) is either region-aware or that jobs can be processed by workers in either region. If using a centralized Redis for Sidekiq, ensure it’s highly available and potentially replicated cross-region.
Testing and Validation
Regularly test your failover procedures. This is non-negotiable.
- Simulated Failures: Periodically stop the primary database, block network traffic to a region, or shut down application servers in one region to simulate an outage.
- Monitor Failover Time: Measure the Recovery Time Objective (RTO) – how long it takes for the system to become fully operational in the secondary region.
- Data Integrity Checks: After failover, perform checks to ensure no data was lost or corrupted.
- DNS Propagation Testing: Verify that DNS changes propagate as expected across different DNS resolvers.
By implementing these strategies, you can build a resilient Ruby architecture on Linode capable of withstanding regional outages and ensuring business continuity.