Disaster Recovery 101: Architecting Auto-Failovers for DynamoDB and Ruby Deployments on DigitalOcean
Establishing Multi-Region DynamoDB Replication
A robust disaster recovery strategy for a DynamoDB-backed application hinges on effective cross-region replication. This isn’t merely about backups; it’s about maintaining a continuously available, synchronized replica of your data in a geographically distinct region. AWS’s Global Tables feature is the cornerstone of this approach, providing active-active replication across multiple AWS regions. While this post focuses on DigitalOcean, the principles of multi-region data availability are universal. For a DigitalOcean equivalent, we’ll simulate this by setting up a primary database cluster and a read-replica cluster in separate availability zones, and then architecting a failover mechanism that promotes the replica.
Let’s assume a primary database cluster is provisioned in DigitalOcean’s New York 2 (NYC3) data center, and a read-replica cluster is configured in their San Francisco 1 (SFO1) data center. The replication mechanism will be managed at the application level, or by leveraging a managed database service that supports cross-region replication if available and suitable for your specific needs. For this example, we’ll outline a conceptual approach using a primary and replica, with application logic to handle synchronization and failover.
Automating Failover for Ruby Applications on DigitalOcean
The core of our automated failover lies in a health check mechanism and an orchestration script. This script will periodically probe the primary database and the application instances. If the primary database becomes unresponsive or exhibits critical errors, the script will initiate a failover process.
We’ll use DigitalOcean’s Droplets for our Ruby application servers and a managed PostgreSQL or MySQL cluster for our database. For simplicity, let’s assume a PostgreSQL setup. The health check will involve a simple query to the primary database. If this query times out or returns an error, the failover is triggered.
Health Check Script (Python)
This Python script will run as a cron job on a dedicated monitoring Droplet or as a background process on one of the application servers. It connects to the primary database and executes a simple `SELECT 1` query. If the connection or query fails, it signals a potential outage.
import psycopg2
import requests
import os
import time
PRIMARY_DB_HOST = os.environ.get('PRIMARY_DB_HOST', 'your-primary-db-host.digitalocean.com')
PRIMARY_DB_USER = os.environ.get('PRIMARY_DB_USER', 'db_user')
PRIMARY_DB_PASSWORD = os.environ.get('PRIMARY_DB_PASSWORD', 'db_password')
PRIMARY_DB_NAME = os.environ.get('PRIMARY_DB_NAME', 'app_database')
HEALTH_CHECK_URL = os.environ.get('HEALTH_CHECK_URL', 'http://your-app-primary-host/health')
FAILOVER_TRIGGER_URL = os.environ.get('FAILOVER_TRIGGER_URL', 'http://your-failover-manager-host/trigger_failover')
def check_database_health():
try:
conn = psycopg2.connect(
host=PRIMARY_DB_HOST,
user=PRIMARY_DB_USER,
password=PRIMARY_DB_PASSWORD,
dbname=PRIMARY_DB_NAME,
connect_timeout=5
)
cur = conn.cursor()
cur.execute("SELECT 1;")
cur.close()
conn.close()
print("Primary database is healthy.")
return True
except psycopg2.OperationalError as e:
print(f"Primary database health check failed: {e}")
return False
except Exception as e:
print(f"An unexpected error occurred during DB health check: {e}")
return False
def check_application_health():
try:
response = requests.get(HEALTH_CHECK_URL, timeout=5)
response.raise_for_status() # Raise an exception for bad status codes
print("Primary application endpoint is healthy.")
return True
except requests.exceptions.RequestException as e:
print(f"Primary application health check failed: {e}")
return False
def trigger_failover():
try:
print("Attempting to trigger failover...")
response = requests.post(FAILOVER_TRIGGER_URL, json={'reason': 'primary_unreachable'}, timeout=10)
response.raise_for_status()
print(f"Failover trigger successful: {response.json()}")
except requests.exceptions.RequestException as e:
print(f"Failed to trigger failover: {e}")
if __name__ == "__main__":
db_healthy = check_database_health()
app_healthy = check_application_health()
if not db_healthy or not app_healthy:
print("Primary system is unhealthy. Initiating failover process.")
trigger_failover()
else:
print("Primary system is healthy. No action needed.")
Failover Orchestration Service (Ruby/Sinatra)
This Sinatra application will act as the central point for managing failover. It exposes an endpoint that the health check script can call. Upon receiving a failover trigger, it will execute a series of commands to reconfigure DNS, update application configurations, and potentially promote a read replica.
require 'sinatra'
require 'json'
require 'open3'
# Configuration for your DigitalOcean Droplets and database
PRIMARY_APP_TAG = 'production-primary'
REPLICA_APP_TAG = 'production-replica'
PRIMARY_DB_HOST = ENV.fetch('PRIMARY_DB_HOST', 'your-primary-db-host.digitalocean.com')
REPLICA_DB_HOST = ENV.fetch('REPLICA_DB_HOST', 'your-replica-db-host.digitalocean.com')
DO_API_TOKEN = ENV.fetch('DO_API_TOKEN', 'your-digitalocean-api-token')
DO_REGION = ENV.fetch('DO_REGION', 'nyc3') # Region of the failover manager
# --- Database Promotion Logic (Conceptual for PostgreSQL) ---
# This assumes you have SSH access to the replica DB host and can run psql commands.
# In a real-world scenario, you'd likely use a managed service's API or a more robust tool.
def promote_replica_db
puts "Attempting to promote replica database at #{REPLICA_DB_HOST}..."
# Example: For PostgreSQL, this might involve stopping replication and marking it as primary.
# This is highly dependent on your specific database setup and version.
# A more robust solution would involve database-specific tools or managed service APIs.
command = "ssh root@#{REPLICA_DB_HOST} 'sudo systemctl stop postgresql && sudo pg_ctl promote'"
stdout, stderr, status = Open3.capture3(command)
if status.success?
puts "Replica database promotion command executed successfully."
# Further steps might include updating connection strings for applications
return true
else
puts "Error promoting replica database: #{stderr}"
return false
end
end
# --- DNS Update Logic (Conceptual) ---
# This would typically involve updating A records or CNAMEs via DigitalOcean's DNS API
# or a load balancer's configuration. For simplicity, we'll simulate by updating
# an environment variable or a configuration file that applications read.
def update_dns_records
puts "Simulating DNS update to point to replica..."
# In a real scenario, you'd use the DigitalOcean API to update DNS records.
# Example:
# `curl -X PUT "https://api.digitalocean.com/v2/domains/yourdomain.com/records/record-id" -d '{"data": "new-ip-address"}' -H "Authorization: Bearer #{DO_API_TOKEN}"`
# For this example, we'll just log the action.
puts "DNS records would be updated here to point to the replica."
return true
end
# --- Application Configuration Update ---
# This involves updating application configuration files or environment variables
# on the replica Droplets to point to the new primary database.
def update_app_configs
puts "Updating application configurations on replica Droplets..."
# This would involve SSHing into each replica Droplet and modifying config files.
# Example:
# `ssh root@replica-droplet-ip 'sed -i "s/#{PRIMARY_DB_HOST}/#{REPLICA_DB_HOST}/g" /path/to/config.yml'`
# For this example, we'll just log the action.
puts "Application configurations would be updated here."
return true
end
# --- Droplet Tagging/Management (Conceptual) ---
# This would involve using the DigitalOcean API to tag Droplets,
# potentially to direct traffic or manage their roles.
def reassign_droplet_roles
puts "Reassigning Droplet roles (e.g., tagging)..."
# Example: Remove PRIMARY_APP_TAG from current primary, add REPLICA_APP_TAG to new primary.
# This is highly dependent on how you manage your application deployments.
puts "Droplet roles would be managed via DigitalOcean API here."
return true
end
post '/trigger_failover' do
request.body.rewind
payload = JSON.parse(request.body.read)
reason = payload['reason'] || 'unknown'
puts "Received failover trigger. Reason: #{reason}"
# --- Failover Sequence ---
# 1. Promote the replica database
unless promote_replica_db
status 500
return { error: 'Failed to promote replica database' }.to_json
end
# 2. Update DNS records to point to the replica infrastructure
unless update_dns_records
status 500
return { error: 'Failed to update DNS records' }.to_json
end
# 3. Update application configurations on replica servers
unless update_app_configs
status 500
return { error: 'Failed to update application configurations' }.to_json
end
# 4. (Optional) Reassign Droplet roles or update load balancer pools
unless reassign_droplet_roles
status 500
return { error: 'Failed to reassign Droplet roles' }.to_json
end
puts "Failover process completed successfully."
status 200
{ message: 'Failover initiated successfully' }.to_json
end
get '/health' do
status 200
{ status: 'ok' }.to_json
end
# --- Deployment Notes ---
# - Ensure the DO_API_TOKEN has sufficient permissions.
# - The server running this Sinatra app needs SSH access to the replica DB host.
# - This script assumes a PostgreSQL setup. Adapt `promote_replica_db` for MySQL or other DBs.
# - DNS updates require integration with DigitalOcean's API.
# - Application configuration updates require a strategy for pushing changes to Droplets.
Deployment and Configuration
1. Provision Droplets: Set up your primary and replica application Droplets in different DigitalOcean regions (e.g., NYC3 and SFO1). Ensure they are tagged appropriately (e.g., 'production-primary', 'production-replica'). Provision your primary and replica database clusters in these respective regions.
2. Configure Database Replication: Set up streaming replication from your primary database to the replica. This is crucial for data consistency.
3. Deploy Health Check Script: Deploy the Python health check script to a dedicated monitoring Droplet or one of your application servers. Configure it to run as a cron job (e.g., every minute).
Example cron entry:
* * * * * /usr/bin/python3 /path/to/your/health_check.py >> /var/log/health_check.log 2>&1
4. Deploy Failover Orchestration Service: Deploy the Ruby Sinatra application on a Droplet in one of your regions (preferably not the primary region to avoid a single point of failure for the orchestrator itself). Ensure it's running and accessible.
5. Set Environment Variables: Configure the necessary environment variables for both the health check script and the Sinatra application (database credentials, API tokens, hostnames, etc.).
6. Configure DNS: Point your application's primary DNS record to a load balancer or directly to your primary application Droplet's IP. The failover script will be responsible for updating this record.
Testing the Failover
Thorough testing is paramount. Simulate failures by:
- Stopping the primary database service.
- Blocking network access to the primary database.
- Simulating application unresponsiveness by returning 5xx errors from the health check endpoint.
Monitor the logs of the health check script and the failover orchestration service to verify that the failover process is triggered and executed correctly. Crucially, test the application's functionality after a failover to ensure data integrity and availability.
Considerations for Production
Database Specifics: The `promote_replica_db` function is highly conceptual. For PostgreSQL, you might use `pg_rewind` after promotion if the old primary comes back online. For MySQL, you'd use `CHANGE MASTER TO MASTER_AUTO_POSITION = 1; START SLAVE;` and potentially `RESET SLAVE ALL;`. Managed database services often provide APIs for failover that are more robust.
DNS Propagation: Be mindful of DNS propagation delays. Using low TTL values for your DNS records can help, but it's not instantaneous. Consider using a Global Load Balancer if your provider offers one.
State Management: If your application has stateful components (e.g., background job queues, in-memory caches), ensure these are also handled during failover. This might involve replicating queue states or re-initializing caches.
Orchestrator Redundancy: The failover orchestration service itself should be highly available. Consider running multiple instances of the Sinatra app behind a load balancer, or using a more robust orchestration tool like Nomad or Kubernetes.
Rollback Strategy: Define a clear rollback procedure in case the automated failover has unintended consequences or if the primary can be restored quickly.