Automating Multi-Region Redundancy for Ruby Architectures on OVH
Establishing Multi-Region Redundancy with OVHcloud: A Deep Dive for Ruby Architectures
Achieving robust disaster recovery for critical Ruby applications necessitates a multi-region strategy. This post outlines a practical, production-grade approach leveraging OVHcloud’s infrastructure, focusing on automated failover and data synchronization for a typical Rails application stack. We’ll cover database replication, application server deployment, and load balancing configuration.
Database Replication Strategy: PostgreSQL in a Multi-Region Setup
For our Ruby on Rails application, we’ll opt for PostgreSQL, a reliable and feature-rich relational database. OVHcloud’s Public Cloud instances provide the necessary building blocks. A primary-replica setup across two distinct OVHcloud regions (e.g., GRA1 and RBX3) is a common and effective pattern. We’ll configure streaming replication to ensure near real-time data consistency.
Primary Database Server Configuration (GRA1)
On the primary instance in GRA1, ensure PostgreSQL is installed and running. The key is to configure postgresql.conf and pg_hba.conf to enable replication and allow connections from the replica.
postgresql.conf Adjustments
Locate your postgresql.conf file (typically in /etc/postgresql/[version]/main/). Modify the following parameters:
wal_level = replica max_wal_senders = 5 wal_keep_segments = 64 archive_mode = on archive_command = 'cp %p /var/lib/postgresql/wal-archive/%f'
The archive_command is crucial for Point-In-Time Recovery (PITR) and can be adapted to use rsync or other transfer methods for offsite backups. Ensure the /var/lib/postgresql/wal-archive/ directory exists and is writable by the PostgreSQL user.
pg_hba.conf Configuration
Allow the replica server to connect for replication. Add a line like this, replacing [replica_ip_address] with the private IP of your replica instance in RBX3:
host replication repl_user [replica_ip_address]/32 md5
Create a dedicated replication user:
-- Connect to PostgreSQL as a superuser CREATE ROLE repl_user WITH REPLICATION LOGIN PASSWORD 'your_strong_replication_password';
Restart PostgreSQL after these changes:
sudo systemctl restart postgresql
Replica Database Server Configuration (RBX3)
On the replica instance in RBX3, install PostgreSQL. Before starting it, you need to configure it to connect to the primary. First, stop the PostgreSQL service if it auto-started:
sudo systemctl stop postgresql
Clean out the default data directory to ensure a clean clone:
sudo rm -rf /var/lib/postgresql/[version]/main/*
Perform the base backup using pg_basebackup. Replace [primary_ip_address] with the private IP of your GRA1 primary instance and repl_user with your replication username.
sudo -u postgres pg_basebackup -h [primary_ip_address] -U repl_user -D /var/lib/postgresql/[version]/main -P -v -R
The -R flag automatically creates the standby.signal file and a postgresql.auto.conf with the necessary connection information for streaming replication. Ensure the PostgreSQL data directory permissions are correct.
postgresql.conf on Replica
The -R flag in pg_basebackup should have created a postgresql.auto.conf file in the data directory. Verify it contains settings like:
primary_conninfo = 'host=[primary_ip_address] port=5432 user=repl_user password=your_strong_replication_password sslmode=prefer sslcompression=0' primary_slot_name = 'replication_slot_name'
It’s highly recommended to create a replication slot on the primary to prevent WAL segments from being removed before the replica has received them. You’d create this on the primary:
-- On the primary:
SELECT pg_create_physical_replication_slot('replication_slot_name');
Start the PostgreSQL service on the replica:
sudo systemctl start postgresql
Monitor the replication status on both servers using pg_stat_replication on the primary and pg_stat_wal_receiver on the replica.
Application Deployment and Load Balancing
For a typical Rails application, we’ll deploy stateless application servers in both regions. OVHcloud’s Load Balancer service will distribute traffic and manage failover.
Deploying Rails Application Servers
Use a configuration management tool like Ansible or Chef, or a container orchestration platform like Kubernetes (using OVHcloud Managed Kubernetes Service) to deploy your Rails application consistently across both regions. Ensure your application is configured to connect to the *local* database replica in its respective region for read operations, and to the *global* primary for writes. For simplicity in this example, we’ll assume direct server deployment.
On each application server (e.g., in GRA1 and RBX3), your database.yml should reflect this:
production:
primary:
adapter: postgresql
encoding: unicode
database: your_app_db
pool: 5
username: your_db_user
password: your_db_password
host: &primary_db_host "[primary_db_private_ip_in_gra1]" # Use private IP for GRA1
replica:
adapter: postgresql
encoding: unicode
database: your_app_db
pool: 5
username: your_db_user
password: your_db_password
host: &replica_db_host "[replica_db_private_ip_in_rbx3]" # Use private IP for RBX3
replica: true
# In your Rails application, you'll need logic to direct writes to 'primary'
# and reads to 'replica'. Gems like 'makara' can help manage this.
For write operations, your application logic (or a gem like makara) must explicitly use the primary configuration. For read operations, it should use the replica configuration. This offloads read traffic from the primary, improving performance.
OVHcloud Load Balancer Configuration
OVHcloud’s Load Balancer service can be configured to manage traffic distribution and health checks. We’ll set up a Global Load Balancer (GLB) and regional load balancers.
Global Load Balancer (GLB)
The GLB will have a single public IP address and will direct traffic to the regional load balancers based on health checks and potentially geo-proximity (though for DR, active-active or active-passive is more common).
Regional Load Balancers
Create two regional load balancers, one in GRA1 and one in RBX3. Each will point to the application servers within its region.
Health Checks
Configure health checks for your application servers. A simple HTTP check on a dedicated health endpoint (e.g., /health) is usually sufficient. The health check should verify that the application is running and can connect to its local database replica.
# Example health check endpoint in Rails (config/routes.rb)
get '/health', to: 'health#show'
# Example HealthController (app/controllers/health_controller.rb)
class HealthController < ApplicationController
def show
# Check database connection
ActiveRecord::Base.connection.execute('SELECT 1')
render json: { status: 'ok', database: 'connected' }, status: :ok
rescue StandardError => e
render json: { status: 'error', message: e.message }, status: :internal_server_error
end
end
Failover Logic
The OVHcloud Load Balancer will automatically stop sending traffic to unhealthy instances or regions. For a full regional failover, you might configure the GLB to direct all traffic to one region, with the other region acting as a standby. If the primary region becomes unhealthy, the GLB will shift traffic to the secondary region.
Automating Failover and Failback
Manual failover is prone to error and delay. Automation is key for effective disaster recovery.
Database Failover
Automating database failover is complex. A common approach involves:
- Monitoring replication lag and primary health.
- In case of primary failure, promoting the replica to become the new primary.
- Reconfiguring other replicas (if any) to follow the new primary.
- Updating application configurations (e.g.,
database.ymlor DNS records) to point to the new primary.
Tools like repmgr or custom scripts using PostgreSQL’s API can facilitate this. For a simpler active-passive setup, you might have a script that:
- Detects primary failure (e.g., via failed health checks or monitoring tools).
- Connects to the replica in RBX3 and executes
pg_ctl promote. - Updates the application’s configuration or a central configuration store (like Consul or etcd) with the new primary’s IP address.
- Triggers a rolling restart of application servers to pick up the new configuration.
Application Server Failover
The OVHcloud Load Balancer handles application server failover automatically based on health checks. If all servers in a region fail, the GLB will direct traffic to the healthy region.
Failback Strategy
Failback (returning operations to the original primary region) requires careful planning:
- Ensure the original primary database server is healthy and synchronized.
- If the original primary was promoted, it needs to be demoted and reconfigured as a replica of the current primary.
- Update application configurations to point back to the original primary region.
- Gradually shift traffic back using the load balancer.
Automating failback is often more challenging than failover due to the need to maintain data consistency during the transition. Manual intervention with thorough validation is often preferred for failback.
Monitoring and Alerting
Comprehensive monitoring is non-negotiable. Implement:
- Replication Lag: Monitor
pg_stat_replicationon the primary andpg_stat_wal_receiveron the replica. Set alerts for excessive lag. - Database Health: Monitor CPU, memory, disk I/O, and connection counts on both database servers.
- Application Health: Monitor application response times, error rates, and resource utilization on application servers.
- Load Balancer Status: Monitor the health of backend servers and regions via the OVHcloud console or API.
- Network Latency: Monitor latency between regions, especially for database replication.
Utilize tools like Prometheus/Grafana, Datadog, or OVHcloud’s integrated monitoring solutions. Configure alerts to notify your team immediately of any issues that could impact availability.
Conclusion
Implementing multi-region redundancy for your Ruby architecture on OVHcloud requires a layered approach, addressing database replication, application deployment, and intelligent load balancing. By automating key failover processes and establishing robust monitoring, you can significantly enhance your application’s resilience against regional outages, ensuring business continuity.