Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Ruby Deployments on OVH
Establishing a Robust MySQL Replication Topology for High Availability
A foundational element for any disaster recovery strategy involving MySQL is a well-architected replication setup. For automated failover, we need a primary-replica configuration that is not only functional but also monitored and capable of being promoted programmatically. We’ll focus on asynchronous replication initially, as it’s the most common and performant, with considerations for semi-synchronous replication if stricter data consistency guarantees are paramount.
Our topology will consist of at least one primary instance and multiple replica instances. For OVH deployments, this often means leveraging their dedicated servers or Public Cloud instances. We’ll assume a basic setup where MySQL is installed and configured on each node.
MySQL Primary Configuration (my.cnf)
On the designated primary server, the MySQL configuration file (typically /etc/mysql/my.cnf or /etc/my.cnf) needs specific settings to enable binary logging and define a unique server ID. This is crucial for replication.
[mysqld] server-id = 1 log_bin = /var/log/mysql/mysql-bin.log binlog_format = ROW gtid_mode = ON enforce_gtid_consistency= ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = OFF innodb_flush_log_at_trx_commit = 1 innodb_buffer_pool_size = 4G # Adjust based on your server's RAM max_connections = 200 # Adjust as needed skip-name-resolve
Key parameters:
server-id: Must be unique across all servers in the replication topology. We use ‘1’ for the primary.log_bin: Enables binary logging, which records all data changes.binlog_format=ROW: Recommended for consistency and avoiding issues with non-deterministic statements.gtid_mode=ONandenforce_gtid_consistency=ON: Global Transaction Identifiers simplify failover and replica management significantly.read_only=OFF: Ensures the primary can accept writes.innodb_flush_log_at_trx_commit=1: Guarantees durability (ACID compliance) at the cost of some performance.skip-name-resolve: Improves performance by avoiding DNS lookups for client connections.
MySQL Replica Configuration (my.cnf)
Replica servers require similar configurations but with a different server-id and settings to point to the primary.
[mysqld] server-id = 2 # Unique ID for this replica log_bin = /var/log/mysql/mysql-bin.log # Still good practice for chained replication or if it becomes a primary binlog_format = ROW gtid_mode = ON enforce_gtid_consistency= ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = ON # Crucial: Prevents accidental writes to replicas innodb_flush_log_at_trx_commit = 1 innodb_buffer_pool_size = 4G # Adjust based on your server's RAM max_connections = 200 # Adjust as needed skip-name-resolve
The critical difference here is read_only=ON. This prevents applications from writing directly to a replica, which would break replication consistency.
Setting up Replication User and Initializing Replicas
On the primary MySQL server, create a dedicated replication user. This user needs the REPLICATION SLAVE privilege.
-- On the primary MySQL server CREATE USER 'replicator'@'%' IDENTIFIED BY 'your_strong_password'; GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%'; FLUSH PRIVILEGES;
To initialize a replica, you need a consistent snapshot of the primary’s data. The most robust method with GTIDs is to use mysqldump with the --master-data=2 and --single-transaction flags, or for larger datasets, consider Percona XtraBackup.
First, get the current binary log file and position (or GTID set) from the primary. With GTIDs, this is simpler.
-- On the primary MySQL server SHOW MASTER STATUS; -- Or, to get GTID set: SHOW GLOBAL STATUS LIKE 'Gtid_executed';
Then, take a dump. For GTID-based replication, mysqldump will automatically include the necessary GTID information if gtid_mode=ON.
# On the primary or a client machine mysqldump -u root -p --all-databases --master-data=2 --single-transaction --gtid > initial_dump.sql # If using Percona XtraBackup for large datasets: # xtrabackup --backup --target-dir=/path/to/backup --user=root --password=your_root_password # xtrabackup --prepare --target-dir=/path/to/backup # Then copy the prepared data to the replica's data directory.
On each replica, stop MySQL, clear its data directory (or restore from backup), and then import the dump. After importing, configure the replica to connect to the primary.
-- On the replica MySQL server -- Stop replication if it was running STOP SLAVE; -- Reset slave configuration (use with caution if data is already present) RESET SLAVE ALL; -- Configure replication to point to the primary CHANGE MASTER TO MASTER_HOST='primary_ip_address', MASTER_USER='replicator', MASTER_PASSWORD='your_strong_password', MASTER_PORT=3306, MASTER_AUTO_POSITION=1; -- Use AUTO_POSITION for GTID-based replication -- Start replication START SLAVE; -- Verify replication status SHOW SLAVE STATUS\G
The SHOW SLAVE STATUS\G output is critical. Look for Slave_IO_Running: Yes and Slave_SQL_Running: Yes. Also, monitor Seconds_Behind_Master; it should ideally be 0 or very low.
Implementing Automated Failover with Orchestration Tools
Manual failover is error-prone and slow. For automated failover, we need a mechanism that monitors the health of the primary and can promote a replica when the primary becomes unavailable. Orchestration tools like Orchestrator, MHA (Master High Availability), or custom scripts leveraging tools like pt-heartbeat and a distributed lock manager (e.g., etcd, ZooKeeper) are common.
We’ll outline an approach using Orchestrator, a popular open-source tool designed for MySQL topology management and automated failover. Orchestrator needs to be installed on one or more dedicated nodes (or co-located on non-database servers).
Orchestrator Setup and Configuration
Orchestrator requires a backend database to store its topology information. PostgreSQL or MySQL can be used. For high availability of Orchestrator itself, run multiple instances in a cluster.
Install Orchestrator (e.g., from binary or Docker). The configuration file (orchestrator.conf.json) is key. Here’s a simplified example:
{
"Debug": false,
"ListenAddress": ":3000",
"MySQLTopologyUser": "orchestrator",
"MySQLTopologyPassword": "your_orchestrator_db_password",
"MySQLOrchestratorHost": "127.0.0.1",
"MySQLOrchestratorPort": 3306,
"MySQLOrchestratorDatabase": "orchestrator",
"DiscoveryPeriodSeconds": 10,
"FailureDetectionPeriodBlockSeconds": 600,
"RecoveryPeriodBlockSeconds": 300,
"Promotion-User": "orchestrator_promote",
"Promotion-Password": "your_promote_password",
"Promotion-Host": "127.0.0.1",
"Promotion-Port": 3306,
"Promotion-Database": "orchestrator",
"DetectClusterAliasConflicts": true,
"DetectClusterAliasGroup": true,
"SlaveLagQuery": "SELECT * FROM mysql.slave_master_info WHERE Master_Host IS NOT NULL AND Slave_SQL_Running = 'Yes' AND Slave_IO_Running = 'Yes' AND Seconds_Behind_Master > {lag_minutes} * 60",
"AutoDiscoverByClusterName": true,
"ClusterNameDetectionQuery": "SELECT @@global.server_id",
"SnapshotTopologies": true,
"SnapshotTopologiesPeriodSeconds": 3600,
"Hooks": [
{
"name": "MySQL-Promotion",
"when": "post-promotion",
"command": "/path/to/your/post-promotion-hook.sh",
"timeout-seconds": 60
},
{
"name": "MySQL-Failover",
"when": "post-failover",
"command": "/path/to/your/post-failover-hook.sh",
"timeout-seconds": 60
}
]
}
You’ll need to create the orchestrator user and database in your backend MySQL instance. Also, create the orchestrator_promote user on the primary and replicas with sufficient privileges to stop/start slaves, reset slave configurations, and potentially change the read-only status.
-- On the backend MySQL for Orchestrator CREATE DATABASE orchestrator; -- Grant privileges for the orchestrator user GRANT ALL PRIVILEGES ON orchestrator.* TO 'orchestrator'@'%' IDENTIFIED BY 'your_orchestrator_db_password'; FLUSH PRIVILEGES; -- On each MySQL server (primary and replicas) for promotion CREATE USER 'orchestrator_promote'@'%' IDENTIFIED BY 'your_promote_password'; GRANT SUPER, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, PROCESS, LOCK TABLES, RESTART ON *.* TO 'orchestrator_promote'@'%'; FLUSH PRIVILEGES;
Start Orchestrator. It will begin discovering your MySQL topology. You can then use its API or web UI to explicitly define clusters and their desired states.
Orchestrator’s Automated Failover Mechanism
Orchestrator continuously monitors the health of MySQL instances. When it detects a primary failure (e.g., it’s unreachable, or health checks fail), it initiates a failover process:
- Detection: Orchestrator’s discovery process identifies that the primary is down.
- Replica Selection: It selects the best replica to promote based on factors like replication lag, GTID set, and potentially other heuristics.
- Promotion: The chosen replica is promoted to become the new primary. This involves:
- Setting
read_only=OFFon the chosen replica. - Setting
read_only=ONon all other former replicas. - Reconfiguring other replicas to replicate from the newly promoted primary.
- Setting
- Application Reconfiguration (Crucial): This is the most challenging part. Orchestrator itself doesn’t typically reconfigure your application’s database connection strings. This requires an external mechanism.
Integrating with Ruby Applications for Seamless Reconfiguration
Automated failover is only half the battle. The application must be able to connect to the new primary seamlessly. For Ruby applications, this often involves managing database connection pools and dynamically updating them.
Database Connection Management in Rails
In a Rails application, database connections are typically managed by ActiveRecord. The config/database.yml file defines connection parameters. For high availability, we need a way to update these parameters without restarting the entire application.
# config/database.yml production: adapter: mysql2 encoding: utf8mb4 pool: 5 username: app_user password: your_app_password host: primary_db_host # This needs to be dynamically updated database: app_database
The host parameter is the critical piece. It needs to point to a stable endpoint that always resolves to the current primary. This can be achieved using:
- DNS CNAME Record: A CNAME record (e.g.,
db.myapp.com) pointing to the IP address of the current primary. This requires an external system to update the DNS record upon failover. - Virtual IP (VIP): A floating IP address managed by a cluster manager (like Keepalived) that is moved to the new primary server.
- Load Balancer: A dedicated load balancer (e.g., HAProxy, Nginx) that directs traffic to the current primary.
Dynamic Configuration Updates
When Orchestrator performs a promotion, it can trigger a hook (as defined in orchestrator.conf.json). This hook script is the ideal place to initiate the application-level reconfiguration.
Let’s consider the DNS CNAME approach. The hook script would need to interact with your DNS provider’s API (e.g., OVH’s API, AWS Route 53 API, Cloudflare API) to update the CNAME record.
#!/bin/bash # /path/to/your/post-promotion-hook.sh # This script is executed by Orchestrator after a successful promotion. # It should update the application's endpoint to point to the new primary. NEW_PRIMARY_IP="$1" # Orchestrator passes the new primary's IP as the first argument CLUSTER_NAME="$2" # Orchestrator passes the cluster name as the second argument LOG_FILE="/var/log/orchestrator_hooks.log" echo "$(date): Post-promotion hook triggered for cluster $CLUSTER_NAME. New primary IP: $NEW_PRIMARY_IP" >> $LOG_FILE # --- Example: Update DNS CNAME using a hypothetical DNS API client --- # Replace with your actual DNS provider's CLI or API calls. # For OVH, you might use 'ovh' CLI or direct API calls. DNS_RECORD_NAME="db.myapp.com" DNS_ZONE="myapp.com" # Example using a placeholder 'update-dns' command # This command would typically involve authentication and specific API calls. # update-dns --record-name $DNS_RECORD_NAME --zone $DNS_ZONE --type CNAME --value $NEW_PRIMARY_IP --api-key YOUR_API_KEY --api-secret YOUR_API_SECRET # A more robust solution might involve a dedicated service that listens for Orchestrator hooks # and manages DNS/VIP/Load Balancer updates. # --- Example: Update a configuration file that your application reloads --- # If your application can reload its config without a full restart, this is an option. # For Rails, this is less common for database host changes without a restart. # echo "Updating application config..." >> $LOG_FILE # sed -i "s/host: .*/host: $NEW_PRIMARY_IP/" /path/to/your/app/config/database.yml # touch /path/to/your/app/tmp/restart.txt # For Passenger/Puma to reload # --- Example: Notify monitoring systems --- # curl -X POST -d "message=MySQL primary promoted to $NEW_PRIMARY_IP for cluster $CLUSTER_NAME" http://your-monitoring-webhook/ echo "$(date): DNS update initiated for $DNS_RECORD_NAME to point to $NEW_PRIMARY_IP." >> $LOG_FILE echo "$(date): Hook script finished." >> $LOG_FILE exit 0
The hook script receives the new primary’s IP address as an argument. It’s crucial to implement robust error handling and retry mechanisms within this script, as DNS propagation can take time, and API calls can fail.
Using a Load Balancer (HAProxy)
A more resilient approach is to use a load balancer like HAProxy. HAProxy can monitor the health of your MySQL instances and automatically direct traffic to the available primary.
# /etc/haproxy/haproxy.cfg
frontend mysql_frontend
bind *:3306
mode tcp
default_backend mysql_backend
backend mysql_backend
mode tcp
balance roundrobin
option tcp-check
# Health check for MySQL: expects a specific string on successful connection
# This requires a MySQL user with minimal privileges to run 'SELECT 1;'
tcp-check connect port 3306
tcp-check send GET_MASTER_STATUS\r\n
tcp-check expect ! r/^$/ # Expect any response, not an empty one
# If using Orchestrator, you might have a script that updates the backend dynamically.
# Otherwise, manually add/remove servers or use HAProxy's API/runtime configuration.
server primary_db_1 192.168.1.10:3306 check port 3306 inter 2s fall 3 rise 2
server replica_db_1 192.168.1.11:3306 check port 3306 inter 2s fall 3 rise 2 backup
server replica_db_2 192.168.1.12:3306 check port 3306 inter 2s fall 3 rise 2 backup
In this HAProxy setup, the application connects to the HAProxy IP on port 3306. HAProxy’s health checks will determine which server is the active primary. Orchestrator’s hook script could then be used to dynamically update HAProxy’s backend configuration or signal HAProxy to reconfigure itself.
Monitoring and Alerting for Disaster Recovery
A robust disaster recovery strategy is incomplete without comprehensive monitoring and alerting. Key metrics to track include:
- Replication Lag:
Seconds_Behind_MasterinSHOW SLAVE STATUS. - Replication Status:
Slave_IO_RunningandSlave_SQL_Running. - Primary Availability: Network reachability, MySQL service status.
- Disk Space: Especially for binary logs and data directories.
- CPU/Memory Usage: To identify performance bottlenecks that might affect failover.
- Orchestrator Health: Ensure Orchestrator instances are running and discovering the topology correctly.
Tools like Prometheus with MySQL exporter, Zabbix, Nagios, or Datadog can be used. Alerts should be configured for critical failures (e.g., replication broken, primary down) and potential issues (e.g., high replication lag).
Example: Prometheus Alerting Rule for Replication Lag
groups:
- name: mysql_replication_alerts
rules:
- alert: MySQLReplicationLagging
expr: mysql_slave_status_seconds_behind_master{slave_io_running="Yes", slave_sql_running="Yes"} > 300 # Lagging by more than 5 minutes
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL replication lag detected on {{ $labels.instance }}"
description: "MySQL replica {{ $labels.instance }} is lagging behind master by {{ $value }} seconds."
- alert: MySQLReplicationBroken
expr: |
mysql_slave_io_running != 1 OR mysql_slave_sql_running != 1
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL replication is broken on {{ $labels.instance }}"
description: "MySQL replica {{ $labels.instance }} has IO running: {{ mysql_slave_io_running }} and SQL running: {{ mysql_slave_sql_running }}."
These alerts should trigger notifications to your on-call engineering team, allowing for prompt investigation and intervention, even if automated failover is in place.