Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Ruby Deployments on OVH

Establishing a Robust MySQL Replication Topology for High Availability

A foundational element for any disaster recovery strategy involving MySQL is a well-architected replication setup. For automated failover, we need a primary-replica configuration that is not only functional but also monitored and capable of being promoted programmatically. We’ll focus on asynchronous replication initially, as it’s the most common and performant, with considerations for semi-synchronous replication if stricter data consistency guarantees are paramount.

Our topology will consist of at least one primary instance and multiple replica instances. For OVH deployments, this often means leveraging their dedicated servers or Public Cloud instances. We’ll assume a basic setup where MySQL is installed and configured on each node.

MySQL Primary Configuration (my.cnf)

On the designated primary server, the MySQL configuration file (typically /etc/mysql/my.cnf or /etc/my.cnf) needs specific settings to enable binary logging and define a unique server ID. This is crucial for replication.

[mysqld]
server-id               = 1
log_bin                 = /var/log/mysql/mysql-bin.log
binlog_format           = ROW
gtid_mode               = ON
enforce_gtid_consistency= ON
relay_log               = /var/log/mysql/mysql-relay-bin.log
read_only               = OFF
innodb_flush_log_at_trx_commit = 1
innodb_buffer_pool_size = 4G  # Adjust based on your server's RAM
max_connections         = 200 # Adjust as needed
skip-name-resolve

Key parameters:

server-id: Must be unique across all servers in the replication topology. We use ‘1’ for the primary.
log_bin: Enables binary logging, which records all data changes.
binlog_format=ROW: Recommended for consistency and avoiding issues with non-deterministic statements.
gtid_mode=ON and enforce_gtid_consistency=ON: Global Transaction Identifiers simplify failover and replica management significantly.
read_only=OFF: Ensures the primary can accept writes.
innodb_flush_log_at_trx_commit=1: Guarantees durability (ACID compliance) at the cost of some performance.
skip-name-resolve: Improves performance by avoiding DNS lookups for client connections.

MySQL Replica Configuration (my.cnf)

Replica servers require similar configurations but with a different server-id and settings to point to the primary.

[mysqld]
server-id               = 2 # Unique ID for this replica
log_bin                 = /var/log/mysql/mysql-bin.log # Still good practice for chained replication or if it becomes a primary
binlog_format           = ROW
gtid_mode               = ON
enforce_gtid_consistency= ON
relay_log               = /var/log/mysql/mysql-relay-bin.log
read_only               = ON # Crucial: Prevents accidental writes to replicas
innodb_flush_log_at_trx_commit = 1
innodb_buffer_pool_size = 4G  # Adjust based on your server's RAM
max_connections         = 200 # Adjust as needed
skip-name-resolve

The critical difference here is read_only=ON. This prevents applications from writing directly to a replica, which would break replication consistency.

Setting up Replication User and Initializing Replicas

On the primary MySQL server, create a dedicated replication user. This user needs the REPLICATION SLAVE privilege.

-- On the primary MySQL server
CREATE USER 'replicator'@'%' IDENTIFIED BY 'your_strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;

To initialize a replica, you need a consistent snapshot of the primary’s data. The most robust method with GTIDs is to use mysqldump with the --master-data=2 and --single-transaction flags, or for larger datasets, consider Percona XtraBackup.

First, get the current binary log file and position (or GTID set) from the primary. With GTIDs, this is simpler.

-- On the primary MySQL server
SHOW MASTER STATUS;
-- Or, to get GTID set:
SHOW GLOBAL STATUS LIKE 'Gtid_executed';

Then, take a dump. For GTID-based replication, mysqldump will automatically include the necessary GTID information if gtid_mode=ON.

# On the primary or a client machine
mysqldump -u root -p --all-databases --master-data=2 --single-transaction --gtid > initial_dump.sql
# If using Percona XtraBackup for large datasets:
# xtrabackup --backup --target-dir=/path/to/backup --user=root --password=your_root_password
# xtrabackup --prepare --target-dir=/path/to/backup
# Then copy the prepared data to the replica's data directory.

On each replica, stop MySQL, clear its data directory (or restore from backup), and then import the dump. After importing, configure the replica to connect to the primary.

-- On the replica MySQL server
-- Stop replication if it was running
STOP SLAVE;
-- Reset slave configuration (use with caution if data is already present)
RESET SLAVE ALL;

-- Configure replication to point to the primary
CHANGE MASTER TO
  MASTER_HOST='primary_ip_address',
  MASTER_USER='replicator',
  MASTER_PASSWORD='your_strong_password',
  MASTER_PORT=3306,
  MASTER_AUTO_POSITION=1; -- Use AUTO_POSITION for GTID-based replication

-- Start replication
START SLAVE;

-- Verify replication status
SHOW SLAVE STATUS\G

The SHOW SLAVE STATUS\G output is critical. Look for Slave_IO_Running: Yes and Slave_SQL_Running: Yes. Also, monitor Seconds_Behind_Master; it should ideally be 0 or very low.

Implementing Automated Failover with Orchestration Tools

Manual failover is error-prone and slow. For automated failover, we need a mechanism that monitors the health of the primary and can promote a replica when the primary becomes unavailable. Orchestration tools like Orchestrator, MHA (Master High Availability), or custom scripts leveraging tools like pt-heartbeat and a distributed lock manager (e.g., etcd, ZooKeeper) are common.

We’ll outline an approach using Orchestrator, a popular open-source tool designed for MySQL topology management and automated failover. Orchestrator needs to be installed on one or more dedicated nodes (or co-located on non-database servers).

Orchestrator Setup and Configuration

Orchestrator requires a backend database to store its topology information. PostgreSQL or MySQL can be used. For high availability of Orchestrator itself, run multiple instances in a cluster.

Install Orchestrator (e.g., from binary or Docker). The configuration file (orchestrator.conf.json) is key. Here’s a simplified example:

{
  "Debug": false,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "your_orchestrator_db_password",
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "DiscoveryPeriodSeconds": 10,
  "FailureDetectionPeriodBlockSeconds": 600,
  "RecoveryPeriodBlockSeconds": 300,
  "Promotion-User": "orchestrator_promote",
  "Promotion-Password": "your_promote_password",
  "Promotion-Host": "127.0.0.1",
  "Promotion-Port": 3306,
  "Promotion-Database": "orchestrator",
  "DetectClusterAliasConflicts": true,
  "DetectClusterAliasGroup": true,
  "SlaveLagQuery": "SELECT * FROM mysql.slave_master_info WHERE Master_Host IS NOT NULL AND Slave_SQL_Running = 'Yes' AND Slave_IO_Running = 'Yes' AND Seconds_Behind_Master > {lag_minutes} * 60",
  "AutoDiscoverByClusterName": true,
  "ClusterNameDetectionQuery": "SELECT @@global.server_id",
  "SnapshotTopologies": true,
  "SnapshotTopologiesPeriodSeconds": 3600,
  "Hooks": [
    {
      "name": "MySQL-Promotion",
      "when": "post-promotion",
      "command": "/path/to/your/post-promotion-hook.sh",
      "timeout-seconds": 60
    },
    {
      "name": "MySQL-Failover",
      "when": "post-failover",
      "command": "/path/to/your/post-failover-hook.sh",
      "timeout-seconds": 60
    }
  ]
}

You’ll need to create the orchestrator user and database in your backend MySQL instance. Also, create the orchestrator_promote user on the primary and replicas with sufficient privileges to stop/start slaves, reset slave configurations, and potentially change the read-only status.

-- On the backend MySQL for Orchestrator
CREATE DATABASE orchestrator;
-- Grant privileges for the orchestrator user
GRANT ALL PRIVILEGES ON orchestrator.* TO 'orchestrator'@'%' IDENTIFIED BY 'your_orchestrator_db_password';
FLUSH PRIVILEGES;

-- On each MySQL server (primary and replicas) for promotion
CREATE USER 'orchestrator_promote'@'%' IDENTIFIED BY 'your_promote_password';
GRANT SUPER, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, PROCESS, LOCK TABLES, RESTART ON *.* TO 'orchestrator_promote'@'%';
FLUSH PRIVILEGES;

Start Orchestrator. It will begin discovering your MySQL topology. You can then use its API or web UI to explicitly define clusters and their desired states.

Orchestrator’s Automated Failover Mechanism

Orchestrator continuously monitors the health of MySQL instances. When it detects a primary failure (e.g., it’s unreachable, or health checks fail), it initiates a failover process:

Detection: Orchestrator’s discovery process identifies that the primary is down.
Replica Selection: It selects the best replica to promote based on factors like replication lag, GTID set, and potentially other heuristics.
Promotion: The chosen replica is promoted to become the new primary. This involves:
- Setting read_only=OFF on the chosen replica.
- Setting read_only=ON on all other former replicas.
- Reconfiguring other replicas to replicate from the newly promoted primary.
Application Reconfiguration (Crucial): This is the most challenging part. Orchestrator itself doesn’t typically reconfigure your application’s database connection strings. This requires an external mechanism.

Integrating with Ruby Applications for Seamless Reconfiguration

Automated failover is only half the battle. The application must be able to connect to the new primary seamlessly. For Ruby applications, this often involves managing database connection pools and dynamically updating them.

Database Connection Management in Rails

In a Rails application, database connections are typically managed by ActiveRecord. The config/database.yml file defines connection parameters. For high availability, we need a way to update these parameters without restarting the entire application.

# config/database.yml
production:
  adapter: mysql2
  encoding: utf8mb4
  pool: 5
  username: app_user
  password: your_app_password
  host: primary_db_host # This needs to be dynamically updated
  database: app_database

The host parameter is the critical piece. It needs to point to a stable endpoint that always resolves to the current primary. This can be achieved using:

DNS CNAME Record: A CNAME record (e.g., db.myapp.com) pointing to the IP address of the current primary. This requires an external system to update the DNS record upon failover.
Virtual IP (VIP): A floating IP address managed by a cluster manager (like Keepalived) that is moved to the new primary server.
Load Balancer: A dedicated load balancer (e.g., HAProxy, Nginx) that directs traffic to the current primary.

Dynamic Configuration Updates

When Orchestrator performs a promotion, it can trigger a hook (as defined in orchestrator.conf.json). This hook script is the ideal place to initiate the application-level reconfiguration.

Let’s consider the DNS CNAME approach. The hook script would need to interact with your DNS provider’s API (e.g., OVH’s API, AWS Route 53 API, Cloudflare API) to update the CNAME record.

#!/bin/bash
# /path/to/your/post-promotion-hook.sh

# This script is executed by Orchestrator after a successful promotion.
# It should update the application's endpoint to point to the new primary.

NEW_PRIMARY_IP="$1" # Orchestrator passes the new primary's IP as the first argument
CLUSTER_NAME="$2"    # Orchestrator passes the cluster name as the second argument
LOG_FILE="/var/log/orchestrator_hooks.log"

echo "$(date): Post-promotion hook triggered for cluster $CLUSTER_NAME. New primary IP: $NEW_PRIMARY_IP" >> $LOG_FILE

# --- Example: Update DNS CNAME using a hypothetical DNS API client ---
# Replace with your actual DNS provider's CLI or API calls.
# For OVH, you might use 'ovh' CLI or direct API calls.

DNS_RECORD_NAME="db.myapp.com"
DNS_ZONE="myapp.com"

# Example using a placeholder 'update-dns' command
# This command would typically involve authentication and specific API calls.
# update-dns --record-name $DNS_RECORD_NAME --zone $DNS_ZONE --type CNAME --value $NEW_PRIMARY_IP --api-key YOUR_API_KEY --api-secret YOUR_API_SECRET

# A more robust solution might involve a dedicated service that listens for Orchestrator hooks
# and manages DNS/VIP/Load Balancer updates.

# --- Example: Update a configuration file that your application reloads ---
# If your application can reload its config without a full restart, this is an option.
# For Rails, this is less common for database host changes without a restart.

# echo "Updating application config..." >> $LOG_FILE
# sed -i "s/host: .*/host: $NEW_PRIMARY_IP/" /path/to/your/app/config/database.yml
# touch /path/to/your/app/tmp/restart.txt # For Passenger/Puma to reload

# --- Example: Notify monitoring systems ---
# curl -X POST -d "message=MySQL primary promoted to $NEW_PRIMARY_IP for cluster $CLUSTER_NAME" http://your-monitoring-webhook/

echo "$(date): DNS update initiated for $DNS_RECORD_NAME to point to $NEW_PRIMARY_IP." >> $LOG_FILE
echo "$(date): Hook script finished." >> $LOG_FILE

exit 0

The hook script receives the new primary’s IP address as an argument. It’s crucial to implement robust error handling and retry mechanisms within this script, as DNS propagation can take time, and API calls can fail.

Using a Load Balancer (HAProxy)

A more resilient approach is to use a load balancer like HAProxy. HAProxy can monitor the health of your MySQL instances and automatically direct traffic to the available primary.

# /etc/haproxy/haproxy.cfg
frontend mysql_frontend
    bind *:3306
    mode tcp
    default_backend mysql_backend

backend mysql_backend
    mode tcp
    balance roundrobin
    option tcp-check
    # Health check for MySQL: expects a specific string on successful connection
    # This requires a MySQL user with minimal privileges to run 'SELECT 1;'
    tcp-check connect port 3306
    tcp-check send GET_MASTER_STATUS\r\n
    tcp-check expect ! r/^$/ # Expect any response, not an empty one
    # If using Orchestrator, you might have a script that updates the backend dynamically.
    # Otherwise, manually add/remove servers or use HAProxy's API/runtime configuration.
    server primary_db_1 192.168.1.10:3306 check port 3306 inter 2s fall 3 rise 2
    server replica_db_1 192.168.1.11:3306 check port 3306 inter 2s fall 3 rise 2 backup
    server replica_db_2 192.168.1.12:3306 check port 3306 inter 2s fall 3 rise 2 backup

In this HAProxy setup, the application connects to the HAProxy IP on port 3306. HAProxy’s health checks will determine which server is the active primary. Orchestrator’s hook script could then be used to dynamically update HAProxy’s backend configuration or signal HAProxy to reconfigure itself.

Monitoring and Alerting for Disaster Recovery

A robust disaster recovery strategy is incomplete without comprehensive monitoring and alerting. Key metrics to track include:

Replication Lag: Seconds_Behind_Master in SHOW SLAVE STATUS.
Replication Status: Slave_IO_Running and Slave_SQL_Running.
Primary Availability: Network reachability, MySQL service status.
Disk Space: Especially for binary logs and data directories.
CPU/Memory Usage: To identify performance bottlenecks that might affect failover.
Orchestrator Health: Ensure Orchestrator instances are running and discovering the topology correctly.

Tools like Prometheus with MySQL exporter, Zabbix, Nagios, or Datadog can be used. Alerts should be configured for critical failures (e.g., replication broken, primary down) and potential issues (e.g., high replication lag).

Example: Prometheus Alerting Rule for Replication Lag

groups:
- name: mysql_replication_alerts
  rules:
  - alert: MySQLReplicationLagging
    expr: mysql_slave_status_seconds_behind_master{slave_io_running="Yes", slave_sql_running="Yes"} > 300 # Lagging by more than 5 minutes
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "MySQL replication lag detected on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging behind master by {{ $value }} seconds."

  - alert: MySQLReplicationBroken
    expr: |
      mysql_slave_io_running != 1 OR mysql_slave_sql_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MySQL replication is broken on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} has IO running: {{ mysql_slave_io_running }} and SQL running: {{ mysql_slave_sql_running }}."

These alerts should trigger notifications to your on-call engineering team, allowing for prompt investigation and intervention, even if automated failover is in place.