Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Shopify Deployments on OVH
Establishing a Robust MySQL Replication Topology for High Availability
For any mission-critical Shopify deployment, a resilient MySQL infrastructure is paramount. We’ll architect a multi-master or active-passive replication setup, focusing on automated failover mechanisms. This involves setting up primary and replica instances, ensuring data consistency, and implementing health checks that can trigger failover events.
Our chosen strategy will be an active-passive setup with semi-synchronous replication for reduced data loss during failover. We’ll leverage GTID (Global Transaction Identifiers) for simplified failover and recovery. The OVH Public Cloud offers robust networking and compute resources ideal for this deployment.
MySQL Configuration for Replication and GTID
On both the primary and replica MySQL servers, the following `my.cnf` configurations are essential. Ensure these are applied consistently across all nodes. We’ll use `server-id` to uniquely identify each instance and `log_bin` to enable binary logging, which is crucial for replication. `gtid_mode=ON` and `enforce_gtid_consistency=ON` are critical for GTID-based replication.
For the primary server:
[mysqld] server-id = 1 log_bin = /var/log/mysql/mysql-bin.log gtid_mode = ON enforce_gtid_consistency = ON binlog_format = ROW relay_log = /var/log/mysql/mysql-relay-bin.log read_only = OFF innodb_flush_log_at_trx_commit = 1 sync_binlog = 1 # Optional: For semi-synchronous replication (requires plugin) # plugin_load_add = rpl_semi_sync_master=rpl_semi_sync_master.so # rpl_semi_sync_master_enabled = 1 # rpl_semi_sync_master_timeout = 10000
For the replica server:
[mysqld] server-id = 2 log_bin = /var/log/mysql/mysql-bin.log gtid_mode = ON enforce_gtid_consistency = ON binlog_format = ROW relay_log = /var/log/mysql/mysql-relay-bin.log read_only = ON innodb_flush_log_at_trx_commit = 1 sync_binlog = 1 # Optional: For semi-synchronous replication (requires plugin) # plugin_load_add = rpl_semi_sync_replica=rpl_semi_sync_replica.so # rpl_semi_sync_replica_enabled = 1
Setting Up Replication Users and Initializing Replicas
On the primary MySQL server, create a dedicated replication user and grant necessary privileges. Then, take a consistent snapshot of the primary database. For GTID-based replication, this is simpler as the replica will automatically determine its starting point.
On the primary:
-- On Primary MySQL Server CREATE USER 'repl_user'@'%' IDENTIFIED BY 'your_strong_password'; GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%'; FLUSH PRIVILEGES; -- To get the current GTID set for initial sync (if not using a tool like Percona XtraBackup) SHOW MASTER STATUS; -- Note down the 'Executed_Gtid_Set' value. This will be used if manual sync is needed.
To initialize the replica, the most robust method is using Percona XtraBackup. This tool performs hot backups without blocking writes on the primary. After taking a backup, restore it to the replica server. Then, configure the replica to connect to the primary.
On the replica (after restoring backup):
-- On Replica MySQL Server -- Stop replica threads if they are running STOP SLAVE; -- Configure the replica to connect to the primary using GTID CHANGE MASTER TO MASTER_HOST='primary_mysql_ip_or_hostname', MASTER_USER='repl_user', MASTER_PASSWORD='your_strong_password', MASTER_PORT=3306, MASTER_AUTO_POSITION=1; -- This is key for GTID-based replication -- Start replica threads START SLAVE; -- Verify replication status SHOW SLAVE STATUS\G
Crucially, ensure `MASTER_AUTO_POSITION=1` is set. This tells the replica to use GTIDs to automatically determine where to start replicating from. The `SHOW SLAVE STATUS\G` output should show `Slave_IO_Running: Yes` and `Slave_SQL_Running: Yes`, with `Seconds_Behind_Master` ideally at 0 or a very low number.
Automating Failover with Orchestrator
Manual failover is error-prone and slow. Orchestrator is an excellent open-source tool for MySQL high availability and disaster recovery. It detects failures, promotes replicas, and reconfigures other replicas automatically. We’ll deploy Orchestrator as a service on a separate, highly available instance (or cluster).
First, install Orchestrator. On a Debian/Ubuntu system:
wget https://github.com/github/orchestrator/releases/download/v3.2.7/orchestrator_3.2.7_amd64.deb sudo dpkg -i orchestrator_3.2.7_amd64.deb sudo apt-get install -f # To install any missing dependencies
Next, configure Orchestrator. The `orchestrator.conf.json` file (typically in `/etc/orchestrator/`) is central. Key parameters include database credentials for Orchestrator’s internal database (e.g., MySQL itself), discovery credentials, and failover settings.
{
"Debug": false,
"ListenAddress": ":3000",
"MySQLTopologyUser": "orchestrator",
"MySQLTopologyPassword": "your_orchestrator_db_password",
"MySQLOrchestratorHost": "127.0.0.1",
"MySQLOrchestratorPort": 3306,
"MySQLOrchestratorDatabase": "orchestrator",
"DiscoveryPeriodSeconds": 10,
"FailureDetectionPeriodSeconds": 30,
"RecoveryPeriodSeconds": 60,
"DetectClusterFailuresPeriodSeconds": 10,
"PromotionUser": "repl_user",
"PromotionPassword": "your_strong_password",
"PromotionPoolSize": 5,
"PromotionSkipDeconfig": false,
"SnapshotPeriodSeconds": 3600,
"PreemptiveSrvPromotion": true,
"PostMasterFailoverProcesses": [
"/usr/local/bin/post_failover_hook.sh"
],
"PostInstancePromotionProcesses": [
"/usr/local/bin/post_instance_promotion_hook.sh"
],
"AutoDiscoverByClusterName": true,
"ClusterNameDetectionQuery": "SELECT @@global.server_id, @@global.gtid_executed",
"TopologyRefreshSeconds": 60,
"MaxLagSeconds": 300,
"MaxReplicationLagQuery": "SELECT SUM(IF(VARIABLE_NAME='Seconds_Behind_Master', VARIABLE_VALUE, 0)) FROM performance_schema.replication_connection_status WHERE CHANNEL_NAME='master' AND VARIABLE_NAME='Seconds_Behind_Master'",
"SlaveLagQuery": "SELECT @@GLOBAL.gtid_executed",
"SlaveLagQueryMaxReplicas": 10,
"SlaveLagQueryMaxLagSeconds": 300,
"SlaveLagQueryMinReplicas": 2,
"SlaveLagQueryMinLagSeconds": 60
}
Create the Orchestrator database and user:
-- On the MySQL server where Orchestrator's DB will reside CREATE DATABASE orchestrator; CREATE USER 'orchestrator'@'localhost' IDENTIFIED BY 'your_orchestrator_db_password'; GRANT ALL PRIVILEGES ON orchestrator.* TO 'orchestrator'@'localhost'; FLUSH PRIVILEGES;
Start and enable the Orchestrator service:
sudo systemctl start orchestrator sudo systemctl enable orchestrator
Once Orchestrator is running, access its web UI (default: `http://your_orchestrator_host:3000`). Discover your MySQL topology by clicking “Discover DBs”. Orchestrator will then visualize your primary and replicas. Configure it to automatically detect and recover from failures. The `PreemptiveSrvPromotion: true` setting is crucial for automatic failover.
Integrating with Shopify: Application-Level Considerations
Your Shopify application needs to be aware of the database failover. Direct database connections from the application to a single IP address are problematic. Instead, use a Virtual IP (VIP) or a DNS-based approach managed by your load balancer or Orchestrator’s API.
Option 1: Virtual IP (VIP) with Keepalived
Keepalived can manage a VIP that floats to the current primary MySQL server. Your application connects to this VIP. Orchestrator can be configured to trigger Keepalived scripts during failover to move the VIP.
Orchestrator’s `PostMasterFailoverProcesses` and `PostInstancePromotionProcesses` are hooks for this. A simple script might look like this:
#!/bin/bash
# /usr/local/bin/post_failover_hook.sh
# This script is executed by Orchestrator after a master failover.
# It should update the VIP management system.
NEW_MASTER_IP=$1 # Orchestrator passes the new master's IP as the first argument
VIP_ADDRESS="192.168.1.100" # Your floating VIP
echo "New master detected: $NEW_MASTER_IP. Moving VIP $VIP_ADDRESS."
# Example: Using a custom script to manage VIP (e.g., via cloud provider API or keepalived config update)
# This is highly dependent on your infrastructure.
# For Keepalived, you might need to trigger a re-sync or re-configuration.
# For cloud providers, you'd use their API to update the floating IP.
# Example for a hypothetical cloud API call:
# curl -X POST "https://api.yourcloud.com/v1/floatingips/$VIP_ADDRESS/assign" -d "instance_id=$(hostname -f)"
# Example for updating Keepalived config and reloading:
# sed -i "s/virtual_ipaddress \{ 1 \}.*/virtual_ipaddress \{ 1 \} $NEW_MASTER_IP\/32 dev eth0;/g" /etc/keepalived/keepalived.conf
# systemctl reload keepalived
echo "VIP update process initiated."
exit 0
Ensure the script has execute permissions (`chmod +x /usr/local/bin/post_failover_hook.sh`) and that Orchestrator is configured to run it.
Option 2: DNS-based Failover
Use a DNS service that supports health checks and automatic record updates. Your application connects to a DNS name (e.g., `mysql.your-shopify-app.com`). Orchestrator can update the DNS A record to point to the new primary’s IP address upon failover. This requires integrating Orchestrator with your DNS provider’s API.
A script for DNS updates would involve using `dig` or `nsupdate` with your DNS provider’s API. This is often more complex due to API authentication and rate limiting.
OVH Specific Considerations and Monitoring
When deploying on OVH Public Cloud, leverage their network features. Ensure your MySQL instances are in the same region and availability zone for low latency, but consider cross-zone or cross-region replicas for disaster recovery. Use OVH’s load balancers or internal network configurations to manage access to the VIP or DNS endpoint.
Monitoring is critical. Implement comprehensive monitoring for:
- MySQL replication status (e.g., `Seconds_Behind_Master`, `Slave_IO_Running`, `Slave_SQL_Running`).
- Orchestrator health and alerts.
- Server resource utilization (CPU, RAM, Disk I/O).
- Network latency between nodes.
- Application-level connectivity to the database.
Tools like Prometheus with `mysqld_exporter` and Grafana are excellent for visualizing this data. Configure alerts for any deviations from normal operating parameters. Orchestrator itself provides alerts via email or Slack, which should be configured.
By combining robust MySQL replication, automated failover with Orchestrator, and careful application integration, you can build a highly available and resilient database layer for your Shopify deployment on OVH, minimizing downtime and protecting your business.