Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Shopify Deployments on OVH

Establishing a Robust MySQL Replication Topology for High Availability

For any mission-critical Shopify deployment, a resilient MySQL infrastructure is paramount. We’ll architect a multi-master or active-passive replication setup, focusing on automated failover mechanisms. This involves setting up primary and replica instances, ensuring data consistency, and implementing health checks that can trigger failover events.

Our chosen strategy will be an active-passive setup with semi-synchronous replication for reduced data loss during failover. We’ll leverage GTID (Global Transaction Identifiers) for simplified failover and recovery. The OVH Public Cloud offers robust networking and compute resources ideal for this deployment.

MySQL Configuration for Replication and GTID

On both the primary and replica MySQL servers, the following `my.cnf` configurations are essential. Ensure these are applied consistently across all nodes. We’ll use `server-id` to uniquely identify each instance and `log_bin` to enable binary logging, which is crucial for replication. `gtid_mode=ON` and `enforce_gtid_consistency=ON` are critical for GTID-based replication.

For the primary server:

[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_format = ROW
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = OFF
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
# Optional: For semi-synchronous replication (requires plugin)
# plugin_load_add = rpl_semi_sync_master=rpl_semi_sync_master.so
# rpl_semi_sync_master_enabled = 1
# rpl_semi_sync_master_timeout = 10000

For the replica server:

[mysqld]
server-id = 2
log_bin = /var/log/mysql/mysql-bin.log
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_format = ROW
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = ON
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
# Optional: For semi-synchronous replication (requires plugin)
# plugin_load_add = rpl_semi_sync_replica=rpl_semi_sync_replica.so
# rpl_semi_sync_replica_enabled = 1

Setting Up Replication Users and Initializing Replicas

On the primary MySQL server, create a dedicated replication user and grant necessary privileges. Then, take a consistent snapshot of the primary database. For GTID-based replication, this is simpler as the replica will automatically determine its starting point.

On the primary:

-- On Primary MySQL Server
CREATE USER 'repl_user'@'%' IDENTIFIED BY 'your_strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';
FLUSH PRIVILEGES;

-- To get the current GTID set for initial sync (if not using a tool like Percona XtraBackup)
SHOW MASTER STATUS;
-- Note down the 'Executed_Gtid_Set' value. This will be used if manual sync is needed.

To initialize the replica, the most robust method is using Percona XtraBackup. This tool performs hot backups without blocking writes on the primary. After taking a backup, restore it to the replica server. Then, configure the replica to connect to the primary.

On the replica (after restoring backup):

-- On Replica MySQL Server
-- Stop replica threads if they are running
STOP SLAVE;

-- Configure the replica to connect to the primary using GTID
CHANGE MASTER TO
  MASTER_HOST='primary_mysql_ip_or_hostname',
  MASTER_USER='repl_user',
  MASTER_PASSWORD='your_strong_password',
  MASTER_PORT=3306,
  MASTER_AUTO_POSITION=1; -- This is key for GTID-based replication

-- Start replica threads
START SLAVE;

-- Verify replication status
SHOW SLAVE STATUS\G

Crucially, ensure `MASTER_AUTO_POSITION=1` is set. This tells the replica to use GTIDs to automatically determine where to start replicating from. The `SHOW SLAVE STATUS\G` output should show `Slave_IO_Running: Yes` and `Slave_SQL_Running: Yes`, with `Seconds_Behind_Master` ideally at 0 or a very low number.

Automating Failover with Orchestrator

Manual failover is error-prone and slow. Orchestrator is an excellent open-source tool for MySQL high availability and disaster recovery. It detects failures, promotes replicas, and reconfigures other replicas automatically. We’ll deploy Orchestrator as a service on a separate, highly available instance (or cluster).

First, install Orchestrator. On a Debian/Ubuntu system:

wget https://github.com/github/orchestrator/releases/download/v3.2.7/orchestrator_3.2.7_amd64.deb
sudo dpkg -i orchestrator_3.2.7_amd64.deb
sudo apt-get install -f # To install any missing dependencies

Next, configure Orchestrator. The `orchestrator.conf.json` file (typically in `/etc/orchestrator/`) is central. Key parameters include database credentials for Orchestrator’s internal database (e.g., MySQL itself), discovery credentials, and failover settings.

{
  "Debug": false,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "your_orchestrator_db_password",
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "DiscoveryPeriodSeconds": 10,
  "FailureDetectionPeriodSeconds": 30,
  "RecoveryPeriodSeconds": 60,
  "DetectClusterFailuresPeriodSeconds": 10,
  "PromotionUser": "repl_user",
  "PromotionPassword": "your_strong_password",
  "PromotionPoolSize": 5,
  "PromotionSkipDeconfig": false,
  "SnapshotPeriodSeconds": 3600,
  "PreemptiveSrvPromotion": true,
  "PostMasterFailoverProcesses": [
    "/usr/local/bin/post_failover_hook.sh"
  ],
  "PostInstancePromotionProcesses": [
    "/usr/local/bin/post_instance_promotion_hook.sh"
  ],
  "AutoDiscoverByClusterName": true,
  "ClusterNameDetectionQuery": "SELECT @@global.server_id, @@global.gtid_executed",
  "TopologyRefreshSeconds": 60,
  "MaxLagSeconds": 300,
  "MaxReplicationLagQuery": "SELECT SUM(IF(VARIABLE_NAME='Seconds_Behind_Master', VARIABLE_VALUE, 0)) FROM performance_schema.replication_connection_status WHERE CHANNEL_NAME='master' AND VARIABLE_NAME='Seconds_Behind_Master'",
  "SlaveLagQuery": "SELECT @@GLOBAL.gtid_executed",
  "SlaveLagQueryMaxReplicas": 10,
  "SlaveLagQueryMaxLagSeconds": 300,
  "SlaveLagQueryMinReplicas": 2,
  "SlaveLagQueryMinLagSeconds": 60
}

Create the Orchestrator database and user:

-- On the MySQL server where Orchestrator's DB will reside
CREATE DATABASE orchestrator;
CREATE USER 'orchestrator'@'localhost' IDENTIFIED BY 'your_orchestrator_db_password';
GRANT ALL PRIVILEGES ON orchestrator.* TO 'orchestrator'@'localhost';
FLUSH PRIVILEGES;

Start and enable the Orchestrator service:

sudo systemctl start orchestrator
sudo systemctl enable orchestrator

Once Orchestrator is running, access its web UI (default: `http://your_orchestrator_host:3000`). Discover your MySQL topology by clicking “Discover DBs”. Orchestrator will then visualize your primary and replicas. Configure it to automatically detect and recover from failures. The `PreemptiveSrvPromotion: true` setting is crucial for automatic failover.

Integrating with Shopify: Application-Level Considerations

Your Shopify application needs to be aware of the database failover. Direct database connections from the application to a single IP address are problematic. Instead, use a Virtual IP (VIP) or a DNS-based approach managed by your load balancer or Orchestrator’s API.

Option 1: Virtual IP (VIP) with Keepalived

Keepalived can manage a VIP that floats to the current primary MySQL server. Your application connects to this VIP. Orchestrator can be configured to trigger Keepalived scripts during failover to move the VIP.

Orchestrator’s `PostMasterFailoverProcesses` and `PostInstancePromotionProcesses` are hooks for this. A simple script might look like this:

#!/bin/bash
# /usr/local/bin/post_failover_hook.sh
# This script is executed by Orchestrator after a master failover.
# It should update the VIP management system.

NEW_MASTER_IP=$1 # Orchestrator passes the new master's IP as the first argument
VIP_ADDRESS="192.168.1.100" # Your floating VIP

echo "New master detected: $NEW_MASTER_IP. Moving VIP $VIP_ADDRESS."

# Example: Using a custom script to manage VIP (e.g., via cloud provider API or keepalived config update)
# This is highly dependent on your infrastructure.
# For Keepalived, you might need to trigger a re-sync or re-configuration.
# For cloud providers, you'd use their API to update the floating IP.

# Example for a hypothetical cloud API call:
# curl -X POST "https://api.yourcloud.com/v1/floatingips/$VIP_ADDRESS/assign" -d "instance_id=$(hostname -f)"

# Example for updating Keepalived config and reloading:
# sed -i "s/virtual_ipaddress \{ 1 \}.*/virtual_ipaddress \{ 1 \} $NEW_MASTER_IP\/32 dev eth0;/g" /etc/keepalived/keepalived.conf
# systemctl reload keepalived

echo "VIP update process initiated."
exit 0

Ensure the script has execute permissions (`chmod +x /usr/local/bin/post_failover_hook.sh`) and that Orchestrator is configured to run it.

Option 2: DNS-based Failover

Use a DNS service that supports health checks and automatic record updates. Your application connects to a DNS name (e.g., `mysql.your-shopify-app.com`). Orchestrator can update the DNS A record to point to the new primary’s IP address upon failover. This requires integrating Orchestrator with your DNS provider’s API.

A script for DNS updates would involve using `dig` or `nsupdate` with your DNS provider’s API. This is often more complex due to API authentication and rate limiting.

OVH Specific Considerations and Monitoring

When deploying on OVH Public Cloud, leverage their network features. Ensure your MySQL instances are in the same region and availability zone for low latency, but consider cross-zone or cross-region replicas for disaster recovery. Use OVH’s load balancers or internal network configurations to manage access to the VIP or DNS endpoint.

Monitoring is critical. Implement comprehensive monitoring for:

MySQL replication status (e.g., `Seconds_Behind_Master`, `Slave_IO_Running`, `Slave_SQL_Running`).
Orchestrator health and alerts.
Server resource utilization (CPU, RAM, Disk I/O).
Network latency between nodes.
Application-level connectivity to the database.

Tools like Prometheus with `mysqld_exporter` and Grafana are excellent for visualizing this data. Configure alerts for any deviations from normal operating parameters. Orchestrator itself provides alerts via email or Slack, which should be configured.

By combining robust MySQL replication, automated failover with Orchestrator, and careful application integration, you can build a highly available and resilient database layer for your Shopify deployment on OVH, minimizing downtime and protecting your business.

Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Shopify Deployments on OVH

Establishing a Robust MySQL Replication Topology for High Availability

MySQL Configuration for Replication and GTID

Setting Up Replication Users and Initializing Replicas

Automating Failover with Orchestrator

Integrating with Shopify: Application-Level Considerations

OVH Specific Considerations and Monitoring

Recent Posts

Top Categories

Our Products

Our Services