Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Magento 2 Deployments on OVH
Understanding the Challenge: MySQL and Magento 2 High Availability
Deploying a mission-critical Magento 2 instance necessitates a robust disaster recovery (DR) strategy, with automated failover being the cornerstone of minimizing downtime. For many, this involves a primary MySQL database and a Magento 2 application layer, often hosted on cloud providers like OVH. The core challenge lies in detecting a primary database failure and seamlessly redirecting all application traffic to a standby replica without manual intervention. This post outlines an architectural approach to achieve this, focusing on MySQL replication, proxying, and health checks.
MySQL Replication Topology for Failover
A standard asynchronous or semi-synchronous replication setup is the foundation. We’ll configure a primary MySQL instance and at least one replica. For true high availability, multiple replicas are recommended, but for failover automation, one is the minimum. The key is ensuring the replica is consistently lagging behind the primary by a minimal, acceptable margin. We’ll use GTID-based replication for easier failover management.
Primary MySQL Configuration Snippet
On your primary MySQL server (e.g., `mysql-primary.example.com`):
[mysqld] server-id = 1 log_bin = /var/log/mysql/mysql-bin.log binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = OFF bind-address = 0.0.0.0
On your replica MySQL server (e.g., `mysql-replica.example.com`):
[mysqld] server-id = 2 log_bin = /var/log/mysql/mysql-bin.log binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = ON bind-address = 0.0.0.0
Setting up Replication
After configuring `my.cnf` and restarting MySQL on both servers, create a replication user on the primary and initiate replication from a consistent snapshot. Using Percona XtraBackup is highly recommended for minimal downtime during the initial backup.
-- On Primary MySQL CREATE USER 'repl_user'@'%' IDENTIFIED BY 'your_secure_password'; GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%'; FLUSH PRIVILEGES; -- On Replica MySQL (after restoring backup from primary) CHANGE MASTER TO MASTER_HOST='mysql-primary.example.com', MASTER_USER='repl_user', MASTER_PASSWORD='your_secure_password', MASTER_PORT=3306, MASTER_USE_GTID=slave_pos; START SLAVE; -- Verify replication status SHOW SLAVE STATUS\G
Ensure `Seconds_Behind_Master` is consistently low (ideally 0 or a few seconds). If it drifts significantly, investigate network latency, disk I/O, or query performance on the replica.
Introducing a MySQL Proxy for Traffic Management
Directly pointing Magento 2 to a single MySQL instance is insufficient for failover. A proxy layer is essential. ProxySQL is a popular, high-performance, thread-pool aware proxy that can manage connections, route queries, and perform health checks. We’ll configure ProxySQL to point to our primary MySQL instance and, upon failure, switch traffic to the replica.
ProxySQL Installation and Basic Configuration
Install ProxySQL on a dedicated server or co-located with your Magento application servers. The configuration is managed via its internal MySQL interface.
# Example installation on Debian/Ubuntu sudo apt update sudo apt install proxysql
Connect to ProxySQL’s admin interface:
mysql -u admin -padmin -h 127.0.0.1 -P 6032
Configuring ProxySQL Hostgroups and Servers
We define ‘hostgroups’ to logically group our MySQL servers. Hostgroup 10 will be our primary writers, and hostgroup 20 will be our read-only replicas. Initially, only hostgroup 10 will be active for writes.
-- Load default configuration if not already done
LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL USERS TO RUNTIME;
LOAD MYSQL HOSTGROUPS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;
LOAD MYSQL MONITOR TO RUNTIME;
-- Define MySQL servers
INSERT INTO mysql_servers (hostname, hostgroup_id, port, weight, max_connections, active) VALUES
('mysql-primary.example.com', 10, 3306, 1000, 2000, 1),
('mysql-replica.example.com', 20, 3306, 1000, 2000, 1);
-- Define hostgroups
INSERT INTO mysql_hostgroups (id, name, writer_hostgroup, reader_hostgroup) VALUES
(10, 'writers', 10, 10),
(20, 'readers', 10, 20);
-- Configure ProxySQL to use the defined hostgroups
UPDATE mysql_servers SET hostgroup_id = 10 WHERE hostname = 'mysql-primary.example.com';
UPDATE mysql_servers SET hostgroup_id = 20 WHERE hostname = 'mysql-replica.example.com';
-- Apply changes
LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL HOSTGROUPS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL HOSTGROUPS TO DISK;
Configuring Query Rules for Write Splitting
Magento 2 primarily performs writes. We need to ensure all write operations go to the primary. ProxySQL’s query rules can achieve this. For simplicity, we’ll route all traffic to hostgroup 10 initially.
-- Rule to send all traffic to hostgroup 10 (writers) INSERT INTO mysql_query_rules (rule_id, active, destination_hostgroup, apply) VALUES (1, 1, 10, 1); -- Apply changes LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;
Magento 2’s database connection string should now point to the ProxySQL server’s IP address and port (default 6446 for MySQL protocol).
Automated Failover Mechanism
The core of automation lies in a health check and failover script. This script will periodically check the health of the primary MySQL server via ProxySQL. If it detects an issue, it will reconfigure ProxySQL to promote the replica and then attempt to re-establish replication from the new primary.
ProxySQL Health Checks
ProxySQL has built-in monitoring. We need to configure it to check the health of our MySQL servers.
-- Configure monitoring for servers UPDATE mysql_servers SET monitor_hostgroup = 10 WHERE hostname = 'mysql-primary.example.com'; UPDATE mysql_servers SET monitor_hostgroup = 20 WHERE hostname = 'mysql-replica.example.com'; -- Configure health check intervals and thresholds -- These are example values, tune them based on your network and tolerance for false positives UPDATE mysql_monitor SET interval_ms = 5000, -- Check every 5 seconds timeout_ms = 2000, -- Timeout after 2 seconds max_maint_errors = 3, -- Mark as offline after 3 consecutive errors max_fail_errors = 5; -- Mark as failed after 5 consecutive errors -- Apply changes LOAD MYSQL MONITOR TO RUNTIME; SAVE MYSQL MONITOR TO DISK;
The Failover Script (Python Example)
This Python script uses the `mysql.connector` library to interact with ProxySQL’s admin interface and the MySQL servers. It should be run as a cron job or managed by a process supervisor like `systemd`.
import mysql.connector
import time
import logging
# --- Configuration ---
PROXYSQL_ADMIN_HOST = '127.0.0.1'
PROXYSQL_ADMIN_PORT = 6032
PROXYSQL_ADMIN_USER = 'admin'
PROXYSQL_ADMIN_PASSWORD = 'admin_password' # Use a strong password!
PRIMARY_MYSQL_HOST = 'mysql-primary.example.com'
REPLICA_MYSQL_HOST = 'mysql-replica.example.com'
REPLICA_MYSQL_USER = 'repl_user'
REPLICA_MYSQL_PASSWORD = 'your_secure_password' # Use a strong password!
# GTID-based replication commands
# These commands assume the replica is already configured to use GTID
# and we are promoting it to be the new primary.
# The actual promotion involves stopping replication, setting read_only=OFF,
# and then re-establishing replication from the new primary.
# This script focuses on ProxySQL reconfiguration and initiating replication setup.
LOG_FILE = '/var/log/proxysql_failover.log'
CHECK_INTERVAL_SECONDS = 30 # How often to check primary health
# --- Logging Setup ---
logging.basicConfig(filename=LOG_FILE, level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
def get_db_connection(host, user, password, port=6032):
try:
conn = mysql.connector.connect(
host=host,
user=user,
password=password,
port=port
)
return conn
except mysql.connector.Error as err:
logging.error(f"Error connecting to DB {host}:{port}: {err}")
return None
def is_primary_healthy(proxysql_conn):
try:
cursor = proxysql_conn.cursor(dictionary=True)
# Check if the primary server (hostgroup 10) is online in ProxySQL
cursor.execute("SELECT hostgroup_id, hostname, status FROM mysql_servers WHERE hostgroup_id = 10 AND active = 1;")
servers = cursor.fetchall()
cursor.close()
if not servers:
logging.warning("No servers found in hostgroup 10 (writers).")
return False
# ProxySQL's status for a server in a writer hostgroup should be 'ONLINE'
# We are checking if the *primary* server is marked as ONLINE by ProxySQL's monitor
# A more direct check would be to query MySQL itself, but this leverages ProxySQL's view.
# A better approach might be to check ProxySQL's monitor status directly.
# For simplicity, we'll assume if the primary is in hostgroup 10 and active, it's "healthy" from ProxySQL's perspective.
# A more robust check would involve querying MySQL's `SHOW SLAVE STATUS` or similar.
# Let's refine this to check ProxySQL's internal monitor status.
cursor = proxysql_conn.cursor(dictionary=True)
cursor.execute("SELECT hostgroup_id, hostname, status FROM mysql_servers_state WHERE hostgroup_id = 10;")
states = cursor.fetchall()
cursor.close()
for state in states:
if state['hostname'] == PRIMARY_MYSQL_HOST and state['status'] == 'ONLINE':
return True
return False
except Exception as e:
logging.error(f"Error checking primary health: {e}")
return False
def promote_replica(proxysql_conn):
logging.info("Attempting to promote replica...")
try:
cursor = proxysql_conn.cursor()
# 1. Stop replication on the replica
logging.info(f"Stopping replication on replica: {REPLICA_MYSQL_HOST}")
conn_replica = get_db_connection(REPLICA_MYSQL_HOST, REPLICA_MYSQL_USER, REPLICA_MYSQL_PASSWORD, port=3306)
if conn_replica:
cursor_replica = conn_replica.cursor()
cursor_replica.execute("STOP SLAVE;")
conn_replica.commit()
cursor_replica.close()
conn_replica.close()
logging.info("Replication stopped on replica.")
else:
logging.error("Failed to connect to replica to stop replication.")
return False
# 2. Set replica to read-write mode
logging.info(f"Setting replica to read-write: {REPLICA_MYSQL_HOST}")
conn_replica = get_db_connection(REPLICA_MYSQL_HOST, REPLICA_MYSQL_USER, REPLICA_MYSQL_PASSWORD, port=3306)
if conn_replica:
cursor_replica = conn_replica.cursor()
cursor_replica.execute("SET GLOBAL read_only = 0;")
conn_replica.commit()
cursor_replica.close()
conn_replica.close()
logging.info("Replica set to read-write.")
else:
logging.error("Failed to connect to replica to set read_only.")
return False
# 3. Update ProxySQL configuration:
# - Remove primary from hostgroup 10 (writers)
# - Add replica to hostgroup 10 (writers)
# - Ensure replica is not in hostgroup 20 (readers, if it was)
# Remove primary from hostgroup 10
logging.info(f"Removing {PRIMARY_MYSQL_HOST} from hostgroup 10.")
cursor.execute(f"DELETE FROM mysql_servers WHERE hostname = '{PRIMARY_MYSQL_HOST}' AND hostgroup_id = 10;")
cursor.execute("LOAD MYSQL SERVERS TO RUNTIME;")
# Add replica to hostgroup 10
logging.info(f"Adding {REPLICA_MYSQL_HOST} to hostgroup 10.")
# Check if replica is already in hostgroup 10, if so, update it. Otherwise, insert.
cursor.execute(f"SELECT COUNT(*) FROM mysql_servers WHERE hostname = '{REPLICA_MYSQL_HOST}' AND hostgroup_id = 10;")
if cursor.fetchone()[0] > 0:
cursor.execute(f"UPDATE mysql_servers SET active = 1, weight = 1000, max_connections = 2000 WHERE hostname = '{REPLICA_MYSQL_HOST}' AND hostgroup_id = 10;")
else:
cursor.execute(f"INSERT INTO mysql_servers (hostname, hostgroup_id, port, weight, max_connections, active) VALUES ('{REPLICA_MYSQL_HOST}', 10, 3306, 1000, 2000, 1);")
cursor.execute("LOAD MYSQL SERVERS TO RUNTIME;")
# Ensure replica is not in hostgroup 20 (if it was previously)
logging.info(f"Ensuring {REPLICA_MYSQL_HOST} is not in hostgroup 20.")
cursor.execute(f"DELETE FROM mysql_servers WHERE hostname = '{REPLICA_MYSQL_HOST}' AND hostgroup_id = 20;")
cursor.execute("LOAD MYSQL SERVERS TO RUNTIME;")
# Save changes
cursor.execute("SAVE MYSQL SERVERS TO DISK;")
logging.info("ProxySQL configuration updated to promote replica.")
# 4. Re-establish replication from the NEW primary (which was the old replica)
# This is a critical step. We need to know the GTID position of the new primary.
# For simplicity, this script assumes the new primary is already in a state
# where it can serve as a source. A more robust solution would involve
# capturing the GTID from the old primary *before* it failed, or using
# a tool to find the latest GTID on the new primary.
# For this example, we'll just restart replication pointing to the new primary.
# This part is tricky and might require manual intervention or a more sophisticated script.
# A common pattern is to have a separate "standby" server that is always a replica,
# and when the primary fails, the standby becomes the new primary, and a *new* standby
# is provisioned from the new primary.
# For this script, we'll assume the old primary is gone and the replica is now the primary.
# We need to set up a *new* replica from this new primary.
# This script will NOT handle setting up a new replica automatically.
# It focuses on promoting the existing replica to be the primary.
# If you have a dedicated standby server that you want to replicate from the NEW primary:
# NEW_PRIMARY_HOST = REPLICA_MYSQL_HOST
# NEW_STANDBY_HOST = 'mysql-new-standby.example.com'
# ... then configure replication on NEW_STANDBY_HOST to point to NEW_PRIMARY_HOST
return True
except Exception as e:
logging.error(f"Error during replica promotion: {e}")
return False
def failback_to_primary(proxysql_conn):
logging.info("Attempting to failback to original primary...")
try:
cursor = proxysql_conn.cursor()
# 1. Set the current primary (which was the replica) back to read-only
current_primary_host = None
cursor.execute("SELECT hostname FROM mysql_servers WHERE hostgroup_id = 10 AND active = 1;")
result = cursor.fetchone()
if result:
current_primary_host = result[0]
logging.info(f"Current primary identified as: {current_primary_host}")
if current_primary_host and current_primary_host != PRIMARY_MYSQL_HOST:
logging.info(f"Setting current primary {current_primary_host} to read-only.")
conn_current_primary = get_db_connection(current_primary_host, REPLICA_MYSQL_USER, REPLICA_MYSQL_PASSWORD, port=3306)
if conn_current_primary:
cursor_current_primary = conn_current_primary.cursor()
cursor_current_primary.execute("SET GLOBAL read_only = 1;")
conn_current_primary.commit()
cursor_current_primary.close()
conn_current_primary.close()
logging.info("Current primary set to read-only.")
else:
logging.error(f"Failed to connect to current primary {current_primary_host} to set read_only.")
return False
else:
logging.warning("Could not identify a current primary to failback from, or it's already the original primary.")
# If the original primary is back online, we might need to re-establish replication.
# 2. Re-establish replication on the original primary from the current primary
logging.info(f"Re-establishing replication on {PRIMARY_MYSQL_HOST} from {current_primary_host}.")
conn_original_primary = get_db_connection(PRIMARY_MYSQL_HOST, REPLICA_MYSQL_USER, REPLICA_MYSQL_PASSWORD, port=3306)
if conn_original_primary:
cursor_original_primary = conn_original_primary.cursor()
# Assuming GTID-based replication is set up and we just need to point it.
# This is a simplified command. In a real scenario, you'd need to ensure
# the original primary is clean and ready to be a replica.
# You might need to reset slave and reconfigure.
cursor_original_primary.execute(f"CHANGE MASTER TO MASTER_HOST='{current_primary_host}', MASTER_USER='{REPLICA_MYSQL_USER}', MASTER_PASSWORD='{REPLICA_MYSQL_PASSWORD}', MASTER_PORT=3306, MASTER_USE_GTID=slave_pos;")
cursor_original_primary.execute("START SLAVE;")
conn_original_primary.commit()
cursor_original_primary.close()
conn_original_primary.close()
logging.info("Replication re-established on original primary.")
else:
logging.error(f"Failed to connect to original primary {PRIMARY_MYSQL_HOST} to re-establish replication.")
return False
# 3. Update ProxySQL configuration:
# - Remove replica from hostgroup 10 (writers)
# - Add original primary back to hostgroup 10 (writers)
# - Ensure replica is back in hostgroup 20 (readers)
# Remove replica from hostgroup 10
logging.info(f"Removing {current_primary_host} from hostgroup 10.")
cursor.execute(f"DELETE FROM mysql_servers WHERE hostname = '{current_primary_host}' AND hostgroup_id = 10;")
cursor.execute("LOAD MYSQL SERVERS TO RUNTIME;")
# Add original primary back to hostgroup 10
logging.info(f"Adding {PRIMARY_MYSQL_HOST} back to hostgroup 10.")
cursor.execute(f"INSERT INTO mysql_servers (hostname, hostgroup_id, port, weight, max_connections, active) VALUES ('{PRIMARY_MYSQL_HOST}', 10, 3306, 1000, 2000, 1);")
cursor.execute("LOAD MYSQL SERVERS TO RUNTIME;")
# Add replica back to hostgroup 20 (readers)
logging.info(f"Adding {current_primary_host} back to hostgroup 20.")
cursor.execute(f"INSERT INTO mysql_servers (hostname, hostgroup_id, port, weight, max_connections, active) VALUES ('{current_primary_host}', 20, 3306, 1000, 2000, 1);")
cursor.execute("LOAD MYSQL SERVERS TO RUNTIME;")
# Save changes
cursor.execute("SAVE MYSQL SERVERS TO DISK;")
logging.info("ProxySQL configuration updated for failback.")
return True
except Exception as e:
logging.error(f"Error during failback: {e}")
return False
def main():
proxysql_conn = get_db_connection(PROXYSQL_ADMIN_HOST, PROXYSQL_ADMIN_USER, PROXYSQL_ADMIN_PASSWORD, port=PROXYSQL_ADMIN_PORT)
if not proxysql_conn:
logging.error("Failed to connect to ProxySQL admin interface. Exiting.")
return
primary_is_currently_online = False
try:
# Check ProxySQL's view of the primary server's status
cursor = proxysql_conn.cursor(dictionary=True)
cursor.execute("SELECT status FROM mysql_servers_state WHERE hostname = %s AND hostgroup_id = 10;", (PRIMARY_MYSQL_HOST,))
result = cursor.fetchone()
if result and result['status'] == 'ONLINE':
primary_is_currently_online = True
cursor.close()
except Exception as e:
logging.error(f"Error querying ProxySQL server state: {e}")
# Assume primary is offline if we can't query its state.
if primary_is_currently_online:
logging.info(f"Primary server {PRIMARY_MYSQL_HOST} is online. No failover needed.")
# Optional: Check if failback is needed (e.g., if original primary is back online and replica is still primary)
# This logic can become complex. For now, we only trigger failover.
else:
logging.warning(f"Primary server {PRIMARY_MYSQL_HOST} appears to be offline. Initiating failover.")
if promote_replica(proxysql_conn):
logging.info("Failover to replica completed successfully.")
# After successful promotion, we might want to set up a new replica for the new primary.
# This script does not automate that.
else:
logging.error("Failover process failed.")
proxysql_conn.close()
if __name__ == "__main__":
while True:
main()
time.sleep(CHECK_INTERVAL_SECONDS)
Cron Job or Systemd Service
To run this script continuously, use `cron` or `systemd`.
# Example cron entry (runs every minute) * * * * * /usr/bin/python3 /path/to/your/failover_script.py >> /var/log/failover_script.log 2>&1
[Unit] Description=ProxySQL MySQL Failover Monitor After=network.target proxysql.service [Service] ExecStart=/usr/bin/python3 /path/to/your/failover_script.py Restart=always User=root Group=root [Install] WantedBy=multi-user.target
Enable and start the systemd service:
sudo systemctl enable proxysql-failover.service sudo systemctl start proxysql-failover.service
Magento 2 Considerations
Ensure your Magento 2 application’s `app/etc/env.php` is configured to point to the ProxySQL server’s IP address and the MySQL port (6446 by default). The database user Magento uses must have sufficient privileges on the MySQL server.
<?php
return [
'backend' => [
'frontName' => 'admin_your_backend_path'
],
'crypt' => [
'key' => 'your_encryption_key'
],
'db' => [
'connection' => [
'default' => [
'host' => 'proxysql_server_ip', // IP of your ProxySQL server
'dbname' => 'magento_db',
'username' => 'magento_user',
'password' => 'magento_password',
'model' => 'mysql4',
'initStatements' => 'SET NAMES utf8; SET SESSION sql_mode="NO_ENGINE_SUBSTITUTION";',
'engine' => 'innodb',
'port' => '6446', // ProxySQL's default MySQL port
'active' => 1
],
'innodb_buffer_pool_size' => '1024M', // Example for tuning
'query_cache_type' => '0', // Disable query cache
'query_cache_size' => '0'
]
],
// ... other configuration
];
?>
During a failover, Magento 2’s persistent connections to the old primary might need to be reset. Restarting the Magento application services (e.g., PHP-FPM, web server) after a successful failover is often necessary to clear these connections and ensure they pick up the new database endpoint.
Advanced Considerations and Next Steps
- Multi-Region Failover: For true disaster recovery, consider replicating MySQL across different OVH data centers or cloud regions. This adds complexity to replication lag and failover orchestration.
- Automated New Replica Provisioning: The script above promotes an existing replica. A more advanced system would automatically provision a *new* replica from the newly promoted primary to maintain HA.
- Read Replicas: Configure ProxySQL to route read-only queries to dedicated read replicas (hostgroup 20) for performance scaling.
- ProxySQL High Availability: Run multiple instances of ProxySQL in an active/passive or active/active cluster for resilience of the proxy layer itself.
- Monitoring and Alerting: Integrate with tools like Prometheus, Grafana, or Zabbix to monitor ProxySQL’s status, replication lag, and script execution. Set up alerts for failures.
- Testing: Rigorously test your failover process regularly. Simulate primary failures and verify that the failover occurs as expected and that Magento remains accessible.
- Database User Management: Ensure the database user used by Magento is replicated or managed consistently across primary and replica.
- Magento Cache Management: After a failover, consider clearing Magento’s cache to ensure it doesn’t serve stale data that might have been written to the old primary just before it went down.
Implementing automated failover for MySQL and Magento 2 is a multi-faceted task. By combining robust MySQL replication, a capable proxy like ProxySQL, and a well-tested automation script, you can significantly reduce downtime and protect your critical e-commerce operations.