Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Python Deployments on OVH
Automated MySQL Failover with Orchestrator and Keepalived
Achieving true high availability for MySQL necessitates an automated failover strategy. Relying on manual intervention during an outage is a recipe for extended downtime and significant business impact. This section details a robust, automated failover architecture using Orchestrator for cluster management and Keepalived for virtual IP (VIP) failover, deployed on OVH infrastructure.
Our setup assumes a multi-instance MySQL deployment, typically with one primary (writer) and multiple replicas (readers). Orchestrator will monitor the health of these instances and manage the promotion of a replica to primary in case of failure. Keepalived will ensure that the application’s connection endpoint (a VIP) always points to the currently active primary MySQL server.
Orchestrator Deployment and Configuration
Orchestrator is a highly available MySQL replication topology manager. It can be deployed as a single instance or, for better resilience, as a cluster itself. For this guide, we’ll focus on a single Orchestrator instance for simplicity, but a clustered deployment is recommended for production.
First, install Orchestrator on a dedicated management server or one of your application servers. Ensure it has network access to all MySQL instances.
Installation (Debian/Ubuntu)
wget https://github.com/openark/orchestrator/releases/download/v3.2.7/orchestrator-3.2.7-linux-amd64.tar.gz tar -xzf orchestrator-3.2.7-linux-amd64.tar.gz sudo mv orchestrator-3.2.7-linux-amd64 /usr/local/orchestrator sudo ln -s /usr/local/orchestrator/orchestrator /usr/local/bin/orchestrator
Configuration File
Create the Orchestrator configuration file, typically at /etc/orchestrator.conf.json. Key parameters include database credentials for Orchestrator to connect to MySQL, discovery settings, and failover hooks.
{
"Debug": false,
"ListenAddress": ":3000",
"MySQLTopologyUser": "orchestrator",
"MySQLTopologyPassword": "your_orchestrator_db_password",
"MySQLOrchestratorHost": "127.0.0.1",
"MySQLOrchestratorPort": 3306,
"MySQLOrchestratorDatabase": "orchestrator",
"DiscoveryPeriodSeconds": 10,
"PromotionHooks": [
"/etc/orchestrator/hooks/mysql-promote-replica.sh"
],
"PostMasterDiscoveryHooks": [
"/etc/orchestrator/hooks/mysql-post-master-discovery.sh"
],
"MySQLReplicationLagQuery": "SELECT MAX(seconds_behind_master) FROM mysql.slave_status WHERE master_host IS NOT NULL AND master_host != ''",
"SlaveLagQuery": "SELECT MAX(seconds_behind_master) FROM mysql.slave_status WHERE master_host IS NOT NULL AND master_host != ''",
"Flavor": "MySQL"
}
You’ll need to create the orchestrator database and user on one of your MySQL instances (or a dedicated management DB) and grant it necessary privileges. Orchestrator will use this to store its internal state.
CREATE DATABASE IF NOT EXISTS orchestrator; CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'your_orchestrator_db_password'; GRANT ALL PRIVILEGES ON orchestrator.* TO 'orchestrator'@'%'; FLUSH PRIVILEGES;
Failover Hooks: The Automation Core
The power of Orchestrator lies in its hook system. These are scripts executed when specific events occur, such as a master promotion. We’ll define two critical hooks:
mysql-promote-replica.sh: Promoting a Replica
This script is executed by Orchestrator when it decides to promote a replica to become the new master. Its primary responsibility is to ensure the chosen replica is ready and to update its replication configuration if necessary. Crucially, it needs to signal to Keepalived that the VIP should be moved.
#!/bin/bash # This script is executed by Orchestrator when a replica is promoted. # It should perform any necessary actions on the promoted instance # and signal Keepalived to move the VIP. # Orchestrator passes arguments: # $1: promoted_host # $2: promoted_port # $3: original_master_host # $4: original_master_port # $5: cluster_name (if applicable) PROMOTED_HOST="$1" PROMOTED_PORT="$2" ORIGINAL_MASTER_HOST="$3" ORIGINAL_MASTER_PORT="$4" CLUSTER_NAME="$5" echo "Orchestrator hook: Promoting replica $PROMOTED_HOST:$PROMOTED_PORT to master." echo "Original master was $ORIGINAL_MASTER_HOST:$ORIGINAL_MASTER_PORT." echo "Cluster name: $CLUSTER_NAME" # --- Step 1: Verify the promoted instance is ready --- # In a real-world scenario, you might want to add checks here: # - Ensure MySQL is running and accessible. # - Check replication status (though Orchestrator already did this). # - Potentially run a quick health check query. # Example: Check if MySQL is listening if ! nc -z $PROMOTED_HOST $PROMOTED_PORT; then echo "Error: MySQL is not accessible on $PROMOTED_HOST:$PROMOTED_PORT." exit 1 fi # --- Step 2: Update replication for other replicas --- # Orchestrator usually handles re-pointing slaves to the new master. # However, if you have custom logic or need to ensure it, you can do it here. # For simplicity, we assume Orchestrator's default behavior is sufficient. # If not, you would query Orchestrator's API or DB to get the list of slaves # and execute CHANGE MASTER TO commands. # --- Step 3: Signal Keepalived to move the VIP --- # This is the critical part for application connectivity. # We'll use a simple mechanism: trigger a script on the Keepalived master. # This could be via SSH, a shared file, or a dedicated API. # For this example, we'll assume SSH access to the Keepalived master. KEEPALIVED_MASTER_IP="192.168.1.100" # IP of the server running Keepalived master KEEPALIVED_SSH_USER="deploy" VIP_MANAGER_SCRIPT="/usr/local/bin/manage_mysql_vip.sh" echo "Attempting to signal Keepalived master ($KEEPALIVED_MASTER_IP) to move VIP..." ssh -o StrictHostKeyChecking=no -o ConnectTimeout=5 $KEEPALIVED_SSH_USER@$KEEPALIVED_MASTER_IP "$VIP_MANAGER_SCRIPT move_mysql_vip $PROMOTED_HOST" if [ $? -eq 0 ]; then echo "Successfully signaled Keepalived master to move VIP." else echo "Error: Failed to signal Keepalived master. Manual intervention may be required." exit 1 fi echo "Promotion hook completed." exit 0
Make this script executable: sudo chmod +x /etc/orchestrator/hooks/mysql-promote-replica.sh.
mysql-post-master-discovery.sh: Post-Discovery Actions
This hook runs after Orchestrator discovers the topology. It’s useful for tasks that need to happen regardless of a promotion, such as ensuring all replicas are correctly configured or updating monitoring systems. For our VIP management, we’ll use it to ensure the VIP is correctly assigned if the primary MySQL instance is healthy.
#!/bin/bash
# This script is executed by Orchestrator after discovering the topology.
# It can be used to perform actions on all discovered instances.
# We'll use it to ensure the VIP is correctly assigned if the primary is healthy.
# Orchestrator passes arguments:
# $1: discovered_host
# $2: discovered_port
# $3: is_master (true/false)
# $4: is_candidate (true/false)
# $5: cluster_name (if applicable)
DISCOVERED_HOST="$1"
DISCOVERED_PORT="$2"
IS_MASTER="$3"
IS_CANDIDATE="$4"
CLUSTER_NAME="$5"
echo "Orchestrator hook: Post-discovery action for $DISCOVERED_HOST:$DISCOVERED_PORT"
# --- Step 1: If this is the current master, ensure VIP is assigned ---
if [ "$IS_MASTER" = "true" ]; then
echo "$DISCOVERED_HOST is the master. Ensuring VIP is assigned..."
KEEPALIVED_MASTER_IP="192.168.1.100" # IP of the server running Keepalived master
KEEPALIVED_SSH_USER="deploy"
VIP_MANAGER_SCRIPT="/usr/local/bin/manage_mysql_vip.sh"
echo "Attempting to signal Keepalived master ($KEEPALIVED_MASTER_IP) to assign VIP to $DISCOVERED_HOST..."
ssh -o StrictHostKeyChecking=no -o ConnectTimeout=5 $KEEPALIVED_SSH_USER@$KEEPALIVED_MASTER_IP "$VIP_MANAGER_SCRIPT assign_mysql_vip $DISCOVERED_HOST"
if [ $? -eq 0 ]; then
echo "Successfully signaled Keepalived master to assign VIP."
else
echo "Error: Failed to signal Keepalived master. Manual intervention may be required."
# Decide if this should be a fatal error for Orchestrator hook
# exit 1
fi
fi
echo "Post-discovery hook completed."
exit 0
Make this script executable: sudo chmod +x /etc/orchestrator/hooks/mysql-post-master-discovery.sh.
Keepalived Configuration for VIP Failover
Keepalived provides a simple yet effective way to manage a Virtual IP address (VIP) across multiple servers. The server with the VIP is considered the active one. If it fails, Keepalived automatically transfers the VIP to a backup server.
We’ll deploy Keepalived on at least two servers that can reach the MySQL instances. One will be the MASTER, and the other(s) will be BACKUP. The VIP will be assigned to the MASTER. When the MASTER fails, the BACKUP will take over the VIP.
Installation (Debian/Ubuntu)
sudo apt update sudo apt install keepalived -y
Keepalived Configuration File
The main configuration file is /etc/keepalived/keepalived.conf. We need to define a VRRP instance that manages our MySQL VIP.
vrrp_script check_mysql {
script "/usr/local/bin/check_mysql_health.sh"
interval 2
weight 20 # Add weight to increase priority if MySQL is healthy
fall 2
rise 2
}
vrrp_instance VI_MYSQL {
state MASTER # Set to BACKUP on other nodes
interface eth0 # Your primary network interface
virtual_router_id 51 # Must be unique for this VRRP group
priority 150 # Higher priority for MASTER, lower for BACKUP (e.g., 100)
advert_int 1 # Advertisement interval in seconds
authentication {
auth_type PASS
auth_pass your_vrrp_password
}
virtual_ipaddress {
192.168.1.101/24 dev eth0 label eth0:vip # Your MySQL VIP
}
track_script {
check_mysql
}
notify_master "/usr/local/bin/mysql_vip_notify.sh master"
notify_backup "/usr/local/bin/mysql_vip_notify.sh backup"
notify_fault "/usr/local/bin/mysql_vip_vip_notify.sh fault"
}
Important Notes:
- Replace
eth0with your actual network interface. virtual_router_idmust be the same across all nodes in the VRRP group.prioritydetermines which node becomes MASTER. Higher is better.auth_passmust be identical on all nodes.- The
virtual_ipaddressis the VIP your applications will connect to. - The
check_mysqlscript is crucial for dynamic VIP management. notify_master,notify_backup, andnotify_faultscripts are executed when the state changes.
check_mysql_health.sh: Health Check Script
This script is executed by Keepalived to determine the health of the MySQL instance on the current node. If the instance is unhealthy, Keepalived will reduce the priority of this node, potentially causing the VIP to failover.
#!/bin/bash
# Script to check MySQL health for Keepalived.
# Assumes MySQL is running on localhost and accessible via socket or default port.
MYSQL_USER="health_check_user"
MYSQL_PASSWORD="your_health_check_password"
MYSQL_HOST="127.0.0.1"
MYSQL_PORT="3306"
# Use a simple query that should always succeed on a healthy master.
# Avoid queries that might be slow or depend on specific data.
QUERY="SELECT 1;"
# Attempt to connect and execute the query.
mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASSWORD -e "$QUERY" > /dev/null 2>&1
if [ $? -eq 0 ]; then
# MySQL is healthy
exit 0
else
# MySQL is unhealthy
exit 1
fi
Make this script executable: sudo chmod +x /usr/local/bin/check_mysql_health.sh. Ensure the health_check_user exists in MySQL with appropriate privileges (e.g., USAGE). This user should NOT be the Orchestrator user.
mysql_vip_notify.sh: Notification Script
This script is executed by Keepalived when the node transitions to MASTER, BACKUP, or FAULT state. We’ll use this to coordinate with Orchestrator and ensure the VIP is correctly managed.
#!/bin/bash
# Script executed by Keepalived on state change.
# $1: state (master, backup, fault)
STATE="$1"
VIP="192.168.1.101" # The MySQL VIP
echo "Keepalived state changed to: $STATE"
if [ "$STATE" == "master" ]; then
echo "This node is now MASTER. Ensuring VIP $VIP is assigned."
# The VIP is automatically assigned by Keepalived when it becomes MASTER.
# We can optionally log this or trigger other actions.
# For example, if Orchestrator's post-discovery hook didn't run yet,
# we might want to explicitly tell it to assign the VIP.
# However, relying on Orchestrator's post-discovery hook is cleaner.
elif [ "$STATE" == "backup" ]; then
echo "This node is now BACKUP. VIP $VIP will be managed by another node."
# VIP is automatically removed from this node by Keepalived.
elif [ "$STATE" == "fault" ]; then
echo "This node entered FAULT state. VIP $VIP will be released."
# VIP is automatically released by Keepalived.
# This state might indicate a network issue or Keepalived crash.
fi
exit 0
Make this script executable: sudo chmod +x /usr/local/bin/mysql_vip_notify.sh.
manage_mysql_vip.sh: VIP Management Script (Centralized)
This script will reside on the Keepalived MASTER node and will be called by Orchestrator’s hooks. It’s responsible for assigning or moving the VIP based on Orchestrator’s decisions.
#!/bin/bash
# Script to manage the MySQL VIP, called by Orchestrator hooks.
# This script should reside on the server designated as the Keepalived MASTER.
VIP="192.168.1.101"
INTERFACE="eth0" # Your primary network interface
TARGET_HOST="$1" # The MySQL host to assign the VIP to
ACTION="$2" # 'assign_mysql_vip' or 'move_mysql_vip'
echo "VIP Management Script: Action='$ACTION', Target Host='$TARGET_HOST'"
if [ "$ACTION" == "assign_mysql_vip" ] || [ "$ACTION" == "move_mysql_vip" ]; then
if [ -z "$TARGET_HOST" ]; then
echo "Error: Target host is not specified."
exit 1
fi
# Check if the target host is the current node. If so, Keepalived handles it.
CURRENT_NODE_IP=$(hostname -I | awk '{print $1}')
if [ "$TARGET_HOST" == "$CURRENT_NODE_IP" ]; then
echo "Target host is the current node ($TARGET_HOST). Keepalived will manage the VIP."
# Ensure VIP is configured on this node if it's the master and target
if ip addr show $INTERFACE | grep -q $VIP; then
echo "VIP $VIP already assigned to $INTERFACE."
else
echo "Assigning VIP $VIP to $INTERFACE..."
ip addr add $VIP/24 dev $INTERFACE label $INTERFACE:vip
if [ $? -ne 0 ]; then
echo "Error: Failed to assign VIP $VIP to $INTERFACE."
exit 1
fi
echo "VIP $VIP assigned successfully."
fi
exit 0
fi
# If the target host is NOT the current node, we need to remove the VIP from the current node.
# This happens when Orchestrator promotes a *different* server.
echo "Target host ($TARGET_HOST) is different from current node ($CURRENT_NODE_IP)."
echo "Removing VIP $VIP from $INTERFACE if it exists..."
if ip addr show $INTERFACE | grep -q $VIP; then
ip addr del $VIP/24 dev $INTERFACE
if [ $? -ne 0 ]; then
echo "Warning: Failed to remove VIP $VIP from $INTERFACE."
# Continue, as Keepalived might handle this or the target node will take over.
else
echo "VIP $VIP removed from $INTERFACE."
fi
else
echo "VIP $VIP not found on $INTERFACE."
fi
# In a more advanced setup, you might want to SSH to the TARGET_HOST
# and ensure it's ready before assigning the VIP. However, Keepalived's
# health checks and Orchestrator's promotion logic should cover this.
# The primary role here is to ensure the VIP is NOT on the wrong node.
echo "VIP management action completed for target host $TARGET_HOST."
exit 0
else
echo "Error: Unknown action '$ACTION'."
exit 1
fi
Make this script executable: sudo chmod +x /usr/local/bin/manage_mysql_vip.sh. This script should be deployed on the server designated as the Keepalived MASTER.
Orchestrator and Keepalived Integration Flow
1. Initial State: Keepalived MASTER node has the MySQL VIP. Orchestrator discovers the topology and its mysql-post-master-discovery.sh hook runs. It signals the Keepalived MASTER node (via SSH) to assign the VIP to the current MySQL primary. The manage_mysql_vip.sh script on the Keepalived MASTER ensures the VIP is on the correct interface (which is itself, if it’s the primary MySQL host).
2. MySQL Primary Failure: Orchestrator detects the primary MySQL instance is down. It selects a healthy replica and initiates a promotion.
3. Replica Promotion: Orchestrator executes the mysql-promote-replica.sh hook on the chosen replica. This script signals the Keepalived MASTER node (via SSH) to move the VIP.
4. VIP Move: The manage_mysql_vip.sh script on the Keepalived MASTER node receives the signal. It detects that the target MySQL host (the newly promoted primary) is *not* the current Keepalived MASTER node. It then removes the VIP from the current Keepalived MASTER node’s interface.
5. Keepalived Failover: The original Keepalived MASTER node, having lost its MySQL health check (or simply because the VIP was removed), transitions to BACKUP state. The Keepalived BACKUP node(s) detect the loss of the MASTER and transition to MASTER state. They automatically assign the VIP to their own interface.
6. Application Reconnect: Applications attempting to connect to the MySQL VIP will now connect to the newly promoted primary server. Any existing connections might be dropped, but new connections will succeed.
Python Application Deployment Considerations
Your Python application needs to be aware of the VIP and handle potential connection drops gracefully. Using a connection pool is highly recommended.
Connection Pooling
Libraries like SQLAlchemy provide robust connection pooling. Configure your application to use the MySQL VIP as the host.
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
import time
# Use the VIP as the host
MYSQL_VIP = "192.168.1.101"
DB_USER = "app_user"
DB_PASSWORD = "your_app_password"
DB_NAME = "your_database"
# Connection string using the VIP
DATABASE_URL = f"mysql+mysqlconnector://{DB_USER}:{DB_PASSWORD}@{MYSQL_VIP}/{DB_NAME}"
# Configure connection pooling
# pool_size: Number of connections to keep open
# max_overflow: Number of additional connections that can be opened
# pool_recycle: Time in seconds after which a connection is automatically recycled
engine = create_engine(
DATABASE_URL,
pool_size=10,
max_overflow=5,
pool_recycle=1800, # Recycle connections every 30 minutes
connect_args={'connect_timeout': 5} # Set connection timeout
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
def get_db():
db = SessionLocal()
try:
yield db
finally:
db.close()
# Example usage in a web framework (e.g., FastAPI)
# from fastapi import Depends, FastAPI
# app = FastAPI()
# @app.get("/items/")
# async def read_items(db: Session = Depends(get_db)):
# # Perform database operations
# try:
# # Example: Fetching data
# result = db.execute("SELECT 1") # Simple query to test connection
# return {"message": "Database connection successful", "query_result": result.scalar()}
# except Exception as e:
# # Handle connection errors gracefully
# return {"message": f"Database error: {str(e)}"}
# --- Simulating connection attempts ---
def test_connection():
db = None
try:
print(f"Attempting to connect to MySQL VIP: {MYSQL_VIP}")
db = SessionLocal()
# Execute a simple query to verify connection
result = db.execute("SELECT @@hostname as server_name").scalar()
print(f"Successfully connected to: {result}")
return True
except Exception as e:
print(f"Connection failed: {e}")
return False
finally:
if db:
db.close()
if __name__ == "__main__":
print("Testing initial connection...")
if test_connection():
print("Initial connection successful.")
else:
print("Initial connection failed. Waiting for failover...")
# Simulate application running and handling potential drops
for _ in range(5): # Try a few times with delays
time.sleep(10) # Wait between attempts
print("\n--- Testing connection after delay ---")
if test_connection():
print("Connection re-established successfully.")
break
else:
print("Still unable to connect. Waiting...")
else:
print("\nFailed to establish connection after multiple retries.")
The connect_timeout parameter in create_engine is crucial. It dictates how long the application waits for a connection to be established before giving up. During a failover, connections will be dropped, and new connection attempts will time out until the VIP is stable on the new primary. A value between 5-10 seconds is often a good starting point.
Monitoring and Alerting
Automated failover is only part of the solution. Robust monitoring and alerting are essential to detect issues that automation might miss and to notify operators of failures and recoveries.
- Orchestrator UI: Regularly check the Orchestrator web UI (accessible at
http://your-orchestrator-host:3000) to visualize your topology and see any detected issues or promotions. - Keepalived Logs: Monitor
/var/log/syslogorjournalctl -u keepalivedfor Keepalived state changes and errors. - MySQL Logs: Monitor MySQL error logs for signs of failure.
- Application Metrics: Track connection success rates, query latency, and error rates from your Python application.
- Orchestrator Alerts: Configure Orchestrator to send alerts (e.g., via email or Slack) for critical events like failed promotions or topology changes.
By combining Orchestrator’s intelligent cluster management with Keepalived’s reliable VIP failover, and ensuring your Python applications are resilient, you can build a highly available MySQL deployment on OVH that withstands individual instance failures with minimal downtime.