Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Python Deployments on DigitalOcean
Establishing a Highly Available MySQL Cluster with Orchestrator
For mission-critical applications, a single MySQL instance is a single point of failure. Architecting for high availability (HA) necessitates a robust failover strategy. We’ll leverage Orchestrator, a popular MySQL replication topology manager, to automate failover detection and execution. This approach minimizes downtime by automatically promoting a replica to primary when the current primary becomes unavailable.
Our setup will involve at least two MySQL instances: one primary and one or more replicas. Orchestrator will monitor these instances, track their replication status, and manage the promotion process. For simplicity in this example, we’ll assume a basic master-replica setup, but Orchestrator scales to complex topologies.
Orchestrator Installation and Configuration
Orchestrator can be installed from source or via pre-compiled binaries. For production, using pre-compiled binaries is generally recommended for ease of management. We’ll focus on the configuration aspects.
The core configuration file for Orchestrator is typically located at /etc/orchestrator/orchestrator.conf.json. Here’s a minimal, production-ready configuration:
{
"Debug": false,
"ListenAddress": ":3000",
"MySQLTopologyUser": "orchestrator",
"MySQLTopologyPassword": "your_mysql_password",
"MySQLOrchestratorHostPort": "127.0.0.1:3306",
"MySQLOrchestratorCredentialsConfigFile": "/etc/orchestrator/orchestrator-mysql-credentials.txt",
"DiscoveryPeriodSeconds": 10,
"FailureDetectionPeriodSeconds": 5,
"PromotionLagQuery": "SELECT MAX(seconds_behind_master) FROM mysql.slave_status",
"PromotionRule": "most-available",
"PostUnsuccessfulFailoverProcesses": [
"/path/to/your/slack_notification_script.sh"
],
"PostSuccessfulFailoverProcesses": [
"/path/to/your/slack_notification_script.sh"
],
"HooksGranularitySeconds": 60,
"SlaveLagQuery": "SELECT MAX(seconds_behind_master) FROM mysql.slave_status",
"ReplicaServerIDPattern": "^(mysql|mariadb)-",
"DetectClusterAlias": true,
"ClusterAliasDetectionCommand": "echo $(hostname -s | sed 's/mysql-//')",
"GlobalUserCredentials": {
"user": "orchestrator",
"password": "your_mysql_password"
},
"InstancePollSeconds": 10,
"TopologyPollSeconds": 60,
"MaxConcurrentOrchestrations": 10,
"SnapshotTopologiesOnStartup": true,
"SnapshotTopologiesPeriodSeconds": 3600,
"SnapshotTopologiesPath": "/var/lib/orchestrator/snapshots"
}
Key parameters to note:
ListenAddress: The address Orchestrator listens on for API requests and its web UI.MySQLTopologyUserandMySQLTopologyPassword: Credentials for Orchestrator to connect to MySQL instances. These should be dedicated, least-privilege users.MySQLOrchestratorCredentialsConfigFile: A file containing credentials for Orchestrator to connect to its *own* backend MySQL database (if using one for state persistence).DiscoveryPeriodSecondsandFailureDetectionPeriodSeconds: Control how frequently Orchestrator polls instances and checks for failures. Tune these based on your tolerance for detection latency.PromotionRule: Defines the logic for selecting a replica to promote.most-availableis a common choice, promoting the replica with the least replication lag.PostUnsuccessfulFailoverProcessesandPostSuccessfulFailoverProcesses: Crucial for integrating with alerting systems (e.g., Slack, PagerDuty). These scripts are executed after a failover attempt.ClusterAliasDetectionCommand: If your MySQL instances are part of logical clusters, this command helps Orchestrator identify them.
You’ll also need a credentials file (e.g., /etc/orchestrator/orchestrator-mysql-credentials.txt) for Orchestrator’s backend database if you’re not using an embedded backend:
user:orchestrator password:your_orchestrator_db_password
Ensure the Orchestrator user has the necessary privileges on your MySQL instances:
CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'your_mysql_password'; GRANT REPLICATION CLIENT, REPLICATION SLAVE, PROCESS, RELOAD, SUPER, SELECT, LOCK TABLES, SHOW DATABASES, EVENT, SHOW VIEW, CREATE TEMPORARY TABLES, SHOW VIEW, SUPER, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'orchestrator'@'%'; FLUSH PRIVILEGES;
MySQL Replication Setup
Before Orchestrator can manage failovers, you need a properly configured MySQL replication topology. This involves setting up binary logging on the primary and configuring replicas to connect to it.
On the primary MySQL server (my.cnf or my.ini):
[mysqld] server-id = 1 log_bin = /var/log/mysql/mysql-bin.log binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = OFF
On each replica MySQL server (my.cnf or my.ini):
[mysqld] server-id = 2 # Unique ID for each replica log_bin = /var/log/mysql/mysql-bin.log binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = ON
After configuring my.cnf, restart MySQL. Then, on the primary, obtain its binary log file and position (or GTID set) for setting up replicas. With GTID enabled, this is simpler. You can use SHOW MASTER STATUS; or query performance_schema.replication_connection_status.
On each replica, configure it to replicate from the primary:
-- On the replica CHANGE MASTER TO MASTER_HOST='', MASTER_USER='replication_user', MASTER_PASSWORD='replication_password', MASTER_PORT=3306, MASTER_AUTO_POSITION=1; -- Use 1 for GTID-based replication START SLAVE; -- Ensure replication user exists on primary with REPLICATION SLAVE privilege
Integrating Python Applications with Orchestrator
Your Python applications need to be aware of the current MySQL primary. Instead of hardcoding the primary’s IP address, they should query Orchestrator’s API to discover the active primary. This ensures that when a failover occurs, your applications automatically connect to the new primary.
We can use the requests library in Python to interact with Orchestrator’s REST API. Here’s a Python function to get the current primary instance for a given cluster (assuming you’ve configured cluster aliases in Orchestrator):
import requests
import os
ORCHESTRATOR_API_URL = os.environ.get("ORCHESTRATOR_API_URL", "http://localhost:3000")
def get_mysql_primary(cluster_name):
"""
Retrieves the current primary MySQL instance for a given cluster from Orchestrator.
"""
try:
response = requests.get(f"{ORCHESTRATOR_API_URL}/api/discover/clusters/{cluster_name}")
response.raise_for_status() # Raise an exception for bad status codes
data = response.json()
if not data or not data.get("Clusters"):
return None
# Orchestrator API returns a list of clusters, we expect one
cluster_info = data["Clusters"][0]
# Find the primary node within the cluster
for node in cluster_info.get("Nodes", []):
if node.get("IsPrimary"):
return {
"host": node.get("HostName"),
"port": node.get("Port"),
"instance_id": node.get("InstanceId")
}
return None
except requests.exceptions.RequestException as e:
print(f"Error querying Orchestrator API: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred: {e}")
return None
# Example usage:
if __name__ == "__main__":
# Assuming your cluster is named 'my_app_cluster' in Orchestrator
cluster_name = "my_app_cluster"
primary_info = get_mysql_primary(cluster_name)
if primary_info:
print(f"Current MySQL primary for cluster '{cluster_name}':")
print(f" Host: {primary_info['host']}")
print(f" Port: {primary_info['port']}")
print(f" Instance ID: {primary_info['instance_id']}")
# Use this information to configure your database connection pool
# For example, with SQLAlchemy:
# engine = create_engine(f"mysql+mysqlconnector://user:password@{primary_info['host']}:{primary_info['port']}/database")
else:
print(f"Could not determine MySQL primary for cluster '{cluster_name}'.")
# Implement fallback or error handling logic here
This function should be called periodically or on application startup to obtain the current primary’s connection details. Your application’s database connection manager should dynamically update its configuration based on this information. For robust applications, consider implementing a retry mechanism that uses this function to re-discover the primary if a connection to the current primary fails.
Automated Failover Workflow and Testing
Orchestrator’s core value lies in its automated failover capabilities. When Orchestrator detects that the primary is unreachable (e.g., via network checks or failed health probes), it initiates a failover process:
- Detection: Orchestrator’s polling mechanism identifies the primary as down.
- Replica Selection: Based on the
PromotionRule(e.g.,most-available), Orchestrator selects the best replica to promote. - Promotion: Orchestrator stops replication on the chosen replica, sets it to
read_writemode, and updates its internal topology. - Re-pointing: Orchestrator attempts to re-point other replicas to the newly promoted primary.
- Hooks Execution:
PostSuccessfulFailoverProcessesorPostUnsuccessfulFailoverProcessesare triggered, allowing for notifications or further automated remediation.
Testing Failovers: Thorough testing is paramount. You can simulate failures in several ways:
- Graceful Primary Shutdown: Stop the MySQL process on the primary server. Orchestrator should detect this and initiate a failover.
- Network Isolation: Use firewall rules (
iptables, DigitalOcean Cloud Firewall) to block network access to the primary MySQL port from the Orchestrator server and other replicas. - Simulated Instance Failure: If running Orchestrator and MySQL in Docker, stop the MySQL container.
After simulating a failure, verify:
- Orchestrator’s web UI shows the failover has occurred and the new primary is correctly identified.
- Your Python application can successfully connect to the new primary.
- Replication is functioning correctly between the new primary and its replicas.
- Alerts were sent via the configured hooks.
DigitalOcean Specific Considerations
When deploying this on DigitalOcean, consider the following:
- Droplets: Deploy Orchestrator on a dedicated Droplet, or co-locate it with a non-critical application component. Ensure it has network access to all MySQL Droplets.
- Firewalls: Utilize DigitalOcean’s Cloud Firewalls to restrict access to your MySQL ports (default 3306) only from your application servers and the Orchestrator Droplet. This enhances security.
- Monitoring: Integrate Orchestrator’s status and failover events with DigitalOcean’s monitoring tools or your preferred external monitoring solution (e.g., Prometheus, Datadog) for comprehensive visibility.
- Backups: While Orchestrator handles HA, it does not replace backups. Ensure you have a robust backup strategy for your MySQL data, leveraging DigitalOcean’s snapshot capabilities or custom backup scripts.
- Networking: If your MySQL instances are spread across different VPCs or regions, ensure proper network connectivity and low latency. For HA within a single region, using private networking is highly recommended.
By implementing Orchestrator for MySQL failover and designing your Python applications to dynamically discover the primary, you achieve a robust, automated disaster recovery solution that significantly minimizes downtime.