Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Shopify Deployments on Linode
Establishing a High-Availability MySQL Cluster with Orchestrator
For mission-critical Shopify deployments, a single MySQL instance is an unacceptable single point of failure. We’ll architect a highly available MySQL cluster leveraging Orchestrator for automated failover. Orchestrator is a powerful tool for MySQL replication management, topology discovery, and automated failover. It’s designed to be robust and can handle complex topologies.
Our setup will involve a primary (writer) instance and one or more replicas. Orchestrator will monitor the health of the primary and, in case of failure, promote a replica to become the new primary. This process is designed to be as seamless as possible for applications, though some brief downtime during the failover window is inevitable.
Orchestrator Installation and Configuration
We’ll install Orchestrator on a separate, dedicated Linode instance. This instance should have network access to all MySQL nodes. For this example, we’ll assume a Debian/Ubuntu-based system.
First, download the latest Orchestrator binary. You can find the latest release on the Orchestrator GitHub releases page. For example:
wget https://github.com/openark/orchestrator/releases/download/v3.2.7/orchestrator-3.2.7-linux-amd64.tar.gz tar xvfz orchestrator-3.2.7-linux-amd64.tar.gz sudo mv orchestrator-3.2.7-linux-amd64 /usr/local/bin/orchestrator sudo chmod +x /usr/local/bin/orchestrator
Next, create a configuration file for Orchestrator. This file defines how Orchestrator connects to MySQL and its own operational parameters.
[orchestrator] data-dir = /var/lib/orchestrator log-dir = /var/log/orchestrator mysql-host = 127.0.0.1 mysql-port = 3306 mysql-user = orchestrator_user mysql-password = your_orchestrator_db_password mysql-socket = /var/run/mysqld/mysqld.sock discovery-poll-seconds = 10 failure-check-seconds = 5 promotion-user = orchestrator_promote_user promotion-password = your_promotion_password replication-user = orchestrator_replication_user replication-password = your_replication_password web-frontend-port = 3000 debug = false read-only = false graceful-master-takeover = true detect-long-running-transactions = true detect-unseen-downtime = true skip-slave-start = false auto-discover-on-startup = true cluster-name = shopify_cluster cluster-alias = shopify_primary_alias
You’ll need to create the specified MySQL users on your primary MySQL instance. These users require specific privileges for Orchestrator to manage replication and perform failovers. Ensure these users are created on the primary and replicated to any potential failover candidates.
-- On your primary MySQL instance: CREATE USER 'orchestrator_user'@'%' IDENTIFIED BY 'your_orchestrator_db_password'; GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, LOCK TABLES, SHOW DATABASES, SHOW VIEW, EVENT, CREATE, ALTER, DROP, INSERT, UPDATE, DELETE, SELECT ON *.* TO 'orchestrator_user'@'%'; CREATE USER 'orchestrator_promote_user'@'%' IDENTIFIED BY 'your_promotion_password'; GRANT SUPER, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, LOCK TABLES, PROCESS ON *.* TO 'orchestrator_promote_user'@'%'; CREATE USER 'orchestrator_replication_user'@'%' IDENTIFIED BY 'your_replication_password'; GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'orchestrator_replication_user'@'%'; FLUSH PRIVILEGES;
Create the necessary directories and set permissions:
sudo mkdir -p /var/lib/orchestrator sudo mkdir -p /var/log/orchestrator sudo chown -R orchestrator:orchestrator /var/lib/orchestrator sudo chown -R orchestrator:orchestrator /var/log/orchestrator
Now, initialize the Orchestrator database schema. Orchestrator uses its own MySQL database to store its state. This can be on the same instance as Orchestrator or a separate one. For simplicity, we’ll assume it’s on the Orchestrator instance itself.
sudo -u orchestrator orchestrator --mysql-db-host=127.0.0.1 --mysql-db-port=3306 --mysql-db-user=orchestrator_user --mysql-db-password=your_orchestrator_db_password --mysql-db-name=orchestrator_db --mysql-db-create-database=true --mysql-db-schema-init=true
Start the Orchestrator service. You can use systemd for this.
sudo tee /etc/systemd/system/orchestrator.service <<EOF [Unit] Description=Orchestrator After=network.target [Service] User=orchestrator Group=orchestrator ExecStart=/usr/local/bin/orchestrator -c /etc/orchestrator/orchestrator.conf.json Restart=always RestartSec=10 [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable orchestrator sudo systemctl start orchestrator
MySQL Replication Setup
Before Orchestrator can manage your cluster, you need a properly configured MySQL replication topology. This typically involves a primary instance and one or more replicas configured for asynchronous replication.
On the primary instance, ensure binary logging is enabled and set a unique server ID.
[mysqld] server-id = 1 log-bin = mysql-bin binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON
On each replica instance, configure a unique server ID and point it to the primary.
[mysqld] server-id = 2 # Unique for each replica log-bin = mysql-bin # Optional, but good practice for chaining binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON relay-log = mysql-relay-bin
After configuring MySQL, you’ll need to set up the initial replication. This involves taking a consistent snapshot of the primary, copying it to the replica, and then configuring the replica to connect to the primary using its binary log coordinates or GTID information.
-- On the primary: SHOW MASTER STATUS; -- Note the File and Position, or GTID set. -- On the replica (after restoring data snapshot): CHANGE MASTER TO MASTER_HOST='primary_ip_address', MASTER_USER='replication_user', MASTER_PASSWORD='replication_password', MASTER_PORT=3306, MASTER_AUTO_POSITION=1; -- Use this if GTID is enabled -- Or if not using GTID: -- MASTER_LOG_FILE='mysql-bin.xxxxxx', -- MASTER_LOG_POS=xxxxxx; START SLAVE; SHOW SLAVE STATUS\G;
Discovering and Managing the Cluster with Orchestrator
Once Orchestrator is running and MySQL replication is set up, you need to tell Orchestrator about your MySQL instances. You can do this via the Orchestrator API or by configuring discovery.
The simplest way to start is by manually registering your primary instance. Orchestrator will then discover replicas connected to it.
curl -i -X POST "http://localhost:3000/api/discover-instance?instance=primary_ip_address:3306"
After discovery, you can access the Orchestrator web UI (defaulting to port 3000) to visualize your cluster topology. Orchestrator will automatically detect replication relationships, identify potential issues, and provide options for manual intervention.
To enable automatic failover, you need to configure Orchestrator’s failover settings. This is done within the Orchestrator configuration file or via API calls. Key parameters include:
auto-failover-on-unreachable-master: Set totrueto enable automatic failover when the master is unreachable.auto-failover-on-slave-lag: Set totrueto enable failover if a replica lags too far behind the master.promotion-userandpromotion-password: Credentials for the user Orchestrator uses to promote a replica.
You can also use the API to enable auto-failover for specific clusters:
curl -i -X POST "http://localhost:3000/api/set-cluster-topology-promotion-rule?clusterName=shopify_cluster&rule=AutoFailover"
Integrating Shopify with the High-Availability MySQL Cluster
Shopify itself is a SaaS platform and does not allow direct modification of its underlying database. However, if you are running a custom Shopify-like application or a headless commerce solution that *does* manage its own MySQL database, the integration strategy is crucial. The key is to abstract the database endpoint.
Abstracting the Database Endpoint
Your application (e.g., your custom Shopify backend, API gateway, or microservices) should not connect directly to the IP address of the primary MySQL instance. Instead, it should connect to a stable, virtual IP address or a DNS name that always resolves to the current primary.
There are several ways to achieve this:
- Virtual IP (VIP) Management: Tools like Keepalived can manage a floating IP address that is moved to the active MySQL primary. Orchestrator can be configured to trigger Keepalived scripts upon failover to update the VIP.
- DNS Updates: Orchestrator can be configured to update DNS records (e.g., A records in Route 53, Cloudflare, or your Linode DNS) to point to the new primary’s IP address. This requires Orchestrator to have credentials and permissions to manage your DNS provider.
- Proxy Layer: Introduce a proxy like ProxySQL or HAProxy in front of your MySQL cluster. The proxy can be configured to point to the current primary. Orchestrator can then instruct the proxy to switch its backend upon failover.
Let’s consider the DNS update approach using Orchestrator’s hooks.
Orchestrator Hooks for DNS Updates
Orchestrator allows you to define scripts that run on specific events, such as a successful master promotion. We can leverage this to update our DNS records.
First, ensure your application’s database connection string uses a DNS name (e.g., db.yourdomain.com) instead of an IP address. This DNS name should initially point to your primary MySQL instance’s IP.
Create a script that Orchestrator will execute. This script will use a DNS provider’s CLI or API to update the record. For example, using Linode’s CLI:
#!/bin/bash
# This script is executed by Orchestrator on master promotion.
# It updates the DNS record for the database.
DB_DNS_NAME="db.yourdomain.com"
LINODE_DOMAIN="yourdomain.com"
LINODE_RECORD_ID="your_record_id" # Get this from Linode DNS manager
LINODE_TOKEN="your_linode_api_token" # Store securely, e.g., in Orchestrator's config or a protected file
NEW_MASTER_IP="$1" # Orchestrator passes the new master IP as the first argument
if [ -z "$NEW_MASTER_IP" ]; then
echo "Error: New master IP not provided."
exit 1
fi
echo "Updating DNS record for $DB_DNS_NAME to $NEW_MASTER_IP"
# Use Linode CLI to update the DNS record
# Ensure linode-cli is installed and configured with the token
# Example: linode-cli dns records-update $LINODE_DOMAIN $LINODE_RECORD_ID --target $NEW_MASTER_IP --type A
# For demonstration, we'll just echo the command. Replace with actual API call.
echo "Simulating Linode DNS update: linode-cli dns records-update $LINODE_DOMAIN $LINODE_RECORD_ID --target $NEW_MASTER_IP --type A"
# In a real scenario, you would execute the above command.
# Example using curl if linode-cli is not available or preferred:
# curl -X PUT "https://api.linode.com/v4/domains/$LINODE_DOMAIN/records/$LINODE_RECORD_ID" \
# -H "Authorization: Bearer $LINODE_TOKEN" \
# -H "Content-Type: application/json" \
# -d "{\"type\": \"A\", \"target\": \"$NEW_MASTER_IP\", \"name\": \"$(echo $DB_DNS_NAME | sed "s/\.$LINODE_DOMAIN//")\"}"
# Check the exit status of the command and log appropriately
# if [ $? -eq 0 ]; then
# echo "DNS update successful."
# else
# echo "DNS update failed."
# exit 1
# fi
exit 0
Make this script executable:
sudo chmod +x /path/to/your/dns_update_script.sh
Now, configure Orchestrator to use this script as a hook. You can do this by adding the following to your Orchestrator configuration file or by using the API:
[orchestrator] # ... other configurations ... post-promotion-hook = /path/to/your/dns_update_script.sh
Alternatively, via API:
curl -i -X POST "http://localhost:3000/api/set-cluster-topology-promotion-hook?clusterName=shopify_cluster&hook=/path/to/your/dns_update_script.sh"
Testing and Monitoring Failover
Thorough testing is paramount. Simulate failures to ensure the automated failover process works as expected and that your application can reconnect to the new primary.
Simulating Failures
You can simulate a primary failure by:
- Stopping the MySQL service on the primary instance:
sudo systemctl stop mysql - Blocking network access to the primary instance from the Orchestrator node.
- Physically disconnecting the primary server (if applicable).
Monitor the Orchestrator UI and logs. You should see Orchestrator detect the primary’s unavailability, select a suitable replica, promote it, and execute the configured hooks (like DNS updates).
After the failover, verify:
- The DNS record (if used) points to the new primary’s IP.
- Your application can connect to the database using the abstract endpoint (DNS name).
- Replication is functioning correctly from the new primary to its replicas.
- Orchestrator has re-established the topology, potentially making the old primary a replica of the new one after it recovers.
Monitoring Orchestrator and MySQL Health
Beyond Orchestrator’s built-in UI, integrate its metrics and status into your broader monitoring system (e.g., Prometheus, Datadog, Nagios).
Key metrics to monitor:
- Orchestrator service health (is it running?).
- MySQL replication lag on all nodes.
- Orchestrator’s internal database health.
- Number of unreachable masters detected.
- Number of successful/failed promotions.
For MySQL, ensure you have robust monitoring for:
- Query performance.
- Connection counts.
- Disk I/O and space.
- Replication status (
SHOW SLAVE STATUS\G).
By implementing Orchestrator for automated MySQL failover and abstracting your database endpoint, you significantly enhance the resilience of your Shopify deployment against database failures, ensuring minimal downtime and continuous operation.