Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Shopify Deployments on Linode

Establishing a High-Availability MySQL Cluster with Orchestrator

For mission-critical Shopify deployments, a single MySQL instance is an unacceptable single point of failure. We’ll architect a highly available MySQL cluster leveraging Orchestrator for automated failover. Orchestrator is a powerful tool for MySQL replication management, topology discovery, and automated failover. It’s designed to be robust and can handle complex topologies.

Our setup will involve a primary (writer) instance and one or more replicas. Orchestrator will monitor the health of the primary and, in case of failure, promote a replica to become the new primary. This process is designed to be as seamless as possible for applications, though some brief downtime during the failover window is inevitable.

Orchestrator Installation and Configuration

We’ll install Orchestrator on a separate, dedicated Linode instance. This instance should have network access to all MySQL nodes. For this example, we’ll assume a Debian/Ubuntu-based system.

First, download the latest Orchestrator binary. You can find the latest release on the Orchestrator GitHub releases page. For example:

wget https://github.com/openark/orchestrator/releases/download/v3.2.7/orchestrator-3.2.7-linux-amd64.tar.gz
tar xvfz orchestrator-3.2.7-linux-amd64.tar.gz
sudo mv orchestrator-3.2.7-linux-amd64 /usr/local/bin/orchestrator
sudo chmod +x /usr/local/bin/orchestrator

Next, create a configuration file for Orchestrator. This file defines how Orchestrator connects to MySQL and its own operational parameters.

[orchestrator]
data-dir = /var/lib/orchestrator
log-dir = /var/log/orchestrator
mysql-host = 127.0.0.1
mysql-port = 3306
mysql-user = orchestrator_user
mysql-password = your_orchestrator_db_password
mysql-socket = /var/run/mysqld/mysqld.sock
discovery-poll-seconds = 10
failure-check-seconds = 5
promotion-user = orchestrator_promote_user
promotion-password = your_promotion_password
replication-user = orchestrator_replication_user
replication-password = your_replication_password
web-frontend-port = 3000
debug = false
read-only = false
graceful-master-takeover = true
detect-long-running-transactions = true
detect-unseen-downtime = true
skip-slave-start = false
auto-discover-on-startup = true
cluster-name = shopify_cluster
cluster-alias = shopify_primary_alias

You’ll need to create the specified MySQL users on your primary MySQL instance. These users require specific privileges for Orchestrator to manage replication and perform failovers. Ensure these users are created on the primary and replicated to any potential failover candidates.

-- On your primary MySQL instance:
CREATE USER 'orchestrator_user'@'%' IDENTIFIED BY 'your_orchestrator_db_password';
GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, LOCK TABLES, SHOW DATABASES, SHOW VIEW, EVENT, CREATE, ALTER, DROP, INSERT, UPDATE, DELETE, SELECT ON *.* TO 'orchestrator_user'@'%';

CREATE USER 'orchestrator_promote_user'@'%' IDENTIFIED BY 'your_promotion_password';
GRANT SUPER, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, LOCK TABLES, PROCESS ON *.* TO 'orchestrator_promote_user'@'%';

CREATE USER 'orchestrator_replication_user'@'%' IDENTIFIED BY 'your_replication_password';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'orchestrator_replication_user'@'%';

FLUSH PRIVILEGES;

Create the necessary directories and set permissions:

sudo mkdir -p /var/lib/orchestrator
sudo mkdir -p /var/log/orchestrator
sudo chown -R orchestrator:orchestrator /var/lib/orchestrator
sudo chown -R orchestrator:orchestrator /var/log/orchestrator

Now, initialize the Orchestrator database schema. Orchestrator uses its own MySQL database to store its state. This can be on the same instance as Orchestrator or a separate one. For simplicity, we’ll assume it’s on the Orchestrator instance itself.

sudo -u orchestrator orchestrator --mysql-db-host=127.0.0.1 --mysql-db-port=3306 --mysql-db-user=orchestrator_user --mysql-db-password=your_orchestrator_db_password --mysql-db-name=orchestrator_db --mysql-db-create-database=true --mysql-db-schema-init=true

Start the Orchestrator service. You can use systemd for this.

sudo tee /etc/systemd/system/orchestrator.service <<EOF
[Unit]
Description=Orchestrator
After=network.target

[Service]
User=orchestrator
Group=orchestrator
ExecStart=/usr/local/bin/orchestrator -c /etc/orchestrator/orchestrator.conf.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable orchestrator
sudo systemctl start orchestrator

MySQL Replication Setup

Before Orchestrator can manage your cluster, you need a properly configured MySQL replication topology. This typically involves a primary instance and one or more replicas configured for asynchronous replication.

On the primary instance, ensure binary logging is enabled and set a unique server ID.

[mysqld]
server-id = 1
log-bin = mysql-bin
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON

On each replica instance, configure a unique server ID and point it to the primary.

[mysqld]
server-id = 2 # Unique for each replica
log-bin = mysql-bin # Optional, but good practice for chaining
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
relay-log = mysql-relay-bin

After configuring MySQL, you’ll need to set up the initial replication. This involves taking a consistent snapshot of the primary, copying it to the replica, and then configuring the replica to connect to the primary using its binary log coordinates or GTID information.

-- On the primary:
SHOW MASTER STATUS;
-- Note the File and Position, or GTID set.

-- On the replica (after restoring data snapshot):
CHANGE MASTER TO
  MASTER_HOST='primary_ip_address',
  MASTER_USER='replication_user',
  MASTER_PASSWORD='replication_password',
  MASTER_PORT=3306,
  MASTER_AUTO_POSITION=1; -- Use this if GTID is enabled
-- Or if not using GTID:
-- MASTER_LOG_FILE='mysql-bin.xxxxxx',
-- MASTER_LOG_POS=xxxxxx;

START SLAVE;
SHOW SLAVE STATUS\G;

Discovering and Managing the Cluster with Orchestrator

Once Orchestrator is running and MySQL replication is set up, you need to tell Orchestrator about your MySQL instances. You can do this via the Orchestrator API or by configuring discovery.

The simplest way to start is by manually registering your primary instance. Orchestrator will then discover replicas connected to it.

curl -i -X POST "http://localhost:3000/api/discover-instance?instance=primary_ip_address:3306"

After discovery, you can access the Orchestrator web UI (defaulting to port 3000) to visualize your cluster topology. Orchestrator will automatically detect replication relationships, identify potential issues, and provide options for manual intervention.

To enable automatic failover, you need to configure Orchestrator’s failover settings. This is done within the Orchestrator configuration file or via API calls. Key parameters include:

auto-failover-on-unreachable-master: Set to true to enable automatic failover when the master is unreachable.
auto-failover-on-slave-lag: Set to true to enable failover if a replica lags too far behind the master.
promotion-user and promotion-password: Credentials for the user Orchestrator uses to promote a replica.

You can also use the API to enable auto-failover for specific clusters:

curl -i -X POST "http://localhost:3000/api/set-cluster-topology-promotion-rule?clusterName=shopify_cluster&rule=AutoFailover"

Integrating Shopify with the High-Availability MySQL Cluster

Shopify itself is a SaaS platform and does not allow direct modification of its underlying database. However, if you are running a custom Shopify-like application or a headless commerce solution that *does* manage its own MySQL database, the integration strategy is crucial. The key is to abstract the database endpoint.

Abstracting the Database Endpoint

Your application (e.g., your custom Shopify backend, API gateway, or microservices) should not connect directly to the IP address of the primary MySQL instance. Instead, it should connect to a stable, virtual IP address or a DNS name that always resolves to the current primary.

There are several ways to achieve this:

Virtual IP (VIP) Management: Tools like Keepalived can manage a floating IP address that is moved to the active MySQL primary. Orchestrator can be configured to trigger Keepalived scripts upon failover to update the VIP.
DNS Updates: Orchestrator can be configured to update DNS records (e.g., A records in Route 53, Cloudflare, or your Linode DNS) to point to the new primary’s IP address. This requires Orchestrator to have credentials and permissions to manage your DNS provider.
Proxy Layer: Introduce a proxy like ProxySQL or HAProxy in front of your MySQL cluster. The proxy can be configured to point to the current primary. Orchestrator can then instruct the proxy to switch its backend upon failover.

Let’s consider the DNS update approach using Orchestrator’s hooks.

Orchestrator Hooks for DNS Updates

Orchestrator allows you to define scripts that run on specific events, such as a successful master promotion. We can leverage this to update our DNS records.

First, ensure your application’s database connection string uses a DNS name (e.g., db.yourdomain.com) instead of an IP address. This DNS name should initially point to your primary MySQL instance’s IP.

Create a script that Orchestrator will execute. This script will use a DNS provider’s CLI or API to update the record. For example, using Linode’s CLI:

#!/bin/bash

# This script is executed by Orchestrator on master promotion.
# It updates the DNS record for the database.

DB_DNS_NAME="db.yourdomain.com"
LINODE_DOMAIN="yourdomain.com"
LINODE_RECORD_ID="your_record_id" # Get this from Linode DNS manager
LINODE_TOKEN="your_linode_api_token" # Store securely, e.g., in Orchestrator's config or a protected file

NEW_MASTER_IP="$1" # Orchestrator passes the new master IP as the first argument

if [ -z "$NEW_MASTER_IP" ]; then
  echo "Error: New master IP not provided."
  exit 1
fi

echo "Updating DNS record for $DB_DNS_NAME to $NEW_MASTER_IP"

# Use Linode CLI to update the DNS record
# Ensure linode-cli is installed and configured with the token
# Example: linode-cli dns records-update $LINODE_DOMAIN $LINODE_RECORD_ID --target $NEW_MASTER_IP --type A

# For demonstration, we'll just echo the command. Replace with actual API call.
echo "Simulating Linode DNS update: linode-cli dns records-update $LINODE_DOMAIN $LINODE_RECORD_ID --target $NEW_MASTER_IP --type A"

# In a real scenario, you would execute the above command.
# Example using curl if linode-cli is not available or preferred:
# curl -X PUT "https://api.linode.com/v4/domains/$LINODE_DOMAIN/records/$LINODE_RECORD_ID" \
#   -H "Authorization: Bearer $LINODE_TOKEN" \
#   -H "Content-Type: application/json" \
#   -d "{\"type\": \"A\", \"target\": \"$NEW_MASTER_IP\", \"name\": \"$(echo $DB_DNS_NAME | sed "s/\.$LINODE_DOMAIN//")\"}"

# Check the exit status of the command and log appropriately
# if [ $? -eq 0 ]; then
#   echo "DNS update successful."
# else
#   echo "DNS update failed."
#   exit 1
# fi

exit 0

Make this script executable:

sudo chmod +x /path/to/your/dns_update_script.sh

Now, configure Orchestrator to use this script as a hook. You can do this by adding the following to your Orchestrator configuration file or by using the API:

[orchestrator]
# ... other configurations ...
post-promotion-hook = /path/to/your/dns_update_script.sh

Alternatively, via API:

curl -i -X POST "http://localhost:3000/api/set-cluster-topology-promotion-hook?clusterName=shopify_cluster&hook=/path/to/your/dns_update_script.sh"

Testing and Monitoring Failover

Thorough testing is paramount. Simulate failures to ensure the automated failover process works as expected and that your application can reconnect to the new primary.

Simulating Failures

You can simulate a primary failure by:

Stopping the MySQL service on the primary instance: sudo systemctl stop mysql
Blocking network access to the primary instance from the Orchestrator node.
Physically disconnecting the primary server (if applicable).

Monitor the Orchestrator UI and logs. You should see Orchestrator detect the primary’s unavailability, select a suitable replica, promote it, and execute the configured hooks (like DNS updates).

After the failover, verify:

The DNS record (if used) points to the new primary’s IP.
Your application can connect to the database using the abstract endpoint (DNS name).
Replication is functioning correctly from the new primary to its replicas.
Orchestrator has re-established the topology, potentially making the old primary a replica of the new one after it recovers.

Monitoring Orchestrator and MySQL Health

Beyond Orchestrator’s built-in UI, integrate its metrics and status into your broader monitoring system (e.g., Prometheus, Datadog, Nagios).

Key metrics to monitor:

Orchestrator service health (is it running?).
MySQL replication lag on all nodes.
Orchestrator’s internal database health.
Number of unreachable masters detected.
Number of successful/failed promotions.

For MySQL, ensure you have robust monitoring for:

Query performance.
Connection counts.
Disk I/O and space.
Replication status (SHOW SLAVE STATUS\G).

By implementing Orchestrator for automated MySQL failover and abstracting your database endpoint, you significantly enhance the resilience of your Shopify deployment against database failures, ensuring minimal downtime and continuous operation.