Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Shopify Deployments on DigitalOcean

Automated MySQL Failover with Orchestrator on DigitalOcean

Achieving true high availability for critical databases like MySQL necessitates an automated failover strategy. Manual intervention during an outage is a recipe for extended downtime and lost revenue. This section details the setup of Orchestrator, a popular open-source tool for MySQL replication management and automated failover, within a DigitalOcean environment.

Our architecture will consist of at least three MySQL nodes: one primary and two replicas. Orchestrator will monitor the health of the primary and, upon detecting failure, promote one of the replicas to become the new primary. This setup assumes a basic understanding of MySQL replication.

Orchestrator Installation and Configuration

We’ll deploy Orchestrator as a standalone service. For simplicity, we’ll install it on a separate DigitalOcean Droplet. For production, consider a highly available Orchestrator cluster itself.

First, add the Orchestrator repository and install the package:

sudo apt-get update
sudo apt-get install -y wget gnupg
wget -qO - https://download.github.com/orchestrator/apt/pubkey.gpg | sudo apt-key add -
echo "deb [arch=amd64] https://download.github.com/orchestrator/apt/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/orchestrator.list
sudo apt-get update
sudo apt-get install -y orchestrator

Next, configure Orchestrator. The primary configuration file is typically located at /etc/orchestrator/orchestrator.conf.json. We need to define how Orchestrator connects to MySQL and its own backend database.

For the backend database, Orchestrator can use MySQL itself. Let’s set up a dedicated MySQL instance for Orchestrator’s state. This instance does NOT need to be highly available for Orchestrator’s basic functionality, but it’s good practice to have it resilient.

-- On the dedicated MySQL instance for Orchestrator backend
CREATE DATABASE orchestrator;
CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'your_strong_password';
GRANT ALL PRIVILEGES ON orchestrator.* TO 'orchestrator'@'%';
FLUSH PRIVILEGES;

Now, configure orchestrator.conf.json. Replace placeholders with your actual credentials and hostnames/IPs.

{
  "Debug": false,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "your_mysql_replication_password",
  "MySQLOrchestratorHost": "orchestrator-backend-mysql-host",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orchestrator",
  "MySQLOrchestratorPassword": "your_orchestrator_backend_password",
  "DiscoveryPeriodSeconds": 10,
  "PromotionUser": "orchestrator_promote",
  "PromotionPassword": "your_promotion_password",
  "PromotionForgetMasterDelayedSeconds": 30,
  "RecoveryPeriodBlockSeconds": 300,
  "RecoveryPeriodSuccessSeconds": 600,
  "DetectClusterAliasOnPromotion": true,
  "PostMasterFailoverProcesses": [
    "/path/to/your/post_failover_script.sh"
  ],
  "PreMasterFailoverProcesses": [
    "/path/to/your/pre_failover_script.sh"
  ],
  "SlaveLagQuery": "SELECT * FROM mysql.slave_lag_info WHERE server_id = ? LIMIT 1",
  "SlaveLagQueryIntervalSeconds": 5,
  "SlaveLagQueryMaxLagSeconds": 60,
  "SlaveLagQueryMinLagSeconds": 1,
  "ReadOnlyNodeDetectionPeriodSeconds": 30,
  "ReadOnlyNodeDetectionMaxFailures": 3,
  "ReadOnlyNodeDetectionMinLagSeconds": 10,
  "ReadOnlyNodeDetectionMaxLagSeconds": 120,
  "InstancePollSeconds": 10,
  "ClusterName": "my-app-cluster",
  "Flavor": "MySQL"
}

Crucially, the MySQLTopologyUser needs appropriate privileges on all your MySQL instances (primary and replicas) to read replication status and perform failover operations. The PromotionUser is used by Orchestrator to execute promotion commands on replicas.

-- On ALL MySQL instances (primary and replicas)
CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'your_mysql_replication_password';
GRANT REPLICATION CLIENT, REPLICATION SLAVE, PROCESS, RELOAD, SUPER, SELECT ON *.* TO 'orchestrator'@'%';
FLUSH PRIVILEGES;

-- On the designated replica(s) that can be promoted
CREATE USER 'orchestrator_promote'@'%' IDENTIFIED BY 'your_promotion_password';
GRANT REPLICATION SLAVE, REPLICATION CLIENT, SUPER, RELOAD, PROCESS, SELECT ON *.* TO 'orchestrator_promote'@'%';
FLUSH PRIVILEGES;

Start and enable the Orchestrator service:

sudo systemctl start orchestrator
sudo systemctl enable orchestrator

Discovering and Registering MySQL Instances

Once Orchestrator is running, you can access its web UI (defaulting to port 3000 on the Orchestrator Droplet) to discover your MySQL topology. Navigate to the “Discover DBs” tab and enter the connection details for one of your MySQL instances (preferably the current primary). Orchestrator will then probe the instance, detect its replicas, and build the topology map.

Alternatively, you can use the Orchestrator API or CLI to register instances. For example, using curl to register a single instance:

curl -X POST \
  http://localhost:3000/api/discover/topology \
  -d '{"Instance": "your_mysql_primary_ip:3306"}'

After discovery, review the topology in the web UI. Ensure all instances are correctly identified and their replication relationships are accurate. Orchestrator will automatically detect and display clusters. You can also manually assign cluster aliases for better organization.

Configuring Automated Failover

Automated failover is enabled by default in Orchestrator. When Orchestrator detects that a primary instance is unreachable (e.g., Droplet is down, MySQL process stopped), it will initiate a failover process. This involves:

Identifying a suitable replica to promote. Orchestrator considers factors like replication lag and the last successful replication event.
Executing the PreMasterFailoverProcesses (if configured) on the current primary (if reachable) and on the chosen replica.
Promoting the chosen replica using the PromotionUser credentials. This typically involves stopping replication and setting the replica as read-write.
Reconfiguring other replicas to replicate from the newly promoted primary.
Executing the PostMasterFailoverProcesses (if configured) on the new primary and potentially other nodes.

The PostMasterFailoverProcesses and PreMasterFailoverProcesses are critical for integrating with your application. These scripts can:

Update DNS records or load balancer configurations to point to the new primary.
Notify monitoring systems or Slack channels about the failover event.
Trigger application-level reconnections or reconfigurations.

Here’s a sample post_failover_script.sh that could update a DigitalOcean Load Balancer:

#!/bin/bash

# This script is executed by Orchestrator after a successful failover.
# It receives arguments: $1 = new_primary_hostname, $2 = cluster_name

NEW_PRIMARY_IP=$1
CLUSTER_NAME=$2
DO_TOKEN="YOUR_DIGITALOCEAN_API_TOKEN"
LB_ID="YOUR_LOAD_BALANCER_ID" # The ID of your DigitalOcean Load Balancer

# Find the IP address of the new primary if only hostname is provided
if [[ "$NEW_PRIMARY_IP" =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
    NEW_PRIMARY_IP_ADDRESS=$NEW_PRIMARY_IP
else
    NEW_PRIMARY_IP_ADDRESS=$(dig +short $NEW_PRIMARY_IP | head -n 1)
    if [ -z "$NEW_PRIMARY_IP_ADDRESS" ]; then
        echo "Error: Could not resolve IP address for $NEW_PRIMARY_IP"
        exit 1
    fi
fi

echo "Failover complete for cluster $CLUSTER_NAME. New primary is $NEW_PRIMARY_IP ($NEW_PRIMARY_IP_ADDRESS)."

# Update DigitalOcean Load Balancer
# This is a simplified example. You'd typically need to get the current LB config,
# remove the old primary, and add the new one.
# For a robust solution, use the DigitalOcean API client or SDK.

# Example using curl (requires careful handling of existing targets)
# This assumes you want to replace the *existing* MySQL target with the new one.
# A more robust approach would be to fetch the LB config, modify it, and update.

# Get current LB configuration
LB_CONFIG=$(curl -X GET "https://api.digitalocean.com/v2/loadbalancers/$LB_ID" \
  -H "Authorization: Bearer $DO_TOKEN")

# Extract current targets (simplified, assumes only one MySQL target)
CURRENT_TARGET_IP=$(echo "$LB_CONFIG" | jq -r '.load_balancer.forwarding_rules[0].target') # Adjust index if needed

# If the new primary is already the target, do nothing
if [ "$NEW_PRIMARY_IP_ADDRESS" == "$CURRENT_TARGET_IP" ]; then
    echo "Load balancer already points to the new primary. No update needed."
    exit 0
fi

echo "Updating Load Balancer $LB_ID to point to $NEW_PRIMARY_IP_ADDRESS..."

# This is a placeholder for actual LB update logic.
# You would typically need to:
# 1. Get the current forwarding rules.
# 2. Identify the rule(s) for MySQL.
# 3. Remove the old primary IP from the target list of that rule.
# 4. Add the new primary IP to the target list.
# 5. Update the load balancer with the modified configuration.

# Example of updating a forwarding rule (requires jq and careful construction)
# This is highly dependent on your LB setup and may require more complex API calls.
# For a production system, consider using a dedicated DO client library.

# For demonstration, we'll just log the action.
echo "Simulating update: Removing $CURRENT_TARGET_IP and adding $NEW_PRIMARY_IP_ADDRESS to LB $LB_ID."
echo "Actual API calls to DigitalOcean would go here."

# Example of a PUT request to update the LB (conceptual)
# curl -X PUT "https://api.digitalocean.com/v2/loadbalancers/$LB_ID" \
#   -H "Authorization: Bearer $DO_TOKEN" \
#   -H "Content-Type: application/json" \
#   -d '{
#     "forwarding_rules": [
#       {
#         "entry_protocol": "http",
#         "entry_port": 80,
#         "target_protocol": "http",
#         "target_port": 80,
#         "certificate_name": null,
#         "tls_passthrough": false,
#         "sticky_sessions": false,
#         "rules": [
#           {
#             "type": "match",
#             "value": "http_code",
#             "http_code": 200
#           }
#         ],
#         "target": "'"$NEW_PRIMARY_IP_ADDRESS"'" # This is the simplified part
#       }
#     ]
#   }'

exit 0

Ensure the script is executable (`chmod +x /path/to/your/post_failover_script.sh`) and that the user running Orchestrator has permissions to execute it and any commands it invokes (e.g., curl, dig, DigitalOcean API client). For DigitalOcean API interaction, you’ll need to generate an API token and ensure network access from the Orchestrator Droplet to the DigitalOcean API endpoints.

Shopify Deployment Considerations: High Availability and Failover

Shopify’s managed platform abstracts away much of the infrastructure complexity, including database failover. However, when building custom applications or integrations that interact with Shopify’s APIs or rely on external databases, ensuring high availability becomes your responsibility. This section focuses on strategies for external dependencies, particularly databases, that your Shopify integration might use.

External Database High Availability for Shopify Apps

Many Shopify applications require an external database to store application data, user preferences, or custom logic. If your Shopify app is deployed on DigitalOcean and uses MySQL, the Orchestrator setup described previously is directly applicable. The key is to ensure your Shopify app can seamlessly connect to the *current* primary database after a failover.

Challenge: Dynamic Primary IP/Hostname

When a MySQL failover occurs, the IP address or hostname of the primary database changes. Your Shopify application needs a mechanism to discover this new endpoint. Common solutions include:

DNS-based Failover: Maintain a DNS record (e.g., db.myapp.com) that always points to the current MySQL primary. Your Orchestrator’s failover script would update this DNS record (via DigitalOcean DNS API or a third-party DNS provider’s API) after a successful promotion. This is the most common and recommended approach.
Load Balancer Abstraction: If your application connects through a load balancer (like a DigitalOcean Load Balancer), the failover script updates the load balancer’s target to the new primary. The application connects to the stable IP/hostname of the load balancer.
Service Discovery: For more complex microservice architectures, a dedicated service discovery tool (like Consul or etcd) can be used. Orchestrator would update the service registry with the new primary’s address.

Implementing DNS Failover for Shopify Apps

Let’s refine the post_failover_script.sh to update DigitalOcean DNS records.

#!/bin/bash

# This script is executed by Orchestrator after a successful failover.
# It receives arguments: $1 = new_primary_hostname, $2 = cluster_name

NEW_PRIMARY_HOSTNAME=$1 # This is the hostname Orchestrator knows the new primary by
CLUSTER_NAME=$2
DO_TOKEN="YOUR_DIGITALOCEAN_API_TOKEN"
DO_DOMAIN="yourdomain.com" # Your registered domain
DNS_RECORD_NAME="db.myapp.com" # The A record to update

# Resolve the new primary's IP address
NEW_PRIMARY_IP=$(dig +short $NEW_PRIMARY_HOSTNAME | head -n 1)
if [ -z "$NEW_PRIMARY_IP" ]; then
    echo "Error: Could not resolve IP address for $NEW_PRIMARY_HOSTNAME"
    exit 1
fi

echo "Failover complete for cluster $CLUSTER_NAME. New primary is $NEW_PRIMARY_HOSTNAME ($NEW_PRIMARY_IP)."

# Get the current DNS record for $DNS_RECORD_NAME in $DO_DOMAIN
RECORD_INFO=$(curl -s -X GET "https://api.digitalocean.com/v2/domains/$DO_DOMAIN/records?name=$DNS_RECORD_NAME&type=A" \
  -H "Authorization: Bearer $DO_TOKEN")

RECORD_ID=$(echo "$RECORD_INFO" | jq -r '.domain_records[0].id')
CURRENT_IP=$(echo "$RECORD_INFO" | jq -r '.domain_records[0].data')

if [ -z "$RECORD_ID" ]; then
    echo "Error: DNS record '$DNS_RECORD_NAME.$DO_DOMAIN' not found."
    exit 1
fi

echo "Current DNS record '$DNS_RECORD_NAME.$DO_DOMAIN' points to $CURRENT_IP (ID: $RECORD_ID)."

# If the IP hasn't changed, no need to update
if [ "$NEW_PRIMARY_IP" == "$CURRENT_IP" ]; then
    echo "DNS record already points to the new primary. No update needed."
    exit 0
fi

echo "Updating DNS record '$DNS_RECORD_NAME.$DO_DOMAIN' to point to $NEW_PRIMARY_IP..."

# Update the DNS record via DigitalOcean API
UPDATE_RESPONSE=$(curl -s -X PUT "https://api.digitalocean.com/v2/domains/$DO_DOMAIN/records/$RECORD_ID" \
  -H "Authorization: Bearer $DO_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "data": "'"$NEW_PRIMARY_IP"'"
  }')

# Check the response
if echo "$UPDATE_RESPONSE" | jq -e '.domain_records[0].data' > /dev/null; then
    echo "Successfully updated DNS record. New IP: $(echo "$UPDATE_RESPONSE" | jq -r '.domain_records[0].data')"
    exit 0
else
    echo "Error updating DNS record:"
    echo "$UPDATE_RESPONSE"
    exit 1
fi

Ensure the script has execute permissions and that the Orchestrator Droplet has network access to the DigitalOcean API. The jq utility is required for parsing JSON responses from the API. Install it with sudo apt-get install jq.

Application Connection Logic

Your Shopify application’s database connection logic should be configured to use the DNS name (e.g., db.myapp.com) rather than a static IP address. Most database drivers and ORMs support this. For example, in PHP with PDO:

$dbHost = 'db.myapp.com'; // Use the DNS name
$dbName = 'your_app_database';
$dbUser = 'your_app_user';
$dbPass = 'your_app_password';
$dbPort = 3306;

try {
    $dsn = "mysql:host={$dbHost};port={$dbPort};dbname={$dbName};charset=utf8mb4";
    $options = [
        PDO::ATTR_ERRMODE            => PDO::ERRMODE_EXCEPTION,
        PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
        PDO::ATTR_EMULATE_PREPARES   => false,
    ];
    $pdo = new PDO($dsn, $dbUser, $dbPass, $options);
    // Connection successful
    echo "Connected to database successfully!";
} catch (\PDOException $e) {
    // Handle connection error - log it, potentially retry, or show an error page
    // For a Shopify app, you might want to display a user-friendly "Service temporarily unavailable" message.
    error_log("Database connection failed: " . $e->getMessage());
    // In a real Shopify app, you'd render a specific error view.
    // For example, if using a framework like Laravel:
    // return response()->view('errors.database_unavailable', [], 503);
    die("Service temporarily unavailable. Please try again later.");
}

When a failover occurs and the DNS record is updated, the next connection attempt from your application will resolve to the new primary IP address. DNS propagation time can be a factor, but typically it’s fast enough for most applications. You can influence TTL (Time To Live) on your DNS records to manage propagation speed.

Monitoring and Alerting

While Orchestrator handles the failover, robust monitoring and alerting are essential to know when it happens and to catch any issues it might miss. Integrate Orchestrator’s events with your existing monitoring stack (e.g., Prometheus, Grafana, Datadog).

Orchestrator exposes metrics that can be scraped by Prometheus. You can configure this in orchestrator.conf.json:

{
  // ... other configurations ...
  "PrometheusListenAddress": ":8080", // Expose Prometheus metrics on port 8080
  "PrometheusPath": "/metrics"
  // ...
}

Configure Prometheus to scrape http://your-orchestrator-droplet-ip:8080/metrics. You can then create Grafana dashboards to visualize:

Number of active clusters
Number of failed instances
Recent failover events
Replication lag across your topology

Set up alerts for critical events, such as:

Orchestrator service down
Failover initiated or completed
High replication lag detected
Instance unreachable for an extended period

For Shopify applications, ensure your alerting also covers:

Application error rates (especially database connection errors)
Availability of your Shopify app’s endpoints

By combining Orchestrator’s automated failover capabilities with DNS management and comprehensive monitoring, you can build a resilient MySQL infrastructure that supports your critical Shopify integrations, minimizing downtime and ensuring a smooth experience for your customers.