Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Laravel Deployments on Linode

Establishing a High-Availability MySQL Cluster with Replication

For any critical application, a single MySQL instance is a single point of failure. To achieve high availability and enable automated failover, we must architect a robust MySQL cluster. The cornerstone of this architecture is asynchronous or semi-synchronous replication. For this guide, we’ll focus on setting up a primary-replica topology, which is the most common and straightforward for automated failover scenarios. We’ll use two Linode instances for this setup: one for the primary MySQL server and another for the replica.

First, ensure MySQL is installed and running on both Linode instances. We’ll configure the primary server to enable binary logging, which is essential for replication.

Primary MySQL Server Configuration

Edit the MySQL configuration file, typically located at /etc/mysql/mysql.conf.d/mysqld.cnf or /etc/my.cnf. Add or modify the following directives under the [mysqld] section:

[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = OFF
bind-address = 0.0.0.0

After saving the configuration, restart the MySQL service:

sudo systemctl restart mysql

Next, create a dedicated replication user on the primary server. This user will be used by the replica to connect and fetch binary logs.

-- On the primary MySQL server:
CREATE USER 'replicator'@'%' IDENTIFIED BY 'your_strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;

Finally, obtain the current binary log file name and position. This information is crucial for initializing the replica. You can get this by executing:

SHOW MASTER STATUS;

Note down the File and Position values. If you enabled GTID, you can also use SHOW MASTER STATUS; to get the GTID set. For GTID-based replication, the File and Position are less critical for initial setup but still useful for diagnostics.

Replica MySQL Server Configuration

On the replica server, edit its MySQL configuration file (e.g., /etc/mysql/mysql.conf.d/mysqld.cnf) and add/modify these directives:

[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = ON
bind-address = 0.0.0.0

Restart the MySQL service on the replica:

sudo systemctl restart mysql

Now, configure the replica to connect to the primary. You’ll need the primary’s IP address, the replication user’s credentials, and the binary log file/position obtained earlier. If using GTID, the MASTER_AUTO_POSITION=1 is preferred.

-- On the replica MySQL server:
CHANGE MASTER TO
  MASTER_HOST='',
  MASTER_USER='replicator',
  MASTER_PASSWORD='your_strong_password',
  MASTER_PORT=3306,
  MASTER_AUTO_POSITION=1; -- Use this if GTID is enabled on primary
-- If not using GTID, use:
-- MASTER_LOG_FILE='mysql-bin.XXXXXX',
-- MASTER_LOG_POS=YYYYYYY;

START SLAVE;

SHOW SLAVE STATUS\G

The SHOW SLAVE STATUS\G command is critical. Look for Slave_IO_Running: Yes and Slave_SQL_Running: Yes. If you see errors, troubleshoot them by examining the MySQL error logs on both servers and checking network connectivity and firewall rules.

Implementing Automated Failover with Orchestrator

While MySQL replication ensures data consistency, it doesn’t automatically handle failover. For automated failover, we’ll leverage Orchestrator, a popular open-source tool designed for MySQL topology management and high availability. Orchestrator can detect failures and promote a replica to become the new primary.

Orchestrator Installation and Configuration

Orchestrator can be installed on a separate, dedicated server or on one of the MySQL nodes (though a separate node is recommended for true resilience). For simplicity, we’ll assume a separate Linode instance for Orchestrator.

Download the latest Orchestrator binary from its GitHub releases page. For example, on a Debian/Ubuntu system:

wget https://github.com/openark/orchestrator/releases/download/v3.2.7/orchestrator-3.2.7-linux-amd64.tar.gz
tar xvfz orchestrator-3.2.7-linux-amd64.tar.gz
sudo mv orchestrator-3.2.7-linux-amd64 /usr/local/orchestrator
sudo ln -s /usr/local/orchestrator/orchestrator /usr/local/bin/orchestrator

Orchestrator requires a backend database to store its state. We’ll use a small, dedicated MySQL instance for this. You can set this up on the same Orchestrator server or a separate one. For this example, we’ll assume a local MySQL instance for Orchestrator’s backend.

# On the Orchestrator server, install MySQL client and server if not present
sudo apt update
sudo apt install mysql-server mysql-client -y

# Secure MySQL installation (optional but recommended)
sudo mysql_secure_installation

# Create a database and user for Orchestrator
sudo mysql -u root -p
CREATE DATABASE orchestrator;
CREATE USER 'orchestrator'@'localhost' IDENTIFIED BY 'orchestrator_db_password';
GRANT ALL PRIVILEGES ON orchestrator.* TO 'orchestrator'@'localhost';
FLUSH PRIVILEGES;
EXIT;

Create Orchestrator’s configuration file, typically /etc/orchestrator.conf.json:

{
  "Debug": true,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "your_mysql_topology_password",
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "Flavor": "MySQL",
  "DiscoveryPeriodSeconds": 10,
  "PromotionUser": "orchestrator_promote",
  "PromotionPassword": "your_promotion_password",
  "RecoveryPeriodBlockSeconds": 3600,
  "FailureDetectionPeriodBlockSeconds": 300,
  "SlaveLagQuery": "SELECT * FROM mysql.slave_master_info WHERE Master_Host IS NOT NULL AND Slave_SQL_Running = 'Yes' AND Slave_IO_Running = 'Yes' AND Seconds_Behind_Master > {lag_minutes}",
  "SlaveLagQueryMinutes": 60,
  "DetectClusterAliases": true,
  "SnapshotPeriodSeconds": 3600,
  "SnapshotTopologies": true,
  "PreemptivePaging": true,
  "HTTPProxyórios": "localhost:8080"
}

Important Notes on Configuration:

MySQLTopologyUser and MySQLTopologyPassword: This user needs read access to MySQL’s information_schema and performance_schema, and the ability to execute SHOW SLAVE STATUS. Create this user on your MySQL servers.
PromotionUser and PromotionPassword: This user needs privileges to promote a replica (e.g., REPLICATION SLAVE, RELOAD, SUPER, PROCESS). This user will be used by Orchestrator to execute commands like STOP SLAVE, RESET SLAVE ALL, and CHANGE MASTER TO on the promoted replica and other nodes.
SlaveLagQueryMinutes: Defines the acceptable lag for a replica. If a replica lags beyond this, it might not be considered for promotion.
ListenAddress: The address Orchestrator will listen on for its API and web UI.

Create the necessary MySQL users for Orchestrator on both your primary and replica MySQL servers:

-- On Primary and Replica MySQL servers:
CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'your_mysql_topology_password';
GRANT SELECT, PROCESS, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'orchestrator'@'%';
FLUSH PRIVILEGES;

CREATE USER 'orchestrator_promote'@'%' IDENTIFIED BY 'your_promotion_password';
GRANT REPLICATION SLAVE, RELOAD, SUPER, PROCESS ON *.* TO 'orchestrator_promote'@'%';
FLUSH PRIVILEGES;

Now, start Orchestrator. It’s best to run it as a systemd service for production environments.

# Create a systemd service file for Orchestrator
sudo nano /etc/systemd/system/orchestrator.service

[Unit]
Description=Orchestrator MySQL HA
After=network.target

[Service]
User=root
Group=root
ExecStart=/usr/local/bin/orchestrator -config=/etc/orchestrator.conf.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# Reload systemd, enable and start Orchestrator
sudo systemctl daemon-reload
sudo systemctl enable orchestrator
sudo systemctl start orchestrator
sudo systemctl status orchestrator

Once Orchestrator is running, access its web UI (default: http://:3000). Orchestrator will automatically discover your MySQL topology. You should see your primary and replica listed. If not, you might need to manually add your primary instance via the UI or API.

Configuring Orchestrator for Auto-Failover

Orchestrator’s auto-failover is configured via its API. You can use curl or Orchestrator’s web UI to enable it. The key setting is auto-failover-topology-groups.

First, ensure Orchestrator has discovered your cluster. If not, add your primary instance:

curl -X POST "http://localhost:3000/api/discover-instance?instance=:3306"

To enable auto-failover for a specific cluster (or all clusters if you have only one), you can use the API. Orchestrator groups instances into “topology groups”. By default, each discovered cluster forms a group. You can explicitly define a group:

# Example: Enable auto-failover for a specific cluster (replace with your cluster alias or primary host)
# You can find the cluster alias in the Orchestrator UI.
curl -X POST \
  -d '{"TopologyGroup": "my_mysql_cluster", "DiscoverByClusterAlias": true}' \
  http://localhost:3000/api/auto-failover-topology-groups

Or, to enable it for all discovered clusters:

curl -X POST \
  -d '{"TopologyGroup": "default", "DiscoverByClusterAlias": false, "DiscoverByHost": true}' \
  http://localhost:3000/api/auto-failover-topology-groups

Orchestrator will now monitor your MySQL topology. If it detects the primary is down (e.g., it cannot connect, or SHOW SLAVE STATUS on replicas indicates a broken connection), it will attempt to promote the most suitable replica to become the new primary. This process involves:

Identifying a healthy replica with minimal lag.
Stopping replication on that replica.
Resetting its replication configuration.
Making it writable (read_only = OFF).
Updating other replicas to replicate from the newly promoted primary.
Updating DNS or application configurations (this part requires integration).

Integrating with Laravel Applications

Your Laravel application needs to be aware of the primary MySQL instance. The most robust way to handle this is by using a Virtual IP (VIP) or a DNS-based approach that can be updated upon failover.

Option 1: DNS-Based Failover

This is often the simplest to implement initially. You’ll have a DNS record (e.g., db.yourdomain.com) pointing to the IP address of your current primary MySQL server. When Orchestrator promotes a new primary, it needs to trigger an update to this DNS record.

How to automate DNS updates:

Cloud Provider APIs: Most cloud providers (including Linode) offer APIs to manage DNS records. You can write a script that Orchestrator calls upon successful promotion to update the DNS record.
External DNS Services: Services like Cloudflare or Route 53 also have APIs.

Orchestrator has a hook system that allows you to execute custom scripts after certain events, including successful promotions. You can configure this in orchestrator.conf.json:

{
  // ... other configurations
  "PostFailoverProcesses": [
    "/path/to/your/dns_update_script.sh"
  ]
}

The dns_update_script.sh would look something like this (example using Linode API):

#!/bin/bash

# This script is executed by Orchestrator after a successful failover.
# It receives information about the failover event via environment variables.

# Example: Update Linode DNS record
# Ensure you have a Linode API token with DNS management permissions.
# Store your token securely, e.g., in a separate config file or environment variable.

LINODE_API_TOKEN="YOUR_LINODE_API_TOKEN"
RECORD_ID="YOUR_DNS_RECORD_ID" # The ID of the 'db.yourdomain.com' record
NEW_IP="$ORCHESTRATOR_TO_INSTANCE_IP" # Orchestrator provides the new primary IP

# Check if NEW_IP is set
if [ -z "$NEW_IP" ]; then
  echo "Error: NEW_IP not set. Orchestrator did not provide the new primary IP."
  exit 1
fi

echo "Updating DNS record ID $RECORD_ID to IP $NEW_IP..."

curl -X PUT "https://api.linode.com/v4/domains/records/$RECORD_ID" \
  -H "Authorization: Bearer $LINODE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "data": "'"$NEW_IP"'",
    "name": "db",
    "port": 3306,
    "priority": 10,
    "record_type": "A",
    "target": "",
    "ttl_sec": 300
  }'

if [ $? -eq 0 ]; then
  echo "DNS record updated successfully."
else
  echo "Error updating DNS record."
  exit 1
fi

exit 0

You’ll need to obtain the RECORD_ID from your Linode DNS manager. The script should be executable (chmod +x /path/to/your/dns_update_script.sh).

Option 2: Virtual IP (VIP) Failover

A Virtual IP is an IP address that is not tied to a specific physical machine but rather to a cluster. Tools like Keepalived or Pacemaker can manage VIPs. The VIP would be configured on your application servers to point to the current primary MySQL instance. When a failover occurs, the VIP is moved to the new primary.

This approach is more complex to set up but can offer faster failover times as applications don’t need to wait for DNS propagation. Orchestrator can be configured to trigger Keepalived/Pacemaker scripts to move the VIP.

In your Laravel application’s config/database.php, you would configure the database connection to use the VIP address:

return [
    // ...
    'connections' => [
        'mysql' => [
            'driver' => 'mysql',
            'host' => env('DB_HOST', '192.168.1.100'), // This would be your VIP
            'port' => env('DB_PORT', 3306),
            'database' => env('DB_DATABASE', 'forge'),
            'username' => env('DB_USERNAME', 'forge'),
            'password' => env('DB_PASSWORD', ''),
            // ...
        ],
    ],
    // ...
];

The DB_HOST would be the VIP. Orchestrator’s PostFailoverProcesses hook would then execute a script that instructs Keepalived/Pacemaker to move the VIP.

Laravel Application Code Considerations

Even with automated failover, your application might encounter brief connection errors during the transition. Implement robust error handling and retry mechanisms in your Laravel application.

try {
    // Your database operations
    DB::table('users')->get();
} catch (\Illuminate\Database\QueryException $e) {
    // Check for specific MySQL connection errors if possible
    // For example, error code 2002 (Can't connect to local MySQL server)
    // or 2003 (Can't connect to MySQL server on 'host')
    if ($e->getCode() == 2002 || $e->getCode() == 2003) {
        // Log the error, potentially inform the user of a temporary outage
        // and implement a short delay before retrying or redirecting.
        // A simple retry might be:
        // sleep(5);
        // return redirect()->back()->withInput(); // Or re-run the operation
    }
    // Handle other exceptions
    throw $e;
}

Consider using Laravel’s queue system for background jobs that interact with the database. This can help absorb temporary database unavailability.

Monitoring and Testing

A high-availability setup is only as good as its monitoring and testing. Regularly test your failover process to ensure it works as expected and to identify any weaknesses.

Key monitoring points:

Orchestrator Health: Monitor the Orchestrator service itself. Ensure it’s running and can connect to its backend database and MySQL instances.
MySQL Replication Status: Use Orchestrator’s UI or direct SQL queries (SHOW SLAVE STATUS\G) to verify replication is healthy.
Application Connectivity: Monitor your Laravel application’s ability to connect to the database.
DNS/VIP Status: Ensure your DNS records or VIPs are correctly pointing to the active primary.

Testing Failover:

Simulate Primary Failure: Gracefully stop the MySQL service on the primary node (sudo systemctl stop mysql). Observe Orchestrator’s dashboard to see it detect the failure and initiate a promotion. Verify the application can still connect after the failover.
Simulate Replica Failure: Stop a replica and observe how Orchestrator handles it.
Network Partitioning: Test scenarios where network issues might isolate the primary.

By implementing a robust MySQL replication strategy, leveraging Orchestrator for automated failover, and integrating with your Laravel application’s configuration and error handling, you can significantly improve the resilience and availability of your critical data infrastructure.