Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Perl Deployments on Linode

Establishing a Robust MySQL Replication Topology for High Availability

Achieving automated failover for critical MySQL deployments necessitates a well-defined replication strategy. For this architecture, we’ll focus on a primary-replica setup with asynchronous replication, which offers a balance of performance and simplicity. The goal is to have a hot standby ready to take over with minimal data loss in the event of a primary failure. This involves configuring MySQL on both the primary and replica instances, ensuring proper network connectivity, and setting up the replication user.

On the primary MySQL server, ensure binary logging is enabled and configured correctly. This is the foundation of replication. The log_bin directive must be set, and a unique server-id is crucial for identifying the server within the replication topology. We also recommend setting binlog_format to ROW for maximum data integrity, although MIXED can be a viable alternative depending on specific workload characteristics.

Primary MySQL Server Configuration

Edit your MySQL configuration file (typically my.cnf or mysqld.cnf) on the primary node. The following settings are essential:

[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
# Optional: For improved durability, consider sync_binlog
# sync_binlog = 1
# Optional: For GTID-based replication (more complex but robust)
# gtid_mode = ON
# enforce_gtid_consistency = ON

After modifying the configuration, restart the MySQL service. You’ll then need to create a dedicated replication user. This user will be used by the replica to connect to the primary and fetch binary log events.

-- On the primary MySQL server
CREATE USER 'repl_user'@'%' IDENTIFIED BY 'your_strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';
FLUSH PRIVILEGES;

Next, obtain the current binary log file name and position. This information is critical for initializing the replica. If you’re using GTIDs, this step is slightly different, but for basic asynchronous replication, we’ll use SHOW MASTER STATUS.

-- On the primary MySQL server
SHOW MASTER STATUS;

Note down the File and Position values. You’ll also need to create a consistent snapshot of your primary database. This can be done using mysqldump or by taking a filesystem-level snapshot if your storage solution supports it. For this example, we’ll use mysqldump, ensuring it’s done while the primary is operational and the binary log position is captured.

# On the primary MySQL server (or a client machine with access)
mysqldump --all-databases --master-data=2 --single-transaction -u root -p > full_backup.sql
# The --master-data=2 option comments out the CHANGE MASTER TO command in the dump file.
# We will manually configure this on the replica.
# Alternatively, to get the exact binlog file and position at the time of dump:
# mysqldump --all-databases --single-transaction -u root -p > full_backup.sql
# Then, on the primary, run:
# SHOW MASTER STATUS;
# And manually record the File and Position.

Configuring the Replica MySQL Server

On the replica server, ensure MySQL is installed and running. Similar to the primary, a unique server-id is required. It must be different from the primary’s server-id. Binary logging can be enabled on the replica if it’s also intended to act as a primary for another replica, but it’s not strictly necessary for a simple primary-replica setup.

[mysqld]
server-id = 2
# Optional: If this replica will also serve read replicas
# log_bin = /var/log/mysql/mysql-bin.log
# binlog_format = ROW

Restart the MySQL service on the replica. Now, import the data from the primary’s backup. This will establish the initial dataset on the replica.

# On the replica MySQL server
mysql -u root -p < full_backup.sql

Once the data is imported, configure the replica to connect to the primary using the CHANGE MASTER TO command. This command tells the replica where to find the primary, which user to use for authentication, and where in the binary log stream to start replicating from. Use the File and Position obtained earlier from SHOW MASTER STATUS on the primary.

-- On the replica MySQL server
CHANGE MASTER TO
  MASTER_HOST='primary_ip_address',
  MASTER_USER='repl_user',
  MASTER_PASSWORD='your_strong_password',
  MASTER_LOG_FILE='mysql-bin.XXXXXX', -- Replace with the File from SHOW MASTER STATUS
  MASTER_LOG_POS=YYYYYY; -- Replace with the Position from SHOW MASTER STATUS

Finally, start the replication process and check its status.

-- On the replica MySQL server
START SLAVE;
SHOW SLAVE STATUS\G

You should see Slave_IO_Running: Yes and Slave_SQL_Running: Yes. If there are any errors, the Last_IO_Errno, Last_IO_Error, Last_SQL_Errno, and Last_SQL_Error fields will provide clues. Common issues include network connectivity problems, incorrect credentials, or a mismatch in binary log positions.

Implementing a Health Check and Failover Mechanism

Automated failover requires a mechanism to detect primary failure and initiate the switch. We’ll use a combination of a simple health check script and a tool like orchestrator or a custom solution leveraging Linode’s API and a distributed consensus system (like etcd or Consul) for more advanced scenarios. For this example, we’ll outline a basic script-based approach that can be triggered by a monitoring system.

The health check script should periodically attempt to connect to the primary MySQL instance and execute a simple query. If the connection fails or the query times out, it indicates a potential primary failure. The script should then trigger the failover procedure.

Perl Script for Health Check and Failover Trigger

This Perl script connects to the primary MySQL server. If it fails to connect or execute a simple query, it will log an error and attempt to initiate a failover by signaling an external process (e.g., sending a webhook, updating a status file, or calling an API). For simplicity, this example just logs the event and exits with a non-zero status code, which a monitoring system like Nagios or Prometheus Alertmanager can interpret.

#!/usr/bin/perl

use strict;
use warnings;
use DBI;

my $db_host = 'primary_ip_address';
my $db_user = 'health_check_user'; # A read-only user is sufficient
my $db_pass = 'your_health_check_password';
my $db_name = 'information_schema'; # A small, always available database

my $dbh;
my $dsn = "DBI:mysql:database=$db_name;host=$db_host";

eval {
    $dbh = DBI->connect($dsn, $db_user, $db_pass, { RaiseError => 1, AutoCommit => 1, mysql_connect_timeout => 5 });
    my $sth = $dbh->prepare("SELECT 1");
    $sth->execute();
    $sth->fetchrow_array();
    $sth->finish();
    $dbh->disconnect();
};

if ($@) {
    # Connection or query failed - potential primary failure
    my $error = $@;
    chomp $error;
    print STDERR "MySQL health check failed: $error\n";

    # --- FAILOVER TRIGGER LOGIC ---
    # In a real-world scenario, this is where you'd initiate failover.
    # This could involve:
    # 1. Calling a Linode API to reassign a floating IP.
    # 2. Updating a configuration file that your application reads.
    # 3. Sending an alert to a dedicated failover manager.
    # 4. Triggering a script that promotes the replica.
    #
    # Example: Triggering a separate failover script
    # system("/usr/local/bin/trigger_mysql_failover.sh");

    exit 1; # Indicate failure
} else {
    print "MySQL health check successful.\n";
    exit 0; # Indicate success
}

Create a read-only user for this health check on the primary MySQL server:

-- On the primary MySQL server
CREATE USER 'health_check_user'@'%' IDENTIFIED BY 'your_health_check_password';
GRANT SELECT ON information_schema.* TO 'health_check_user'@'%';
FLUSH PRIVILEGES;

Automating Replica Promotion and Application Reconfiguration

When the health check script detects a failure, it needs to trigger a process that promotes the replica to become the new primary. This involves several steps:

Stop Replication: The replica must stop replicating from the old primary to prevent it from receiving further writes that could cause conflicts or data loss.
Promote Replica: The replica needs to be designated as the new primary. This might involve changing its server-id (if using GTIDs and strict consistency) or simply marking it as the active primary in a configuration management system.
Update Application Configuration: The most critical step is redirecting application traffic to the new primary. This is often managed by a load balancer, a floating IP address, or by updating application configuration files.

A robust failover script would perform these actions. For a Linode deployment, this often involves using the Linode API to reassign a Floating IP address to the replica server. The application itself would then connect to this Floating IP.

Example Failover Script (Conceptual)

This is a conceptual outline. Actual implementation requires careful handling of API credentials, error checking, and idempotency. We’ll use linode-cli for managing the Floating IP.

#!/bin/bash

# Configuration
OLD_PRIMARY_IP="primary_ip_address"
NEW_PRIMARY_IP="replica_ip_address"
FLOATING_IP="your_floating_ip_address"
LINODE_REGION="your_linode_region" # e.g., us-east
REPLICA_NODE_ID="your_replica_linode_id" # The Linode ID of the replica server

# --- Step 1: Stop Replication on Replica (if it's running) ---
echo "Stopping replication on replica..."
mysql -h $NEW_PRIMARY_IP -u root -p'your_root_password' -e "STOP SLAVE;"
if [ $? -ne 0 ]; then
    echo "Error stopping slave. Proceeding with caution."
fi

# --- Step 2: Promote Replica (conceptually) ---
# In a simple setup, stopping replication and reassigning IP is enough.
# For more complex setups, you might need to run 'RESET SLAVE ALL;'
# or promote it in a cluster manager.
echo "Promoting replica (conceptually)..."
# If using GTIDs and need to ensure consistency, this step is more involved.
# For basic async, this is often sufficient.

# --- Step 3: Reassign Floating IP ---
echo "Reassigning Floating IP $FLOATING_IP to $REPLICA_NODE_ID..."
# First, remove the IP from the old primary if it was associated
# This step is crucial if the old primary is still reachable and has the IP.
# You might need to query existing IPs associated with the old primary's Linode ID.
# For simplicity, we assume the IP is not actively assigned to the old primary's Linode ID anymore
# or that the old primary is truly down.

# Assign the Floating IP to the new primary's Linode ID
linode-cli floating-ips assign $FLOATING_IP --node $REPLICA_NODE_ID --region $LINODE_REGION
if [ $? -ne 0 ]; then
    echo "FATAL: Failed to assign Floating IP $FLOATING_IP to node $REPLICA_NODE_ID."
    exit 1
fi
echo "Floating IP $FLOATING_IP successfully assigned to node $REPLICA_NODE_ID."

# --- Step 4: Update Application Configuration (if not using Floating IP) ---
# If your application directly uses IP addresses, you'd update config files here.
# For example, using sed or a configuration management tool.
# echo "Updating application configuration..."
# sed -i "s/$OLD_PRIMARY_IP/$NEW_PRIMARY_IP/g" /path/to/app/config.ini

# --- Step 5: Verify Application Connectivity ---
# Add a loop here to check if applications can connect to the new primary.
echo "Failover process initiated. Please verify application connectivity."

exit 0

This script assumes you have linode-cli installed and configured with API credentials. The monitoring system (e.g., Nagios, Prometheus) would execute the health check script. If it returns a non-zero exit code, the monitoring system can be configured to execute this failover script.

Integrating with Perl Applications

Your Perl applications need to be resilient to database connection changes. The most straightforward approach is to configure your applications to connect to the Floating IP address. If the Floating IP is reassigned, the application will automatically connect to the new primary without code changes. If your application configuration is static, you’ll need a mechanism to update it. This could involve:

Using a configuration management tool (Ansible, Chef, Puppet) to push updated configuration files.
Implementing a dynamic configuration loader that periodically checks a central configuration store (e.g., Consul, etcd, or a simple file on a shared volume).
Having the failover script itself update the application’s configuration file and signal the application to reload its configuration.

For applications that directly embed database connection strings, a more involved approach might be necessary. Consider using a connection pooling library that supports dynamic endpoint updates or a proxy like ProxySQL or MaxScale, which can manage connections and route traffic based on the health of backend servers.

Testing and Validation

Thorough testing is paramount. Regularly simulate primary failures:

Graceful Shutdown: Stop the MySQL service on the primary. Observe if the failover script triggers and the Floating IP is reassigned. Verify application connectivity.
Hard Failure: Simulate a Linode instance crash (e.g., by terminating the instance). This tests the resilience of your monitoring and failover process to unexpected outages.
Network Partition: Block network access from the replica to the primary. This tests how your health checks behave under network isolation.
Data Integrity Checks: After a failover, perform checks to ensure no critical data was lost. Compare row counts or checksums for key tables between the old primary (if recoverable) and the new primary.

Document your failover procedures and conduct regular drills. Ensure your team is familiar with the process and can manually intervene if automation fails.