Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Perl Deployments on AWS

Automated MySQL Failover with AWS RDS and Perl Health Checks

Achieving true high availability for critical database services like MySQL necessitates an automated failover strategy. Relying on manual intervention during an outage is a recipe for extended downtime and significant business impact. This guide details an architecture for automated MySQL failover on AWS, leveraging Amazon RDS Multi-AZ deployments and a custom Perl-based health check and failover orchestration system.

Leveraging Amazon RDS Multi-AZ for High Availability

Amazon Relational Database Service (RDS) provides a managed solution that significantly simplifies database administration, including high availability. For MySQL, the Multi-AZ deployment option is the foundational element for automatic failover. When enabled, RDS provisions and maintains a synchronous standby replica of your database instance in a different Availability Zone (AZ). In the event of a primary instance failure (e.g., instance hardware failure, AZ disruption, network issues), RDS automatically initiates a failover to the standby replica. This process is transparent to your application, as RDS updates the DNS CNAME record to point to the standby instance. The failover typically completes within 60-120 seconds.

While RDS Multi-AZ handles the infrastructure-level failover, it’s crucial to understand its limitations and how to integrate it with application-level logic. RDS failover is reactive; it happens *after* a failure is detected. For applications that require near-zero downtime or more granular control over the failover process, a complementary application-level health check and orchestration layer is beneficial.

Perl-Based Health Check and Orchestration Daemon

We’ll implement a lightweight daemon written in Perl to continuously monitor the health of the primary MySQL instance. This daemon will perform more sophisticated checks than the basic RDS health metrics, and if it detects a persistent issue, it can trigger notifications or, in more advanced scenarios, initiate application-level actions. For this example, we’ll focus on robust health checking and alerting.

Prerequisites

An AWS account with an RDS MySQL Multi-AZ instance configured.
AWS CLI installed and configured with appropriate IAM permissions to describe RDS instances.
Perl installed on an EC2 instance (or other compute resource) that has network access to your RDS endpoint.
The DBI and DBD::mysql Perl modules installed.

Health Check Script (`mysql_health_check.pl`)

This script connects to the primary RDS instance, executes a simple query, and checks for specific error conditions. It also queries AWS for RDS health status to correlate findings.

#!/usr/bin/perl

use strict;
use warnings;
use DBI;
use Time::HiRes qw(sleep);
use JSON;
use LWP::UserAgent;
use HTTP::Request;

# --- Configuration ---
my $DB_HOST = 'your-rds-instance.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com'; # Replace with your RDS endpoint
my $DB_PORT = 3306;
my $DB_NAME = 'your_database_name'; # Replace with your database name
my $DB_USER = 'your_db_user';       # Replace with your database user
my $DB_PASS = 'your_db_password';   # Replace with your database password (consider using Secrets Manager)

my $CHECK_INTERVAL_SECONDS = 30;
my $MAX_FAILURES_BEFORE_ALERT = 3; # Number of consecutive failures to trigger an alert

my $AWS_REGION = 'us-east-1'; # Replace with your AWS region
my $RDS_INSTANCE_IDENTIFIER = 'your-rds-instance-identifier'; # Replace with your RDS instance identifier

my $ALERT_SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789012:your-sns-topic'; # Replace with your SNS topic ARN

# --- Global State ---
my $consecutive_failures = 0;
my $last_alert_time = 0;
my $ALERT_COOLDOWN_SECONDS = 300; # 5 minutes cooldown between alerts

# --- Helper Functions ---

sub log_message {
    my ($level, $message) = @_;
    my $timestamp = localtime();
    print "[$timestamp] [$level] $message\n";
}

sub send_sns_alert {
    my ($subject, $message) = @_;
    log_message('INFO', "Sending SNS alert: $subject");

    my $sns_command = "aws sns publish --topic-arn '$ALERT_SNS_TOPIC_ARN' --subject '$subject' --message '$message'";
    my $result = `$sns_command 2>&1`;

    if ($? != 0) {
        log_message('ERROR', "Failed to send SNS alert: $result");
    } else {
        log_message('INFO', "SNS alert sent successfully.");
    }
}

sub get_rds_health_status {
    my $ua = LWP::UserAgent->new;
    $ua->timeout(10);

    my $url = "http://169.254.169.254/latest/meta-data/iam/security-credentials/YOUR_IAM_ROLE_NAME"; # Replace with your EC2 instance profile role name
    my $req = HTTP::Request->new(GET => $url);
    my $res = $ua->request($req);

    if ($res->is_success) {
        my $credentials_json = $res->decoded_content;
        my $credentials = JSON->new->decode($credentials_json);
        my $access_key_id = $credentials->{AccessKeyId};
        my $secret_access_key = $credentials->{SecretAccessKey};
        my $session_token = $credentials->{SessionToken};

        # Constructing the AWS CLI command for RDS describe-db-instances
        # This is a simplified approach. For production, consider using AWS SDKs.
        my $aws_cli_command = "aws rds describe-db-instances --db-instance-identifier '$RDS_INSTANCE_IDENTIFIER' --region '$AWS_REGION' --output json";

        # In a real-world scenario, you'd sign this request properly.
        # For simplicity here, we assume the EC2 instance has an IAM role with permissions.
        # The AWS CLI will use the instance profile credentials automatically.
        my $rds_status_json = `AWS_ACCESS_KEY_ID=$access_key_id AWS_SECRET_ACCESS_KEY=$secret_access_key AWS_SESSION_TOKEN=$session_token $aws_cli_command 2>&1`;

        if ($? != 0) {
            log_message('ERROR', "AWS CLI error describing RDS instance: $rds_status_json");
            return undef;
        }

        my $rds_data = JSON->new->decode($rds_status_json);
        if (exists $rds_data->{DBInstances} && @{$rds_data->{DBInstances}} > 0) {
            return $rds_data->{DBInstances}[0];
        } else {
            log_message('ERROR', "No DBInstances found in AWS CLI response.");
            return undef;
        }
    } else {
        log_message('ERROR', "Failed to retrieve EC2 instance credentials: " . $res->status_line);
        return undef;
    }
}


sub check_database_health {
    my $dsn = "DBI:mysql:database=$DB_NAME;host=$DB_HOST;port=$DB_PORT";
    my $dbh;
    eval {
        $dbh = DBI->connect($dsn, $DB_USER, $DB_PASS, { RaiseError => 1, PrintError => 0, AutoCommit => 1 });
        # Execute a simple query to check connectivity and basic functionality
        my $sth = $dbh->prepare("SELECT 1");
        $sth->execute();
        my $result = $sth->fetchrow_array();
        $sth->finish();
        $dbh->disconnect();
        return ($result == 1);
    };
    if ($@) {
        log_message('ERROR', "Database connection/query failed: $@");
        return 0; # Failure
    }
    return 1; # Success
}

# --- Main Loop ---
log_message('INFO', "Starting MySQL health check daemon...");

while (1) {
    my $is_db_healthy = check_database_health();
    my $rds_instance_info = get_rds_health_status();

    my $is_rds_primary = 0;
    my $rds_instance_status = 'unknown';

    if ($rds_instance_info) {
        $rds_instance_status = $rds_instance_info->{DBInstanceStatus};
        if ($rds_instance_status eq 'available' && $rds_instance_info->{Endpoint} eq $DB_HOST) {
            $is_rds_primary = 1;
        }
    } else {
        log_message('WARN', "Could not retrieve RDS instance status for $RDS_INSTANCE_IDENTIFIER.");
    }

    my $current_time = time;

    if ($is_db_healthy && $is_rds_primary && $rds_instance_status eq 'available') {
        log_message('INFO', "Database is healthy. RDS instance status: $rds_instance_status.");
        $consecutive_failures = 0; # Reset failures on success
    } else {
        $consecutive_failures++;
        my $alert_message = "MySQL health check failed. Consecutive failures: $consecutive_failures.\n";
        $alert_message .= "DB Check Status: " . ($is_db_healthy ? "Healthy" : "Unhealthy") . "\n";
        $alert_message .= "RDS Instance Status: $rds_instance_status\n";
        $alert_message .= "RDS Endpoint: " . ($rds_instance_info ? $rds_instance_info->{Endpoint} : "N/A") . "\n";
        $alert_message .= "Expected Primary Endpoint: $DB_HOST\n";

        log_message('WARN', $alert_message);

        if ($consecutive_failures >= $MAX_FAILURES_BEFORE_ALERT) {
            if ($current_time - $last_alert_time > $ALERT_COOLDOWN_SECONDS) {
                my $alert_subject = "CRITICAL: MySQL Health Check Failure on $RDS_INSTANCE_IDENTIFIER";
                send_sns_alert($alert_subject, $alert_message);
                $last_alert_time = $current_time;
            } else {
                log_message('INFO', "Alert cooldown active. Skipping SNS notification.");
            }
        }
    }

    sleep($CHECK_INTERVAL_SECONDS);
}

Explanation of the Perl Script

Configuration: Defines database credentials, host, port, interval, and AWS parameters. Crucially, avoid hardcoding sensitive credentials. Use AWS Secrets Manager or IAM roles with appropriate permissions for EC2 instances running this script.
log_message: A simple utility for timestamped logging.
send_sns_alert: Publishes alerts to an AWS SNS topic using the AWS CLI. This is a basic implementation; for production, consider using the AWS SDK for Perl.
get_rds_health_status: This function attempts to retrieve the RDS instance’s status from the AWS API. It uses the EC2 instance’s IAM role to authenticate. This is a simplified example; proper AWS SDK integration would be more robust. It checks if the instance is ‘available’ and if its endpoint matches the expected primary endpoint.
check_database_health: Connects to the MySQL database using DBI and executes a simple `SELECT 1` query. It returns 1 for success and 0 for failure. Error handling is done via eval.
Main Loop: Continuously calls check_database_health and get_rds_health_status. It tracks consecutive failures. If failures exceed MAX_FAILURES_BEFORE_ALERT and the alert cooldown period has passed, it sends an SNS notification.

Deployment and Orchestration

The Perl script should be deployed on an EC2 instance that has network connectivity to your RDS instance. It’s recommended to run this script as a systemd service to ensure it restarts automatically if the EC2 instance reboots or the script crashes.

EC2 Instance Setup

Launch an EC2 instance in the same VPC as your RDS instance (or a peered VPC) to minimize latency and ensure network access.
Assign an IAM role to the EC2 instance with permissions to:
- rds:DescribeDBInstances
- sns:Publish

Install Perl and the necessary modules:

sudo yum update -y
sudo yum install perl-DBI perl-DBD-MySQL -y
# Or for Ubuntu/Debian:
# sudo apt update -y
# sudo apt install perl libdbi-perl libdbd-mysql-perl -y

Install the AWS CLI:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Save the Perl script (e.g., as /opt/scripts/mysql_health_check.pl).
Make the script executable: chmod +x /opt/scripts/mysql_health_check.pl.

Systemd Service Configuration

Create a systemd service file (e.g., /etc/systemd/system/mysql-health-check.service) to manage the script.

[Unit]
Description=MySQL Health Check Daemon
After=network.target

[Service]
User=ec2-user # Or the user you want to run the script as
Group=ec2-user # Or the group
WorkingDirectory=/opt/scripts/
ExecStart=/usr/bin/perl /opt/scripts/mysql_health_check.pl
Restart=always
RestartSec=10
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=mysql-health-check

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable mysql-health-check.service
sudo systemctl start mysql-health-check.service
sudo systemctl status mysql-health-check.service

Advanced Considerations and Next Steps

Application-Level Failover Orchestration

While SNS alerts are valuable for human operators, a truly automated system might require the health check daemon to trigger application-level failover actions. This could involve:

Updating Application Configuration: The daemon could update a configuration file or a service discovery endpoint (e.g., Consul, etcd) with the new RDS endpoint if a failover is detected.
Restarting Application Services: Triggering restarts of application tiers that might have cached the old database endpoint.
Integrating with Load Balancers: If using a custom load balancer or a service mesh, the daemon could signal it to direct traffic to the new primary.

Implementing these actions requires careful design to avoid race conditions and ensure consistency. For instance, the daemon could write the new primary endpoint to an S3 bucket or DynamoDB table, and application instances would periodically poll this source for the current database endpoint.

Database Credentials Management

Hardcoding database passwords in scripts is a significant security risk. For production environments, leverage AWS Secrets Manager. The Perl script can be modified to retrieve credentials from Secrets Manager using the AWS SDK or CLI.

# Example of retrieving a secret using AWS CLI (to be integrated into Perl)
aws secretsmanager get-secret-value --secret-id your-rds-secret-name --region us-east-1 --query SecretString --output text

Monitoring and Alerting Granularity

Enhance monitoring by:

Collecting detailed metrics from the Perl script (e.g., query latency, connection errors) and sending them to CloudWatch or Prometheus.
Configuring CloudWatch Alarms on RDS metrics (e.g., CPUUtilization, FreeableMemory, ReplicaLag) to complement the custom health checks.
Ensuring the EC2 instance running the health check daemon is itself highly available (e.g., part of an Auto Scaling Group with health checks).

Testing Failover Scenarios

Regularly test your failover mechanism. You can simulate a failure by:

Stopping the RDS instance (this will trigger an automatic Multi-AZ failover).
Simulating network issues from the EC2 instance to the RDS endpoint.
Manually terminating the primary RDS instance (use with extreme caution in production).

Verify that alerts are triggered correctly and that applications can reconnect to the new primary endpoint after the failover. Monitor application logs for any connection errors during the failover window.