Disaster Recovery 101: Architecting Auto-Failovers for MySQL and WordPress Deployments on AWS
Leveraging AWS RDS Multi-AZ for MySQL High Availability
For mission-critical WordPress deployments, MySQL availability is paramount. AWS Relational Database Service (RDS) offers a robust solution with its Multi-AZ deployment option. This configuration automatically provisions and maintains a synchronous standby replica of your primary database instance in a different Availability Zone (AZ). In the event of a primary instance failure, planned maintenance, or an AZ outage, RDS automatically fails over to the standby replica. This process is transparent to your application, minimizing downtime.
Configuring RDS Multi-AZ is straightforward during instance creation or modification. The key is selecting the “Multi-AZ deployment” option. RDS handles the underlying replication and failover orchestration. However, understanding the failover process and its implications for your WordPress application is crucial for architecting a truly resilient system.
Automating WordPress Application Failover with EC2 Auto Scaling and Elastic Load Balancing
While RDS Multi-AZ handles database failover, your WordPress application servers running on EC2 instances also need a high-availability strategy. This typically involves a combination of Elastic Load Balancing (ELB) and EC2 Auto Scaling Groups (ASG).
An ELB distributes incoming traffic across multiple EC2 instances. By configuring health checks on the ELB, it can automatically detect unhealthy instances and stop sending traffic to them. EC2 Auto Scaling Groups work in conjunction with ELB. ASGs monitor the health of instances and can automatically launch new instances to replace unhealthy ones or scale out based on demand. This ensures that even if an EC2 instance fails, a replacement is quickly provisioned and registered with the ELB, maintaining application availability.
Configuring ELB Health Checks for WordPress
Effective health checks are the cornerstone of automated failover. For WordPress, a simple HTTP/HTTPS check to the root path (`/`) might not be sufficient, as it doesn’t verify database connectivity. A more robust approach involves a custom health check script that verifies both application responsiveness and database reachability.
Create a simple PHP script (e.g., /wp-admin/health-check.php) on your WordPress servers:
<?php
/**
* WordPress Health Check Script
* Verifies application responsiveness and database connectivity.
*/
// Define expected database host (replace with your RDS endpoint)
define('EXPECTED_DB_HOST', 'your-rds-endpoint.region.rds.amazonaws.com');
// Attempt to connect to the database
$db_connection = @mysqli_connect(DB_HOST, DB_USER, DB_PASSWORD, DB_NAME);
if (mysqli_connect_errno()) {
// Database connection failed
header('HTTP/1.1 503 Service Unavailable');
echo "Database connection failed: " . mysqli_connect_error();
exit(1);
}
// Check if WordPress is accessible (basic check)
if (!file_exists(ABSPATH . 'wp-load.php')) {
header('HTTP/1.1 503 Service Unavailable');
echo "wp-load.php not found.";
exit(1);
}
// Optional: More advanced check - try to query WordPress options table
// This requires WP_USE_EXT_MYSQL to be true or similar setup.
// For simplicity, we'll rely on the DB connection and file existence.
// If all checks pass
header('HTTP/1.1 200 OK');
echo "WordPress is healthy.";
exit(0);
?>
Ensure this script is accessible via HTTP/HTTPS. Then, configure your ELB’s health check settings:
- Protocol: HTTP (or HTTPS if your ELB is configured for it)
- Port: 80 (or 443)
- Path:
/wp-admin/health-check.php - Healthy Threshold: 2 (or 3)
- Unhealthy Threshold: 2 (or 3)
- Timeout: 5 seconds
- Interval: 30 seconds
These settings ensure that the ELB considers an instance unhealthy only after multiple consecutive failures, preventing transient network glitches from triggering unnecessary failovers. The timeout should be short enough to detect issues quickly but long enough to avoid false positives.
Configuring EC2 Auto Scaling Groups for WordPress
An Auto Scaling Group (ASG) is configured with a Launch Template or Launch Configuration that defines how new EC2 instances are launched. This template should include:
- AMI ID: Your custom WordPress AMI or a standard Amazon Linux 2 AMI.
- Instance Type: Appropriate for your WordPress workload.
- Security Groups: Allowing inbound traffic from the ELB and necessary outbound access.
- User Data Script: A shell script to install and configure WordPress on boot, including connecting to your RDS instance.
The ASG itself is configured with:
- Desired Capacity: The number of instances to maintain.
- Min Size: The minimum number of instances.
- Max Size: The maximum number of instances.
- VPC and Subnets: Spanning multiple Availability Zones for high availability.
- Load Balancer: Associating the ASG with your ELB.
- Health Check Type: Set to “ELB” to leverage the ELB’s health checks.
When the ELB marks an instance as unhealthy, the ASG will automatically terminate it and launch a replacement. The new instance will be provisioned, configured by the user data script, and automatically registered with the ELB, seamlessly taking over traffic.
Implementing a Custom Failover Orchestration (Advanced)
While RDS Multi-AZ and ELB/ASG provide excellent automated failover for typical scenarios, there might be edge cases or specific requirements that necessitate a more granular control over the failover process. This could involve custom logic for application-level failover, such as switching to a read-only replica or triggering specific maintenance tasks.
Leveraging AWS Lambda and CloudWatch Events for Custom Actions
AWS CloudWatch Events (now EventBridge) can monitor RDS events, such as failover notifications. You can set up a rule that triggers an AWS Lambda function when a specific RDS event occurs.
CloudWatch Event Rule:
{
"source": [
"aws.rds"
],
"detail-type": [
"RDS DB Instance Event"
],
"detail": {
"EventCategories": [
"failover"
],
"SourceArn": [
"arn:aws:rds:REGION:ACCOUNT_ID:db:YOUR_DB_INSTANCE_IDENTIFIER"
]
}
}
This rule would trigger a Lambda function upon an RDS failover event for a specific database instance.
AWS Lambda Function (Python example):
import json
import boto3
rds_client = boto3.client('rds')
elbv2_client = boto3.client('elbv2')
def lambda_handler(event, context):
print("Received event: " + json.dumps(event, indent=2))
db_instance_identifier = event['detail']['DBInstanceIdentifier']
new_endpoint = event['detail']['Endpoint'] # This might not be directly available in all failover events, need to fetch it.
# Fetch the new endpoint from RDS
try:
response = rds_client.describe_db_instances(DBInstanceIdentifier=db_instance_identifier)
if response['DBInstances']:
current_db_instance = response['DBInstances'][0]
current_endpoint = current_db_instance['Endpoint']
print(f"Current DB Endpoint for {db_instance_identifier}: {current_endpoint}")
else:
print(f"Could not find DB instance details for {db_instance_identifier}")
return {'statusCode': 404, 'body': 'DB instance not found'}
except Exception as e:
print(f"Error fetching DB instance details: {e}")
return {'statusCode': 500, 'body': f'Error fetching DB instance details: {e}'}
# Update WordPress configuration files on EC2 instances
# This is a complex step and requires a mechanism to update instances.
# Options include:
# 1. SSM Run Command to execute a script on instances.
# 2. Updating a configuration management tool (Ansible, Chef, Puppet).
# 3. Using a shared configuration store (e.g., AWS Systems Manager Parameter Store)
# and having WordPress read from it.
# Example using SSM Run Command (simplified):
# You would need to ensure your Lambda function has permissions for SSM.
# This example assumes you have a way to get instance IDs.
# For a real-world scenario, you'd likely query ASG for instance IDs.
# For demonstration, let's assume we have a list of instance IDs.
# In a real scenario, you'd get these from the ASG.
instance_ids = ['i-0123456789abcdef0', 'i-0abcdef0123456789'] # Example instance IDs
update_script = f"""
#!/bin/bash
# Update wp-config.php with the new RDS endpoint
sed -i "s/define('DB_HOST', '.*');/define('DB_HOST', '{current_endpoint}');/" /var/www/html/wp-config.php
echo "Updated DB_HOST in wp-config.php to {current_endpoint}"
# Optional: Clear WordPress cache if using a caching plugin
# wp cache flush --allow-root
# Optional: Restart PHP-FPM or Apache if necessary
# sudo systemctl restart php-fpm
# sudo systemctl restart apache2
"""
try:
ssm_client = boto3.client('ssm')
response = ssm_client.send_command(
InstanceIds=instance_ids,
DocumentName='AWS-RunShellScript',
Parameters={'commands': [update_script]},
Comment='Automated WordPress DB host update after RDS failover'
)
command_id = response['Command']['CommandId']
print(f"Sent SSM command {command_id} to instances.")
except Exception as e:
print(f"Error sending SSM command: {e}")
return {'statusCode': 500, 'body': f'Error sending SSM command: {e}'}
# Optional: Trigger ELB deregistration/registration if needed (usually handled by ASG)
# In most cases, ELB health checks and ASG will handle instance replacement.
# If you need to manually influence ELB targets, you can use elbv2_client.
return {
'statusCode': 200,
'body': json.dumps('RDS failover processed and WordPress updated.')
}
This Lambda function, triggered by an RDS failover event, fetches the new RDS endpoint and uses AWS Systems Manager (SSM) Run Command to update the wp-config.php file on your EC2 instances. This ensures your WordPress application points to the correct database after a failover. Remember to grant the Lambda function the necessary IAM permissions to interact with RDS and SSM.
Testing Your Failover Strategy
Regular, rigorous testing is non-negotiable. Simulate failures to validate your automated failover mechanisms.
- RDS Failover Test: In the AWS RDS console, select your database instance and choose “Reboot” with the “Reboot with failover” option. Monitor your WordPress site for any downtime.
- EC2 Instance Failure: Manually stop or terminate an EC2 instance within your Auto Scaling Group. Observe if the ELB stops sending traffic to it and if the ASG launches a replacement.
- AZ Outage Simulation: While more complex, you can simulate an AZ outage by reconfiguring your VPC subnets or security groups to isolate instances in one AZ.
Document the results of each test, including the duration of any downtime, and use this information to refine your configurations and scripts.
Conclusion
Architecting for automated failover for WordPress on AWS involves a multi-layered approach. RDS Multi-AZ provides database resilience, while ELB and EC2 Auto Scaling Groups ensure application availability. For advanced scenarios, custom orchestration using CloudWatch Events and Lambda offers fine-grained control. By implementing and diligently testing these strategies, you can build a highly available and resilient WordPress deployment capable of withstanding infrastructure failures with minimal disruption.