Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Magento 2 Deployments on AWS

Leveraging AWS RDS for Automated MySQL Failover

For mission-critical Magento 2 deployments, a robust disaster recovery strategy is paramount. At its core, this means ensuring minimal downtime during database failures. Amazon Web Services (AWS) Relational Database Service (RDS) offers a managed solution that significantly simplifies achieving high availability and automated failover for MySQL instances. The key lies in configuring Multi-AZ deployments.

When you enable Multi-AZ for an RDS MySQL instance, AWS automatically provisions and maintains a synchronous standby replica in a different Availability Zone (AZ) within the same AWS Region. In the event of a planned database maintenance or an unplanned outage affecting the primary instance, RDS automatically fails over to the standby replica. This process is transparent to your application, with a brief interruption typically measured in seconds to a couple of minutes, depending on the workload and recovery process. The DNS record for your DB instance is updated to point to the standby replica, which is then promoted to become the new primary.

Configuring RDS Multi-AZ Deployment

Configuring Multi-AZ can be done during the initial creation of an RDS instance or by modifying an existing one. The process is straightforward via the AWS Management Console, AWS CLI, or SDKs.

Using AWS CLI for Multi-AZ Configuration

To create a new RDS MySQL instance with Multi-AZ enabled:

aws rds create-db-instance \
    --db-instance-identifier my-magento-db-primary \
    --db-instance-class db.r5.large \
    --engine mysql \
    --allocated-storage 100 \
    --master-username admin \
    --master-user-password YOUR_SECURE_PASSWORD \
    --vpc-security-group-ids sg-xxxxxxxxxxxxxxxxx \
    --db-subnet-group-name my-db-subnet-group \
    --multi-az \
    --backup-retention-period 7 \
    --region us-east-1

To modify an existing RDS MySQL instance to enable Multi-AZ:

aws rds modify-db-instance \
    --db-instance-identifier my-magento-db-primary \
    --multi-az \
    --apply-immediately \
    --region us-east-1

The --apply-immediately flag will initiate the Multi-AZ configuration change. Be aware that this operation can cause a brief interruption to database availability as the standby replica is provisioned and synchronized.

Architecting Magento 2 for High Availability with RDS Read Replicas

While Multi-AZ provides automatic failover for the primary database, it doesn’t inherently offload read traffic. For a high-performance Magento 2 deployment, especially one with significant read operations (product listings, category pages, search), leveraging RDS Read Replicas is crucial. Read Replicas allow you to scale read capacity independently of write capacity and can also serve as a disaster recovery mechanism for read-heavy workloads.

Creating and Configuring Read Replicas

Read Replicas are created from a snapshot of your primary database instance. They use asynchronous replication to stay in sync with the primary. Importantly, Read Replicas can be deployed across different Availability Zones and even different AWS Regions, offering a powerful DR strategy.

Using AWS CLI to Create a Read Replica

aws rds create-db-instance-read-replica \
    --db-instance-identifier my-magento-db-replica-1 \
    --source-db-instance-identifier my-magento-db-primary \
    --db-instance-class db.r5.large \
    --availability-zone us-east-1a \
    --region us-east-1

You can create multiple read replicas to distribute read traffic. For cross-region disaster recovery, specify a different region in the command.

Magento 2 Database Configuration for Read/Write Splitting

Magento 2’s architecture supports database read/write splitting. This is configured in the app/etc/env.php file. You’ll define multiple database connections, one for the primary (writes and reads) and others for read replicas.

Modifying `env.php` for Read/Write Splitting

Locate the 'db' section in your app/etc/env.php file. You’ll need to modify the 'connection' array to include your read replica(s). The 'default' connection will be your primary RDS instance, and you can add new connections for read replicas.

<?php
return [
    'backend' => [
        'frontName' => 'admin_secret_path'
    ],
    'crypt' => [
        'key' => 'your_application_crypt_key'
    ],
    'db' => [
        'connection' => [
            'default' => [
                'host' => 'my-magento-db-primary.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com',
                'dbname' => 'magento_db',
                'username' => 'admin',
                'password' => 'YOUR_SECURE_PASSWORD',
                'model' => 'mysql4',
                'initStatements' => 'SET NAMES utf8',
                'driver_options' => [
                    1000 => 1 // PDO::MYSQL_ATTR_INIT_COMMAND
                ]
            ],
            'read' => [
                'host' => 'my-magento-db-replica-1.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com',
                'dbname' => 'magento_db',
                'username' => 'admin',
                'password' => 'YOUR_SECURE_PASSWORD',
                'model' => 'mysql4',
                'initStatements' => 'SET NAMES utf8',
                'driver_options' => [
                    1000 => 1 // PDO::MYSQL_ATTR_INIT_COMMAND
                ]
            ]
        ],
        'default_setup' => [
            'connection' => 'default'
        ]
    ],
    'resource' => [
        'default_setup' => [
            'connection' => 'default'
        ]
    ],
    // ... other configuration
];
?>

Magento 2’s ORM will automatically distribute read queries to the connection defined under the 'read' key. For more complex read/write splitting scenarios or to manage multiple read replicas, consider using a database proxy like ProxySQL or MaxScale, which can be deployed on EC2 instances and configured to route traffic intelligently.

Automating Failover Detection and Application Restart

While RDS Multi-AZ handles the database failover automatically, your application instances (e.g., EC2 web servers running PHP-FPM) need to be aware of potential connection disruptions and re-establish connections to the new primary endpoint. This is where automation becomes critical.

Monitoring Database Connectivity

A common approach is to implement a health check mechanism. This can be a simple script that periodically attempts to connect to the database and verifies its availability. AWS CloudWatch Alarms can be configured to trigger actions based on the success or failure of these checks.

Example: Python Health Check Script

import pymysql
import time
import os

DB_HOST = os.environ.get('DB_HOST', 'my-magento-db-primary.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com')
DB_USER = os.environ.get('DB_USER', 'admin')
DB_PASSWORD = os.environ.get('DB_PASSWORD', 'YOUR_SECURE_PASSWORD')
DB_NAME = os.environ.get('DB_NAME', 'magento_db')
CHECK_INTERVAL = 60 # seconds

def check_db_connection():
    try:
        connection = pymysql.connect(
            host=DB_HOST,
            user=DB_USER,
            password=DB_PASSWORD,
            database=DB_NAME,
            connect_timeout=5
        )
        connection.close()
        return True
    except pymysql.Error as e:
        print(f"Database connection error: {e}")
        return False

if __name__ == "__main__":
    print(f"Starting database health check for {DB_HOST}...")
    while True:
        if check_db_connection():
            print(f"{time.strftime('%Y-%m-%d %H:%M:%S')} - Database is reachable.")
        else:
            print(f"{time.strftime('%Y-%m-%d %H:%M:%S')} - Database is NOT reachable. Initiating recovery actions...")
            # In a real-world scenario, this would trigger an SNS notification,
            # an Auto Scaling Group termination/launch, or a Lambda function.
            # For simplicity, we'll just exit here, assuming an external orchestrator
            # will handle the restart.
            exit(1)
        time.sleep(CHECK_INTERVAL)

This script can be run as a cron job on a dedicated EC2 instance or as part of your application’s deployment pipeline. When the script detects a failure, it can trigger an AWS Lambda function or an SNS topic to initiate recovery actions.

Automated Application Restart/Reconfiguration

When a database failover occurs, your application servers might still hold stale connections or be configured with the old primary endpoint. To ensure seamless recovery:

EC2 Auto Scaling Groups: Configure your Magento 2 application servers within an EC2 Auto Scaling Group. If the health check script detects a database issue, it can signal the Auto Scaling Group to terminate unhealthy instances. New instances launched by the Auto Scaling Group will then pick up the latest database endpoint (which RDS updates automatically via DNS) and establish fresh connections.
Configuration Management Tools: Tools like Ansible, Chef, or Puppet can be used to re-deploy or restart application services. A trigger mechanism (e.g., an SNS notification from CloudWatch) can initiate a playbook that updates the database connection string in env.php and restarts PHP-FPM or the web server.
AWS Lambda: A Lambda function triggered by a CloudWatch alarm can be responsible for orchestrating the recovery. It can update configuration files on EC2 instances via Systems Manager, trigger an Auto Scaling Group refresh, or send notifications.

Example: Triggering an Auto Scaling Group Action via Lambda

This Python Lambda function, triggered by a CloudWatch alarm on database connectivity failure, can initiate a scale-in event on a specific Auto Scaling Group, forcing it to terminate instances and subsequently launch new ones with updated configurations.

import boto3
import os

autoscaling_client = boto3.client('autoscaling')
asg_name = os.environ['AUTO_SCALING_GROUP_NAME']
instance_id_to_terminate = os.environ.get('INSTANCE_ID_TO_TERMINATE') # Optional: specify an instance to target

def lambda_handler(event, context):
    print(f"Received event: {event}")

    try:
        if instance_id_to_terminate:
            print(f"Terminating specific instance: {instance_id_to_terminate}")
            response = autoscaling_client.terminate_instance_in_auto_scaling_group(
                InstanceId=instance_id_to_terminate,
                ShouldDecrementDesiredCapacity=True # Set to False if you want to maintain desired capacity
            )
        else:
            print(f"Initiating scale-in for Auto Scaling Group: {asg_name}")
            # To force a refresh, you might need to adjust desired capacity.
            # A common pattern is to decrease desired capacity, then immediately increase it.
            # For simplicity here, we'll just trigger a termination.
            # A more robust solution would involve checking current state and desired capacity.
            response = autoscaling_client.set_desired_capacity(
                AutoScalingGroupName=asg_name,
                DesiredCapacity=1, # Example: set to a lower value to trigger termination
                HonorCooldown=False
            )
            print(f"Set desired capacity response: {response}")

        print("Successfully initiated recovery action.")
        return {
            'statusCode': 200,
            'body': 'Recovery action initiated.'
        }
    except Exception as e:
        print(f"Error initiating recovery action: {e}")
        return {
            'statusCode': 500,
            'body': f'Error initiating recovery action: {str(e)}'
        }

This Lambda function would be triggered by a CloudWatch alarm that monitors the output of your health check script. The alarm should be configured to trigger when the script exits with a non-zero status code (indicating failure).

Cross-Region Disaster Recovery for Magento 2

For the highest level of resilience, a cross-region disaster recovery strategy is essential. This protects against catastrophic failures affecting an entire AWS Region.

RDS Cross-Region Read Replicas

You can create RDS Read Replicas in a different AWS Region from your primary RDS instance. These replicas are asynchronously replicated. In a DR scenario, you would promote a cross-region read replica to become the new primary database.

Creating a Cross-Region Read Replica

aws rds create-db-instance-read-replica \
    --db-instance-identifier my-magento-db-dr-replica \
    --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:my-magento-db-primary \
    --db-instance-class db.r5.large \
    --region us-west-2

Note the use of the ARN (Amazon Resource Name) of the source DB instance, which is required for cross-region replication. You’ll need to configure VPC peering or AWS Transit Gateway between your primary and DR regions if your application servers are also deployed cross-region.

Promoting a Cross-Region Read Replica

In the event of a regional outage, you would manually promote the cross-region read replica to a standalone database instance. This is a manual step that requires careful orchestration.

aws rds promote-read-replica \
    --db-instance-identifier my-magento-db-dr-replica \
    --region us-west-2

After promotion, you would update your application’s DNS records or configuration to point to this new primary database in the DR region. This is where a robust DNS failover strategy (e.g., using Amazon Route 53 with health checks) becomes critical.

Data Synchronization and Application Deployment in DR Region

For a true cross-region DR, you need to ensure your application code and static assets are also deployed to the DR region. This can be achieved through CI/CD pipelines that deploy to multiple regions simultaneously or are triggered upon a DR event.

Consider using AWS services like:

Amazon Route 53: For health checks and DNS-based failover to the DR region.
AWS CodeDeploy: To automate application deployments to EC2 instances in the DR region.
Amazon S3: For replicating media assets and configuration files.
AWS Systems Manager: To manage and configure instances in the DR region.

Architecting for automated failover is an ongoing process. Regular testing of your DR plan is non-negotiable to ensure it functions as expected when a real disaster strikes.