Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Shopify Deployments on AWS

Automated MongoDB Failover with AWS RDS and Route 53

Achieving true disaster recovery for critical databases like MongoDB necessitates automated failover mechanisms. For deployments on AWS, leveraging Amazon RDS for MongoDB (DocumentDB) combined with Amazon Route 53 offers a robust and scalable solution. This approach minimizes downtime by automatically redirecting traffic to a healthy replica set primary in the event of an outage.

The core components of this strategy are:

Amazon RDS for MongoDB (DocumentDB): Provides managed MongoDB-compatible database instances with built-in replication and high availability features.
Amazon Route 53: A highly available and scalable cloud Domain Name System (DNS) web service. We’ll use its health check and failover routing policies.
AWS Lambda: A serverless compute service that can be triggered by events (like RDS health changes) to perform actions, such as updating DNS records.

Configuring DocumentDB for High Availability

When creating your DocumentDB cluster, ensure you configure it with multiple replicas across different Availability Zones (AZs) within a region. This is the foundational step for any failover strategy. A minimum of three instances (one primary, two read replicas) is recommended for production environments.

During cluster creation or modification via the AWS Management Console, AWS CLI, or SDKs, specify the desired number of replicas and select different AZs for each instance. For example, using the AWS CLI:

aws rds create-db-cluster --db-cluster-identifier my-mongo-cluster --engine docdb --master-username admin --master-user-password 'your_secure_password' --db-subnet-group-name my-db-subnet-group --vpc-security-group-ids sg-xxxxxxxxxxxxxxxxx --engine-version 4.0.0 --backup-retention-period 7 --preferred-backup-window '03:00-04:00' --preferred-maintenance-window 'sun:04:00-sun:05:00' --tags Key=Environment,Value=Production

aws rds create-db-instance --db-instance-identifier my-mongo-instance-1 --db-cluster-identifier my-mongo-cluster --db-instance-class db.r5.large --engine docdb --availability-zone us-east-1a
aws rds create-db-instance --db-instance-identifier my-mongo-instance-2 --db-cluster-identifier my-mongo-cluster --db-instance-class db.r5.large --engine docdb --availability-zone us-east-1b
aws rds create-db-instance --db-instance-identifier my-mongo-instance-3 --db-cluster-identifier my-mongo-cluster --db-instance-class db.r5.large --engine docdb --availability-zone us-east-1c

Setting Up Route 53 Health Checks

Route 53 health checks are crucial for monitoring the availability of your DocumentDB primary instance. While DocumentDB doesn’t expose a direct health endpoint like a typical web server, we can monitor the cluster endpoint itself. A common strategy is to create a health check that attempts to connect to the DocumentDB cluster endpoint on its standard port (27017) and performs a simple operation, like a `ping` or a lightweight query, if authentication is not a barrier for the health check itself.

Alternatively, and often more reliably, you can leverage CloudWatch alarms. DocumentDB emits metrics to CloudWatch. You can set up an alarm that triggers if key metrics (e.g., `CPUUtilization`, `DatabaseConnections`) exceed certain thresholds or if the cluster status changes. This alarm can then trigger a Lambda function.

Let’s focus on the CloudWatch alarm approach, as it’s more robust for managed services.

Creating a CloudWatch Alarm for DocumentDB Health

We’ll create an alarm that monitors the `DatabaseConnections` metric for our DocumentDB cluster. If the number of connections drops to zero for a sustained period (indicating the primary might be down or unreachable), we’ll trigger an action.

aws cloudwatch put-metric-alarm --alarm-name "DocumentDBPrimaryUnreachable" --alarm-description "Alarm when DocumentDB primary has no connections" --metric-name "DatabaseConnections" --namespace "AWS/DocDB" --statistic Sum --period 300 --threshold 1 --comparison-operator "LessThanOrEqualToThreshold" --dimensions Name=DBClusterIdentifier,Value=my-mongo-cluster --evaluation-periods 2 --datapoints-to-alarm 2 --treat-missing-data notBreaching --alarm-actions arn:aws:sns:us-east-1:123456789012:my-docdb-failover-topic

In this command:

--metric-name DatabaseConnections: We’re monitoring the number of active database connections.
--period 300: The metric is evaluated over 5-minute intervals.
--threshold 1: If the sum of connections over a period is less than or equal to 1.
--evaluation-periods 2: The condition must be met for two consecutive periods (10 minutes total) to trigger the alarm.
--datapoints-to-alarm 2: Both periods must have data points.
--treat-missing-data notBreaching: If data is missing, assume the metric is not breaching the threshold.
--alarm-actions: This is crucial. We’ll later configure this to trigger a Lambda function, but for now, we’ll point it to an SNS topic that our Lambda function will subscribe to.

Developing a Lambda Function for DNS Failover

The Lambda function will be triggered by the CloudWatch alarm. Its responsibility is to update the DNS record in Route 53 to point to the new primary instance’s endpoint. DocumentDB automatically promotes a replica to primary within minutes of the old primary failing. We need to detect this promotion and update DNS accordingly.

The Lambda function will need the following IAM permissions:

rds:DescribeDBClusters: To get the current primary endpoint.
route53:ChangeResourceRecordSets: To update the DNS record.
logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents: For logging.

Here’s a Python 3.9 Lambda function example:

import boto3
import json
import os

rds_client = boto3.client('rds')
route53_client = boto3.client('route53')

# Environment variables
HOSTED_ZONE_ID = os.environ['HOSTED_ZONE_ID'] # e.g., Z1XXXXXXXXXXXXX
RECORD_NAME = os.environ['RECORD_NAME']       # e.g., mongo.yourdomain.com.
DB_CLUSTER_IDENTIFIER = os.environ['DB_CLUSTER_IDENTIFIER'] # e.g., my-mongo-cluster

def get_primary_endpoint(cluster_identifier):
    """Retrieves the primary endpoint of the DocumentDB cluster."""
    try:
        response = rds_client.describe_db_clusters(DBClusterIdentifier=cluster_identifier)
        if not response['DBClusters']:
            print(f"Error: DB cluster '{cluster_identifier}' not found.")
            return None
        
        cluster = response['DBClusters'][0]
        if cluster['Status'] != 'available':
            print(f"Error: DB cluster '{cluster_identifier}' is not available. Status: {cluster['Status']}")
            return None
            
        return cluster['Endpoint']
    except Exception as e:
        print(f"Error describing DB cluster '{cluster_identifier}': {e}")
        return None

def update_route53_record(hosted_zone_id, record_name, new_endpoint):
    """Updates the Route 53 A record to point to the new primary endpoint."""
    try:
        change_batch = {
            'Changes': [
                {
                    'Action': 'UPSERT',
                    'ResourceRecordSet': {
                        'Name': record_name,
                        'Type': 'A', # Assuming A record for simplicity; could be CNAME if pointing to a load balancer
                        'TTL': 60, # Short TTL for faster propagation
                        'ResourceRecords': [
                            {
                                'Value': new_endpoint # This assumes the endpoint is an IP address. DocumentDB endpoints are DNS names.
                                # For DocumentDB, the endpoint is a DNS name, so we should use CNAME.
                                # Let's correct this to use CNAME.
                            }
                        ]
                    }
                }
            ]
        }
        
        # Corrected for CNAME record type
        change_batch_cname = {
            'Changes': [
                {
                    'Action': 'UPSERT',
                    'ResourceRecordSet': {
                        'Name': record_name,
                        'Type': 'CNAME',
                        'TTL': 60, # Short TTL for faster propagation
                        'ResourceRecords': [
                            {
                                'Value': new_endpoint # DocumentDB endpoint is a DNS name
                            }
                        ]
                    }
                }
            ]
        }

        response = route53_client.change_resource_record_sets(
            HostedZoneId=hosted_zone_id,
            ChangeBatch=change_batch_cname
        )
        print(f"Successfully updated Route 53 record for {record_name} to {new_endpoint}. Change Info: {response['ChangeInfo']}")
        return True
    except Exception as e:
        print(f"Error updating Route 53 record for {record_name}: {e}")
        return False

def lambda_handler(event, context):
    print("Received event: " + json.dumps(event, indent=2))

    # The event structure from CloudWatch Alarms is detailed.
    # We need to extract the DBClusterIdentifier if it's not passed directly.
    # For simplicity, we assume DB_CLUSTER_IDENTIFIER is set as an env var.
    
    # Check if the alarm is in ALARM state
    if event['Records'][0]['Sns']['Message'] and 'AlarmName' in json.loads(event['Records'][0]['Sns']['Message']):
        message = json.loads(event['Records'][0]['Sns']['Message'])
        if message['NewStateValue'] == 'ALARM':
            print(f"CloudWatch Alarm '{message['AlarmName']}' is in ALARM state.")
            
            # Get the current primary endpoint
            primary_endpoint = get_primary_endpoint(DB_CLUSTER_IDENTIFIER)
            
            if primary_endpoint:
                print(f"Current primary endpoint: {primary_endpoint}")
                # Update Route 53 record
                if update_route53_record(HOSTED_ZONE_ID, RECORD_NAME, primary_endpoint):
                    return {
                        'statusCode': 200,
                        'body': json.dumps('Route 53 record updated successfully!')
                    }
                else:
                    return {
                        'statusCode': 500,
                        'body': json.dumps('Failed to update Route 53 record.')
                    }
            else:
                return {
                    'statusCode': 500,
                    'body': json.dumps('Failed to retrieve primary endpoint.')
                }
        else:
            print(f"Alarm state is not ALARM: {message['NewStateValue']}")
            return {
                'statusCode': 200,
                'body': json.dumps('Alarm state not ALARM, no action taken.')
            }
    else:
        print("Event structure not recognized or missing SNS message.")
        return {
            'statusCode': 400,
            'body': json.dumps('Invalid event structure.')
        }

Important Note on DNS Record Type: DocumentDB cluster endpoints are DNS names, not IP addresses. Therefore, the Route 53 record type should be CNAME, not A. The provided Python code has been corrected to reflect this. Your application should connect to mongo.yourdomain.com, and Route 53 will resolve this to the actual DocumentDB cluster endpoint.

Connecting CloudWatch Alarm to Lambda

Now, we need to link the CloudWatch alarm to our Lambda function. We can do this by modifying the alarm’s actions to trigger an SNS topic, and then have the Lambda function subscribe to that SNS topic. Alternatively, and more directly, we can configure the Lambda function to be triggered by the CloudWatch alarm directly.

Direct Triggering (Recommended):

aws lambda add-permission --function-name my-docdb-failover-lambda --statement-id "AllowCloudWatchAlarmInvoke" --action "lambda:InvokeFunction" --principal "cloudwatch.amazonaws.com" --source-arn "arn:aws:cloudwatch:us-east-1:123456789012:alarm:DocumentDBPrimaryUnreachable"

aws cloudwatch put-metric-alarm --alarm-name "DocumentDBPrimaryUnreachable" --alarm-description "Alarm when DocumentDB primary has no connections" --metric-name "DatabaseConnections" --namespace "AWS/DocDB" --statistic Sum --period 300 --threshold 1 --comparison-operator "LessThanOrEqualToThreshold" --dimensions Name=DBClusterIdentifier,Value=my-mongo-cluster --evaluation-periods 2 --datapoints-to-alarm 2 --treat-missing-data notBreaching --alarm-actions arn:aws:lambda:us-east-1:123456789012:function:my-docdb-failover-lambda

This approach directly invokes the Lambda function when the alarm state changes to ALARM. The Lambda function’s handler will receive the CloudWatch alarm event.

Configuring Shopify for Auto-Failover

Shopify, as a SaaS platform, abstracts away much of the underlying infrastructure. Direct control over database failover for Shopify *stores* is not possible. However, if you are using Shopify’s APIs to build custom applications or integrations that rely on an external database (like the MongoDB cluster we just configured for high availability), then the failover strategy described above is directly applicable to your application’s data layer.

Your Shopify application, running on a platform like AWS (e.g., EC2, ECS, EKS, Lambda), would connect to your MongoDB cluster using the Route 53 DNS record (e.g., mongo.yourdomain.com). When a failover occurs, the Lambda function updates the DNS record. Your application, with a short DNS TTL, will eventually resolve the new primary endpoint and continue operations with minimal interruption.

Key considerations for your Shopify application:

Connection Pooling: Ensure your application’s database connection pool is configured to handle connection errors gracefully and to re-establish connections when the DNS record is updated.
Retry Logic: Implement robust retry mechanisms for database operations. When a connection fails, the application should retry the operation after a short, exponential backoff delay. This is crucial for handling transient network issues during failover.
DNS Caching: Be mindful of DNS caching on your application servers. A low TTL (e.g., 60 seconds) on the Route 53 record helps ensure that clients pick up the DNS change relatively quickly. You might also need to configure your application servers or OS to use a DNS resolver with aggressive caching settings or to flush DNS caches programmatically if issues persist.
Application Downtime: While the database failover is automated, there will still be a period of unavailability. This includes the time it takes for DocumentDB to promote a new primary, the time for the CloudWatch alarm to trigger, the Lambda execution time, and the DNS propagation time. Aim to minimize each of these. For DocumentDB, failover is typically within a few minutes. DNS propagation can take longer depending on TTLs and resolver caching.

Testing the Failover Mechanism

Thorough testing is paramount. You can simulate a primary instance failure by manually deleting the primary DocumentDB instance. This will force DocumentDB to initiate its failover process. Monitor the CloudWatch alarm and the Route 53 DNS record changes. Verify that your application can reconnect and resume operations after the failover is complete.

# First, identify the current primary instance
aws rds describe-db-clusters --db-cluster-identifier my-mongo-cluster --query 'DBClusters[0].DBClusterMembers[?IsClusterWriter==`true`].DBInstanceIdentifier' --output text

# Then, delete the identified primary instance (USE WITH CAUTION IN PRODUCTION)
# aws rds delete-db-instance --db-instance-identifier  --skip-final-snapshot

After initiating the deletion, observe the CloudWatch alarm state and the Route 53 record. You should see the alarm transition to ALARM, triggering the Lambda function, which in turn updates the Route 53 record to point to the new primary’s endpoint. Test your application’s connectivity and functionality against the DNS record.

Advanced Considerations and Alternatives

Using Route 53 Failover Routing Policy Directly: For simpler scenarios where you can directly monitor an endpoint (e.g., a web server), Route 53’s built-in failover routing policy is an option. However, for managed services like DocumentDB, where direct endpoint health checks can be complex or unreliable, the CloudWatch alarm + Lambda approach is generally more robust.

Application-Level Failover: Some drivers and ORMs offer application-level failover capabilities. While useful, they often rely on the application itself to detect failures and re-establish connections. Integrating this with a DNS-based failover provides a multi-layered approach to resilience.

Multi-Region Failover: For true disaster recovery across geographic regions, you would extend this architecture. This involves setting up cross-region replication for DocumentDB, using Route 53 latency-based or geolocation routing, and potentially a more complex Lambda function or AWS Step Functions workflow to manage failover across regions. This is significantly more complex and costly.

Load Balancers: For applications that are not directly connecting to MongoDB but rather through an intermediary (e.g., a custom API layer), you would implement the failover at the load balancer level (e.g., AWS ELB) pointing to your API instances, and your API instances would use the DNS-based MongoDB failover described above.