Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Magento 2 Deployments on AWS

Establishing a Multi-AZ MongoDB Replica Set for High Availability

A robust disaster recovery strategy for MongoDB hinges on a well-configured replica set distributed across multiple Availability Zones (AZs) within AWS. This ensures that if one AZ experiences an outage, the remaining nodes can maintain service availability. We’ll focus on a three-node replica set (primary, secondary, secondary) as a baseline, with considerations for scaling to larger deployments.

The core of this setup involves deploying MongoDB instances on EC2, configuring them as a replica set, and leveraging AWS networking constructs like Security Groups and VPCs to isolate and secure the deployment.

EC2 Instance Configuration and MongoDB Installation

For production, consider instances with sufficient IOPS (e.g., `i3en` or `i4i` instance families) and EBS volumes optimized for performance. A typical setup might involve:

Three EC2 instances, each in a different AZ within the same AWS region.
Each instance running a recent, stable version of MongoDB Community Edition.
EBS volumes attached for data storage, ideally provisioned IOPS SSD (gp3 or io2).

Installation can be automated using user data scripts or configuration management tools like Ansible. Here’s a simplified example for Ubuntu:

MongoDB Installation Script (User Data Example)

#!/bin/bash
set -euxo pipefail

# Add MongoDB repository
wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-6.0.list

# Update package list and install MongoDB
apt-get update
apt-get install -y mongodb-org

# Start and enable MongoDB service
systemctl start mongod
systemctl enable mongod

# Configure MongoDB for replica set (details below)
# ...

Replica Set Configuration

The critical step is configuring the replica set. This involves editing the MongoDB configuration file (typically /etc/mongod.conf) on each instance and then initiating the replica set from one of the nodes.

MongoDB Configuration File (`/etc/mongod.conf`)

Each node needs to be aware of its role and the other members of the replica set. The replication.replSetName parameter is essential. For internal communication, ensure the net.bindIp is set to allow connections from other nodes, or use the instance’s private IP address.

# /etc/mongod.conf on each node
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
  wiredTiger:
    engineConfig:
      cacheSizeGB: 0.5 # Adjust based on instance RAM
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  port: 27017
  bindIp: 0.0.0.0 # Or specific private IP for better security
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid
security:
  keyFile: /etc/mongo-keyfile # Ensure this file is securely distributed
  authorization: enabled
replication:
  replSetName: "rs0" # The name of your replica set
sharding:
  clusterRole: configsvr # Only if this node is a config server (not for this basic setup)

Security Note: The keyFile is crucial for authentication and authorization within the replica set. It must be generated securely (e.g., using openssl rand -base64 756) and distributed to all replica set members with strict file permissions (chmod 400).

Initiating the Replica Set

Once MongoDB is installed and configured on all nodes, connect to one of the nodes (preferably the one intended to be the initial primary) via the mongo shell and initiate the replica set. Ensure your Security Group allows traffic on port 27017 between the replica set members.

# Connect to one of the MongoDB instances
mongo --host  --port 27017

# Inside the mongo shell:
rs.initiate(
  {
    _id : "rs0",
    members: [
      { _id : 0, host : ":27017" },
      { _id : 1, host : ":27017" },
      { _id : 2, host : ":27017" }
    ]
  }
)

After initiation, you can check the status with rs.status(). The replica set will elect a primary node. Applications should be configured to connect to the replica set name (e.g., mongodb://rs0/?replicaSet=rs0) rather than a single host, allowing them to automatically discover and connect to the current primary.

Automating Magento 2 Failover with AWS Services

Magento 2 deployments introduce additional complexity due to their reliance on multiple components: web servers (Nginx/Apache), PHP-FPM, database (MySQL/MariaDB), and potentially Redis/Varnish. Achieving auto-failover requires orchestrating these components across redundant infrastructure.

Database Failover (RDS Multi-AZ)

For the primary database, AWS Relational Database Service (RDS) with Multi-AZ deployment is the most straightforward and robust solution. RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. In the event of a primary instance failure, RDS automatically fails over to the standby replica with minimal downtime. The endpoint for your RDS instance remains the same, simplifying application configuration.

Web Server and Application Layer Redundancy

For the Magento application itself, a common pattern involves:

Auto Scaling Groups (ASG): To maintain a desired number of healthy EC2 instances running Magento.
Elastic Load Balancer (ELB): To distribute traffic across healthy instances and perform health checks.
Shared Storage: For media files (pub/media) and potentially session storage.

Auto Scaling Group Configuration

An ASG ensures that if an instance fails, a new one is launched to replace it. The ASG should be configured to span multiple Availability Zones.

# Example ASG Launch Template Configuration (Conceptual)
# This would be configured via AWS Console, CLI, or IaC tools like Terraform/CloudFormation

# Instance Type: e.g., t3.medium or m5.large
# AMI: Custom AMI with Magento, Nginx, PHP-FPM pre-installed and configured
# Security Group: Allows inbound traffic from ELB on ports 80/443, outbound to RDS, Redis, etc.
# User Data Script: For any post-launch configuration (e.g., joining a cluster, fetching latest code)
# EBS Volumes: Attached for Magento installation, logs, etc.
# IAM Role: For accessing other AWS services (e.g., S3 for media)

# Auto Scaling Group Settings:
# Desired Capacity: e.g., 2
# Min Size: e.g., 1
# Max Size: e.g., 4
# Availability Zones: us-east-1a, us-east-1b, us-east-1c
# Health Check Type: ELB (preferred) or EC2
# Health Check Grace Period: e.g., 300 seconds (to allow instances to start up)

Elastic Load Balancer (ELB) Configuration

An Application Load Balancer (ALB) is recommended for HTTP/S traffic. It will route traffic to healthy EC2 instances within the ASG.

# Example ALB Configuration (Conceptual)

# Listener: HTTP:80, HTTPS:443 (with ACM certificate)
# Target Group:
#   Protocol: HTTP
#   Port: 80 (or 8080 if Nginx listens on a different port)
#   VPC: Your Magento VPC
#   Health Checks:
#     Protocol: HTTP
#     Path: /health_check.php (a simple PHP file returning 200 OK)
#     Interval: 30 seconds
#     Timeout: 5 seconds
#     Healthy Threshold: 2
#     Unhealthy Threshold: 2
#   Targets: Registered EC2 instances from the ASG
#   Availability Zones: Enabled for all AZs where the ASG operates

The health check path (e.g., /health_check.php) is critical. This script should perform a minimal check, such as verifying database connectivity and returning an HTTP 200 status code. If the script fails, the ELB will mark the instance as unhealthy and stop sending traffic to it.

Shared Storage for Media

Magento’s pub/media directory must be accessible by all web servers. AWS Elastic File System (EFS) is a good choice for this. It provides a managed NFS file system that can be mounted by multiple EC2 instances across different AZs.

# Example EFS Mount Command (on EC2 instance)
sudo apt-get update
sudo apt-get install -y nfs-common

# Create mount point
sudo mkdir -p /var/www/html/your_magento_root/pub/media

# Mount EFS (replace with your EFS mount target DNS name)
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tls :/ /var/www/html/your_magento_root/pub/media

# Add to /etc/fstab for persistence
<efs-mount-target-dns-name>:/ /var/www/html/your_magento_root/pub/media nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tls,_netdev 0 0

Ensure the Security Group for your EFS mount targets allows inbound NFS traffic (TCP/UDP port 2049) from your Magento web server Security Group.

Orchestrating Failover with AWS Lambda and EventBridge

While RDS Multi-AZ and ASGs handle most of the failover, there are scenarios that might require custom logic, such as failing over a custom-built caching layer or triggering specific post-failover tasks. AWS Lambda and EventBridge can be used to build these custom failover workflows.

Example: Custom Redis Failover Trigger

If you’re using a self-managed Redis cluster, you might need a mechanism to detect failure and promote a replica. This can be achieved by:

A CloudWatch Alarm monitoring Redis health metrics (e.g., connection errors, latency).
An EventBridge rule triggered by the CloudWatch Alarm.
A Lambda function invoked by EventBridge to execute Redis failover commands (e.g., redis-cli SENTINEL failover ).

The Lambda function would need appropriate IAM permissions to interact with the Redis instances (e.g., via Systems Manager Run Command or direct network access if configured). This approach is more complex and generally less preferred than using a managed service like ElastiCache for Redis with Multi-AZ enabled.

Deployment and Code Updates

Automated deployments are crucial for disaster recovery. Using CI/CD pipelines (e.g., AWS CodePipeline, Jenkins, GitLab CI) to build and deploy new versions of Magento to the ASG ensures that failover instances are running the latest stable code. Blue/Green deployments or Canary releases can further minimize risk during updates.

Monitoring and Testing for Resilience

A disaster recovery plan is only effective if it’s regularly tested and monitored. Comprehensive monitoring is key to detecting failures early and verifying that failover mechanisms are functioning as expected.

Key Monitoring Metrics

RDS: CPU utilization, memory usage, disk I/O, replica lag (if applicable), connection count, database connections.
EC2 Instances (Magento App Servers): CPU, memory, disk I/O, network traffic, application-specific metrics (e.g., request latency, error rates).
ELB: Healthy/unhealthy host counts, request counts, latency, HTTP error codes (5xx).
MongoDB: Network traffic, disk I/O, oplog window, replication lag, connection counts, query performance.
EFS: Throughput, latency.

AWS CloudWatch is the primary tool for collecting and visualizing these metrics. Set up alarms for critical thresholds to proactively alert your operations team.

Regular Failover Testing

The most critical aspect of DR is validation. Schedule regular, controlled failover tests. This involves:

Simulating failures: Terminate an EC2 instance, stop an RDS primary instance, or simulate network partitions.
Observing failover: Monitor the ELB health checks, ASG instance replacement, and RDS failover process.
Verifying application functionality: Perform basic user journeys on the application to ensure it’s accessible and functional after failover.
Documenting results: Record the time taken for failover, any issues encountered, and lessons learned.

These tests should be performed at least quarterly, and ideally more frequently for critical systems. They provide confidence in the automated failover mechanisms and highlight areas for improvement.