Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and C++ Deployments on AWS
Designing for Resiliency: MongoDB Auto-Failover with AWS EC2 and Route 53
Achieving true disaster recovery for a critical MongoDB deployment necessitates an automated failover strategy. This isn’t about manual intervention during an outage; it’s about a system that detects failure and seamlessly redirects traffic to a healthy replica set member. We’ll architect this using AWS EC2 instances for MongoDB, leveraging Route 53 for DNS-based failover, and a custom health check mechanism.
MongoDB Replica Set Configuration for High Availability
A robust MongoDB deployment for HA relies on a replica set. For our auto-failover scenario, we’ll assume a minimum of three nodes to ensure a quorum even if one node fails. The primary node handles all write operations, while secondaries replicate the data. If the primary becomes unavailable, the remaining members elect a new primary.
Here’s a sample MongoDB configuration file snippet for a replica set member:
# /etc/mongod.conf
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
net:
bindIp: 0.0.0.0
port: 27017
security:
authorization: enabled
replication:
replSetName: "rs0"
sharding:
clusterRole: configsvr # Or shardsvr if this is a sharded cluster member
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
To initialize the replica set, connect to one of the MongoDB instances (preferably the one intended to be the initial primary) and run:
rs.initiate(
{
_id: "rs0",
members: [
{ _id: 0, host: "mongodb-node-1.example.com:27017" },
{ _id: 1, host: "mongodb-node-2.example.com:27017" },
{ _id: 2, host: "mongodb-node-3.example.com:27017" }
]
}
)
Replace mongodb-node-X.example.com with the actual private IP addresses or resolvable hostnames of your EC2 instances. Ensure these instances are within the same VPC and security group for proper communication.
AWS Route 53 Health Checks and DNS Failover
Route 53’s health checking capabilities are crucial for detecting the failure of the primary MongoDB node. We’ll configure a custom health check that probes the MongoDB replica set status.
Step 1: Create a Health Check in Route 53
- Navigate to the Route 53 console.
- Go to “Health checks” and click “Create health check”.
- Name:
MongoDBPrimaryHealthCheck - What to monitor: “Endpoint”
- Protocol: “TCP”
- Port:
27017 - IP address or domain name: Enter the private IP address or internal DNS name of your current primary MongoDB node.
- Advanced configuration: Set a reasonable “Request interval” (e.g., 30 seconds) and “Failure threshold” (e.g., 3).
- Enable this health check: Yes
- Click “Create health check”.
This initial health check is a starting point. A more sophisticated approach involves a health check that queries the replica set status. However, Route 53’s native health checks are limited in their ability to execute arbitrary commands. For a truly robust check, we’ll need an intermediary script.
Custom Health Check Script for Replica Set Status
We’ll deploy a small script on a separate EC2 instance (or even on one of the MongoDB nodes, though a separate instance is more resilient) that periodically checks the replica set status and reports its health to Route 53. This script will use the mongo shell to query rs.status().
Step 1: Create the Health Check Script (Python)
# /opt/scripts/mongodb_health_check.py
import pymongo
import requests
import os
import sys
import logging
# Configuration
MONGO_HOST = os.environ.get("MONGO_HOST", "localhost") # Default to localhost if not set
MONGO_PORT = int(os.environ.get("MONGO_PORT", 27017))
REPLICA_SET_NAME = os.environ.get("REPLICA_SET_NAME", "rs0")
ROUTE53_HEALTH_CHECK_ID = os.environ.get("ROUTE53_HEALTH_CHECK_ID")
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
LOG_FILE = "/var/log/mongodb_health_check.log"
# Setup logging
logging.basicConfig(filename=LOG_FILE, level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
def check_mongo_health():
try:
client = pymongo.MongoClient(MONGO_HOST, MONGO_PORT, serverSelectionTimeoutMS=5000)
# The ismaster command is cheap and does not require auth.
client.admin.command('ismaster')
rs_status = client.admin.command('replSetGetStatus')
primary_member = None
for member in rs_status.get('members', []):
if member.get('stateStr') == 'PRIMARY':
primary_member = member
break
if primary_member:
logging.info(f"MongoDB health check successful. Primary: {primary_member.get('name')}")
return True
else:
logging.warning("MongoDB health check failed: No primary found.")
return False
except pymongo.errors.ConnectionFailure as e:
logging.error(f"MongoDB health check failed: Connection error - {e}")
return False
except Exception as e:
logging.error(f"MongoDB health check failed: Unexpected error - {e}")
return False
def report_health_to_route53(is_healthy):
if not ROUTE53_HEALTH_CHECK_ID:
logging.warning("ROUTE53_HEALTH_CHECK_ID not set. Skipping Route 53 reporting.")
return
try:
import boto3
client = boto3.client('route53', region_name=AWS_REGION)
health_status = "Healthy" if is_healthy else "Unhealthy"
response = client.change_tags_for_resource(
ResourceType='healthcheck',
ResourceId=ROUTE53_HEALTH_CHECK_ID,
AddTags={
'HealthStatus': health_status
}
)
logging.info(f"Reported health to Route 53: {health_status}")
except ImportError:
logging.error("Boto3 not installed. Cannot report to Route 53.")
except Exception as e:
logging.error(f"Failed to report health to Route 53: {e}")
if __name__ == "__main__":
is_healthy = check_mongo_health()
report_health_to_route53(is_healthy)
sys.exit(0 if is_healthy else 1)
Step 2: Install Dependencies and Configure IAM Role
- Install Python and Boto3:
sudo apt-get update && sudo apt-get install -y python3 python3-pip && pip3 install pymongo boto3(or equivalent for your OS). - Create an IAM role for the EC2 instance running this script. This role needs permissions to call
route53:ChangeTagsForResource. - Attach this IAM role to the EC2 instance.
Step 3: Create a Route 53 Health Check that uses the script’s output
Instead of a direct TCP check, we’ll use a “Custom health check” that relies on tags. The script above updates a tag named HealthStatus on the Route 53 health check resource. We then create a *separate* Route 53 health check that monitors this tag.
- In Route 53, create a new health check.
- Name:
MongoDBReplicaSetHealthCheck - What to monitor: “Endpoint”
- Protocol: “HTTPS” (or any protocol, as we won’t be directly connecting)
- Port:
443(or any port) - IP address or domain name: Use the IP address of the EC2 instance running the script.
- Advanced configuration:
- Request interval: 30 seconds
- Failure threshold: 3
- Health check regions: Select a region close to your deployment.
- Enable this health check: Yes
- Click “Create health check”.
After creating this health check, go to its details and click “Edit”. Under “Health checker regions”, ensure it’s set to monitor from a region that can reach your MongoDB cluster. Crucially, you’ll need to associate this health check with a Route 53 record. This is where the tag-based logic comes in. The script updates a tag on the *health check resource itself*. Route 53 doesn’t directly interpret script output for failover. A more common pattern is to have the script update a record set directly or trigger an external mechanism.
Revised Route 53 Strategy: Using a Health Check to Monitor a Record Set
A more direct approach for DNS failover is to have Route 53 monitor a specific record set and use a health check that can actually probe the MongoDB primary. If the primary fails, the health check fails, and Route 53 automatically fails over the record set.
- Create a Route 53 Record Set:
- Go to your Hosted Zone.
- Click “Create record”.
- Record name:
mongo.yourdomain.com(or your application’s MongoDB endpoint). - Record type:
A - Value: The private IP address of your *current primary* MongoDB node.
- Routing policy: “Failover”
- Set ID:
PrimaryMongoDB - Associate with health check: Select the
MongoDBPrimaryHealthCheckyou created earlier (the one that probes port 27017). - Click “Create records”.
- Create a Secondary Record Set:
- Click “Create record” again.
- Record name:
mongo.yourdomain.com - Record type:
A - Value: The private IP address of a *healthy secondary* node (which will become primary).
- Routing policy: “Failover”
- Set ID:
SecondaryMongoDB - Failover record type: “Secondary”
- Associate with health check: Create a *new* health check that probes the *current secondary* node (e.g.,
MongoDBSecondaryHealthCheck). This is important: Route 53 needs a health check for each record in a failover set. - Click “Create records”.
With this setup, if the primary node (associated with PrimaryMongoDB) becomes unhealthy, Route 53 will automatically start returning the IP address of the secondary node (associated with SecondaryMongoDB) for mongo.yourdomain.com. Your application should be configured to connect to mongo.yourdomain.com.
C++ Application Integration for MongoDB Failover
Your C++ application needs to be aware of the DNS endpoint and handle potential connection errors gracefully. The key is to connect to the DNS name (e.g., mongo.yourdomain.com) rather than a static IP. The MongoDB C++ driver typically handles replica set discovery and failover internally if configured correctly.
Step 1: MongoDB C++ Driver Connection String
#include <iostream>
#include <mongocxx/client.hpp>
#include <mongocxx/instance.hpp>
#include <mongocxx/uri.hpp>
#include <mongocxx/options/client.hpp>
#include <bsoncxx/stdx/optional.hpp>
int main() {
mongocxx::instance instance{};
// Use the DNS name that Route 53 manages
mongocxx::uri uri("mongodb://mongo.yourdomain.com:27017/?replicaSet=rs0");
mongocxx::options::client client_options;
// Set a reasonable connection timeout and server selection timeout
client_options.connect_timeout(std::chrono::seconds(5));
client_options.server_selection_timeout(std::chrono::seconds(10));
mongocxx::client client(uri, client_options);
try {
// Attempt to access a database to trigger connection and server selection
auto db = client["mydatabase"];
auto collection = db["mycollection"];
std::cout << "Successfully connected to MongoDB replica set." << std::endl;
// Example write operation
collection.insert_one(bsoncxx::builder::basic::make_document(
bsoncxx::builder::basic::kvp("hello", "world")
));
std::cout << "Insert operation successful." << std::endl;
} catch (const mongocxx::exception& e) {
std::cerr << "MongoDB connection or operation failed: " << e.what() << std::endl;
return 1;
}
return 0;
}
The MongoDB C++ driver, when connecting to a replica set using a DNS name that Route 53 manages, will automatically discover the topology. If the primary node fails, the driver will detect the change (via heartbeats or failed operations) and initiate a new server selection process, eventually connecting to the newly elected primary. The server_selection_timeout is critical here; it defines how long the driver will wait to find a suitable server before giving up.
Monitoring and Alerting
Automated failover is only part of the solution. Comprehensive monitoring and alerting are essential to ensure the system is functioning as expected and to be notified of failures that require investigation.
- Route 53 Health Check Status: Monitor the health check status in the AWS console. Set up CloudWatch Alarms based on the health check status.
- MongoDB Replica Set Status: Use tools like
mongostat,mongotop, or custom scripts to monitor replica set lag, oplog window, and member states. - Application Logs: Ensure your C++ application logs connection errors and other relevant information.
- EC2 Instance Metrics: Monitor CPU, memory, and network utilization of your MongoDB and health check EC2 instances.
By combining Route 53’s DNS failover capabilities with a well-configured MongoDB replica set and a resilient C++ application, you can achieve a robust disaster recovery strategy for your critical data infrastructure on AWS.