Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Perl Deployments on DigitalOcean
Establishing a MongoDB Replica Set for High Availability
A robust disaster recovery strategy for MongoDB hinges on a properly configured replica set. This ensures data redundancy and provides automatic failover capabilities. We’ll focus on a three-node replica set deployed across different DigitalOcean availability zones for maximum resilience. This setup includes one primary node and two secondary nodes. One secondary will be configured as a hidden member and arbiter to ensure quorum without participating in read operations, and the other will be a standard secondary.
MongoDB Deployment and Configuration
Assume you have three DigitalOcean Droplets provisioned, each with MongoDB installed. We’ll use private networking for inter-node communication. Ensure your MongoDB configuration files (`/etc/mongod.conf`) are set up to bind to the private IP address and enable replication.
Node 1 (Primary) Configuration (`/etc/mongod.conf`)
# /etc/mongod.conf on Node 1 (Primary)
net:
port: 27017
bindIp: 10.10.0.1,127.0.0.1 # Replace with Node 1's private IP
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
replication:
replSetName: myReplicaSet
sharding:
clusterRole: configsvr # If this is a config server, otherwise omit
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
security:
keyFile: /etc/mongo-keyfile
authorization: enabled
Node 2 (Secondary) Configuration (`/etc/mongod.conf`)
# /etc/mongod.conf on Node 2 (Secondary)
net:
port: 27017
bindIp: 10.10.0.2,127.0.0.1 # Replace with Node 2's private IP
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
replication:
replSetName: myReplicaSet
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
security:
keyFile: /etc/mongo-keyfile
authorization: enabled
Node 3 (Hidden Secondary + Arbiter) Configuration (`/etc/mongod.conf`)
# /etc/mongod.conf on Node 3 (Hidden Secondary + Arbiter)
net:
port: 27017
bindIp: 10.10.0.3,127.0.0.1 # Replace with Node 3's private IP
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
replication:
replSetName: myReplicaSet
# No sharding configuration here as it's not a config server
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
security:
keyFile: /etc/mongo-keyfile
authorization: enabled
Setting up the Replica Set
First, generate a secure key file and distribute it to all nodes. Ensure file permissions are restrictive.
Generate and Distribute Key File
On one of the nodes (e.g., Node 1):
# On Node 1 openssl rand -base64 756 > /etc/mongo-keyfile chmod 400 /etc/mongo-keyfile chown mongodb:mongodb /etc/mongo-keyfile
Then, securely copy this key file to the other nodes:
# On Node 1, to copy to Node 2 scp /etc/mongo-keyfile [email protected]:/etc/mongo-keyfile # On Node 1, to copy to Node 3 scp /etc/mongo-keyfile [email protected]:/etc/mongo-keyfile # On Node 2 and Node 3, set permissions chmod 400 /etc/mongo-keyfile chown mongodb:mongodb /etc/mongo-keyfile
Initiate the Replica Set
Connect to the primary node (Node 1) using the MongoDB shell and initiate the replica set. Ensure you have created an administrative user for authentication.
# On Node 1
mongo --port 27017 --username adminUser --password 'adminPassword' --authenticationDatabase admin
# Inside the mongo shell:
rs.initiate(
{
_id : "myReplicaSet",
members: [
{ _id: 0, host: "10.10.0.1:27017" }, # Node 1 (Primary)
{ _id: 1, host: "10.10.0.2:27017" }, # Node 2 (Secondary)
{ _id: 2, host: "10.10.0.3:27017", arbiterOnly: true, hidden: true } # Node 3 (Hidden Arbiter)
]
}
)
After initiation, check the replica set status:
# Inside the mongo shell on any node: rs.status()
You should see Node 1 as PRIMARY and Node 2 as SECONDARY. Node 3 will be listed as an arbiter and hidden.
Perl Application Integration and Failover Handling
Your Perl application needs to be aware of the MongoDB replica set and handle potential failovers gracefully. This involves using the appropriate MongoDB driver and implementing connection logic that can adapt to primary changes.
Perl MongoDB Driver Configuration
We’ll use the `MongoDB` Perl module. Ensure it’s installed:
cpan MongoDB
Connecting to the Replica Set
The connection string should list all members of the replica set. The driver will automatically discover the primary.
use MongoDB;
use strict;
use warnings;
my $mongo_client;
my $db;
# Connection parameters
my $replica_set_name = "myReplicaSet";
my @hosts = (
"10.10.0.1:27017", # Node 1 (Primary)
"10.10.0.2:27017", # Node 2 (Secondary)
"10.10.0.3:27017" # Node 3 (Hidden Arbiter)
);
my $database_name = "myAppDB";
my $username = "appUser";
my $password = "appPassword";
# Construct the connection URI
my $connection_uri = "mongodb://" . join(",", @hosts) . "/?replicaSet=$replica_set_name";
eval {
$mongo_client = MongoDB::MongoClient->new(
host => $connection_uri,
ssl => 0, # Set to 1 if using SSL
username => $username,
password => $password,
authentication_database => "admin",
connect_timeout_ms => 5000, # 5 seconds
server_selection_timeout_ms => 10000 # 10 seconds
);
$db = $mongo_client->get_database($database_name);
# Perform a simple operation to verify connection
my $collection = $db->get_collection("status");
my $doc_count = $collection->count();
print "Successfully connected to MongoDB. Document count in 'status' collection: $doc_count\n";
};
if ($@) {
warn "Failed to connect to MongoDB: $@\n";
# Implement retry logic or alert mechanism here
exit 1;
}
# Your application logic using $db goes here...
# Example:
# my $users_collection = $db->get_collection("users");
# $users_collection->insert({ name => "Test User", timestamp => time });
Handling Connection Errors and Failovers
The `MongoDB` Perl module provides mechanisms to detect and react to primary changes. The `server_selection_timeout_ms` parameter is crucial for allowing the driver time to find a new primary if the current one becomes unavailable. For more sophisticated handling, you can implement event listeners or periodic health checks.
use MongoDB;
use strict;
use warnings;
# ... (previous connection setup) ...
# Example of a simple retry mechanism
sub get_mongo_connection {
my ($attempts, $delay) = @_;
my $client;
my $db_handle;
for (1..$attempts) {
eval {
$client = MongoDB::MongoClient->new(
host => $connection_uri,
ssl => 0,
username => $username,
password => $password,
authentication_database => "admin",
connect_timeout_ms => 5000,
server_selection_timeout_ms => 10000
);
$db_handle = $client->get_database($database_name);
# Test connection with a quick operation
$db_handle->get_collection("system.users")->count();
print "Connection successful on attempt $_\n";
return ($client, $db_handle);
};
if ($@) {
warn "Connection attempt $_ failed: $@\n";
sleep $delay;
} else {
last; # Exit loop on success
}
}
return (undef, undef); # Return undef if all attempts fail
}
my ($mongo_client, $db) = get_mongo_connection(5, 10); # Try 5 times with 10s delay
if (!$db) {
die "Failed to establish MongoDB connection after multiple retries.\n";
}
# Use $db for your application logic...
Automated Failover Orchestration with DigitalOcean and External Monitoring
While MongoDB handles replica set failover internally, orchestrating the *detection* of a failure and potential *application-level adjustments* requires external tooling. For this, we’ll leverage DigitalOcean’s monitoring capabilities and a simple external script.
DigitalOcean Monitoring and Alerting
DigitalOcean’s built-in monitoring can alert you when a Droplet becomes unresponsive. However, this is often too late for automatic failover. A more proactive approach involves custom health checks.
Custom Health Check Script
We can deploy a small script on a separate, reliable Droplet (or even a server outside DigitalOcean) that periodically checks the health of the MongoDB primary. This script will use the Perl MongoDB driver to attempt a write operation to a dedicated “heartbeat” collection.
# heartbeat_checker.pl
use MongoDB;
use strict;
use warnings;
use Time::HiRes qw(sleep);
# --- Configuration ---
my $replica_set_name = "myReplicaSet";
my @hosts = (
"10.10.0.1:27017",
"10.10.0.2:27017",
"10.10.0.3:27017"
);
my $database_name = "admin"; # Use admin DB for health checks
my $username = "healthCheckerUser"; # Dedicated user with write access to heartbeat collection
my $password = "healthCheckerPassword";
my $heartbeat_collection_name = "heartbeat";
my $check_interval_seconds = 15;
my $connection_timeout_ms = 3000;
my $server_selection_timeout_ms = 5000;
my $alert_threshold_seconds = 60; # If heartbeat is older than this, consider it a failure
my $primary_node_ip = "10.10.0.1"; # Expected primary IP
my $connection_uri = "mongodb://" . join(",", @hosts) . "/?replicaSet=$replica_set_name";
sub check_mongodb_health {
my $client;
my $db;
my $collection;
eval {
$client = MongoDB::MongoClient->new(
host => $connection_uri,
ssl => 0,
username => $username,
password => $password,
authentication_database => "admin",
connect_timeout_ms => $connection_timeout_ms,
server_selection_timeout_ms => $server_selection_timeout_ms
);
$db = $client->get_database($database_name);
$collection = $db->get_collection($heartbeat_collection_name);
# Attempt a write operation
my $result = $collection->update_one(
{ _id => "heartbeat" },
{ '$set' => { timestamp => time, node_ip => $primary_node_ip } },
{ upsert => 1 }
);
if ($result->matched_count || $result->upserted_count) {
print "Heartbeat successful. Timestamp: " . time . "\n";
return 1; # Success
} else {
warn "Heartbeat write operation did not report success.\n";
return 0; # Failure
}
};
if ($@) {
warn "MongoDB connection or operation failed: $@\n";
return 0; # Failure
}
}
sub check_heartbeat_freshness {
my $client;
my $db;
my $collection;
eval {
$client = MongoDB::MongoClient->new(
host => $connection_uri,
ssl => 0,
username => $username,
password => $password,
authentication_database => "admin",
connect_timeout_ms => $connection_timeout_ms,
server_selection_timeout_ms => $server_selection_timeout_ms
);
$db = $client->get_database($database_name);
$collection = $db->get_collection($heartbeat_collection_name);
my $heartbeat_doc = $collection->find_one({ _id => "heartbeat" });
if ($heartbeat_doc) {
my $last_timestamp = $heartbeat_doc->{timestamp};
if (time - $last_timestamp > $alert_threshold_seconds) {
warn "Heartbeat is stale! Last update: " . scalar localtime($last_timestamp) . "\n";
return 0; # Stale
} else {
print "Heartbeat is fresh. Last update: " . scalar localtime($last_timestamp) . "\n";
return 1; # Fresh
}
} else {
warn "Heartbeat document not found.\n";
return 0; # Stale (or never written)
}
};
if ($@) {
warn "MongoDB connection or operation failed during freshness check: $@\n";
return 0; # Failure
}
}
# --- Main Loop ---
my $last_successful_heartbeat = time;
while (1) {
if (check_mongodb_health()) {
$last_successful_heartbeat = time;
} else {
# If the direct health check fails, try checking freshness
if (!check_heartbeat_freshness()) {
my $current_time = time;
if ($current_time - $last_successful_heartbeat > $alert_threshold_seconds) {
# Trigger alert/failover action
print "ALERT: MongoDB primary appears to be down or unresponsive for more than $alert_threshold_seconds seconds!\n";
# In a real-world scenario, this would trigger an automated failover script
# For demonstration, we'll just exit and let a supervisor restart this script
# or trigger an external notification system.
# system("curl -X POST -d 'message=MongoDB Primary Failure Alert' YOUR_ALERTING_WEBHOOK");
exit 1; # Exit to allow supervisor to restart
} else {
print "MongoDB unresponsive, but within acceptable stale threshold. Waiting...\n";
}
}
}
sleep $check_interval_seconds;
}
Automated Failover Script (Conceptual)
When the `heartbeat_checker.pl` script detects a failure (e.g., by exiting with a non-zero status), a supervisor process (like `systemd` or `supervisord`) should catch this. The supervisor can then trigger a more complex failover script. This script would:
- Verify the primary is indeed down (e.g., by pinging or attempting a connection from multiple points).
- If confirmed, trigger a manual failover using `rs.stepDown()` on the current primary (if reachable) or `rs.reconfig()` on a secondary to promote it. This is a complex operation and requires careful scripting to avoid split-brain scenarios.
- Update DNS records (e.g., using DigitalOcean’s API) to point applications to the new primary.
- Notify operations teams via Slack, PagerDuty, etc.
A simplified example of a failover trigger script (run by `systemd` or similar):
#!/bin/bash
# failover_orchestrator.sh
# This script is triggered when heartbeat_checker.pl exits with an error.
LOG_FILE="/var/log/mongodb_failover.log"
MONGO_PRIMARY_IP="10.10.0.1" # Current expected primary
MONGO_SECONDARY_IP="10.10.0.2" # A known secondary
REPLICA_SET_NAME="myReplicaSet"
ADMIN_USER="adminUser"
ADMIN_PASS="adminPassword" # Store securely, e.g., in env vars or a secret manager
echo "$(date): Failover detected. Initiating orchestration." >> $LOG_FILE
# 1. Verify primary is down (basic ping check)
if ping -c 1 -W 2 $MONGO_PRIMARY_IP >> /dev/null; then
echo "$(date): Primary $MONGO_PRIMARY_IP is still reachable. Aborting automated failover." >> $LOG_FILE
exit 0 # Primary is back, no need to proceed
fi
echo "$(date): Primary $MONGO_PRIMARY_IP confirmed unreachable." >> $LOG_FILE
# 2. Attempt to promote a secondary (requires careful auth setup)
# This is a simplified example. In production, use a dedicated user with appropriate roles.
# You might need to connect to a secondary first.
echo "$(date): Attempting to promote secondary $MONGO_SECONDARY_IP." >> $LOG_FILE
mongo --host $MONGO_SECONDARY_IP:27017 --username $ADMIN_USER --password $ADMIN_PASS --authenticationDatabase admin --eval "rs.stepDown(300)" >> $LOG_FILE 2>&1
if [ $? -eq 0 ]; then
echo "$(date): rs.stepDown() on secondary might have initiated a failover. Waiting for new primary." >> $LOG_FILE
# Wait for a new primary to be elected
sleep 30
# Verify new primary
NEW_PRIMARY=$(mongo --host $MONGO_SECONDARY_IP:27017 --username $ADMIN_USER --password $ADMIN_PASS --authenticationDatabase admin --quiet --eval "rs.status().members.find(m => m.stateStr === 'PRIMARY').name")
if [ -n "$NEW_PRIMARY" ]; then
echo "$(date): New primary identified: $NEW_PRIMARY" >> $LOG_FILE
# 3. Update DNS (example using DigitalOcean API - requires DO_API_TOKEN)
# This part is highly dependent on your DNS setup and API credentials.
# Example: Update A record for 'mongodb.yourdomain.com'
# curl -X PUT "https://api.digitalocean.com/v2/domains/yourdomain.com/records/RECORD_ID" \
# -d "{\"type\":\"A\",\"name\":\"mongodb\",\"data\":\"$(echo $NEW_PRIMARY | cut -d: -f1)\"}" \
# -H "Authorization: Bearer $DO_API_TOKEN"
# echo "$(date): DNS updated for new primary." >> $LOG_FILE
else
echo "$(date): Failed to identify a new primary after stepDown." >> $LOG_FILE
fi
else
echo "$(date): Failed to execute rs.stepDown() on secondary $MONGO_SECONDARY_IP. Manual intervention may be required." >> $LOG_FILE
fi
# 4. Send notifications (e.g., Slack, PagerDuty)
# curl -X POST -d 'payload={"text": "MongoDB failover initiated. New primary: '$NEW_PRIMARY'"}' YOUR_SLACK_WEBHOOK_URL
echo "$(date): Failover orchestration script finished." >> $LOG_FILE
exit 0
Systemd Service for Health Checker
To ensure the `heartbeat_checker.pl` script runs reliably and restarts on failure, configure it as a `systemd` service.
# /etc/systemd/system/mongodb-heartbeat.service [Unit] Description=MongoDB Heartbeat Checker After=network.target [Service] User=your_app_user # Run as a non-root user Group=your_app_group WorkingDirectory=/path/to/your/scripts ExecStart=/usr/bin/perl /path/to/your/scripts/heartbeat_checker.pl Restart=on-failure RestartSec=10 StandardOutput=append:/var/log/mongodb_heartbeat.log StandardError=append:/var/log/mongodb_heartbeat.log [Install] WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable mongodb-heartbeat.service sudo systemctl start mongodb-heartbeat.service sudo systemctl status mongodb-heartbeat.service
Triggering the Failover Orchestrator
Modify the `mongodb-heartbeat.service` to also run the `failover_orchestrator.sh` script upon failure. This can be achieved by chaining them or using `ExecStopPost` if the orchestrator is designed to be run once per failure event.
# /etc/systemd/system/mongodb-heartbeat.service (modified) [Unit] Description=MongoDB Heartbeat Checker and Failover Orchestrator After=network.target [Service] User=your_app_user Group=your_app_group WorkingDirectory=/path/to/your/scripts ExecStart=/usr/bin/perl /path/to/your/scripts/heartbeat_checker.pl # When heartbeat_checker.pl exits with non-zero status, run the orchestrator ExecStopPost=/bin/bash /path/to/your/scripts/failover_orchestrator.sh Restart=on-failure RestartSec=10 StandardOutput=append:/var/log/mongodb_heartbeat.log StandardError=append:/var/log/mongodb_heartbeat.log [Install] WantedBy=multi-user.target
This setup provides a layered approach to disaster recovery: MongoDB’s internal replica set ensures data availability and automatic failover within the cluster, while external monitoring and orchestration scripts provide proactive detection and automated recovery actions, including DNS updates, to minimize application downtime.