Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Perl Deployments on DigitalOcean

Establishing a MongoDB Replica Set for High Availability

A robust disaster recovery strategy for MongoDB hinges on a properly configured replica set. This ensures data redundancy and provides automatic failover capabilities. We’ll focus on a three-node replica set deployed across different DigitalOcean availability zones for maximum resilience. This setup includes one primary node and two secondary nodes. One secondary will be configured as a hidden member and arbiter to ensure quorum without participating in read operations, and the other will be a standard secondary.

MongoDB Deployment and Configuration

Assume you have three DigitalOcean Droplets provisioned, each with MongoDB installed. We’ll use private networking for inter-node communication. Ensure your MongoDB configuration files (`/etc/mongod.conf`) are set up to bind to the private IP address and enable replication.

Node 1 (Primary) Configuration (`/etc/mongod.conf`)

# /etc/mongod.conf on Node 1 (Primary)
net:
  port: 27017
  bindIp: 10.10.0.1,127.0.0.1  # Replace with Node 1's private IP
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
replication:
  replSetName: myReplicaSet
sharding:
  clusterRole: configsvr # If this is a config server, otherwise omit
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid
security:
  keyFile: /etc/mongo-keyfile
  authorization: enabled

Node 2 (Secondary) Configuration (`/etc/mongod.conf`)

# /etc/mongod.conf on Node 2 (Secondary)
net:
  port: 27017
  bindIp: 10.10.0.2,127.0.0.1  # Replace with Node 2's private IP
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
replication:
  replSetName: myReplicaSet
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid
security:
  keyFile: /etc/mongo-keyfile
  authorization: enabled

Node 3 (Hidden Secondary + Arbiter) Configuration (`/etc/mongod.conf`)

# /etc/mongod.conf on Node 3 (Hidden Secondary + Arbiter)
net:
  port: 27017
  bindIp: 10.10.0.3,127.0.0.1  # Replace with Node 3's private IP
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
replication:
  replSetName: myReplicaSet
# No sharding configuration here as it's not a config server
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid
security:
  keyFile: /etc/mongo-keyfile
  authorization: enabled

Setting up the Replica Set

First, generate a secure key file and distribute it to all nodes. Ensure file permissions are restrictive.

Generate and Distribute Key File

On one of the nodes (e.g., Node 1):

# On Node 1
openssl rand -base64 756 > /etc/mongo-keyfile
chmod 400 /etc/mongo-keyfile
chown mongodb:mongodb /etc/mongo-keyfile

Then, securely copy this key file to the other nodes:

# On Node 1, to copy to Node 2
scp /etc/mongo-keyfile [email protected]:/etc/mongo-keyfile
# On Node 1, to copy to Node 3
scp /etc/mongo-keyfile [email protected]:/etc/mongo-keyfile

# On Node 2 and Node 3, set permissions
chmod 400 /etc/mongo-keyfile
chown mongodb:mongodb /etc/mongo-keyfile

Initiate the Replica Set

Connect to the primary node (Node 1) using the MongoDB shell and initiate the replica set. Ensure you have created an administrative user for authentication.

# On Node 1
mongo --port 27017 --username adminUser --password 'adminPassword' --authenticationDatabase admin

# Inside the mongo shell:
rs.initiate(
  {
    _id : "myReplicaSet",
    members: [
      { _id: 0, host: "10.10.0.1:27017" }, # Node 1 (Primary)
      { _id: 1, host: "10.10.0.2:27017" }, # Node 2 (Secondary)
      { _id: 2, host: "10.10.0.3:27017", arbiterOnly: true, hidden: true } # Node 3 (Hidden Arbiter)
    ]
  }
)

After initiation, check the replica set status:

# Inside the mongo shell on any node:
rs.status()

You should see Node 1 as PRIMARY and Node 2 as SECONDARY. Node 3 will be listed as an arbiter and hidden.

Perl Application Integration and Failover Handling

Your Perl application needs to be aware of the MongoDB replica set and handle potential failovers gracefully. This involves using the appropriate MongoDB driver and implementing connection logic that can adapt to primary changes.

Perl MongoDB Driver Configuration

We’ll use the `MongoDB` Perl module. Ensure it’s installed:

cpan MongoDB

Connecting to the Replica Set

The connection string should list all members of the replica set. The driver will automatically discover the primary.

use MongoDB;
use strict;
use warnings;

my $mongo_client;
my $db;

# Connection parameters
my $replica_set_name = "myReplicaSet";
my @hosts = (
    "10.10.0.1:27017", # Node 1 (Primary)
    "10.10.0.2:27017", # Node 2 (Secondary)
    "10.10.0.3:27017"  # Node 3 (Hidden Arbiter)
);
my $database_name = "myAppDB";
my $username = "appUser";
my $password = "appPassword";

# Construct the connection URI
my $connection_uri = "mongodb://" . join(",", @hosts) . "/?replicaSet=$replica_set_name";

eval {
    $mongo_client = MongoDB::MongoClient->new(
        host => $connection_uri,
        ssl => 0, # Set to 1 if using SSL
        username => $username,
        password => $password,
        authentication_database => "admin",
        connect_timeout_ms => 5000, # 5 seconds
        server_selection_timeout_ms => 10000 # 10 seconds
    );

    $db = $mongo_client->get_database($database_name);

    # Perform a simple operation to verify connection
    my $collection = $db->get_collection("status");
    my $doc_count = $collection->count();
    print "Successfully connected to MongoDB. Document count in 'status' collection: $doc_count\n";
};
if ($@) {
    warn "Failed to connect to MongoDB: $@\n";
    # Implement retry logic or alert mechanism here
    exit 1;
}

# Your application logic using $db goes here...
# Example:
# my $users_collection = $db->get_collection("users");
# $users_collection->insert({ name => "Test User", timestamp => time });

Handling Connection Errors and Failovers

The `MongoDB` Perl module provides mechanisms to detect and react to primary changes. The `server_selection_timeout_ms` parameter is crucial for allowing the driver time to find a new primary if the current one becomes unavailable. For more sophisticated handling, you can implement event listeners or periodic health checks.

use MongoDB;
use strict;
use warnings;

# ... (previous connection setup) ...

# Example of a simple retry mechanism
sub get_mongo_connection {
    my ($attempts, $delay) = @_;
    my $client;
    my $db_handle;

    for (1..$attempts) {
        eval {
            $client = MongoDB::MongoClient->new(
                host => $connection_uri,
                ssl => 0,
                username => $username,
                password => $password,
                authentication_database => "admin",
                connect_timeout_ms => 5000,
                server_selection_timeout_ms => 10000
            );
            $db_handle = $client->get_database($database_name);
            # Test connection with a quick operation
            $db_handle->get_collection("system.users")->count();
            print "Connection successful on attempt $_\n";
            return ($client, $db_handle);
        };
        if ($@) {
            warn "Connection attempt $_ failed: $@\n";
            sleep $delay;
        } else {
            last; # Exit loop on success
        }
    }
    return (undef, undef); # Return undef if all attempts fail
}

my ($mongo_client, $db) = get_mongo_connection(5, 10); # Try 5 times with 10s delay

if (!$db) {
    die "Failed to establish MongoDB connection after multiple retries.\n";
}

# Use $db for your application logic...

Automated Failover Orchestration with DigitalOcean and External Monitoring

While MongoDB handles replica set failover internally, orchestrating the *detection* of a failure and potential *application-level adjustments* requires external tooling. For this, we’ll leverage DigitalOcean’s monitoring capabilities and a simple external script.

DigitalOcean Monitoring and Alerting

DigitalOcean’s built-in monitoring can alert you when a Droplet becomes unresponsive. However, this is often too late for automatic failover. A more proactive approach involves custom health checks.

Custom Health Check Script

We can deploy a small script on a separate, reliable Droplet (or even a server outside DigitalOcean) that periodically checks the health of the MongoDB primary. This script will use the Perl MongoDB driver to attempt a write operation to a dedicated “heartbeat” collection.

# heartbeat_checker.pl
use MongoDB;
use strict;
use warnings;
use Time::HiRes qw(sleep);

# --- Configuration ---
my $replica_set_name = "myReplicaSet";
my @hosts = (
    "10.10.0.1:27017",
    "10.10.0.2:27017",
    "10.10.0.3:27017"
);
my $database_name = "admin"; # Use admin DB for health checks
my $username = "healthCheckerUser"; # Dedicated user with write access to heartbeat collection
my $password = "healthCheckerPassword";
my $heartbeat_collection_name = "heartbeat";
my $check_interval_seconds = 15;
my $connection_timeout_ms = 3000;
my $server_selection_timeout_ms = 5000;
my $alert_threshold_seconds = 60; # If heartbeat is older than this, consider it a failure
my $primary_node_ip = "10.10.0.1"; # Expected primary IP

my $connection_uri = "mongodb://" . join(",", @hosts) . "/?replicaSet=$replica_set_name";

sub check_mongodb_health {
    my $client;
    my $db;
    my $collection;

    eval {
        $client = MongoDB::MongoClient->new(
            host => $connection_uri,
            ssl => 0,
            username => $username,
            password => $password,
            authentication_database => "admin",
            connect_timeout_ms => $connection_timeout_ms,
            server_selection_timeout_ms => $server_selection_timeout_ms
        );
        $db = $client->get_database($database_name);
        $collection = $db->get_collection($heartbeat_collection_name);

        # Attempt a write operation
        my $result = $collection->update_one(
            { _id => "heartbeat" },
            { '$set' => { timestamp => time, node_ip => $primary_node_ip } },
            { upsert => 1 }
        );

        if ($result->matched_count || $result->upserted_count) {
            print "Heartbeat successful. Timestamp: " . time . "\n";
            return 1; # Success
        } else {
            warn "Heartbeat write operation did not report success.\n";
            return 0; # Failure
        }
    };
    if ($@) {
        warn "MongoDB connection or operation failed: $@\n";
        return 0; # Failure
    }
}

sub check_heartbeat_freshness {
    my $client;
    my $db;
    my $collection;

    eval {
        $client = MongoDB::MongoClient->new(
            host => $connection_uri,
            ssl => 0,
            username => $username,
            password => $password,
            authentication_database => "admin",
            connect_timeout_ms => $connection_timeout_ms,
            server_selection_timeout_ms => $server_selection_timeout_ms
        );
        $db = $client->get_database($database_name);
        $collection = $db->get_collection($heartbeat_collection_name);

        my $heartbeat_doc = $collection->find_one({ _id => "heartbeat" });

        if ($heartbeat_doc) {
            my $last_timestamp = $heartbeat_doc->{timestamp};
            if (time - $last_timestamp > $alert_threshold_seconds) {
                warn "Heartbeat is stale! Last update: " . scalar localtime($last_timestamp) . "\n";
                return 0; # Stale
            } else {
                print "Heartbeat is fresh. Last update: " . scalar localtime($last_timestamp) . "\n";
                return 1; # Fresh
            }
        } else {
            warn "Heartbeat document not found.\n";
            return 0; # Stale (or never written)
        }
    };
    if ($@) {
        warn "MongoDB connection or operation failed during freshness check: $@\n";
        return 0; # Failure
    }
}

# --- Main Loop ---
my $last_successful_heartbeat = time;

while (1) {
    if (check_mongodb_health()) {
        $last_successful_heartbeat = time;
    } else {
        # If the direct health check fails, try checking freshness
        if (!check_heartbeat_freshness()) {
            my $current_time = time;
            if ($current_time - $last_successful_heartbeat > $alert_threshold_seconds) {
                # Trigger alert/failover action
                print "ALERT: MongoDB primary appears to be down or unresponsive for more than $alert_threshold_seconds seconds!\n";
                # In a real-world scenario, this would trigger an automated failover script
                # For demonstration, we'll just exit and let a supervisor restart this script
                # or trigger an external notification system.
                # system("curl -X POST -d 'message=MongoDB Primary Failure Alert' YOUR_ALERTING_WEBHOOK");
                exit 1; # Exit to allow supervisor to restart
            } else {
                print "MongoDB unresponsive, but within acceptable stale threshold. Waiting...\n";
            }
        }
    }
    sleep $check_interval_seconds;
}

Automated Failover Script (Conceptual)

When the `heartbeat_checker.pl` script detects a failure (e.g., by exiting with a non-zero status), a supervisor process (like `systemd` or `supervisord`) should catch this. The supervisor can then trigger a more complex failover script. This script would:

Verify the primary is indeed down (e.g., by pinging or attempting a connection from multiple points).
If confirmed, trigger a manual failover using `rs.stepDown()` on the current primary (if reachable) or `rs.reconfig()` on a secondary to promote it. This is a complex operation and requires careful scripting to avoid split-brain scenarios.
Update DNS records (e.g., using DigitalOcean’s API) to point applications to the new primary.
Notify operations teams via Slack, PagerDuty, etc.

A simplified example of a failover trigger script (run by `systemd` or similar):

#!/bin/bash

# failover_orchestrator.sh
# This script is triggered when heartbeat_checker.pl exits with an error.

LOG_FILE="/var/log/mongodb_failover.log"
MONGO_PRIMARY_IP="10.10.0.1" # Current expected primary
MONGO_SECONDARY_IP="10.10.0.2" # A known secondary
REPLICA_SET_NAME="myReplicaSet"
ADMIN_USER="adminUser"
ADMIN_PASS="adminPassword" # Store securely, e.g., in env vars or a secret manager

echo "$(date): Failover detected. Initiating orchestration." >> $LOG_FILE

# 1. Verify primary is down (basic ping check)
if ping -c 1 -W 2 $MONGO_PRIMARY_IP >> /dev/null; then
    echo "$(date): Primary $MONGO_PRIMARY_IP is still reachable. Aborting automated failover." >> $LOG_FILE
    exit 0 # Primary is back, no need to proceed
fi

echo "$(date): Primary $MONGO_PRIMARY_IP confirmed unreachable." >> $LOG_FILE

# 2. Attempt to promote a secondary (requires careful auth setup)
# This is a simplified example. In production, use a dedicated user with appropriate roles.
# You might need to connect to a secondary first.
echo "$(date): Attempting to promote secondary $MONGO_SECONDARY_IP." >> $LOG_FILE
mongo --host $MONGO_SECONDARY_IP:27017 --username $ADMIN_USER --password $ADMIN_PASS --authenticationDatabase admin --eval "rs.stepDown(300)" >> $LOG_FILE 2>&1
if [ $? -eq 0 ]; then
    echo "$(date): rs.stepDown() on secondary might have initiated a failover. Waiting for new primary." >> $LOG_FILE
    # Wait for a new primary to be elected
    sleep 30
    # Verify new primary
    NEW_PRIMARY=$(mongo --host $MONGO_SECONDARY_IP:27017 --username $ADMIN_USER --password $ADMIN_PASS --authenticationDatabase admin --quiet --eval "rs.status().members.find(m => m.stateStr === 'PRIMARY').name")
    if [ -n "$NEW_PRIMARY" ]; then
        echo "$(date): New primary identified: $NEW_PRIMARY" >> $LOG_FILE
        # 3. Update DNS (example using DigitalOcean API - requires DO_API_TOKEN)
        # This part is highly dependent on your DNS setup and API credentials.
        # Example: Update A record for 'mongodb.yourdomain.com'
        # curl -X PUT "https://api.digitalocean.com/v2/domains/yourdomain.com/records/RECORD_ID" \
        #      -d "{\"type\":\"A\",\"name\":\"mongodb\",\"data\":\"$(echo $NEW_PRIMARY | cut -d: -f1)\"}" \
        #      -H "Authorization: Bearer $DO_API_TOKEN"
        # echo "$(date): DNS updated for new primary." >> $LOG_FILE
    else
        echo "$(date): Failed to identify a new primary after stepDown." >> $LOG_FILE
    fi
else
    echo "$(date): Failed to execute rs.stepDown() on secondary $MONGO_SECONDARY_IP. Manual intervention may be required." >> $LOG_FILE
fi

# 4. Send notifications (e.g., Slack, PagerDuty)
# curl -X POST -d 'payload={"text": "MongoDB failover initiated. New primary: '$NEW_PRIMARY'"}' YOUR_SLACK_WEBHOOK_URL

echo "$(date): Failover orchestration script finished." >> $LOG_FILE
exit 0

Systemd Service for Health Checker

To ensure the `heartbeat_checker.pl` script runs reliably and restarts on failure, configure it as a `systemd` service.

# /etc/systemd/system/mongodb-heartbeat.service
[Unit]
Description=MongoDB Heartbeat Checker
After=network.target

[Service]
User=your_app_user # Run as a non-root user
Group=your_app_group
WorkingDirectory=/path/to/your/scripts
ExecStart=/usr/bin/perl /path/to/your/scripts/heartbeat_checker.pl
Restart=on-failure
RestartSec=10
StandardOutput=append:/var/log/mongodb_heartbeat.log
StandardError=append:/var/log/mongodb_heartbeat.log

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable mongodb-heartbeat.service
sudo systemctl start mongodb-heartbeat.service
sudo systemctl status mongodb-heartbeat.service

Triggering the Failover Orchestrator

Modify the `mongodb-heartbeat.service` to also run the `failover_orchestrator.sh` script upon failure. This can be achieved by chaining them or using `ExecStopPost` if the orchestrator is designed to be run once per failure event.

# /etc/systemd/system/mongodb-heartbeat.service (modified)
[Unit]
Description=MongoDB Heartbeat Checker and Failover Orchestrator
After=network.target

[Service]
User=your_app_user
Group=your_app_group
WorkingDirectory=/path/to/your/scripts
ExecStart=/usr/bin/perl /path/to/your/scripts/heartbeat_checker.pl
# When heartbeat_checker.pl exits with non-zero status, run the orchestrator
ExecStopPost=/bin/bash /path/to/your/scripts/failover_orchestrator.sh
Restart=on-failure
RestartSec=10
StandardOutput=append:/var/log/mongodb_heartbeat.log
StandardError=append:/var/log/mongodb_heartbeat.log

[Install]
WantedBy=multi-user.target

This setup provides a layered approach to disaster recovery: MongoDB’s internal replica set ensures data availability and automatic failover within the cluster, while external monitoring and orchestration scripts provide proactive detection and automated recovery actions, including DNS updates, to minimize application downtime.