Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Python Deployments on DigitalOcean

Establishing a MongoDB Replica Set for High Availability

A robust disaster recovery strategy for MongoDB hinges on implementing a replica set. This ensures data redundancy and automatic failover in case of node failure. We’ll focus on a three-node setup for quorum and resilience, deployed on DigitalOcean Droplets.

First, ensure MongoDB is installed on each Droplet. For this example, we’ll assume Ubuntu 22.04 LTS. The configuration file, typically located at /etc/mongod.conf, needs to be adjusted on each node.

Node 1: Primary Configuration

This node will initially act as the primary. We need to enable replication and specify the replica set name.

# /etc/mongod.conf on Node 1
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 0.0.0.0
  port: 27017
security:
  keyFile: /etc/mongo-keyfile
replication:
  replSetName: rs0
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid

Crucially, a keyFile is required for authentication between replica set members. Generate this on one node and securely copy it to the others. Ensure file permissions are restrictive (e.g., chmod 400 /etc/mongo-keyfile).

# On Node 1, generate keyfile
openssl rand -base64 756 > /etc/mongo-keyfile
chmod 400 /etc/mongo-keyfile
# Securely copy /etc/mongo-keyfile to Node 2 and Node 3
# Ensure permissions are 400 on all nodes

Node 2 & Node 3: Secondary Configuration

The configuration for Node 2 and Node 3 is identical to Node 1, with the exception of the keyFile path if you chose to generate it separately (though sharing is common). Ensure the replSetName is consistent across all nodes.

# /etc/mongod.conf on Node 2 & Node 3 (identical to Node 1's config)
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 0.0.0.0
  port: 27017
security:
  keyFile: /etc/mongo-keyfile
replication:
  replSetName: rs0
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid

Initializing the Replica Set

After configuring and restarting MongoDB on all nodes, connect to the primary node (Node 1 in our setup) using the mongo shell. Then, initiate the replica set.

// On Node 1, connect to mongo shell
mongo --host <node1_ip>:27017

// Inside the mongo shell:
rs.initiate(
  {
    _id: "rs0",
    members: [
      { _id: 0, host: "<node1_ip>:27017" },
      { _id: 1, host: "<node2_ip>:27017" },
      { _id: 2, host: "<node3_ip>:27017" }
    ]
  }
)

Replace <nodeX_ip> with the actual private IP addresses of your DigitalOcean Droplets. After initiation, you can check the replica set status with rs.status(). You should see all members in an ‘PRIMARY’ or ‘SECONDARY’ state.

Architecting Python Application for MongoDB Failover

Your Python application needs to be aware of the MongoDB replica set and handle connection strings that allow it to discover and connect to the current primary. The PyMongo driver is excellent at this.

Connection String Configuration

Instead of connecting to a single MongoDB instance, use a connection string that lists all members of the replica set. This allows PyMongo to discover the topology and automatically switch to a new primary if the current one becomes unavailable.

# Example using environment variables for configuration
import os
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure

# Retrieve MongoDB connection details from environment variables
MONGO_HOSTS = os.environ.get("MONGO_HOSTS", "mongodb://<node1_ip>:27017,<node2_ip>:27017,<node3_ip>:27017")
MONGO_REPLICA_SET_NAME = os.environ.get("MONGO_REPLICA_SET_NAME", "rs0")
MONGO_DB_NAME = os.environ.get("MONGO_DB_NAME", "mydatabase")

def get_mongo_client():
    """Establishes a connection to the MongoDB replica set."""
    try:
        client = MongoClient(
            MONGO_HOSTS,
            replicaSet=MONGO_REPLICA_SET_NAME,
            serverSelectionTimeoutMS=5000  # Timeout for server selection (in milliseconds)
        )
        # The ismaster command is cheap and does not require auth.
        client.admin.command('ismaster')
        print("MongoDB connection successful!")
        return client
    except ConnectionFailure as e:
        print(f"Could not connect to MongoDB: {e}")
        # Implement retry logic or graceful shutdown here
        return None

# Example usage:
# client = get_mongo_client()
# if client:
#     db = client[MONGO_DB_NAME]
#     # Perform database operations
#     # For example:
#     # my_collection = db["mycollection"]
#     # my_collection.insert_one({"name": "test"})
#     # client.close()

The serverSelectionTimeoutMS is critical. It defines how long PyMongo will attempt to find a suitable server (primary) before giving up. A value of 5000ms (5 seconds) is a reasonable starting point for detecting failovers without excessive application latency during normal operation.

Handling Connection Errors and Retries

Your application must gracefully handle ConnectionFailure exceptions. This might involve implementing a retry mechanism with exponential backoff or alerting operators. For critical operations, consider using a connection pool and ensuring that operations are idempotent.

# Example of a simple retry mechanism
import time

def execute_with_retry(operation, max_retries=5, delay=2):
    """Executes an operation with retry logic."""
    for attempt in range(max_retries):
        try:
            return operation()
        except ConnectionFailure as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(delay * (2 ** attempt)) # Exponential backoff
            else:
                print("Max retries reached. Operation failed.")
                raise # Re-raise the exception after max retries
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            raise # Re-raise other exceptions

# Example usage within your application logic:
# def insert_data():
#     client = get_mongo_client()
#     if client:
#         db = client[MONGO_DB_NAME]
#         my_collection = db["mycollection"]
#         my_collection.insert_one({"data": "important"})
#         client.close()
#
# try:
#     execute_with_retry(insert_data)
# except ConnectionFailure:
#     # Handle the ultimate failure, e.g., log, alert, return error to user
#     pass

Automating Failover Detection and Response

While MongoDB’s replica set handles automatic failover internally, your infrastructure needs to monitor the health of the replica set and potentially trigger external actions or notifications. DigitalOcean’s monitoring tools and custom scripts can be leveraged here.

Health Checks and Monitoring

Implement regular health checks that query the replica set status. This can be done via a cron job or a dedicated monitoring agent.

#!/bin/bash

MONGO_HOSTS="<node1_ip>:27017,<node2_ip>:27017,<node3_ip>:27017"
REPLICA_SET_NAME="rs0"
ALERT_EMAIL="[email protected]"

# Check if mongo shell is available
if ! command -v mongo &>/dev/null; then
    echo "Error: mongo shell not found."
    exit 1
fi

# Execute rs.status() and parse the output
# This is a simplified check; a more robust script would parse JSON output
STATUS_OUTPUT=$(mongo --host $MONGO_HOSTS --quiet --eval "rs.status()" 2>&1)

if echo "$STATUS_OUTPUT" | grep -q "errmsg"; then
    echo "Error checking MongoDB replica set status: $STATUS_OUTPUT"
    echo "Subject: MongoDB Replica Set Alert - Status Check Failed" | sendmail $ALERT_EMAIL
    exit 1
fi

PRIMARY_COUNT=$(echo "$STATUS_OUTPUT" | grep -o '"PRIMARY"' | wc -l)
SECONDARY_COUNT=$(echo "$STATUS_OUTPUT" | grep -o '"SECONDARY"' | wc -l)
ARBITER_COUNT=$(echo "$STATUS_OUTPUT" | grep -o '"ARBITER"' | wc -l) # If you have an arbiter
DOWN_COUNT=$(echo "$STATUS_OUTPUT" | grep -o '"DOWN"' | wc -l)
STARTUP_LAG_COUNT=$(echo "$STATUS_OUTPUT" | grep -o '"STARTUP_LAG"' | wc -l)

# Basic health check logic
# Expecting 1 PRIMARY, 2 SECONDARY for a 3-node set without arbiter
EXPECTED_SECONDARY=2
if [ "$PRIMARY_COUNT" -ne 1 ] || [ "$SECONDARY_COUNT" -ne "$EXPECTED_SECONDARY" ] || [ "$DOWN_COUNT" -gt 0 ] || [ "$STARTUP_LAG_COUNT" -gt 0 ]; then
    echo "MongoDB replica set health issue detected!"
    echo "Status: Primary=$PRIMARY_COUNT, Secondary=$SECONDARY_COUNT, Down=$DOWN_COUNT, StartupLag=$STARTUP_LAG_COUNT"
    echo "Replica Set Status Output:"
    echo "$STATUS_OUTPUT"
    echo "Subject: MongoDB Replica Set Alert - Health Issue Detected" | sendmail $ALERT_EMAIL
    exit 1
fi

echo "MongoDB replica set is healthy."
exit 0

This script can be scheduled via cron. For more advanced monitoring, consider integrating with DigitalOcean’s monitoring APIs or using tools like Prometheus with the MongoDB exporter.

Automated Actions (Optional)

In a production environment, you might want to automate actions based on health check failures. This could include:

Triggering alerts via Slack, PagerDuty, etc.
Initiating automated diagnostics or recovery procedures (e.g., restarting a failed node, though this should be done with extreme caution).
Scaling up resources if performance degradation is detected.

For automated actions, a system like Ansible or a custom webhook receiver triggered by your monitoring system would be necessary. For instance, if the health check script detects a persistent issue, it could POST a payload to a webhook endpoint that triggers an Ansible playbook to attempt node recovery.

Deployment and Infrastructure Considerations on DigitalOcean

Leveraging DigitalOcean’s infrastructure is key to a successful DR strategy. This includes using private networking, snapshots, and potentially load balancers.

Private Networking

Ensure your MongoDB nodes are on the same DigitalOcean VPC (Virtual Private Cloud). This allows them to communicate using their private IP addresses, which is more secure and often faster than using public IPs. Configure your firewall rules (e.g., ufw) to only allow traffic on port 27017 from your application servers and other MongoDB nodes.

# Example ufw rules on MongoDB nodes
sudo ufw allow from <app_server_private_ip> to any port 27017
sudo ufw allow from <other_mongo_node_private_ip> to any port 27017
sudo ufw allow from 127.0.0.1 to any port 27017 # For local access
sudo ufw enable

Snapshots and Backups

Regularly scheduled snapshots of your MongoDB Droplets are essential for point-in-time recovery. DigitalOcean provides automated snapshot features. For MongoDB-specific backups, consider using mongodump to create logical backups, which can be stored off-site or in object storage.

# Example mongodump command (run from an application server or dedicated backup node)
MONGO_HOST="<primary_mongo_ip>:27017"
BACKUP_DIR="/mnt/backups/$(date +%Y-%m-%d_%H-%M-%S)"
mkdir -p "$BACKUP_DIR"

mongodump --host "$MONGO_HOST" --out "$BACKUP_DIR" --username "backupuser" --password "your_password" --authenticationDatabase "admin"

# For compressed backups
# mongodump --host "$MONGO_HOST" --archive="$BACKUP_DIR.gz" --gzip --username "backupuser" --password "your_password" --authenticationDatabase "admin"

Automate this process and transfer the backups to a separate storage solution (e.g., DigitalOcean Spaces) for true disaster recovery.

Load Balancers (Optional but Recommended)

While not strictly necessary for MongoDB’s internal failover, a DigitalOcean Load Balancer can be placed in front of your application servers. This simplifies application deployment and management, and can also perform health checks on your application instances. For MongoDB itself, direct client connections to the replica set are generally preferred over a load balancer, as the driver handles topology discovery and failover more effectively.

Testing Your Disaster Recovery Plan

A DR plan is only as good as its tested execution. Regularly simulate failures to validate your setup.

Simulating Node Failures

To test failover:

Gracefully shut down the primary MongoDB node (e.g., sudo systemctl stop mongod). Observe how a secondary is elected as the new primary.
Forcefully terminate the primary node’s process (e.g., sudo kill -9 $(pgrep mongod)). This simulates an unexpected crash and should also trigger an election.
Simulate network partitions by blocking traffic between nodes using firewall rules.

During these tests, monitor your application’s behavior. Verify that it can still connect to the database and that operations are eventually successful after the failover. Check your monitoring system for alerts and ensure they are triggered correctly.

Testing Data Recovery

Periodically test restoring data from your backups (both snapshots and logical dumps) to ensure their integrity and that your restore procedures are documented and effective.