Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and C++ Deployments on DigitalOcean

Establishing a MongoDB Replica Set for High Availability

For any mission-critical application, especially those leveraging MongoDB, a robust disaster recovery strategy hinges on high availability. This is achieved through MongoDB’s replica sets. A replica set is a group of MongoDB servers that maintain the same data set, providing redundancy and high availability. If one server in the replica set goes down, another server automatically takes over as the primary.

We’ll architect this on DigitalOcean, utilizing their Droplets for our MongoDB instances. For simplicity and demonstration, we’ll set up a 3-node replica set (one primary, two secondaries). In a production environment, consider a minimum of 3 nodes for fault tolerance, and ideally an odd number of voting members to avoid split-brain scenarios.

Provisioning DigitalOcean Droplets

First, provision three Droplets on DigitalOcean. For this example, let’s assume they are named:

mongo-node-1 (Primary candidate)
mongo-node-2 (Secondary candidate)
mongo-node-3 (Secondary candidate)

Ensure these Droplets are in the same DigitalOcean region and preferably on a private network for inter-node communication. Assign static private IP addresses to each Droplet. For this guide, we’ll use the following private IPs:

mongo-node-1: 10.10.10.1
mongo-node-2: 10.10.10.2
mongo-node-3: 10.10.10.3

Configuring MongoDB Instances

On each Droplet, install MongoDB. Then, configure the mongod.conf file. The key is to enable replication and specify the replica set name.

On mongo-node-1 (and similarly for nodes 2 and 3, adjusting the bindIp if necessary, though for private networks it’s often sufficient to bind to 0.0.0.0 or the private IP):

`/etc/mongod.conf` on `mongo-node-1`

# mongod.conf

storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 0.0.0.0  # Or the private IP: 10.10.10.1
  port: 27017
replication:
  replSetName: "myReplicaSet" # Crucial for replica set identification
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid
security:
  authorization: enabled # Recommended for production

Ensure the bindIp is set correctly to allow other nodes to connect. For production, binding to the private IP is more secure than 0.0.0.0. After modifying the configuration, restart the MongoDB service:

sudo systemctl restart mongod
sudo systemctl enable mongod

Repeat this configuration for mongo-node-2 and mongo-node-3, ensuring they all have the same replSetName. For mongo-node-2, bindIp would be 10.10.10.2, and for mongo-node-3, 10.10.10.3.

Initializing the Replica Set

Connect to the MongoDB instance on mongo-node-1. This node will initiate the replica set configuration.

mongo --host 10.10.10.1

Once connected to the mongo shell, run the rs.initiate() command. This command takes a configuration document that specifies the members of the replica set. The first member added is typically elected as the primary.

rs.initiate(
  {
    _id: "myReplicaSet",
    members: [
      { _id: 0, host: "10.10.10.1:27017" },
      { _id: 1, host: "10.10.10.2:27017" },
      { _id: 2, host: "10.10.10.3:27017" }
    ]
  }
)

After running rs.initiate(), MongoDB will elect a primary. You can check the status of the replica set by running:

rs.status()

The output should show one member as PRIMARY and the others as SECONDARY. If you need to add members later or reconfigure, you can use rs.add() and rs.reconfig().

Architecting C++ Application Failover with Keepalived

For the C++ application layer, we need a mechanism to ensure that if the active instance fails, traffic is seamlessly redirected to a standby instance. This is a classic use case for a High Availability (HA) solution like Keepalived. Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to manage a virtual IP address (VIP) across multiple servers.

We’ll deploy two Droplets for our C++ application instances, acting as active and passive nodes. A third Droplet can be used as a witness or for other supporting services, but for the core failover, two are sufficient.

Setting up Application Droplets

Provision two Droplets for your C++ application. Let’s name them:

app-node-1 (Active candidate)
app-node-2 (Passive candidate)

Assign static private IP addresses. For this example:

app-node-1: 10.10.10.11
app-node-2: 10.10.10.12

On both Droplets, configure a Virtual IP (VIP) that your application will bind to. Let’s use 10.10.10.10 as our VIP. This VIP will be managed by Keepalived.

Installing and Configuring Keepalived

Install Keepalived on both application Droplets:

sudo apt update
sudo apt install keepalived -y

The primary configuration file for Keepalived is /etc/keepalived/keepalived.conf. We need to configure it for VRRP.

`/etc/keepalived/keepalived.conf` on `app-node-1` (Active)

vrrp_script chk_app {
    script "/usr/local/bin/check_app_status.sh" # A script to check if the C++ app is running
    interval 2                                  # Check every 2 seconds
    weight 2                                    # Add 2 to priority if script succeeds
    fall 2                                      # If script fails 2 times, decrease priority
    rise 2                                      # If script succeeds 2 times, increase priority
}

vrrp_instance VI_1 {
    state BACKUP                                # Start as BACKUP on passive node
    interface eth0                                # Network interface to bind VIP to
    virtual_router_id 51                          # Unique ID for this VRRP instance
    priority 100                                  # Higher priority for active node
    advert_int 1                                  # Advertisement interval in seconds
    authentication {
        auth_type PASS
        auth_pass your_secret_password          # Shared secret for authentication
    }
    virtual_ipaddress {
        10.10.10.10/24                             # The Virtual IP address
    }
    track_script {
        chk_app
    }
}

`/etc/keepalived/keepalived.conf` on `app-node-2` (Passive)

vrrp_script chk_app {
    script "/usr/local/bin/check_app_status.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP                                # Start as BACKUP on passive node
    interface eth0                                # Network interface to bind VIP to
    virtual_router_id 51                          # Must match the active node
    priority 90                                   # Lower priority for passive node
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_secret_password          # Must match the active node
    }
    virtual_ipaddress {
        10.10.10.10/24                             # Must match the active node
    }
    track_script {
        chk_app
    }
}

The state parameter is crucial. On the intended active node (app-node-1), set it to MASTER. On the passive node (app-node-2), set it to BACKUP. The priority parameter determines which node becomes MASTER. The node with the higher priority will win. The track_script directive is vital for automated failover; it monitors the health of your C++ application.

Creating the Application Health Check Script

Create the /usr/local/bin/check_app_status.sh script on both application Droplets. This script should check if your C++ application is running and responsive. A simple check might be to see if the process is alive, or a more robust check could involve sending a health check request to the application.

#!/bin/bash

# Simple check: see if the C++ application process is running
# Replace 'your_cpp_app_process_name' with the actual process name or command
APP_PROCESS_NAME="your_cpp_app_process_name"

if pgrep -f "$APP_PROCESS_NAME" > /dev/null
then
    exit 0 # Process is running, exit with success
else
    exit 1 # Process is not running, exit with failure
fi

# More advanced check (example):
# curl -s --fail http://localhost:8080/health || exit 1
# exit 0

Make the script executable:

sudo chmod +x /usr/local/bin/check_app_status.sh

Starting Keepalived and Verifying Failover

Start and enable Keepalived on both nodes:

sudo systemctl start keepalived
sudo systemctl enable keepalived

On app-node-1 (the MASTER), you should see the VIP 10.10.10.10 assigned to the eth0 interface. You can verify this with:

ip addr show eth0

To test failover:

Stop your C++ application on app-node-1.
Observe the Keepalived logs (/var/log/syslog or journalctl -u keepalived). The chk_app script should fail, causing app-node-1‘s priority to drop.
app-node-2 should detect this and transition to MASTER state, acquiring the VIP.
Verify the VIP is now on app-node-2 using ip addr show eth0.
Restart your C++ application on app-node-1. If its priority is still higher, it will reclaim the VIP.

This setup ensures that your C++ application is always accessible via the stable VIP, abstracting away the underlying server failures.

Integrating MongoDB and C++ Application Failover

The final piece of the puzzle is ensuring your C++ application connects to the correct MongoDB primary and can adapt if the primary changes. MongoDB drivers typically handle replica set connections gracefully.

C++ Application MongoDB Connection String

When configuring your C++ application to connect to MongoDB, use a connection string that specifies the replica set name and lists multiple members of the replica set. This allows the driver to discover the current primary and to reconnect if a failover occurs.

mongodb://mongo-node-1:27017,mongo-node-2:27017,mongo-node-3:27017/?replicaSet=myReplicaSet&readPreference=primary

The replicaSet=myReplicaSet parameter is critical. The MongoDB C++ driver (and most other drivers) will automatically:

Connect to one of the listed hosts.
Discover the current primary member of the replica set.
Update its internal view of the replica set topology as members change (e.g., during a failover).
Automatically reconnect to the new primary if the current primary becomes unavailable.

Ensure your C++ application’s firewall rules allow outbound connections to the MongoDB nodes on port 27017. Similarly, MongoDB nodes must allow inbound connections from the application nodes.

Handling Application Failover Scenarios

When the C++ application node fails over (due to Keepalived detecting an issue and reassigning the VIP), the new active application instance will continue to use the same MongoDB connection string. Because the MongoDB driver is aware of the replica set, it will automatically connect to the current primary. If the MongoDB primary fails over, the C++ application driver will detect this and switch to the new primary once it’s elected.

For applications requiring immediate consistency during a MongoDB failover, consider implementing application-level logic to detect primary changes or to retry operations that might fail during the brief transition period. However, for most use cases, the automatic failover handled by the MongoDB driver is sufficient.

Monitoring and Maintenance

A robust disaster recovery strategy is not complete without comprehensive monitoring and a well-defined maintenance plan.

Monitoring Tools

Utilize DigitalOcean’s built-in monitoring for Droplet health (CPU, memory, disk I/O). For more granular insights:

MongoDB Monitoring: Use MongoDB’s built-in tools like mongostat, mongotop, and rs.status(). Consider integrating with external monitoring solutions like Prometheus with the MongoDB exporter, or commercial tools like Datadog or New Relic. Key metrics to watch include oplog lag, network latency between nodes, and replication status.
Keepalived Monitoring: Regularly check Keepalived logs for state transitions and VRRP advertisements. Ensure the VIP is always assigned to an active node.
Application Monitoring: Implement application-level health checks and expose metrics that indicate application performance and availability.

Maintenance Procedures

Perform regular maintenance to ensure the system remains stable and up-to-date:

MongoDB Upgrades: Plan and execute MongoDB version upgrades carefully, following the official documentation for rolling upgrades of replica sets to minimize downtime.
Keepalived Configuration Updates: Test any changes to keepalived.conf on a staging environment before applying to production.
Application Updates: Deploy new versions of your C++ application using a blue-green deployment strategy or similar to ensure zero downtime during updates.
Regular Backups: While replica sets provide high availability, they are not a substitute for backups. Implement a robust backup strategy (e.g., using mongodump or filesystem snapshots) and regularly test your restore process.

By combining MongoDB’s native replication with Keepalived for application-level failover, and ensuring your C++ application is configured to leverage these HA features, you can build a resilient and highly available system on DigitalOcean.