Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and C++ Deployments on OVH

Establishing a Multi-Region MongoDB Replica Set on OVH

For robust disaster recovery, a multi-region MongoDB replica set is paramount. We’ll architect this across OVH’s European data centers, specifically GRA (Gravelines) and RBX (Roubaix), leveraging their dedicated servers and robust network infrastructure. The goal is to achieve automatic failover with minimal data loss and downtime.

Our setup will involve at least three nodes: one primary and two secondaries. For high availability and automatic failover, we’ll configure a hidden, arbiter node in a third, geographically distinct region (e.g., LIM – Limoges) if available, or a separate zone within the same region if not. This arbiter does not hold data but participates in elections, ensuring quorum even if one of the data-bearing nodes becomes unavailable.

MongoDB Node Configuration (Example: Ubuntu 22.04 LTS)

Each MongoDB node will require specific configuration. We’ll assume dedicated servers with static IP addresses. Ensure firewalls are configured to allow traffic on port 27017 (default MongoDB port) between all nodes.

Node 1: Primary (GRA)

On the primary node (e.g., 192.168.1.101 in GRA), edit the MongoDB configuration file, typically /etc/mongod.conf.

# /etc/mongod.conf

storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 0.0.0.0
  port: 27017
security:
  keyFile: /etc/mongodb-keyfile.pem
replication:
  replSetName: myReplicaSet
sharding:
  clusterRole: configsvr # If this node is also a config server for sharding
  configsvr: true # If this node is also a config server for sharding
# If not sharding, remove sharding section

Generate the key file (must be identical on all replica set members):

openssl rand -base64 756 > /etc/mongodb-keyfile.pem
chmod 400 /etc/mongodb-keyfile.pem
chown mongodb:mongodb /etc/mongodb-keyfile.pem

Restart MongoDB:

sudo systemctl restart mongod

Node 2: Secondary (RBX)

Configure /etc/mongod.conf identically to Node 1, but ensure the IP address in bindIp (if not 0.0.0.0) is the server’s local IP. Copy the generated /etc/mongodb-keyfile.pem to this node and set permissions.

# On Node 1 (GRA)
scp /etc/mongodb-keyfile.pem user@node2_ip:/etc/mongodb-keyfile.pem

# On Node 2 (RBX)
chmod 400 /etc/mongodb-keyfile.pem
chown mongodb:mongodb /etc/mongodb-keyfile.pem
sudo systemctl restart mongod

Node 3: Arbiter (LIM or separate zone)

The arbiter node does not need a dbPath or storage configuration. It only needs to bind to a network interface and participate in elections. Copy the key file and restart.

# /etc/mongod.conf (Arbiter Node)

systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 0.0.0.0
  port: 27017
security:
  keyFile: /etc/mongodb-keyfile.pem
replication:
  replSetName: myReplicaSet
# No storage or sharding configuration needed for an arbiter

Copy the key file and restart MongoDB on the arbiter node.

# On Node 1 (GRA)
scp /etc/mongodb-keyfile.pem user@arbiter_ip:/etc/mongodb-keyfile.pem

# On Arbiter Node
chmod 400 /etc/mongodb-keyfile.pem
chown mongodb:mongodb /etc/mongodb-keyfile.pem
sudo systemctl restart mongod

Initializing the Replica Set

Connect to the primary node (Node 1) using the MongoDB shell and initiate the replica set.

# On Node 1 (GRA)
mongosh

# Inside mongosh:
rs.initiate(
  {
    _id: "myReplicaSet",
    members: [
      { _id: 0, host: "192.168.1.101:27017" }, // GRA Primary
      { _id: 1, host: "192.168.1.102:27017" }, // RBX Secondary
      { _id: 2, host: "192.168.1.103:27017", arbiterOnly: true } // LIM Arbiter
    ]
  }
)

Verify the replica set status:

# Inside mongosh:
rs.status()

You should see all members in an ‘UP’ state, with one designated as ‘PRIMARY’ and others as ‘SECONDARY’. The arbiter will be listed with arbiterOnly: true.

Architecting C++ Application Auto-Failover

For C++ applications interacting with this MongoDB replica set, the driver handles much of the failover logic. However, the application itself needs to be resilient and potentially aware of the primary’s status for certain operations or for graceful shutdown/restart during failover events.

C++ MongoDB Driver Configuration

When connecting from your C++ application, specify the replica set name and provide a connection string that lists all members of the replica set. The driver will automatically discover the current primary and handle reconnections if the primary changes.

#include <iostream>
#include <mongocxx/client.hpp>
#include <mongocxx/instance.hpp>
#include <mongocxx/uri.hpp>
#include <mongocxx/options/client.hpp>

int main() {
    try {
        // Initialize mongocxx
        mongocxx::instance instance{};

        // Connection URI with replica set name and all members
        // Replace with your actual IPs and domain names if applicable
        mongocxx::uri uri("mongodb://192.168.1.101:27017,192.168.1.102:27017,192.168.1.103:27017/?replicaSet=myReplicaSet");

        // Set client options, e.g., connection timeout
        mongocxx::options::client client_options =
            mongocxx::options::client{}
            .connect_timeout(std::chrono::seconds(5))
            .server_selection_timeout(std::chrono::seconds(10)); // Timeout for server selection

        // Connect to MongoDB
        mongocxx::client client(uri, client_options);

        // Access a database and collection
        auto db = client["mydatabase"];
        auto collection = db["mycollection"];

        // Perform an operation (e.g., insert a document)
        auto result = collection.insert_one(bsoncxx::builder::basic::make_document(
            bsoncxx::builder::basic::kvp("name", "test document"),
            bsoncxx::builder::basic::kvp("value", 123)
        ));

        std::cout << "Inserted document with _id: " << result->inserted_id().value().to_string() << std::endl;

    } catch (const mongocxx::exception& e) {
        std::cerr << "MongoDB Exception: " << e.what() << std::endl;
        // Implement application-level retry logic or graceful degradation here
        return 1;
    } catch (const std::exception& e) {
        std::cerr << "General Exception: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}

The server_selection_timeout is crucial. If the driver cannot find a suitable server within this timeframe, it will throw an exception. This is where your application’s resilience logic kicks in.

Application-Level Resilience and Monitoring

While the MongoDB driver handles automatic failover, your C++ application should implement strategies to manage connection errors gracefully:

Retry Mechanisms: Implement exponential backoff retry logic for database operations that fail due to network issues or temporary unavailability during a failover.
Health Checks: Periodically check the health of the MongoDB connection. The driver provides methods to check server status.
Graceful Degradation: If database operations consistently fail, your application might need to enter a degraded mode, perhaps serving stale data from a cache or queuing requests for later processing.
Monitoring and Alerting: Integrate with monitoring systems (e.g., Prometheus, Datadog) to track database connection errors, replica set status, and election events. OVH’s monitoring tools can also be leveraged.

Example: Basic Retry Logic (Conceptual)

This is a simplified illustration. A production-ready solution would involve more sophisticated error handling and backoff strategies.

// Inside your database operation function
int max_retries = 5;
int retry_count = 0;
std::chrono::milliseconds delay(100); // Initial delay

while (retry_count < max_retries) {
    try {
        // Attempt database operation
        // ... collection.insert_one(...) ...
        // If successful, break the loop
        break;
    } catch (const mongocxx::exception& e) {
        std::cerr << "Database operation failed: " << e.what() << std::endl;
        retry_count++;
        if (retry_count >= max_retries) {
            std::cerr << "Max retries reached. Operation failed permanently." << std::endl;
            // Handle permanent failure: log, alert, enter degraded mode
            throw; // Re-throw or handle appropriately
        }
        // Exponential backoff
        std::this_thread::sleep_for(delay);
        delay *= 2; // Double the delay for next retry
    }
}

OVH Infrastructure Considerations

Leveraging OVH’s infrastructure requires attention to network configuration and server management.

Network Configuration

Ensure that your OVH dedicated servers have static IP addresses and that firewall rules (both on the servers and potentially OVH’s network firewall) allow traffic on port 27017 between all MongoDB nodes. For inter-region communication, OVH’s private network capabilities can reduce latency and cost, but public IPs are necessary if nodes are in different OVH regions without a direct private network link.

Server Provisioning and Management

Use OVH’s control panel to provision dedicated servers. Consider using configuration management tools like Ansible or Chef to automate the setup of MongoDB and your C++ application across these servers. This ensures consistency and simplifies deployment and updates.

Monitoring with OVH Tools

OVH provides monitoring dashboards for your dedicated servers, including CPU, RAM, disk I/O, and network traffic. Monitor these metrics closely, especially during failover events, to identify any performance bottlenecks. You can also set up custom alerts for critical thresholds.

Testing Failover Scenarios

Regularly testing your failover mechanism is non-negotiable. Simulate failures to ensure your setup behaves as expected.

Simulate Primary Failure: Stop the MongoDB process on the current primary node (e.g., sudo systemctl stop mongod). Observe the replica set status from a secondary node. A new primary should be elected within seconds. Test your C++ application’s ability to reconnect and continue operations.
Simulate Network Partition: Use firewall rules (e.g., iptables) to block network traffic between specific nodes. This helps test how the replica set handles network disruptions and ensures the arbiter correctly influences elections.
Simulate Node Failure: Shut down a secondary node or the arbiter. The replica set should continue operating, and the application should remain available.

Document the results of each test, including the time to failover and any application-level errors encountered. Use these findings to refine your configuration and application logic.