Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and C Deployments on OVH
Understanding MongoDB Replica Sets and OVH’s Public Cloud Infrastructure
Architecting for disaster recovery, particularly automated failover, in a distributed database like MongoDB requires a deep understanding of both the database’s internal mechanisms and the underlying cloud infrastructure. For deployments on OVH’s Public Cloud, this means leveraging their robust network, compute, and storage offerings to build resilient MongoDB replica sets.
A MongoDB replica set is a group of mongod processes that maintain the same data set. It provides redundancy and high availability. A replica set consists of:
- Primary: A single mongod process that receives all write operations.
- Secondaries: mongod processes that replicate the primary’s oplog and apply its operations. They can serve read operations.
- Arbiter (Optional): A mongod process that does not hold data but participates in elections. It’s typically used to break ties in replica sets with an even number of data-bearing nodes.
The key to automated failover lies in the replica set’s election process. When the primary becomes unavailable, the remaining members of the replica set hold an election to choose a new primary from the available secondaries. This process is managed by the replica set’s internal heartbeat mechanism and configuration.
Designing the MongoDB Replica Set Topology for OVH
For a production-ready, auto-failover setup on OVH Public Cloud, we’ll aim for a minimum of three data-bearing nodes. This ensures that even if one node fails, the remaining two can still elect a primary and maintain quorum. Deploying across different Availability Zones (AZs) within a single OVH region is crucial for resilience against datacenter-level failures.
Consider the following topology:
- Node 1: Primary (e.g., `mongo-01` in AZ-A)
- Node 2: Secondary (e.g., `mongo-02` in AZ-B)
- Node 3: Secondary (e.g., `mongo-03` in AZ-C)
- Node 4 (Optional): Arbiter (e.g., `mongo-arbiter-01` in AZ-A) – useful if you have an even number of data-bearing nodes or want to dedicate a lightweight instance for elections. For a 3-node setup, an arbiter is not strictly necessary but can be considered for specific quorum strategies.
Each node should be provisioned on a separate OVH Public Cloud instance (e.g., a General Purpose instance type like the `GP-SSD-2` or `GP-SSD-4` depending on I/O requirements). Ensure these instances are within the same OVH region but distributed across different Availability Zones. Network configuration is paramount: all MongoDB nodes must be able to communicate with each other on the MongoDB port (default 27017) via private IP addresses. Security groups or firewall rules must be configured to allow this traffic.
Provisioning OVH Instances and Network Configuration
We’ll use the OVH API or the Horizon dashboard to provision the instances. For automation, the OVH API is preferred. Let’s assume we’re using Ubuntu 22.04 LTS.
Example: Provisioning an instance via OVH API (Conceptual – requires API client setup)
# This is a conceptual example. Actual API calls depend on your chosen SDK/method.
# POST /cloud/project/{serviceName}/instance
{
"region": "GRA",
"flavorId": "generalPurpose-2",
"imageName": "ubuntu_22_04",
"name": "mongo-01",
"sshKeyId": "your-ssh-key-id",
"volumeType": "high-performance",
"volumeSize": 100,
"ipAddress": {
"type": "private",
"subnetId": "your-private-subnet-id-az-a"
}
}
Repeat this for `mongo-02` (in AZ-B) and `mongo-03` (in AZ-C), ensuring they are in different subnets corresponding to their respective AZs. Obtain the private IP addresses for each instance.
Next, configure network security. On OVH, this is typically done via Security Groups. Ensure that each instance is part of a security group that allows inbound traffic on port 27017 from the private IP addresses of the other MongoDB instances.
Example: Firewall rule on `mongo-01` allowing traffic from `mongo-02` and `mongo-03` on port 27017.
# On mongo-01 (assuming private IPs: 10.0.0.1, 10.0.0.2, 10.0.0.3) sudo ufw allow from 10.0.0.2 to any port 27017 proto tcp sudo ufw allow from 10.0.0.3 to any port 27017 proto tcp # Repeat similar rules on mongo-02 and mongo-03 for each other's IPs.
Installing and Configuring MongoDB
On each provisioned instance, install MongoDB. It’s recommended to use the official MongoDB Community Edition packages.
Step 1: Add MongoDB Repository and Install
# On all MongoDB nodes wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add - echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu $(lsb_release -cs)/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.list sudo apt-get update sudo apt-get install -y mongodb-org
Step 2: Configure MongoDB for Replication
Edit the MongoDB configuration file (`/etc/mongod.conf`) on each node. The key changes are:
- Set `net.bindIp` to `0.0.0.0` or the instance’s private IP to allow connections from other nodes.
- Enable replication by setting `replication.replSetName` to a chosen name (e.g., `myReplicaSet`).
- Ensure `storage.dbPath` and `systemLog.path` are correctly set.
Example: `/etc/mongod.conf` on `mongo-01`
# mongod.conf
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
net:
port: 27017
bindIp: 0.0.0.0 # Or the private IP of this instance
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
replication:
replSetName: myReplicaSet # The name of your replica set
sharding:
clusterRole: configsvr # If this node is part of a sharded cluster, otherwise omit
# security:
# keyFile: /path/to/keyfile # For authentication, highly recommended
# authorization: enabled
Important Security Note: For production, you MUST enable authentication and use a keyfile for inter-node communication. This involves generating a keyfile, distributing it securely to all nodes, and configuring `security.keyFile` and `security.authorization: enabled` in `mongod.conf`. This example omits it for brevity but is critical for production.
After configuring, restart MongoDB and enable it to start on boot:
# On all MongoDB nodes sudo systemctl daemon-reload sudo systemctl enable mongod sudo systemctl start mongod sudo systemctl status mongod
Initializing the Replica Set
Once all nodes are up and running with the correct configuration, connect to one of the nodes (preferably the one intended to be the initial primary) using the `mongosh` client. Then, initiate the replica set.
Step 1: Connect to a MongoDB instance
# On mongo-01 (or any node) mongosh
Step 2: Initiate the replica set
Inside the `mongosh` shell, run the `rs.initiate()` command. This command takes a configuration document that specifies the replica set members. Use the private IP addresses of your OVH instances.
rs.initiate(
{
_id: "myReplicaSet",
members: [
{ _id: 0, host: "10.0.0.1:27017" }, // mongo-01 private IP
{ _id: 1, host: "10.0.0.2:27017" }, // mongo-02 private IP
{ _id: 2, host: "10.0.0.3:27017" } // mongo-03 private IP
// Add arbiter if used: { _id: 3, host: "10.0.0.4:27017", arbiterOnly: true }
]
}
)
After running `rs.initiate()`, MongoDB will elect a primary. You can check the status of the replica set using `rs.status()`.
rs.status()
The output should show one member as `PRIMARY` and the others as `SECONDARY`. The `health` field for all members should be `1` (healthy).
Testing Automated Failover
To test the automated failover, simulate the failure of the current primary node. The simplest way is to stop the `mongod` process on the primary instance.
Step 1: Identify the current primary
# Connect to any replica set member mongosh # Check status rs.status()
Note the `host` of the primary member.
Step 2: Stop the primary MongoDB process
# On the primary instance sudo systemctl stop mongod
Step 3: Observe the election
Wait for a short period (typically 10-30 seconds, depending on election timeouts). Connect to one of the remaining secondary nodes and check `rs.status()` again. You should see a new primary elected.
# Connect to a secondary node mongosh # Check status rs.status()
The `PRIMARY` role should now be assigned to one of the previously secondary nodes. The replica set will automatically reconfigure itself.
Monitoring and Alerting for Production
Automated failover is only part of a robust disaster recovery strategy. Comprehensive monitoring and alerting are essential to detect failures promptly and to understand the health of your replica set.
Key metrics to monitor:
- Replica Set Health: The `health` status of each member.
- Primary/Secondary Status: Which node is currently primary.
- Oplog Lag: The difference in time between the primary’s oplog and a secondary’s applied operations. High oplog lag indicates replication issues.
- Network Latency: Latency between nodes, especially across AZs.
- Disk I/O and Usage: Crucial for database performance.
- CPU and Memory Usage: For instance health.
Tools like Prometheus with the `mongodb_exporter`, Datadog, or OVH’s built-in monitoring can be used. Configure alerts for:
- Replica set member health dropping to unhealthy.
- Oplog lag exceeding a defined threshold (e.g., 60 seconds).
- No primary elected for an extended period.
- Instance-level resource exhaustion (CPU, memory, disk).
Advanced Considerations and Best Practices
1. Read Preferences: Configure your application’s read preference to leverage secondary nodes for read operations, distributing load and improving availability. For example, `secondaryPreferred` or `nearest`.
2. Write Concerns: Use appropriate write concerns (e.g., `w: majority`) to ensure data durability. This guarantees that a write operation is acknowledged by a majority of data-bearing nodes before returning success to the client.
// Example of setting write concern in mongosh
db.collection.insertOne({ item: "book" }, { writeConcern: { w: "majority", wtimeout: 5000 } });
3. Node Configuration (`priority` and `votes`): In the `rs.initiate()` configuration or via `rs.reconfig()`, you can set `priority` and `votes` for members. Higher priority nodes are more likely to be elected primary. `votes` determine if a node participates in elections. Ensure your configuration aligns with your availability goals and AZ distribution.
4. Network Stability: OVH’s Public Cloud offers good network reliability, but inter-AZ latency can be a factor. Monitor this and ensure your MongoDB configuration (e.g., election timeouts) is tuned appropriately. Consider using dedicated network interfaces if extreme low latency is required.
5. Backup Strategy: Automated failover handles instance/node failures, but not data corruption or accidental deletion. Implement a robust backup strategy (e.g., using `mongodump` or filesystem snapshots) and regularly test restores.
6. Sharding: For very large datasets or high throughput requirements, consider sharding your MongoDB deployment. This adds another layer of complexity but allows for horizontal scaling beyond a single replica set.
By carefully designing the replica set topology, configuring instances on OVH Public Cloud, and implementing robust monitoring, you can achieve a highly available MongoDB deployment with automated failover capabilities.