Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Python Deployments on OVH
Understanding MongoDB Replica Sets for High Availability
Achieving robust disaster recovery for MongoDB hinges on its native replica set functionality. A replica set is a group of MongoDB servers that maintain the same data set. This provides redundancy and high availability. At least three nodes are recommended for a production replica set: one primary and two secondaries. The primary handles all write operations, and secondaries replicate the primary’s oplog (operations log) to maintain an identical data set. If the primary becomes unavailable, the remaining secondaries elect a new primary.
Configuring a replica set involves initializing each MongoDB instance with specific configuration options. The key setting is the --replSet parameter, which assigns a unique name to the replica set. This name must be consistent across all members.
Setting Up a Three-Node MongoDB Replica Set on OVH Instances
For this example, we’ll assume three OVH Public Cloud instances (e.g., `mongo1.example.com`, `mongo2.example.com`, `mongo3.example.com`) running Ubuntu 22.04. Ensure MongoDB is installed on each instance. You can typically install it via `apt`:
sudo apt update sudo apt install -y mongodb-org
Next, we need to configure each MongoDB instance to be part of the replica set. Edit the MongoDB configuration file, typically located at /etc/mongod.conf. Ensure the following settings are present or modified:
# /etc/mongod.conf
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
net:
bindIp: 0.0.0.0 # Or specific IPs for security
port: 27017
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
replication:
replSetName: "rs0" # The name of our replica set
After modifying the configuration file on all three instances, restart the MongoDB service:
sudo systemctl restart mongod sudo systemctl enable mongod
Now, connect to one of the MongoDB instances (e.g., `mongo1.example.com`) using the `mongosh` client and initiate the replica set configuration. This step only needs to be performed once from any member of the intended replica set.
mongosh --host mongo1.example.com:27017
Inside the `mongosh` shell, run the `rs.initiate()` command with a configuration object specifying all members of the replica set. It’s crucial to include the hostname or IP address and the port for each member.
rs.initiate(
{
_id: "rs0",
members: [
{ _id: 0, host: "mongo1.example.com:27017" },
{ _id: 1, host: "mongo2.example.com:27017" },
{ _id: 2, host: "mongo3.example.com:27017" }
]
}
)
You can verify the replica set status by running `rs.status()` in the `mongosh` shell. You should see one primary and two secondaries. If you need to add more members later, you can use `rs.add(“new_host:port”)` from the primary.
Architecting Python Application for MongoDB Failover
Your Python application needs to be aware of the MongoDB replica set and configured to connect to it correctly. The standard MongoDB Python driver (PyMongo) handles replica set connections automatically. When you provide a list of hosts in the connection string, PyMongo will discover the replica set topology and connect to the current primary. If the primary fails, PyMongo will detect the change and connect to the newly elected primary.
Here’s a typical PyMongo connection string for a replica set:
from pymongo import MongoClient
# Connection string for a replica set
# Replace with your actual hostnames/IPs and replica set name
MONGO_URI = "mongodb://mongo1.example.com:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=rs0"
try:
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000) # Timeout in milliseconds
# The ismaster command is cheap and does not require auth.
client.admin.command('ismaster')
print("Successfully connected to MongoDB replica set.")
# Example: Accessing a database and collection
db = client.mydatabase
collection = db.mycollection
# Example: Inserting a document
result = collection.insert_one({"name": "Test Document", "timestamp": datetime.datetime.utcnow()})
print(f"Inserted document with ID: {result.inserted_id}")
except Exception as e:
print(f"Error connecting to MongoDB: {e}")
# Implement your application's error handling strategy here
# e.g., retry connection, log error, switch to read-only mode, etc.
finally:
if 'client' in locals() and client:
client.close()
print("MongoDB connection closed.")
The serverSelectionTimeoutMS parameter is crucial. It defines how long PyMongo will wait to find a suitable server (primary) before giving up. A value between 5,000ms (5 seconds) and 10,000ms (10 seconds) is often a good starting point for production environments. Adjust this based on your network latency and expected failover times.
Simulating Failover and Testing Auto-Failover
Regularly testing your failover mechanism is paramount. The simplest way to simulate a primary failure is to stop the MongoDB process on the current primary node.
First, identify the current primary:
mongosh --host mongo1.example.com:27017 rs.status()
Note which member is listed as “PRIMARY”. Then, on that specific instance, stop the MongoDB service:
# On the current primary instance sudo systemctl stop mongod
Observe the `mongosh` shell connected to one of the secondaries. Within a short period (typically tens of seconds, depending on network and election timeouts), a new primary should be elected. You can verify this by running `rs.status()` again.
During this failover period, your Python application might experience connection errors or timeouts. The PyMongo driver will attempt to reconnect and will eventually connect to the new primary once elected. It’s essential to implement retry logic and graceful error handling in your application to manage these transient failures.
For more advanced testing, consider using tools that can orchestrate more complex failure scenarios, such as network partitions or simultaneous node failures, to ensure your replica set and application can withstand more severe disruptions.
OVH Specific Considerations: Networking and Security
When deploying MongoDB replica sets on OVH Public Cloud, pay close attention to network configuration. Ensure that your instances can communicate with each other on the MongoDB port (default 27017). This typically involves configuring security groups or firewall rules within the OVH control panel or using cloud-init scripts to set up `ufw` or `iptables`.
# Example UFW rule on each MongoDB instance sudo ufw allow fromto any port 27017 proto tcp sudo ufw allow from to any port 27017 proto tcp sudo ufw allow from to any port 27017 proto tcp sudo ufw enable
For production environments, it’s highly recommended to restrict access to the MongoDB port only to your application servers and other replica set members. Avoid binding MongoDB to `0.0.0.0` if possible; instead, bind it to the specific private IP addresses of the instance or a subnet that includes your application servers and other MongoDB nodes. Furthermore, enable authentication (SCRAM-SHA-1 or SCRAM-SHA-256) and use TLS/SSL encryption for data in transit between your application and MongoDB, and between replica set members.
Monitoring and Alerting for Proactive Disaster Recovery
Automated failover is only effective if you are aware when it happens and if it’s functioning correctly. Implement comprehensive monitoring for your MongoDB replica set. Key metrics to track include:
- Replica Set Status (Primary, Secondary, Arbiter, Down)
- Replication Lag (difference in oplog timestamps between primary and secondaries)
- Network Latency between nodes
- Disk I/O and Memory Usage on each node
- MongoDB Connection Counts and Query Performance
Tools like Prometheus with the `mongodb_exporter`, Datadog, or OVH’s built-in monitoring can be used. Configure alerts for critical conditions, such as:
- Loss of primary node
- High replication lag
- Nodes becoming unreachable
- High error rates in application logs indicating connection issues
These alerts will notify your operations team of potential issues, allowing for investigation and intervention before a minor incident becomes a full-blown disaster. Proactive monitoring and well-defined incident response procedures are the final, critical layer in your disaster recovery strategy.