Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Ruby Deployments on DigitalOcean
Establishing a MongoDB Replica Set for High Availability
A robust disaster recovery strategy for MongoDB hinges on implementing a replica set. This ensures data redundancy and automatic failover in case of node failure. For a production deployment on DigitalOcean, we’ll configure a 3-node replica set, with one node designated as a hidden secondary for backups or analytics, and a separate arbiter for fault tolerance without data storage.
First, provision three DigitalOcean Droplets. For this example, let’s assume they have public IPs: `159.65.10.1` (primary candidate), `159.65.10.2` (secondary candidate), and `159.65.10.3` (hidden secondary/backup). We’ll also provision a fourth Droplet for the arbiter, `159.65.10.4`.
On each of the data-bearing Droplets, install MongoDB. Ensure the firewall is configured to allow traffic on port 27017 between these nodes. Edit the MongoDB configuration file (`/etc/mongod.conf`) to enable replication and bind to the private IP address of each Droplet for enhanced security. Replace `eth0` with your actual network interface if necessary.
MongoDB Configuration for Replica Set Nodes
On Droplet `159.65.10.1` (primary candidate):
# /etc/mongod.conf
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
net:
bindIp: 127.0.0.1,10.10.0.1 # Replace 10.10.0.1 with the private IP of this Droplet
port: 27017
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
replication:
replSetName: myReplicaSet
sharding:
clusterRole: configsvr # If this is a config server for sharding, otherwise omit
setParameter:
enableLocalhostAuthBypass: false
Repeat this configuration for Droplets `159.65.10.2` and `159.65.10.3`, adjusting the `bindIp` to their respective private IPs. For Droplet `159.65.10.3`, add the following to its configuration to make it a hidden secondary:
# /etc/mongod.conf (for hidden secondary) # ... other configurations ... replication: replSetName: myReplicaSet hidden: true
On Droplet `159.65.10.4` (arbiter):
# /etc/mongod.conf (for arbiter)
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
net:
bindIp: 127.0.0.1,10.10.0.4 # Replace 10.10.0.4 with the private IP of this Droplet
port: 27017
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
replication:
replSetName: myReplicaSet
arbiterOnly: true
setParameter:
enableLocalhostAuthBypass: false
After configuring and restarting MongoDB on all nodes, initiate the replica set from one of the data-bearing nodes (e.g., `159.65.10.1`).
Initializing the MongoDB Replica Set
Connect to the MongoDB shell on the primary candidate Droplet (`159.65.10.1`).
mongo --host 159.65.10.1 --port 27017
Inside the MongoDB shell, run the following command:
rs.initiate(
{
_id: "myReplicaSet",
version: 1,
members: [
{ _id: 0, host: "159.65.10.1:27017", priority: 2 },
{ _id: 1, host: "159.65.10.2:27017", priority: 1 },
{ _id: 2, host: "159.65.10.3:27017", priority: 0, hidden: true },
{ _id: 3, host: "159.65.10.4:27017", arbiterOnly: true }
]
}
)
Verify the replica set status:
rs.status()
This output should show all members as `PRIMARY`, `SECONDARY`, or `ARBITER` and indicate their health. The `priority` field determines which member is elected as primary. Higher priority means more likely to be elected. The hidden member has a priority of 0, preventing it from being elected primary.
Architecting Auto-Failover for Ruby Applications
For Ruby applications, particularly those using frameworks like Rails, the MongoDB driver needs to be aware of the replica set and capable of handling failover events. The `mongo` gem in Ruby provides built-in support for replica sets.
Configuring the MongoDB Connection String
The connection string is the critical component that informs your Ruby application about the replica set. It should list all members of the replica set, along with the replica set name.
In a Rails application, this would typically be configured in `config/mongoid.yml` (if using Mongoid) or directly in your database connection setup.
Example using the `mongo` gem directly:
require 'mongo'
# Replace with your Droplet private IPs
mongo_uri = "mongodb://10.10.0.1:27017,10.10.0.2:27017,10.10.0.3:27017,10.10.0.4:27017/?replicaSet=myReplicaSet&authSource=admin"
begin
client = Mongo::Client.new(mongo_uri)
# Access a database
db = client.database
puts "Successfully connected to MongoDB replica set: #{client.cluster.replica_set_name}"
puts "Current primary: #{client.cluster.primary_address}"
# Example: Insert a document
result = db[:my_collection].insert_one({ name: "Test Document", timestamp: Time.now })
puts "Inserted document with ID: #{result.inserted_id}"
rescue Mongo::Error::NoServerAvailable => e
puts "Failed to connect to MongoDB: #{e.message}"
# Implement retry logic or alert mechanism here
rescue StandardError => e
puts "An unexpected error occurred: #{e.message}"
end
If using Mongoid, your `config/mongoid.yml` would look something like this:
production:
clients:
default:
uri: "mongodb://10.10.0.1:27017,10.10.0.2:27017,10.10.0.3:27017,10.10.0.4:27017/?replicaSet=myReplicaSet&authSource=admin"
options:
# Consider adding read preference for specific use cases
# read: { mode: :secondary_preferred }
# Consider adding write concern for stronger consistency guarantees
# write: { w: :majority, j: true, wtimeout: 5000 }
# Other Mongoid configurations...
The `replicaSet=myReplicaSet` parameter is crucial. The driver will discover all members of the replica set and monitor their status. When the primary fails, the driver will automatically detect the change and switch to the new primary.
Handling Connection Errors and Failover Gracefully
While the `mongo` gem handles automatic failover, your application should be prepared for transient connection issues during the failover process. Implement retry mechanisms and error handling.
The `Mongo::Error::NoServerAvailable` exception is commonly raised when the driver cannot reach any server in the replica set. Your application logic should catch this and potentially retry the operation after a short delay.
# Example of a retry mechanism for a specific operation
def with_retry(max_retries: 3, delay: 2)
retries = 0
loop do
begin
return yield
rescue Mongo::Error::NoServerAvailable => e
retries += 1
if retries <= max_retries
puts "Connection error: #{e.message}. Retrying in #{delay} seconds (Attempt #{retries}/#{max_retries})..."
sleep(delay)
else
puts "Failed to connect after #{max_retries} retries. Raising error."
raise e # Re-raise the exception after exhausting retries
end
end
end
end
# Usage:
begin
with_retry do
db[:my_collection].insert_one({ message: "This might fail during failover" })
end
rescue Mongo::Error::NoServerAvailable
# Log the persistent failure, notify ops, or present a user-friendly error
Rails.logger.error("MongoDB is unavailable after multiple retries.")
# Potentially render an error page or return a specific status code
end
For more complex scenarios, consider using a connection pool with configurable timeouts and retry strategies. The `mongo` gem’s connection pooling can be tuned.
Automated Failover Monitoring and Alerting
While auto-failover is automated, it’s crucial to monitor the health of your replica set and be alerted when failovers occur. This allows for proactive investigation and ensures the system is functioning as expected.
Leveraging DigitalOcean Monitoring and External Tools
DigitalOcean’s built-in monitoring provides basic Droplet-level metrics (CPU, memory, disk I/O, network). However, for MongoDB-specific health, you’ll need more granular monitoring.
Tools like Prometheus with the `mongodb_exporter` are excellent for this. You can deploy Prometheus and Grafana on a separate Droplet to scrape metrics from your MongoDB instances.
The `mongodb_exporter` can expose metrics such as:
- Replica set status (primary, secondary, arbiter)
- Oplog lag
- Network traffic per member
- Disk usage per member
- Read/write operations per member
Configure Prometheus to scrape these metrics. Then, set up Alertmanager to define alert rules. For example, an alert can be triggered if a node is not reachable, if the oplog lag exceeds a certain threshold, or if the replica set has no primary for an extended period.
Alerting on Failover Events
A common alert strategy is to monitor the number of primaries in the replica set. If this drops to zero for more than a minute, it indicates a significant issue or a prolonged failover event.
Here’s a conceptual Prometheus alerting rule (in YAML):
groups:
- name: mongodb.rules
rules:
- alert: MongoReplicaSetNoPrimary
expr: mongodb_replica_set_state{state="primary"} == 0
for: 5m # Wait for 5 minutes to avoid flapping alerts during brief elections
labels:
severity: critical
annotations:
summary: "MongoDB replica set '{{ $labels.replica_set }}' has no primary node."
description: "The MongoDB replica set '{{ $labels.replica_set }}' has not reported a primary node for 5 minutes. This indicates a potential failure or prolonged election. Check MongoDB logs and cluster health."
- alert: MongoReplicaSetOplogLagging
expr: mongodb_replica_set_oplog_lag_seconds{replica_set="myReplicaSet", member_role="secondary"} > 600 # 10 minutes lag
for: 10m
labels:
severity: warning
annotations:
summary: "MongoDB secondary node '{{ $labels.instance }}' is lagging."
description: "The secondary node '{{ $labels.instance }}' in replica set '{{ $labels.replica_set }}' is lagging behind the primary by more than 10 minutes ({{ $value }} seconds). This could impact read consistency and recovery time."
These alerts can be routed to Slack, PagerDuty, or email, ensuring your operations team is immediately notified of any critical events, including automatic failovers.
Testing Your Disaster Recovery Strategy
A disaster recovery plan is only as good as its last successful test. Regularly simulate failures to validate your auto-failover mechanisms and your team’s response.
Simulating Node Failures
The simplest way to test is to stop the MongoDB process on the primary node.
# On the current primary node (e.g., 159.65.10.1) sudo systemctl stop mongod
Observe the following:
- The MongoDB driver in your Ruby application should automatically detect the primary’s unavailability.
- An election should be triggered among the remaining data-bearing nodes.
- A new primary should be elected within seconds (depending on network latency and configuration).
- Your Ruby application should seamlessly switch to the new primary.
- Monitor your alerting system for notifications about the failover.
You can also test secondary node failures by stopping MongoDB on a secondary Droplet. This should not impact write operations but will reduce redundancy. If the arbiter fails, the replica set can continue to operate as long as a majority of data-bearing nodes are available.
Testing Application Resilience
During a simulated failover, monitor your Ruby application’s error logs and user-facing behavior. Check for:
- Increased error rates or latency for a brief period.
- Successful completion of operations after the failover.
- Correct behavior of retry mechanisms.
Automated failover for MongoDB and seamless integration with Ruby applications significantly enhance the resilience of your deployment. By combining robust MongoDB replica set configuration with intelligent application-level connection management and comprehensive monitoring, you can build a highly available system on DigitalOcean.