Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Ruby Deployments on DigitalOcean

Establishing a MongoDB Replica Set for High Availability

A robust disaster recovery strategy for MongoDB hinges on implementing a replica set. This ensures data redundancy and automatic failover in case of node failure. For a production deployment on DigitalOcean, we’ll configure a 3-node replica set, with one node designated as a hidden secondary for backups or analytics, and a separate arbiter for fault tolerance without data storage.

First, provision three DigitalOcean Droplets. For this example, let’s assume they have public IPs: `159.65.10.1` (primary candidate), `159.65.10.2` (secondary candidate), and `159.65.10.3` (hidden secondary/backup). We’ll also provision a fourth Droplet for the arbiter, `159.65.10.4`.

On each of the data-bearing Droplets, install MongoDB. Ensure the firewall is configured to allow traffic on port 27017 between these nodes. Edit the MongoDB configuration file (`/etc/mongod.conf`) to enable replication and bind to the private IP address of each Droplet for enhanced security. Replace `eth0` with your actual network interface if necessary.

MongoDB Configuration for Replica Set Nodes

On Droplet `159.65.10.1` (primary candidate):

# /etc/mongod.conf
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 127.0.0.1,10.10.0.1  # Replace 10.10.0.1 with the private IP of this Droplet
  port: 27017
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid
replication:
  replSetName: myReplicaSet
sharding:
  clusterRole: configsvr # If this is a config server for sharding, otherwise omit
setParameter:
  enableLocalhostAuthBypass: false

Repeat this configuration for Droplets `159.65.10.2` and `159.65.10.3`, adjusting the `bindIp` to their respective private IPs. For Droplet `159.65.10.3`, add the following to its configuration to make it a hidden secondary:

# /etc/mongod.conf (for hidden secondary)
# ... other configurations ...
replication:
  replSetName: myReplicaSet
  hidden: true

On Droplet `159.65.10.4` (arbiter):

# /etc/mongod.conf (for arbiter)
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 127.0.0.1,10.10.0.4  # Replace 10.10.0.4 with the private IP of this Droplet
  port: 27017
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid
replication:
  replSetName: myReplicaSet
  arbiterOnly: true
setParameter:
  enableLocalhostAuthBypass: false

After configuring and restarting MongoDB on all nodes, initiate the replica set from one of the data-bearing nodes (e.g., `159.65.10.1`).

Initializing the MongoDB Replica Set

Connect to the MongoDB shell on the primary candidate Droplet (`159.65.10.1`).

mongo --host 159.65.10.1 --port 27017

Inside the MongoDB shell, run the following command:

rs.initiate(
  {
    _id: "myReplicaSet",
    version: 1,
    members: [
      { _id: 0, host: "159.65.10.1:27017", priority: 2 },
      { _id: 1, host: "159.65.10.2:27017", priority: 1 },
      { _id: 2, host: "159.65.10.3:27017", priority: 0, hidden: true },
      { _id: 3, host: "159.65.10.4:27017", arbiterOnly: true }
    ]
  }
)

Verify the replica set status:

rs.status()

This output should show all members as `PRIMARY`, `SECONDARY`, or `ARBITER` and indicate their health. The `priority` field determines which member is elected as primary. Higher priority means more likely to be elected. The hidden member has a priority of 0, preventing it from being elected primary.

Architecting Auto-Failover for Ruby Applications

For Ruby applications, particularly those using frameworks like Rails, the MongoDB driver needs to be aware of the replica set and capable of handling failover events. The `mongo` gem in Ruby provides built-in support for replica sets.

Configuring the MongoDB Connection String

The connection string is the critical component that informs your Ruby application about the replica set. It should list all members of the replica set, along with the replica set name.

In a Rails application, this would typically be configured in `config/mongoid.yml` (if using Mongoid) or directly in your database connection setup.

Example using the `mongo` gem directly:

require 'mongo'

# Replace with your Droplet private IPs
mongo_uri = "mongodb://10.10.0.1:27017,10.10.0.2:27017,10.10.0.3:27017,10.10.0.4:27017/?replicaSet=myReplicaSet&authSource=admin"

begin
  client = Mongo::Client.new(mongo_uri)
  # Access a database
  db = client.database
  puts "Successfully connected to MongoDB replica set: #{client.cluster.replica_set_name}"
  puts "Current primary: #{client.cluster.primary_address}"

  # Example: Insert a document
  result = db[:my_collection].insert_one({ name: "Test Document", timestamp: Time.now })
  puts "Inserted document with ID: #{result.inserted_id}"

rescue Mongo::Error::NoServerAvailable => e
  puts "Failed to connect to MongoDB: #{e.message}"
  # Implement retry logic or alert mechanism here
rescue StandardError => e
  puts "An unexpected error occurred: #{e.message}"
end

If using Mongoid, your `config/mongoid.yml` would look something like this:

production:
  clients:
    default:
      uri: "mongodb://10.10.0.1:27017,10.10.0.2:27017,10.10.0.3:27017,10.10.0.4:27017/?replicaSet=myReplicaSet&authSource=admin"
      options:
        # Consider adding read preference for specific use cases
        # read: { mode: :secondary_preferred }
        # Consider adding write concern for stronger consistency guarantees
        # write: { w: :majority, j: true, wtimeout: 5000 }
  # Other Mongoid configurations...

The `replicaSet=myReplicaSet` parameter is crucial. The driver will discover all members of the replica set and monitor their status. When the primary fails, the driver will automatically detect the change and switch to the new primary.

Handling Connection Errors and Failover Gracefully

While the `mongo` gem handles automatic failover, your application should be prepared for transient connection issues during the failover process. Implement retry mechanisms and error handling.

The `Mongo::Error::NoServerAvailable` exception is commonly raised when the driver cannot reach any server in the replica set. Your application logic should catch this and potentially retry the operation after a short delay.

# Example of a retry mechanism for a specific operation
def with_retry(max_retries: 3, delay: 2)
  retries = 0
  loop do
    begin
      return yield
    rescue Mongo::Error::NoServerAvailable => e
      retries += 1
      if retries <= max_retries
        puts "Connection error: #{e.message}. Retrying in #{delay} seconds (Attempt #{retries}/#{max_retries})..."
        sleep(delay)
      else
        puts "Failed to connect after #{max_retries} retries. Raising error."
        raise e # Re-raise the exception after exhausting retries
      end
    end
  end
end

# Usage:
begin
  with_retry do
    db[:my_collection].insert_one({ message: "This might fail during failover" })
  end
rescue Mongo::Error::NoServerAvailable
  # Log the persistent failure, notify ops, or present a user-friendly error
  Rails.logger.error("MongoDB is unavailable after multiple retries.")
  # Potentially render an error page or return a specific status code
end

For more complex scenarios, consider using a connection pool with configurable timeouts and retry strategies. The `mongo` gem’s connection pooling can be tuned.

Automated Failover Monitoring and Alerting

While auto-failover is automated, it’s crucial to monitor the health of your replica set and be alerted when failovers occur. This allows for proactive investigation and ensures the system is functioning as expected.

Leveraging DigitalOcean Monitoring and External Tools

DigitalOcean’s built-in monitoring provides basic Droplet-level metrics (CPU, memory, disk I/O, network). However, for MongoDB-specific health, you’ll need more granular monitoring.

Tools like Prometheus with the `mongodb_exporter` are excellent for this. You can deploy Prometheus and Grafana on a separate Droplet to scrape metrics from your MongoDB instances.

The `mongodb_exporter` can expose metrics such as:

Replica set status (primary, secondary, arbiter)
Oplog lag
Network traffic per member
Disk usage per member
Read/write operations per member

Configure Prometheus to scrape these metrics. Then, set up Alertmanager to define alert rules. For example, an alert can be triggered if a node is not reachable, if the oplog lag exceeds a certain threshold, or if the replica set has no primary for an extended period.

Alerting on Failover Events

A common alert strategy is to monitor the number of primaries in the replica set. If this drops to zero for more than a minute, it indicates a significant issue or a prolonged failover event.

Here’s a conceptual Prometheus alerting rule (in YAML):

groups:
- name: mongodb.rules
  rules:
  - alert: MongoReplicaSetNoPrimary
    expr: mongodb_replica_set_state{state="primary"} == 0
    for: 5m # Wait for 5 minutes to avoid flapping alerts during brief elections
    labels:
      severity: critical
    annotations:
      summary: "MongoDB replica set '{{ $labels.replica_set }}' has no primary node."
      description: "The MongoDB replica set '{{ $labels.replica_set }}' has not reported a primary node for 5 minutes. This indicates a potential failure or prolonged election. Check MongoDB logs and cluster health."

  - alert: MongoReplicaSetOplogLagging
    expr: mongodb_replica_set_oplog_lag_seconds{replica_set="myReplicaSet", member_role="secondary"} > 600 # 10 minutes lag
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "MongoDB secondary node '{{ $labels.instance }}' is lagging."
      description: "The secondary node '{{ $labels.instance }}' in replica set '{{ $labels.replica_set }}' is lagging behind the primary by more than 10 minutes ({{ $value }} seconds). This could impact read consistency and recovery time."

These alerts can be routed to Slack, PagerDuty, or email, ensuring your operations team is immediately notified of any critical events, including automatic failovers.

Testing Your Disaster Recovery Strategy

A disaster recovery plan is only as good as its last successful test. Regularly simulate failures to validate your auto-failover mechanisms and your team’s response.

Simulating Node Failures

The simplest way to test is to stop the MongoDB process on the primary node.

# On the current primary node (e.g., 159.65.10.1)
sudo systemctl stop mongod

Observe the following:

The MongoDB driver in your Ruby application should automatically detect the primary’s unavailability.
An election should be triggered among the remaining data-bearing nodes.
A new primary should be elected within seconds (depending on network latency and configuration).
Your Ruby application should seamlessly switch to the new primary.
Monitor your alerting system for notifications about the failover.

You can also test secondary node failures by stopping MongoDB on a secondary Droplet. This should not impact write operations but will reduce redundancy. If the arbiter fails, the replica set can continue to operate as long as a majority of data-bearing nodes are available.

Testing Application Resilience

During a simulated failover, monitor your Ruby application’s error logs and user-facing behavior. Check for:

Increased error rates or latency for a brief period.
Successful completion of operations after the failover.
Correct behavior of retry mechanisms.

Automated failover for MongoDB and seamless integration with Ruby applications significantly enhance the resilience of your deployment. By combining robust MongoDB replica set configuration with intelligent application-level connection management and comprehensive monitoring, you can build a highly available system on DigitalOcean.