Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Ruby Deployments on Linode

Establishing a MongoDB Replica Set for High Availability

A robust disaster recovery strategy for MongoDB hinges on a properly configured replica set. This ensures data redundancy and automatic failover in case of node failure. We’ll assume a three-node replica set architecture for this example, deployed on separate Linode instances for geographical or availability zone isolation.

Each MongoDB instance needs to be configured to participate in the replica set. This involves setting the replSetName parameter in the MongoDB configuration file (typically /etc/mongod.conf) and ensuring the MongoDB daemon is started with the appropriate configuration.

MongoDB Configuration File (`mongod.conf`)

storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 0.0.0.0
  port: 27017
replication:
  replSetName: rs0
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid

After updating the configuration on all three MongoDB nodes, restart the MongoDB service on each:

sudo systemctl restart mongod

Initializing the Replica Set

Connect to one of the MongoDB instances using the mongo shell. From there, initiate the replica set configuration. It’s crucial to add all members of the replica set. For production environments, consider using authentication and TLS/SSL for secure communication between replica set members.

mongo --host  --port 27017

rs.initiate(
  {
    _id : "rs0",
    members: [
      { _id: 0, host : ":27017" },
      { _id: 1, host : ":27017" },
      { _id: 2, host : ":27017" }
    ]
  }
)

You can verify the replica set status by running rs.status() in the mongo shell. This will show the state of each member (PRIMARY, SECONDARY, ARBITER if configured).

Architecting Auto-Failover for Ruby Applications

For Ruby applications, the standard mongo gem provides built-in support for replica sets. The connection string should specify the replica set name and list all potential hosts. The driver will automatically detect the current PRIMARY and connect to it. In case of a PRIMARY failure, the driver will discover the new PRIMARY and seamlessly switch connections.

Ruby MongoDB Connection String

In your Ruby application’s configuration (e.g., within an initializer file in Rails), you’ll define the MongoDB connection using a URI that includes the replica set details.

# config/initializers/mongo.rb (for Rails)

# Ensure you have the mongo gem installed: gem install mongo
require 'mongo'

# Replace with your actual MongoDB node IPs and replica set name
MONGO_URI = "mongodb://:27017,:27017,:27017/?replicaSet=rs0"

# For production, consider adding authentication and SSL options
# MONGO_URI = "mongodb://user:password@:27017,:27017,:27017/?replicaSet=rs0&ssl=true&ssl_ca_cert=/path/to/ca.pem"

begin
  # Connect to MongoDB
  $mongo_client = Mongo::Client.new(MONGO_URI)

  # Optional: Ping the database to ensure connection is established
  $mongo_client.database.command(ping: 1)
  Rails.logger.info "Successfully connected to MongoDB replica set."

rescue Mongo::Error => e
  Rails.logger.error "Failed to connect to MongoDB: #{e.message}"
  # Depending on your application's criticality, you might want to exit or retry
  # exit(1)
end

# You can then access your collections like:
# @users_collection = $mongo_client[:users]

The mongo gem handles the discovery of the PRIMARY node. When the current PRIMARY becomes unavailable, the driver will periodically check the replica set members and automatically connect to the newly elected PRIMARY. This automatic failover mechanism is a core feature of MongoDB replica sets and the mongo driver.

Simulating Failover and Testing

Regular testing of your failover mechanism is paramount. This involves simulating node failures and observing how your application responds. A common approach is to stop the MongoDB service on the current PRIMARY node.

Simulating PRIMARY Node Failure

# On the current PRIMARY MongoDB node:
sudo systemctl stop mongod

After stopping the service, monitor your Ruby application logs. You should observe connection errors initially, followed by a period where the application might be unresponsive or return errors related to database operations. Within a short period (typically seconds to a minute, depending on election timeouts), the mongo driver should detect the failure, a new PRIMARY should be elected among the remaining SECONDARY nodes, and the application should resume normal operation, now connected to the new PRIMARY.

To verify the new PRIMARY, connect to the MongoDB replica set again and run rs.status(). You should see one of the former SECONDARY nodes now designated as PRIMARY.

Testing Application Resilience

Beyond just connection recovery, test critical application workflows during and immediately after a failover. Ensure that data consistency is maintained and that users experience minimal disruption. For critical applications, consider implementing retry mechanisms within your Ruby code for database operations that might fail during the brief failover window.

# Example of a simple retry mechanism for a database operation
MAX_RETRIES = 3
RETRY_DELAY = 5 # seconds

def with_retry(operation_name)
  retries = 0
  loop do
    begin
      yield
      return # Success
    rescue Mongo::Error => e
      retries += 1
      if retries <= MAX_RETRIES
        Rails.logger.warn "Operation '#{operation_name}' failed (#{e.message}). Retrying in #{RETRY_DELAY}s... (Attempt #{retries}/#{MAX_RETRIES})"
        sleep RETRY_DELAY
      else
        Rails.logger.error "Operation '#{operation_name}' failed after #{MAX_RETRIES} retries. #{e.message}"
        raise # Re-raise the exception after exhausting retries
      end
    end
  end
end

# Usage:
# with_retry("fetching user data") do
#   @user = @users_collection.find_one({ _id: user_id })
# end

Monitoring and Alerting

A critical component of any disaster recovery strategy is robust monitoring. You need to be alerted proactively when issues arise, not just reactively. For MongoDB, this means monitoring replica set health, node status, and replication lag.

Key Metrics to Monitor

Replica Set Status: Ensure a PRIMARY is always available and that SECONDARY nodes are in sync.
Replication Lag: Monitor the time difference between operations on the PRIMARY and their application on SECONDARY nodes. High lag indicates potential performance issues or a struggling SECONDARY.
Node Health: Monitor CPU, memory, disk I/O, and network usage on each MongoDB instance.
Application Connection Status: Track successful and failed connections from your Ruby application to the MongoDB cluster.

Tools like Prometheus with the mongodb_exporter, Datadog, or New Relic can be configured to scrape these metrics. Set up alerts for critical conditions, such as:

No PRIMARY node detected for more than X minutes.
Replication lag exceeding Y seconds.
MongoDB service is down on any node.
High number of connection errors from the application.

For Linode deployments, consider leveraging Linode’s built-in monitoring and creating custom alerts based on these metrics. Integrating these alerts with a notification system like PagerDuty or Slack ensures that your operations team is immediately informed of any potential disaster.