Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Ruby Deployments on AWS

Designing for Resilience: MongoDB Replica Sets and Ruby Application Failover on AWS

Achieving true high availability for a critical application stack, particularly one involving a stateful database like MongoDB and a stateless application layer like Ruby on Rails, necessitates a robust disaster recovery strategy. This strategy must go beyond simple backups and encompass automated failover mechanisms. This document outlines an architectural approach for implementing automated failover for a MongoDB replica set and its associated Ruby application deployment within the AWS ecosystem.

MongoDB Replica Set Health and Failover Orchestration

A MongoDB replica set is the foundational element for database availability. It comprises multiple data-bearing nodes, one of which acts as the primary, handling all write operations. The other nodes are secondaries, replicating the primary’s oplog. In the event of a primary failure, the replica set automatically elects a new primary from the available secondaries. However, this automatic election is only part of the solution. We need to ensure our application layer is aware of and can seamlessly switch to the new primary.

Configuring a Multi-AZ MongoDB Replica Set

For production deployments, a MongoDB replica set should span multiple Availability Zones (AZs) within a single AWS region. This provides resilience against AZ-level failures. A typical setup involves at least three data-bearing nodes, ideally distributed across three AZs. An arbiter can be added to ensure a majority vote in elections, though it doesn’t hold data.

Here’s a conceptual `mongod.conf` snippet for a replica set member:

storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
net:
  bindIp: 0.0.0.0
  port: 27017
replication:
  replSetName: myReplicaSet
sharding:
  clusterRole: configsvr # or shardsvr if sharded
security:
  keyFile: /etc/mongo/mongodb-keyfile.pem
  authorization: enabled
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid

The `replSetName` must be identical across all members. The `keyFile` is crucial for inter-node authentication and should be securely distributed to all members.

Initiating and Managing the Replica Set

Once MongoDB instances are running with the correct configuration, initiate the replica set from one of the nodes (typically the one designated as the initial primary):

mongo --port 27017
> rs.initiate(
   {
      _id : "myReplicaSet",
      members: [
         { _id: 0, host: "mongo-node-1.example.com:27017" },
         { _id: 1, host: "mongo-node-2.example.com:27017" },
         { _id: 2, host: "mongo-node-3.example.com:27017" }
      ]
   }
)

After initiation, you can add more members, configure priorities, and set up delayed secondaries if needed. Monitor the replica set status using `rs.status()`.

Ruby Application Integration and Connection Management

The Ruby application needs to be aware of the replica set’s current primary. Hardcoding a single MongoDB host is a recipe for failure. The standard approach is to provide the application with the replica set name and a list of potential hosts. The MongoDB driver will then handle discovering the current primary.

Configuring the MongoDB Driver

In a Rails application using the `mongo` gem, the connection is typically configured in `config/mongoid.yml` (for Mongoid ODM) or a custom initializer if using the native driver.

# config/mongoid.yml (example for Mongoid)
production:
  clients:
    default:
      database: my_app_production
      hosts:
        - mongo-node-1.example.com:27017
        - mongo-node-2.example.com:27017
        - mongo-node-3.example.com:27017
      replica_set: myReplicaSet
      options:
        # Consider setting read preference to primary for consistency during failover
        # or secondary_preferred for read scaling if acceptable.
        read:
          mode: :primary
        # Add other connection options as needed
        # ssl: true
        # ssl_ca_cert: /path/to/ca.pem
        # ssl_cert: /path/to/client.pem
        # ssl_key: /path/to/client.key

The key here is `replica_set: myReplicaSet` and providing multiple `hosts`. The driver will connect to one of the listed hosts, discover the replica set configuration, and automatically direct operations to the current primary. If the primary changes, the driver will detect this and update its internal routing.

Handling Connection Errors and Retries

Even with automatic primary discovery, transient network issues or the brief window during a failover can lead to connection errors. Implement robust error handling and retry logic in your Ruby application.

# Example in a Rails controller or service object
begin
  # Attempt a database operation
  User.find(params[:id])
rescue Mongo::Error::OperationFailure => e
  # Handle specific MongoDB operation failures
  Rails.logger.error("MongoDB operation failed: #{e.message}")
  # Potentially retry the operation or return an error to the user
  render json: { error: "Database operation failed, please try again." }, status: :service_unavailable
rescue Mongo::Error::ConnectionFailure => e
  # Handle connection issues, which might indicate a failover in progress
  Rails.logger.error("MongoDB connection failed: #{e.message}")
  # Implement a retry mechanism with exponential backoff
  retry_count = 0
  max_retries = 3
  while retry_count < max_retries
    sleep(2**retry_count) # Exponential backoff
    begin
      # Re-establish connection or retry operation
      # The driver might automatically reconnect, but explicit checks can be useful
      Rails.logger.info("Retrying MongoDB operation...")
      User.find(params[:id])
      break # Success, exit retry loop
    rescue Mongo::Error::ConnectionFailure => retry_e
      Rails.logger.error("Retry #{retry_count + 1} failed: #{retry_e.message}")
      retry_count += 1
    end
  end
  if retry_count == max_retries
    render json: { error: "Database unavailable, please try again later." }, status: :service_unavailable
  end
rescue StandardError => e
  # Catch other unexpected errors
  Rails.logger.error("An unexpected error occurred: #{e.message}")
  render json: { error: "An internal server error occurred." }, status: :internal_server_error
end

The `Mongo::Error::ConnectionFailure` is particularly important. While the driver attempts to reconnect and discover a new primary, a short delay might be necessary. The retry logic with exponential backoff ensures that the application doesn’t overwhelm the database during a stressful failover period.

Automating Application Deployment and Health Checks

For automated failover of the application layer itself, we rely on AWS’s managed services and robust health checking. Deploying Ruby applications on AWS typically involves services like Elastic Beanstalk, ECS, or EKS.

Leveraging Elastic Load Balancing (ELB) and Auto Scaling Groups (ASG)

A common pattern is to deploy Ruby applications behind an Application Load Balancer (ALB) managed by an Auto Scaling Group. The ALB distributes traffic to healthy instances within the ASG. The ASG, in turn, monitors instance health and replaces unhealthy ones.

Key Configuration Points:

ALB Target Group Health Checks: Configure the ALB to perform health checks on your application instances. These checks should be specific enough to verify application responsiveness and, crucially, its ability to connect to MongoDB. A simple HTTP 200 OK might not be sufficient. A dedicated health check endpoint that attempts a read operation from MongoDB is more robust.
ASG Health Check Type: Set the ASG’s health check type to `ELB` (or `EC2` if not using ELB, but ELB is recommended for this scenario). This ensures that if the ALB marks an instance as unhealthy, the ASG will terminate it and launch a replacement.
Desired Capacity and Scaling Policies: Configure the desired number of application instances and set up scaling policies based on metrics like CPU utilization or request count. This ensures the application layer can handle load and recover from instance failures.

Implementing a MongoDB-Aware Health Check Endpoint

A basic health check endpoint in a Rails application might look like this:

# app/controllers/health_check_controller.rb
class HealthCheckController < ApplicationController
  skip_before_action :authenticate_user! # If using Devise or similar

  def show
    begin
      # Attempt a simple, low-impact read operation
      # Using a cached value or a small collection is ideal
      # For Mongoid:
      User.limit(1).first # Or a specific health check collection

      # For native driver:
      # client = Mongo::Client.new(MONGO_HOSTS, replica_set: MONGO_REPLICA_SET_NAME)
      # client[:health_checks].find.limit(1).first

      render json: { status: "ok", database: "connected" }, status: :ok
    rescue Mongo::Error::ConnectionFailure => e
      Rails.logger.error("Health check failed: MongoDB connection error - #{e.message}")
      render json: { status: "error", database: "disconnected", message: "Cannot connect to database" }, status: :service_unavailable
    rescue StandardError => e
      Rails.logger.error("Health check failed: Unexpected error - #{e.message}")
      render json: { status: "error", database: "unknown", message: "Internal server error" }, status: :internal_server_error
    end
  end
end

# config/routes.rb
Rails.application.routes.draw do
  get 'health', to: 'health_check#show'
  # ... other routes
end

Configure your ALB target group to poll the `/health` endpoint with a reasonable timeout (e.g., 5 seconds) and an interval (e.g., 30 seconds). If multiple consecutive health checks fail, the ALB will mark the instance as unhealthy, triggering the ASG to replace it.

Advanced Considerations and Monitoring

Automated MongoDB Failover Orchestration Tools

While MongoDB’s built-in replica set election is automatic, for more complex scenarios or to orchestrate actions *around* a failover (e.g., notifying operations teams, triggering external scripts), consider tools like:

Orchestrator (Percona): A powerful tool for managing database clusters, including automated failover for MySQL and MongoDB. It can monitor replica set status and execute predefined actions.
Custom Scripts with AWS Lambda/EventBridge: Monitor MongoDB replica set status using CloudWatch metrics or direct API calls. Trigger Lambda functions on specific events (e.g., primary change detected) to perform custom actions.

Monitoring and Alerting

Comprehensive monitoring is non-negotiable. Key metrics to track include:

MongoDB Replica Set Status: `rs.status()` output, oplog lag, election counts, primary stepdowns. Use CloudWatch Agent to push custom metrics.
Application Performance: Request latency, error rates (especially database-related errors), connection pool usage.
AWS Infrastructure: ALB request counts, target group health status, ASG instance counts, EC2 CPU/Memory utilization.
Network Connectivity: Ensure security groups and NACLs allow necessary traffic between application servers and MongoDB instances.

Set up CloudWatch Alarms for critical metrics. For instance, an alarm on the number of unhealthy targets in your ALB’s target group, or an alarm on high MongoDB connection errors, can provide early warnings of potential issues or active failovers.

Read Preferences and Consistency

The choice of read preference is critical. During a failover, if your application is configured to read from secondaries (`secondary`, `secondary_preferred`), it might briefly read stale data or encounter errors if the newly elected primary hasn’t fully caught up or if network partitions exist. For applications requiring strong consistency, setting the read preference to `:primary` (as shown in the `mongoid.yml` example) is often the safest bet during failover events, albeit at the cost of read scalability to secondaries.

Testing Your Failover Strategy

Regularly test your automated failover. This involves:

Simulating MongoDB Primary Failure: Gracefully step down the primary (`rs.stepDown()`) and observe the election process and application recovery.
Terminating Application Instances: Manually terminate EC2 instances within the ASG to test the ALB/ASG failover mechanism.
Network Partitioning: Simulate network issues between application servers and the database.
Full Region Failure (if applicable): For multi-region DR, test failover to a secondary region.

Document the expected behavior for each test scenario and verify that the actual outcome matches. This iterative testing is crucial for building confidence in your disaster recovery architecture.