Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Ruby Deployments on Google Cloud

Designing for Resilience: MongoDB Replica Sets and Ruby Application Failover on GCP

Achieving true disaster recovery for a production application hinges on automating failover processes. This isn’t about manual intervention during an outage; it’s about building systems that detect failures and seamlessly transition operations to a healthy state with minimal human oversight. For a typical Ruby on Rails application backed by MongoDB, this means orchestrating failover for both the database layer and the application tier. We’ll focus on a Google Cloud Platform (GCP) environment, leveraging its managed services and infrastructure for robustness.

MongoDB Replica Set Auto-Failover with GCP Health Checks and Load Balancing

MongoDB’s native replication provides the foundation for high availability. A replica set consists of multiple MongoDB instances, one primary and several secondaries. If the primary becomes unavailable, one of the secondaries is automatically elected as the new primary. However, ensuring your application connects to the *current* primary requires a robust strategy. On GCP, this involves integrating MongoDB’s replica set awareness with GCP’s networking and health checking capabilities.

Configuring MongoDB Replica Sets

First, ensure your MongoDB instances are configured as a replica set. This is typically done via the MongoDB configuration file (mongod.conf). For a multi-region or multi-zone deployment on GCP, each instance should have a unique name and belong to the same replica set.

# /etc/mongod.conf
replication:
  replSetName: "myReplicaSet"
net:
  bindIp: 0.0.0.0
  port: 27017
storage:
  dbPath: /var/lib/mongodb
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
processManagement:
  fork: true
  pidFilePath: /var/run/mongodb/mongod.pid

After configuring each instance, you’ll need to initiate the replica set from one of the members. Connect to the MongoDB shell and run:

mongo
rs.initiate(
   {
      _id : "myReplicaSet",
      members: [
         { _id : 0, host : "mongo-instance-1.gcp.example.com:27017" },
         { _id : 1, host : "mongo-instance-2.gcp.example.com:27017" },
         { _id : 2, host : "mongo-instance-3.gcp.example.com:27017" }
      ]
   }
)

Replace mongo-instance-X.gcp.example.com with the actual internal DNS names or IP addresses of your MongoDB nodes within your GCP VPC. It’s crucial to have an odd number of members for robust election. For multi-region deployments, consider using arbiter nodes or ensuring a majority of voting members are in a stable region.

GCP Load Balancer for Application Connectivity

Directly connecting your Ruby application to a specific MongoDB instance is brittle. Instead, we’ll use a GCP Load Balancer. For MongoDB, a TCP Load Balancer is suitable. The key is to configure health checks that accurately reflect the health of the MongoDB primary.

Configuring the TCP Load Balancer

Create a Network Load Balancer (TCP) in GCP. The backend service will point to your MongoDB instances. The critical component here is the health check.

# Create a health check that targets the MongoDB port and expects a specific response
gcloud compute health-checks create tcp mongo-health-check \
    --port 27017 \
    --description "MongoDB health check" \
    --timeout 5s \
    --check-interval 5s \
    --unhealthy-threshold 2 \
    --healthy-threshold 2

# Create a backend service for MongoDB
gcloud compute backend-services create mongo-backend-service \
    --protocol TCP \
    --health-checks mongo-health-check \
    --port-name 27017 \
    --global # Or --region [your-region] for regional LB

# Add your MongoDB instances as backend instances
# Assuming you have instance groups already set up for your MongoDB nodes
gcloud compute backend-services add-backend mongo-backend-service \
    --instance-group [MONGO_INSTANCE_GROUP_NAME] \
    --instance-group-zone [ZONE] \
    --global # Or --region [your-region]

# Create a forwarding rule to direct traffic to the backend service
gcloud compute forwarding-rules create mongo-forwarding-rule \
    --load-balancing-scheme EXTERNAL \
    --address [RESERVED_IP_ADDRESS] \
    --ip-protocol TCP \
    --ports 27017 \
    --backend-service mongo-backend-service \
    --global # Or --region [your-region]

The default TCP health check simply checks if the port is open. This is insufficient for MongoDB. We need a health check that verifies the instance is the *primary*. This is where custom health checks or a proxy layer become necessary. A common pattern is to use a small proxy service (e.g., written in Go or Python) running on each MongoDB node that exposes an HTTP endpoint. This proxy queries MongoDB for its role and returns a 200 OK if it’s the primary, and a non-2xx status otherwise. The GCP health check would then target this HTTP endpoint.

Custom Health Check Example (Conceptual Go Proxy)

// main.go
package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"time"

	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)

var mongoClient *mongo.Client
var mongoURI = "mongodb://localhost:27017" // Connect to local mongod

func init() {
	clientOptions := options.Client().ApplyURI(mongoURI)
	var err error
	mongoClient, err = mongo.Connect(context.TODO(), clientOptions)
	if err != nil {
		log.Fatal(err)
	}
	// Ping the primary to verify connection
	err = mongoClient.Ping(context.TODO(), nil)
	if err != nil {
		log.Fatal(err)
	}
	log.Println("Connected to MongoDB!")
}

func isPrimaryHandler(w http.ResponseWriter, r *http.Request) {
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	var result struct {
		IsPrimary bool `bson:"primary"`
	}
	err := mongoClient.Database("admin").RunCommand(ctx, map[string]interface{}{"isMaster": 1}).Decode(&result)
	if err != nil {
		log.Printf("Error checking primary status: %v", err)
		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
		return
	}

	if result.IsPrimary {
		w.WriteHeader(http.StatusOK)
		fmt.Fprintln(w, "I am the primary")
	} else {
		w.WriteHeader(http.StatusServiceUnavailable) // Or another non-2xx code
		fmt.Fprintln(w, "I am not the primary")
	}
}

func main() {
	http.HandleFunc("/healthz", isPrimaryHandler)
	log.Println("Starting health check server on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

Deploy this proxy on each MongoDB instance, listening on a different port (e.g., 8080). Then, configure your GCP health check to target this HTTP endpoint.

# Create a custom HTTP health check
gcloud compute health-checks create http mongo-primary-health-check \
    --port 8080 \
    --request-path "/healthz" \
    --timeout 5s \
    --check-interval 5s \
    --unhealthy-threshold 2 \
    --healthy-threshold 2 \
    --global # Or --region [your-region]

# Update your backend service to use this new health check
gcloud compute backend-services update mongo-backend-service \
    --health-checks mongo-primary-health-check \
    --global # Or --region [your-region]

With this setup, the GCP Load Balancer will only direct traffic to MongoDB instances that report themselves as primary. When the primary fails, the health check will fail for that instance, and the LB will automatically stop sending traffic to it. The remaining secondaries will elect a new primary, and once it’s healthy and passes the health check, the LB will start routing traffic to it.

Ruby Application Connection String

Your Ruby application (e.g., Rails) should connect to the IP address or DNS name of the GCP Load Balancer’s forwarding rule, not directly to any MongoDB instance. The MongoDB Ruby driver is replica set-aware and will automatically discover the current primary if provided with the replica set name and a list of seed hosts (which can include the LB address).

# config/mongoid.yml (for Mongoid ODM)
production:
  clients:
    default:
      uri: "mongodb://[RESERVED_IP_ADDRESS]:27017/my_database?replicaSet=myReplicaSet"
      options:
        # These are optional but good practice
        connect_timeout_ms: 5000
        server_selection_timeout_ms: 10000
        # If using a proxy, you might need to specify hosts explicitly if the LB doesn't resolve replica set names well
        # hosts:
        #   - "[RESERVED_IP_ADDRESS]:27017"
        #   - "mongo-instance-2.gcp.example.com:27017" # Fallback/seed hosts
  options:
    raise_not_found_error: false
    identity_map_enabled: false
    # ... other options

The key here is the replicaSet=myReplicaSet parameter. The driver will use this to discover other members of the replica set. By pointing to the LB, you ensure that the driver always attempts to connect to an available endpoint that the LB deems healthy (i.e., the primary).

Ruby Application Auto-Failover with GCP Managed Instance Groups and Load Balancing

For the application tier, we’ll employ a similar strategy: GCP Managed Instance Groups (MIGs) combined with a GCP Load Balancer. This provides automatic scaling, self-healing, and seamless failover.

Setting up Managed Instance Groups (MIGs)

A MIG allows you to run a set of identical VM instances that are managed as a single entity. GCP can automatically create, delete, and manage these instances based on your configuration.

# Create an instance template for your Ruby application VMs
gcloud compute instance-templates create ruby-app-template \
    --machine-type e2-medium \
    --image-family debian-11 \
    --image-project debian-cloud \
    --metadata startup-script='#! /bin/bash
        # Install Ruby, Bundler, and your application dependencies
        # Clone your application code from Git
        # Configure database connection strings (using environment variables or secrets manager)
        # Start your application server (e.g., Puma, Unicorn)
        # Example:
        # apt-get update -y
        # apt-get install -y ruby-full build-essential git
        # gem install bundler
        # cd /opt/my_app
        # git clone [YOUR_APP_REPO] .
        # bundle install
        # bundle exec puma -C config/puma.rb
    ' \
    --tags http-server,https-server \
    --scopes "cloud-platform" \
    --network [YOUR_VPC_NETWORK] \
    --subnet [YOUR_SUBNET] \
    --zone [ZONE] # Or --region for regional MIGs

# Create a Managed Instance Group from the template
gcloud compute instance-groups managed create ruby-app-mig \
    --template ruby-app-template \
    --size 3 \
    --zone [ZONE] # Or --region [your-region]

The startup-script is crucial. It should automate the deployment of your Ruby application, including installing dependencies, fetching code, configuring environment variables (especially for database connection strings pointing to the MongoDB LB), and starting your application server. Using GCP Secret Manager for sensitive credentials is highly recommended.

Application Load Balancer for Ruby Apps

For the application tier, we’ll use a GCP HTTP(S) Load Balancer. This is ideal for web traffic and integrates well with MIGs.

# Create a health check for your Ruby application
gcloud compute health-checks create http ruby-app-health-check \
    --port 80 \
    --request-path "/health" \
    --timeout 5s \
    --check-interval 5s \
    --unhealthy-threshold 2 \
    --healthy-threshold 2 \
    --global

# Create a backend service for your Ruby application MIG
gcloud compute backend-services create ruby-app-backend-service \
    --protocol HTTP \
    --health-checks ruby-app-health-check \
    --port-name http \
    --timeout 30s \
    --enable-cdn \
    --global

# Add the MIG as a backend to the backend service
gcloud compute backend-services add-backend ruby-app-backend-service \
    --instance-group ruby-app-mig \
    --instance-group-zone [ZONE] \
    --global

# Create a URL map
gcloud compute url-maps create ruby-app-url-map \
    --default-service ruby-app-backend-service \
    --global

# Create a target HTTP proxy
gcloud compute target-http-proxies create ruby-app-http-proxy \
    --url-map ruby-app-url-map \
    --global

# Create a global forwarding rule (for HTTP traffic)
gcloud compute forwarding-rules create ruby-app-forwarding-rule-http \
    --load-balancing-scheme EXTERNAL \
    --address [RESERVED_IP_ADDRESS_FOR_APP] \
    --ip-protocol TCP \
    --ports 80 \
    --target-http-proxy ruby-app-http-proxy \
    --global

# For HTTPS, you would create an SSL certificate and a target HTTPS proxy
# gcloud compute ssl-certificates create my-ssl-cert --domains example.com --global
# gcloud compute target-https-proxies create ruby-app-https-proxy --ssl-certificates my-ssl-cert --url-map ruby-app-url-map --global
# gcloud compute forwarding-rules create ruby-app-forwarding-rule-https --load-balancing-scheme EXTERNAL --address [RESERVED_IP_ADDRESS_FOR_APP] --ip-protocol TCP --ports 443 --target-https-proxy ruby-app-https-proxy --global

The /health endpoint on your Ruby application should return a 200 OK status if the application is healthy and can connect to its dependencies (like MongoDB). The MIG’s health checking mechanism, integrated with the load balancer, will automatically recreate unhealthy instances.

Automated Failover in Action

When an instance in the Ruby app MIG becomes unhealthy (e.g., due to an application crash, dependency failure, or underlying VM issue), the health check will fail. GCP’s MIG will then automatically terminate the unhealthy instance and provision a new one based on the instance template. The load balancer will stop sending traffic to the unhealthy instance immediately. Once the new instance boots up, installs dependencies, starts the application, and passes its health check, the load balancer will begin routing traffic to it. This entire process is automated and typically takes only a few minutes.

Monitoring and Alerting

While automation handles failover, robust monitoring and alerting are essential to detect issues that automation might miss or to be notified when failovers occur. Use GCP’s Cloud Monitoring to:

Monitor MongoDB replica set status (e.g., primary count, replication lag).
Track health check status for both MongoDB and application backends.
Set up alerts for high error rates on the load balancers.
Monitor CPU, memory, and disk usage on MongoDB instances.
Alert on instance group resizing events (indicating potential failures or scaling).

Configure alerts to notify your operations team via email, Slack, or PagerDuty when critical thresholds are breached or when failover events are triggered.

Conclusion

By combining MongoDB’s native replication with GCP’s robust networking services (Load Balancing, Health Checks) and compute services (Managed Instance Groups), you can architect a highly available and resilient deployment. The key is to delegate failure detection and recovery to these managed services, ensuring your Ruby application and its MongoDB backend can automatically failover with minimal downtime.