Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and C Deployments on Google Cloud

Designing for Resiliency: MongoDB Auto-Failover on Google Cloud

Achieving true high availability for stateful services like MongoDB necessitates robust disaster recovery (DR) strategies, with automated failover being the cornerstone of minimizing downtime. This post details the architectural considerations and practical implementation for setting up automatic failover for a MongoDB replica set deployed on Google Cloud Platform (GCP), leveraging GCP’s native capabilities and custom automation.

MongoDB Replica Set Fundamentals for Failover

A MongoDB replica set is the foundational element for high availability. It comprises multiple data-bearing nodes, one of which is the primary, handling all write operations. The other nodes are secondaries, replicating the primary’s data. In the event of a primary node failure, the replica set automatically elects a new primary from the available secondaries. Key configurations for effective failover include:

Sufficient Node Count: A minimum of three voting members is recommended to avoid split-brain scenarios and ensure quorum. For production, five or seven voting members are more robust.
Arbiter Nodes: While they don’t store data, arbiters participate in elections. They are useful for increasing the voting count without the overhead of a full data-bearing node, but should be used judiciously.
Network Latency: Nodes should be deployed across different availability zones (AZs) within a region to tolerate AZ failures. Cross-region deployments are for DR against entire region failures, which introduces higher latency and complexity.
Read Preference: Applications must be configured to respect the replica set’s topology and direct reads to secondaries when appropriate, reducing load on the primary and improving read availability during failovers.
Write Concern: Setting appropriate write concerns (e.g., w: "majority") ensures data durability and consistency across nodes before acknowledging a write operation.

GCP Infrastructure for MongoDB Resilience

Google Cloud offers several services that are instrumental in building a resilient MongoDB deployment. We’ll focus on a multi-AZ deployment within a single GCP region for automated failover against instance or AZ failures.

Compute Engine Instances

Each MongoDB node will run on a Compute Engine (GCE) virtual machine. For optimal resilience, these VMs should be distributed across different GCP zones within the same region. For example, in the us-central1 region, you might deploy nodes in us-central1-a, us-central1-b, and us-central1-c.

Persistent Disks

MongoDB data should reside on Persistent Disks. For performance-critical workloads, consider SSD Persistent Disks. These disks are zone-specific, meaning a disk attached to a VM in us-central1-a remains in that zone. If a VM fails, its attached disk can be reattached to another VM in the same zone. For automatic failover, this implies that the replacement VM must be in the same zone as the failed VM’s disk, or a mechanism to migrate the disk must be in place (which is complex and generally avoided for rapid failover).

Networking

A Virtual Private Cloud (VPC) network with appropriate firewall rules is essential. MongoDB typically uses port 27017. Firewall rules must allow inter-node communication within the replica set and access from application servers. Using GCP’s internal IP addresses for inter-node communication is recommended for security and performance.

Automating Failover: The Core Challenge

MongoDB’s built-in replica set election mechanism handles failover when a primary becomes unreachable. However, this election process relies on the remaining nodes being able to communicate and reach a majority. The challenge lies in detecting failures promptly and ensuring that applications can seamlessly reconnect to the new primary. For automated failover that goes beyond MongoDB’s native capabilities (e.g., handling network partitions gracefully, or orchestrating VM replacement), custom tooling is required.

Health Checks and Monitoring

Robust monitoring is the first step towards automation. We need to monitor:

MongoDB Node Status: Using rs.status() and checking the stateStr for each member (e.g., PRIMARY, SECONDARY, ARBITER, STARTUP, DOWN).
GCE Instance Health: GCP’s built-in health checks and monitoring metrics (CPU, network, disk I/O).
Network Connectivity: Ping tests or TCP checks to MongoDB ports from application servers and other database nodes.

Tools like Prometheus with the MongoDB Exporter, or GCP’s Cloud Monitoring, can collect these metrics. Alerting should be configured for critical thresholds.

Custom Failover Orchestration with Cloud Functions and Pub/Sub

For automated failover that reacts to GCE instance failures or network issues that MongoDB’s native election might not gracefully handle (e.g., a node becomes unresponsive due to underlying infrastructure issues), we can build a custom orchestration layer. This typically involves:

1. Triggering Failover Detection

A common pattern is to use GCP’s Cloud Monitoring to detect unhealthy GCE instances. When an instance is marked as unhealthy, it can trigger a Pub/Sub notification. Alternatively, a scheduled Cloud Function can periodically check the health of MongoDB nodes and GCE instances.

2. Pub/Sub and Cloud Functions for Event Handling

We can set up a Pub/Sub topic to receive health check failure events. A Cloud Function can be subscribed to this topic. This function will be responsible for initiating the failover process.

3. Failover Logic in Cloud Function

The Cloud Function, upon receiving a failure notification, needs to perform several actions:

Identify the Failed Node: Determine which GCE instance is unhealthy.
Verify MongoDB Status: Connect to the replica set and confirm the state of the nodes. If the failed node was the primary, check if a new primary has been elected.
Initiate Node Replacement (if necessary): If the failed node was critical (e.g., the primary and no new primary elected, or a secondary is down and the replica set cannot achieve quorum), the function might need to orchestrate the replacement of the failed GCE instance. This could involve deleting the unhealthy instance and launching a new one from a pre-defined instance template, ensuring it’s configured to join the replica set.
Update Application Connectivity: This is often the most complex part. Applications need to be informed of the new primary. This can be achieved by updating a DNS record, a load balancer configuration, or by having applications poll a configuration service.

Example Cloud Function (Python)

This is a simplified Python example for a Cloud Function. It assumes you have a mechanism to identify the unhealthy instance and a way to interact with the MongoDB replica set.

import google.cloud.storage as storage
import google.auth
import pymongo
import json
import os

# --- Configuration ---
MONGO_URI = os.environ.get("MONGO_URI", "mongodb://user:password@host1:27017,host2:27017/?replicaSet=myReplicaSet")
GCP_PROJECT = os.environ.get("GCP_PROJECT")
GCP_ZONE = os.environ.get("GCP_ZONE") # Zone of the instance that triggered the alert
UNHEALTHY_INSTANCE_NAME = os.environ.get("UNHEALTHY_INSTANCE_NAME") # Name of the unhealthy instance

def handle_mongodb_failover(event, context):
    """
    Cloud Function to handle MongoDB failover events.
    Triggered by Pub/Sub messages from Cloud Monitoring alerts.
    """
    print(f"Received event: {event}")
    print(f"Context: {context}")

    try:
        # Parse Pub/Sub message data
        pubsub_message = json.loads(base64.b64decode(event['data']).decode('utf-8'))
        alert_details = pubsub_message.get('alert', {})
        resource_name = alert_details.get('resource', {}).get('labels', {}).get('instance_name')

        if not resource_name or resource_name != UNHEALTHY_INSTANCE_NAME:
            print(f"Alert not for the expected instance or missing instance name. Expected: {UNHEALTHY_INSTANCE_NAME}, Got: {resource_name}")
            return

        print(f"Detected unhealthy instance: {resource_name} in zone: {GCP_ZONE}")

        # --- Step 1: Verify MongoDB Replica Set Status ---
        client = pymongo.MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
        try:
            client.admin.command('ping') # Check connection
            repl_status = client.admin.command('replSetGetStatus')
            print("Successfully connected to MongoDB replica set.")

            primary_member = None
            for member in repl_status.get('members', []):
                if member['stateStr'] == 'PRIMARY':
                    primary_member = member
                    break

            if primary_member:
                print(f"Current primary: {primary_member['name']}")
                # If a primary exists, MongoDB's internal failover likely handled it.
                # We might still want to ensure the unhealthy node is replaced.
                print("MongoDB has a primary. Proceeding with potential instance replacement.")
            else:
                print("No primary found. This is a critical situation.")
                # Potentially trigger a more aggressive recovery or notification.

        except pymongo.errors.ConnectionFailure as e:
            print(f"Failed to connect to MongoDB: {e}")
            # This indicates a broader issue, potentially affecting quorum.
            return
        except Exception as e:
            print(f"An error occurred during MongoDB status check: {e}")
            return
        finally:
            client.close()

        # --- Step 2: Orchestrate Instance Replacement (Conceptual) ---
        # This part requires GCP Compute Engine API interaction.
        # In a real-world scenario, you'd use the google-cloud-compute library.
        print(f"Attempting to replace instance: {UNHEALTHY_INSTANCE_NAME} in zone: {GCP_ZONE}")

        # Example: Delete the unhealthy instance
        # compute_client = google.cloud.compute_v1.InstancesClient()
        # delete_operation = compute_client.delete(project=GCP_PROJECT, zone=GCP_ZONE, instance=UNHEALTHY_INSTANCE_NAME)
        # print(f"Delete operation initiated: {delete_operation.name}")

        # Example: Create a new instance from a template
        # instance_template_name = "your-mongodb-instance-template"
        # new_instance_config = { ... } # Define new instance configuration
        # create_operation = compute_client.insert(project=GCP_PROJECT, zone=GCP_ZONE, instance_resource=new_instance_config)
        # print(f"Create operation initiated: {create_operation.name}")

        # --- Step 3: Update Application Connectivity (Conceptual) ---
        # This is highly application-specific.
        # Options:
        # 1. Update a Cloud DNS record pointing to a load balancer or the new primary's IP.
        # 2. Update a configuration file in Cloud Storage that applications read.
        # 3. Trigger a rolling update of application deployments.
        print("Application connectivity update logic would go here.")

    except Exception as e:
        print(f"An unexpected error occurred: {e}")

# Example of how to simulate an event for local testing (requires base64 and pymongo installed)
# import base64
# mock_event_data = {
#     "alert": {
#         "resource": {
#             "type": "gce_instance",
#             "labels": {
#                 "project_id": "your-gcp-project",
#                 "instance_name": "your-mongodb-instance-1",
#                 "zone": "us-central1-a"
#             }
#         },
#         "metric": {
#             "type": "compute.googleapis.com/instance/cpu/utilization"
#         },
#         "condition": {
#             "displayName": "CPU utilization too high"
#         }
#     }
# }
# mock_event = {"data": base64.b64encode(json.dumps(mock_event_data).encode('utf-8')).decode('utf-8')}
# mock_context = {}
# handle_mongodb_failover(mock_event, mock_context)

Application Connectivity Management

Ensuring applications can find the new primary after a failover is critical. Several strategies can be employed:

DNS Updates: Maintain a DNS record (e.g., mongodb.yourdomain.com) that always points to the current primary’s IP address. A script or automation can update this DNS record post-failover. GCP’s Cloud DNS API can be used for this.
Load Balancer: Place a TCP load balancer in front of the MongoDB replica set. The load balancer can be configured to direct traffic only to the current primary. This requires the load balancer to be aware of the primary’s status, which can be achieved through health checks or by updating its backend configuration.
Application-Level Discovery: Applications can be designed to connect to the replica set using its connection string and discover the primary themselves. However, during a failover, there’s a brief period where the primary is unavailable, and applications need to handle this transient error and retry with the updated topology.
Configuration Service: A centralized configuration service (e.g., Consul, etcd, or even a simple file in Cloud Storage) can store the current primary’s address. Applications poll this service for the primary’s address.

Testing Your Failover Strategy

A failover strategy is only as good as its tested execution. Regular, automated testing is paramount. This involves:

Simulated Node Failures: Gracefully shut down a primary node (db.shutdownServer()) and observe the election process and application reconnection.
Instance Termination: Terminate a GCE instance running a MongoDB node and verify that the replica set elects a new primary and that your automation replaces the instance.
Network Partition Simulation: Use firewall rules to temporarily isolate nodes and test how the replica set and applications behave.
Zone Failure Simulation: If possible, simulate an entire zone becoming unavailable (though this is difficult to do cleanly in a production environment without impacting other services).

Considerations for C Deployments

If your C application directly interacts with MongoDB, the failover strategy needs to be integrated into the application’s error handling and connection management. C applications typically use the MongoDB C Driver.

MongoDB C Driver Connection Handling

The MongoDB C Driver supports replica set connections. When connecting, you provide a connection string that lists multiple hosts and specifies the replica set name. The driver will discover the topology and identify the primary.

#include <mongoc/mongoc.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    mongoc_client_t *client;
    mongoc_database_t *database;
    mongoc_collection_t *collection;
    mongoc_uri_t *uri;
    char *uri_string;
    bson_error_t error;

    // Connection string for a replica set
    // Ensure this includes all members and the replicaSet name
    uri_string = "mongodb://user:password@host1:27017,host2:27017,host3:27017/?replicaSet=myReplicaSet&appName=myCApp";
    uri = mongoc_uri_new(uri_string);

    // Initialize libmongoc
    mongoc_init();

    // Create a new client instance
    client = mongoc_client_new_from_uri(uri, &error);

    if (!client) {
        fprintf(stderr, "Failed to parse URI: %s\nError: %s\n", uri_string, error.message);
        return EXIT_FAILURE;
    }

    // Register a topology change callback to be notified of primary changes
    mongoc_client_set_topology_changed_cb(client, topology_changed_cb, NULL);

    // Get a handle on the "admin" database
    database = mongoc_client_get_database(client, "admin");
    if (!database) {
        fprintf(stderr, "Failed to get database handle.\n");
        mongoc_uri_destroy(uri);
        mongoc_client_destroy(client);
        mongoc_cleanup();
        return EXIT_FAILURE;
    }

    // Perform a simple operation to ensure connection and topology discovery
    bson_t *command = BCON_NEW("ping", BCON_INT32(1));
    bson_t reply;
    if (!mongoc_client_command(client, "admin", MONGOC_QUERY_RETRY_PRIMARY, command, NULL, NULL, &error)) {
        fprintf(stderr, "Failed to ping server: %s\n", error.message);
        // Handle connection errors - this might indicate a failover is in progress or has occurred
        // The topology_changed_cb should be invoked if the topology changes.
    } else {
        printf("Successfully connected and pinged the server.\n");
    }

    bson_destroy(command);
    mongoc_database_destroy(database);

    // In a real application, you would keep the client alive and handle operations.
    // For this example, we'll clean up.
    // mongoc_client_destroy(client);
    // mongoc_uri_destroy(uri);
    // mongoc_cleanup();

    return EXIT_SUCCESS;
}

// Callback function for topology changes
void topology_changed_cb(const mongoc_topology_description_t *topology, void *ctx) {
    (void)ctx; // Unused parameter
    mongoc_host_list_t *primary = NULL;
    mongoc_host_list_t *hosts = NULL;
    int num_hosts = 0;

    printf("Topology changed!\n");

    // Get the current primary
    primary = mongoc_topology_description_get_primary(topology);
    if (primary) {
        printf("New primary detected: %s:%d\n", primary->host_and_port, primary->port);
        // Here you would update your application's internal state to point to the new primary.
        // For example, update a global variable or signal a thread.
        // mongoc_host_list_destroy(primary); // Free the host list if you copied it
    } else {
        printf("No primary detected (possibly during failover or network issue).\n");
    }

    // You can also iterate through all known servers
    // hosts = mongoc_topology_description_get_servers(topology, &num_hosts);
    // for (int i = 0; i < num_hosts; i++) {
    //     printf("  Server: %s:%d, State: %d\n", hosts[i].host_and_port, hosts[i].port, hosts[i].type);
    // }
    // free(hosts); // Free the array of host lists
}

The C driver automatically handles reconnecting and discovering the new primary when the topology changes. The key is to implement a topology_changed_cb callback. This callback will be invoked by the driver whenever it detects a change in the replica set's topology, such as a new primary being elected. Inside this callback, your application can update its internal state to reflect the new primary's address, ensuring subsequent operations are directed correctly.

Error Handling in C

Your C application must be prepared to handle transient errors during operations. When a failover occurs, operations targeting the old primary will fail. The driver will eventually detect the topology change and update its internal view. Your application should implement retry logic for operations that fail with specific error codes indicating network issues or unavailability of the primary.

// Example of retry logic within a C application
// This is a simplified illustration. Real-world retry logic
// would involve exponential backoff, jitter, and specific error code checks.

mongoc_client_t *client = /* ... get your client ... */;
bson_t *document = BCON_NEW("name", "test");
bson_error_t error;
bool success = false;
int retries = 0;
const int MAX_RETRIES = 5;

while (!success && retries < MAX_RETRIES) {
    if (mongoc_collection_insert_one(collection, document, NULL, NULL, &error)) {
        success = true;
        printf("Document inserted successfully.\n");
    } else {
        retries++;
        fprintf(stderr, "Failed to insert document (Attempt %d/%d): %s\n", retries, MAX_RETRIES, error.message);

        // Check for specific errors that might indicate a transient issue or failover
        // mongoc_error_has_error(error, MONGOC_ERROR_SERVER_SELECTION_FAILED) or
        // mongoc_error_has_error(error, MONGOC_ERROR_NETWORK)
        // For simplicity, we'll retry on any error here.
        if (retries < MAX_RETRIES) {
            // Implement a delay before retrying
            sleep(pow(2, retries)); // Exponential backoff (simplified)
        }
    }
}

if (!success) {
    fprintf(stderr, "Failed to insert document after %d retries.\n", MAX_RETRIES);
}

bson_destroy(document);

Conclusion

Architecting auto-failover for MongoDB on GCP involves a multi-layered approach. It starts with a well-configured MongoDB replica set distributed across GCP's resilient infrastructure. This is augmented by robust monitoring and custom automation, such as Cloud Functions triggered by Pub/Sub, to handle infrastructure-level failures. For C applications, careful integration with the MongoDB C Driver's topology change callbacks and resilient error handling is crucial. Continuous testing of these failover mechanisms is non-negotiable to ensure that your system can indeed recover automatically when disaster strikes.