Disaster Recovery 101: Architecting Auto-Failovers for MySQL and C++ Deployments on Google Cloud

Designing for High Availability: MySQL Replication and C++ Service Failover on GCP

This document outlines a robust disaster recovery strategy focusing on automated failover for a typical web service architecture comprising a MySQL database and a C++ application tier, deployed on Google Cloud Platform (GCP). The objective is to minimize downtime by implementing near-synchronous replication for MySQL and a sophisticated health-checking and service restart mechanism for the C++ services.

MySQL High Availability with Cloud SQL and Replication

For mission-critical MySQL deployments on GCP, leveraging Cloud SQL with its built-in High Availability (HA) configuration is the foundational step. Cloud SQL HA provisions a primary instance and a synchronous standby instance in a different zone within the same region. In the event of a primary instance failure, Cloud SQL automatically promotes the standby instance, minimizing downtime. However, for true disaster recovery across regions and for applications requiring more granular control over failover, manual replication setup is necessary.

Cross-Region Replication Strategy

We will implement asynchronous cross-region replication from a primary Cloud SQL instance to a read replica in a different GCP region. This replica will serve as the candidate for failover. The application will be configured to connect to a floating IP address or a load balancer that can be dynamically re-pointed to the promoted replica.

Setting up a Cross-Region Read Replica

Assuming you have a primary Cloud SQL instance (e.g., `my-mysql-primary` in `us-central1`), you can create a read replica in another region (e.g., `us-east1`).

GCP Console/gcloud CLI Steps

Using the gcloud CLI is more scriptable for automation:

1. Create the Read Replica

This command creates a read replica in `us-east1`. Note that cross-region replicas incur network egress costs.

gcloud sql instances create my-mysql-replica \
    --master-instance-project=your-gcp-project-id \
    --master-instance-name=my-mysql-primary \
    --region=us-east1 \
    --tier=db-custom-2-7680 \
    --storage-size=100GB \
    --availability-type=REGIONAL \
    --database-version=MYSQL_8_0 \
    --network=projects/your-gcp-project-id/global/networks/your-vpc-network

2. Configure Replication User

On the primary instance, create a dedicated user for replication. This user needs the REPLICATION SLAVE privilege.

-- Connect to your primary MySQL instance
-- CREATE USER 'repl_user'@'%' IDENTIFIED BY 'your_strong_password';
-- GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';
-- FLUSH PRIVILEGES;

3. Obtain Primary Instance’s IP and GTID Information

You’ll need the primary instance’s private IP address and its GTID settings. You can find the IP address in the Cloud SQL console or via gcloud sql instances describe my-mysql-primary --format='value(ipAddresses[0].ipAddress)'. GTID is enabled by default for new MySQL 8.0 instances.

Automating Failover with a Custom Script and Cloud Scheduler

Cloud SQL’s built-in HA is zone-level. For cross-region failover, we need an external mechanism. This involves a script that periodically checks the health of the primary instance and, if it’s unresponsive, promotes the replica and updates application connection endpoints.

Health Check Script (Python)

This Python script uses the mysql.connector library to check connectivity and query a dummy table. It will run on a Compute Engine instance in the primary region.

import mysql.connector
import google.auth
import google.auth.transport.requests
import requests
import time
import os

# --- Configuration ---
PRIMARY_DB_HOST = os.environ.get("PRIMARY_DB_HOST", "your-primary-db-private-ip")
REPLICA_DB_HOST = os.environ.get("REPLICA_DB_HOST", "your-replica-db-private-ip")
DB_USER = os.environ.get("DB_USER", "health_check_user")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "your_health_check_password")
FAILOVER_TARGET_URL = os.environ.get("FAILOVER_TARGET_URL", "http://your-app-config-service/api/update-db-endpoint")
HEALTH_CHECK_INTERVAL_SECONDS = int(os.environ.get("HEALTH_CHECK_INTERVAL_SECONDS", 60))
FAILOVER_THRESHOLD_ATTEMPTS = int(os.environ.get("FAILOVER_THRESHOLD_ATTEMPTS", 3))

# --- Global State ---
unresponsive_attempts = 0
is_failover_active = False

def check_db_health(host, user, password):
    """Checks if the database is reachable and responsive."""
    try:
        conn = mysql.connector.connect(
            host=host,
            user=user,
            password=password,
            database="information_schema", # Use a lightweight database
            connection_timeout=10
        )
        cursor = conn.cursor()
        # A simple query to check responsiveness
        cursor.execute("SELECT 1")
        cursor.fetchone()
        cursor.close()
        conn.close()
        return True
    except mysql.connector.Error as err:
        print(f"Database health check failed for {host}: {err}")
        return False

def trigger_failover():
    """Initiates the failover process."""
    global is_failover_active
    print("Initiating failover process...")

    # 1. Promote the replica (This is a manual step or requires a more complex API interaction)
    # For Cloud SQL, promotion is typically done via the console or gcloud.
    # A more advanced setup might use the Cloud SQL Admin API if permissions allow.
    # Example using gcloud (requires service account with appropriate roles):
    # subprocess.run(['gcloud', 'sql', 'instances', 'promote-replica', 'my-mysql-replica', '--project=your-gcp-project-id'], check=True)
    print(f"Manually promote replica instance: {REPLICA_DB_HOST}")
    print("Waiting for replica promotion to complete...")
    # In a real scenario, you'd poll the replica's status until it's no longer a replica.

    # 2. Update application configuration
    try:
        payload = {"new_db_host": REPLICA_DB_HOST}
        response = requests.post(FAILOVER_TARGET_URL, json=payload, timeout=15)
        response.raise_for_status() # Raise an exception for bad status codes
        print(f"Successfully updated application configuration. New DB host: {REPLICA_DB_HOST}")
        is_failover_active = True
    except requests.exceptions.RequestException as e:
        print(f"Failed to update application configuration: {e}")
        # Consider rollback or alerting mechanisms here

def main():
    global unresponsive_attempts, is_failover_active

    print("Starting MySQL failover monitor...")
    while True:
        if not is_failover_active:
            if check_db_health(PRIMARY_DB_HOST, DB_USER, DB_PASSWORD):
                print(f"Primary DB ({PRIMARY_DB_HOST}) is healthy. Resetting attempts.")
                unresponsive_attempts = 0
            else:
                unresponsive_attempts += 1
                print(f"Primary DB ({PRIMARY_DB_HOST}) is unresponsive. Attempts: {unresponsive_attempts}/{FAILOVER_THRESHOLD_ATTEMPTS}")
                if unresponsive_attempts >= FAILOVER_THRESHOLD_ATTEMPTS:
                    trigger_failover()
        else:
            # If failover is active, we might want to monitor the new primary
            # or simply stop this script and rely on a separate monitoring for the promoted instance.
            print(f"Failover is active. Monitoring new primary: {REPLICA_DB_HOST}")
            # Optional: Add checks for the promoted replica
            if not check_db_health(REPLICA_DB_HOST, DB_USER, DB_PASSWORD):
                 print(f"Promoted replica ({REPLICA_DB_HOST}) is now unhealthy. Manual intervention required.")
                 # Alerting is crucial here.
            else:
                 print(f"Promoted replica ({REPLICA_DB_HOST}) is healthy.")

        time.sleep(HEALTH_CHECK_INTERVAL_SECONDS)

if __name__ == "__main__":
    # Ensure necessary environment variables are set or hardcoded defaults are reasonable.
    # For production, use secrets management (e.g., GCP Secret Manager).
    main()

Automating Promotion and Configuration Update

The Python script above outlines the logic. The actual promotion of a Cloud SQL replica is a manual step via the GCP console or gcloud command. For full automation, you would need to:

Grant a service account the necessary IAM roles (e.g., Cloud SQL Admin) to manage Cloud SQL instances.
Use the Cloud SQL Admin API (via client libraries or gcloud) to promote the replica. This involves calling the instances.promoteReplica method.
The FAILOVER_TARGET_URL should point to an internal API endpoint (e.g., running on GKE or Compute Engine) that updates the database connection string used by your C++ application. This could involve updating a configuration file, a database table, or a service discovery mechanism.

Scheduling the Health Check Script

Use Cloud Scheduler to trigger this Python script periodically. The script should run on a Compute Engine instance that has network access to both the primary and replica databases.

# Create a service account for Cloud Scheduler and Compute Engine
gcloud iam service-accounts create failover-monitor-sa \
    --display-name "Failover Monitor Service Account"

# Grant the service account necessary roles
gcloud projects add-iam-policy-binding your-gcp-project-id \
    --member "serviceAccount:[email protected]" \
    --role "roles/cloudsql.admin" # For promoting replica
gcloud projects add-iam-policy-binding your-gcp-project-id \
    --member "serviceAccount:[email protected]" \
    --role "roles/compute.viewer" # To potentially get instance IPs

# Create a Compute Engine instance and run the script
# (This is a simplified example; consider using managed instance groups or Kubernetes for robustness)
gcloud compute instances create failover-monitor-instance \
    --zone=us-central1-a \
    --machine-type=e2-medium \
    --service-account=failover-monitor-sa@your-gcp-project-id.iam.gserviceaccount.com \
    --scopes="https://www.googleapis.com/auth/cloud-platform" \
    --metadata startup-script='#! /bin/bash
        apt-get update
        apt-get install -y python3-pip python3-dev
        pip3 install mysql-connector-python google-auth requests
        # Download and run your Python script here
        # Ensure DB credentials and target URLs are securely managed (e.g., via Secret Manager)
    '

# Create a Cloud Scheduler job
gcloud scheduler jobs create http mysql-failover-check \
    --schedule="*/5 * * * *" \ # Run every 5 minutes
    --uri="http://localhost:8080/run-check" \ # This would be an endpoint on the Compute Engine instance
    --http-method=POST \
    --location=us-central1 \
    --oidc-service-account-email=failover-monitor-sa@your-gcp-project-id.iam.gserviceaccount.com \
    --oidc-token-audience=http://localhost:8080/run-check # Target audience for the OIDC token

Note: The Cloud Scheduler setup above assumes you have a web server running on the Compute Engine instance to receive the HTTP trigger. A simpler approach is to use Cloud Functions triggered by Pub/Sub, which then invokes the Python script. Alternatively, use a cron job on the Compute Engine instance itself, managed by its startup script or a configuration management tool.

C++ Service Auto-Failover with Health Checks and Orchestration

For the C++ application tier, we’ll assume it’s deployed using containers, managed by Google Kubernetes Engine (GKE) or deployed as Compute Engine instances behind a load balancer. The strategy involves:

Implementing robust health check endpoints within the C++ application.
Configuring load balancers (GCP Load Balancer, GKE Ingress) to use these health checks.
For Compute Engine deployments, using a custom script or a managed service to restart unhealthy instances.

C++ Application Health Check Endpoint

Your C++ application should expose an HTTP endpoint (e.g., /healthz) that performs critical checks. This includes:

Database connectivity (using the current DB endpoint).
Availability of essential external services.
Internal application state (e.g., worker threads, cache status).

Example C++ Health Check Implementation (using Boost.Beast)

This is a simplified example. In production, consider using a more mature HTTP server library or integrating with your existing web framework.

#include <boost/beast/core.hpp>
#include <boost/beast/http.hpp>
#include <boost/asio/ip/tcp.hpp>
#include <boost/asio/strand.hpp>
#include <boost/config.hpp>
#include <boost/optional.hpp>
#include <boost/system/error_code.hpp>
#include <iostream>
#include <string>
#include <thread>
#include <vector>
#include <mutex>

namespace beast = boost::beast;         // from <boost/beast.hpp>
namespace http = beast::http;           // from <boost/beast/http.hpp>
namespace net = boost::asio;            // from <boost/asio.hpp>
using tcp = boost::asio::ip::tcp;       // from <boost/asio/ip/tcp.hpp>

// --- Global state for DB connection (simplified) ---
std::string current_db_host = "your-current-db-host";
std::mutex db_host_mutex;

// Function to simulate DB check
bool check_database_connection(const std::string& host) {
    // In a real app, this would use your DB connector (e.g., libmysqlclient, SOCI)
    // to attempt a connection or a simple query.
    std::cout << "Simulating DB check to: " << host << std::endl;
    // For demonstration, assume it's healthy if host is not empty
    return !host.empty();
}

// Handles an HTTP server connection
class session : public std::enable_shared_from_this<session>
{
    beast::tcp_stream stream_;
    beast::flat_buffer buffer_;
    http::request<http::string_body> req_;

public:
    session(tcp::socket&& socket)
        : stream_(std::move(socket))
    {
    }

    // Start the asynchronous operation
    void run()
    {
        // We need to be executing within a strand to perform async operations
        // on the stream.
        net::dispatch(stream_.get_executor(),
                      beast::bind_front_handler(
                          &session::do_read,
                          shared_from_this()));
    }

    void do_read()
    {
        // Make the request empty before reading
        req_ = {};

        // Set the timeout. This is crucial for preventing hung connections.
        beast::get_lowest_layer(stream_).expires_after(std::chrono::seconds(30));

        // Read a request
        http::async_read(stream_, buffer_, req_,
            beast::bind_front_handler(
                &session::on_read,
                shared_from_this()));
    }

    void on_read(beast::error_code ec, std::size_t bytes_transferred)
    {
        boost::ignore_unused(bytes_transferred);

        // This means they closed the connection
        if(ec == http::error::end_of_stream)
            return do_close();

        if(ec)
        {
            std::cerr << "read error: " << ec.message() << std::endl;
            return; // Error occurred, connection will be closed by caller
        }

        // Handle the request
        handle_request();
    }

    void handle_request()
    {
        http::response<http::string_body> res{http::status::ok, req_.version()};
        res.set(http::field::server, BOOST_BEAST_VERSION_STRING);
        res.set(http::field::content_type, "application/json");
        res.keep_alive(req_.keep_alive());

        std::string response_body = R"({"status": "ok"})";
        res.body() = response_body;
        res.prepare_payload();

        // Check if it's the health check endpoint
        if (req_.target() == "/healthz") {
            bool db_ok = false;
            {
                std::lock_guard<std::mutex> lock(db_host_mutex);
                db_ok = check_database_connection(current_db_host);
            }

            if (!db_ok) {
                res.result(http::status::service_unavailable);
                response_body = R"({"status": "unhealthy", "reason": "database connection failed"})";
                res.body() = response_body;
                res.prepare_payload();
            }
        } else {
            // Handle other routes if necessary
            res.result(http::status::not_found);
            response_body = R"({"status": "not_found"})";
            res.body() = response_body;
            res.prepare_payload();
        }

        // Set the timeout. This is crucial for preventing hung connections.
        beast::get_lowest_layer(stream_).expires_after(std::chrono::seconds(30));

        // Respond to the client
        http::async_write(stream_, res,
            beast::bind_front_handler(
                &session::on_write,
                shared_from_this()));
    }

    void on_write(bool close, beast::error_code ec, std::size_t bytes_transferred)
    {
        boost::ignore_unused(bytes_transferred);

        if(ec)
        {
            std::cerr << "write error: " << ec.message() << std::endl;
            return; // Error occurred, connection will be closed by caller
        }

        if(close)
            return do_close();

        // We are done with the response, but keep the connection alive if requested
        // and if the client supports it.
        // For simplicity, we close after each request in this example.
        return do_close();
    }

    void do_close()
    {
        // Send a TCP shutdown
        beast::error_code ec;
        stream_.socket().shutdown(tcp::socket::shutdown_send, ec);

        // At this point the connection is closed gracefully
    }
};

// Accepts incoming connections and launches the sessions
class listener : public std::enable_shared_from_this<listener>
{
    net::io_context& ioc_;
    tcp::acceptor acceptor_;

public:
    listener(net::io_context& ioc, tcp::endpoint endpoint)
        : ioc_(ioc)
        , acceptor_(ioc)
    {
        beast::error_code ec;

        // Open the acceptor
        acceptor_.open(endpoint.protocol(), ec);
        if(ec)
        {
            std::cerr << "error: open acceptor: " << ec.message() << std::endl;
            return;
        }

        // Allow address reuse
        acceptor_.set_option(boost::asio::socket_base::reuse_address(true), ec);
        if(ec)
        {
            std::cerr << "error: set_option reuse_address: " << ec.message() << std::endl;
            return;
        }

        // Bind to the server address
        acceptor_.bind(endpoint, ec);
        if(ec)
        {
            std::cerr << "error: bind: " << ec.message() << std::endl;
            return;
        }

        // Start listening for connections
        acceptor_.listen(net::socket_base::max_listen_connections, ec);
        if(ec)
        {
            std::cerr << "error: listen: " << ec.message() << std::endl;
            return;
        }
    }

    // Start accepting incoming connections
    void run()
    {
        net::dispatch(ioc_,
                      beast::bind_front_handler(
                          &listener::do_accept,
                          shared_from_this()));
    }

private:
    void do_accept()
    {
        // The new connection gets its own socket
        auto socket = std::make_shared<tcp::socket>(ioc_);

        // Accept the next connection
        acceptor_.async_accept(*socket,
            beast::bind_front_handler(
                &listener::on_accept,
                shared_from_this(),
                socket));
    }

    void on_accept(std::shared_ptr<tcp::socket> socket, beast::error_code ec)
    {
        if(ec)
        {
            std::cerr << "accept error: " << ec.message() << std::endl;
        }
        else
        {
            // Create the session and run it
            std::make_shared<session>(std::move(*socket))->run();
        }

        // Accept another connection
        do_accept();
    }
};

// Function to update DB host (called by your config management)
void update_database_host(const std::string& new_host) {
    std::lock_guard<std::mutex> lock(db_host_mutex);
    current_db_host = new_host;
    std::cout << "Database host updated to: " << current_db_host << std::endl;
}

void run_server(net::io_context& ioc, const std::string& address, unsigned short port)
{
    auto const listen_endpoint = tcp::endpoint{net::ip::make_address(address), port};
    auto listener_ = std::make_shared<listener>(ioc, listen_endpoint);
    listener_->run();

    // Run the I/O service on the requested number of threads
    std::vector<std::thread> v;
    for(auto i = 0; i < 1; ++i) // Use multiple threads for production
        v.emplace_back(
            [&ioc]
            {
                ioc.run();
            });

    for(auto& t : v)
        t.join();
}

int main(int argc, char* argv[])
{
    try
    {
        // Check command line arguments.
        if (argc != 3)
        {
            std::cerr << "Usage: http-server <address> <port>\n";
            std::cerr << "Example:\n";
            std::cerr << "    http-server 0.0.0.0 8080\n";
            return EXIT_FAILURE;
        }
        auto const address = net::ip::make_address(argv[1]);
        auto const port = static_cast<unsigned short>(std::atoi(argv[2]));

        net::io_context ioc{1}; // Number of threads

        // Start the HTTP server in a separate thread
        std::thread server_thread(run_server, std::ref(ioc), argv[1], port);

        // Simulate receiving a DB host update (e.g., from your config service)
        // In a real app, this would be triggered by an API call or message queue.
        std::this_thread::sleep_for(std::chrono::seconds(10));
        update_database_host("your-new-failover-db-host");

        server_thread.join();
    }
    catch (const std::exception& e)
    {
        std::cerr << "Exception: " << e.what() << std::endl;
        return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
}

Load Balancer Health Checks (GCP Load Balancer / GKE Ingress)

When deploying your C++ application behind a GCP Load Balancer (Network or HTTP(S)) or GKE Ingress, configure the backend service health checks to point to your /healthz endpoint. This ensures that unhealthy instances are automatically removed from the load balancing pool.

Example GCP Load Balancer Health Check Configuration

gcloud compute health-checks create http my-app-health-check \
    --request-path=/healthz \
    --port=8080 \
    --check-interval=10s \
    --timeout=5s \
    --unhealthy-threshold=3 \
    --healthy-threshold=2 \
    --global # or --region=your-region for regional LBs

Then, associate this health check with your backend service or instance group.

Compute Engine Instance Auto-Restart

If your C++ services run directly on Compute Engine instances (not in GKE), you can leverage GCP’s instance health checks and auto-restart capabilities. Create a health check similar to the one above and associate it with your Managed Instance Group (MIG). Configure the MIG’s auto-healing policy to restart unhealthy instances.

Configuring MIG Auto-Healing

When creating or updating a MIG:

gcloud compute instance-groups managed create my-cpp-mig \
    --template=my-cpp-instance-template \
    --size=3 \
    --zone=us-central1-a \
    --health-check=my-app-health-check \
    --initial-delay=300s # Give new instances time to start up

# Or update an existing MIG
gcloud compute instance-groups managed update my-cpp-mig \
    --health-check=my-app-health-check \
    --initial-delay=300s \
    --zone=us-central1-a

The initial-delay is important to prevent premature restarts of newly launched instances.

Orchestrating the Full Failover Process

A complete disaster recovery solution requires coordination between the database and application tiers. When a database failover occurs:

The database failover script (running via Cloud Scheduler) promotes the replica.
It then notifies a central configuration service or directly calls an API on the application tier.
The application tier updates its database connection string to point to the newly promoted replica.
Load balancers and instance group health checks will naturally route traffic away from any remaining unhealthy application instances and towards healthy ones.

Considerations for Zero Downtime

Achieving true zero downtime during failover is challenging and depends heavily on application design:

Application State Management: Ensure your application can gracefully handle temporary database unavailability or connection drops. Implement retry mechanisms with exponential backoff.
Connection Pooling: If using connection pools, ensure they can be reconfigured or flushed upon database endpoint changes.
Read vs. Write Operations: During failover, there might be a brief period where writes are inconsistent or unavailable. Applications should be designed to tolerate this, perhaps by temporarily disabling write operations or queuing them.
Testing: Rigorous testing of the failover process in a staging environment is paramount. Simulate various failure scenarios (network partitions, instance failures, region outages).

Conclusion

Architecting for automated failover involves a multi-layered approach. For MySQL, Cloud SQL HA provides zone-level resilience, while cross-region replication combined with custom automation scripts enables regional disaster recovery. For C++ services, leveraging load balancer health checks and instance group auto-healing (or Kubernetes readiness/liveness probes) ensures that unhealthy application instances are automatically replaced or removed from service. The key to a successful automated failover lies in the seamless communication and coordination between these components, particularly when updating application configurations to reflect the new database endpoint.