Disaster Recovery 101: Architecting Auto-Failovers for MySQL and WordPress Deployments on Google Cloud

Designing for High Availability: MySQL Replication and Failover on GCP

Achieving true disaster recovery for a critical WordPress deployment hinges on robust, automated failover mechanisms for its underlying MySQL database. This isn’t about manual intervention during an outage; it’s about designing a system that detects failure and seamlessly transitions to a healthy replica with minimal to no human touch. On Google Cloud Platform (GCP), this can be architected using a combination of Cloud SQL’s built-in HA features, custom monitoring, and intelligent application-level routing.

Leveraging Cloud SQL for High Availability

Cloud SQL for MySQL offers a managed, highly available configuration that is the foundational layer for our disaster recovery strategy. This configuration automatically provisions a primary instance and a synchronous standby instance in a different zone within the same region. If the primary instance becomes unavailable, Cloud SQL automatically promotes the standby instance to become the new primary. This process is managed by Google and typically involves a failover time of a few minutes.

While Cloud SQL HA handles the infrastructure-level failover, it’s crucial to understand its limitations for application-level consistency and rapid failover. The automatic promotion might still result in a brief period of unavailability or data loss if transactions were in flight during the failure. For mission-critical applications, we need to augment this with application-aware failover strategies.

Implementing Read Replicas for Scalability and Disaster Recovery

Beyond the HA configuration, Cloud SQL allows for the creation of read replicas. These replicas are asynchronous copies of the primary instance and can be deployed in different regions. While primarily used for scaling read traffic, they also serve as potential candidates for disaster recovery in a multi-region setup. In a disaster scenario affecting an entire region, a cross-region read replica can be promoted to a standalone instance.

The asynchronous nature of read replicas means there’s a potential for data loss during a failover if the replica hasn’t yet received the latest committed transactions from the primary. This is a trade-off for lower latency and geographical distribution. For our automated failover, we’ll focus on the HA primary/standby first, and then consider read replicas for multi-region DR.

Automated Failover Orchestration: The Role of a Proxy Layer

Directly connecting WordPress to a single Cloud SQL instance endpoint is problematic during failover. The IP address of the primary instance changes upon promotion. To manage this dynamically, we introduce a proxy layer. HAProxy is a robust, open-source TCP/HTTP load balancer that can be configured to monitor database health and route traffic accordingly. We’ll deploy HAProxy on a Compute Engine instance (or as a GKE service) that WordPress applications will connect to.

The HAProxy configuration will monitor the health of the Cloud SQL primary instance. When the primary becomes unhealthy, HAProxy will automatically stop sending traffic to it and direct all connections to the standby instance (which will have been promoted by Cloud SQL). This provides a more seamless transition for the application.

HAProxy Configuration for Cloud SQL Failover

Here’s a sample HAProxy configuration designed to monitor a Cloud SQL primary instance and failover to a standby. This assumes you have a mechanism to update HAProxy’s backend server list or IP addresses when a failover occurs. For Cloud SQL, the primary instance’s IP address will change. A more advanced setup would involve a script that polls Cloud SQL’s status and updates HAProxy’s configuration dynamically.

Let’s assume our primary Cloud SQL instance has an IP address of 35.200.100.50 and our standby (which will become the new primary) has an IP address of 35.200.100.51. In a real-world scenario, you’d use a dynamic IP or a DNS name that gets updated.

HAProxy Configuration File (`haproxy.cfg`)

global
    log /dev/log local0
    log /dev/log local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode tcp
    option tcplog
    option dontlognull
    timeout connect 5000
    timeout client 50000
    timeout server 50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

listen mysql-cluster
    bind *:3306
    mode tcp
    option mysql-check user haproxy_check # Requires a dedicated MySQL user with minimal privileges
    balance roundrobin
    # Primary Cloud SQL instance IP
    server primary 35.200.100.50:3306 check port 3306 inter 2s fall 3 rise 2
    # Standby Cloud SQL instance IP (will become primary after failover)
    server standby 35.200.100.51:3306 check port 3306 inter 2s fall 3 rise 2 backup

In this configuration:

mode tcp: HAProxy will operate at the TCP level, suitable for MySQL.
option mysql-check user haproxy_check: This is crucial. HAProxy will attempt to connect to the MySQL server using the specified user and execute a simple query (e.g., SELECT 1). If the query fails or the connection is refused, the server is marked as down. You’ll need to create a MySQL user like haproxy_check with appropriate permissions on your Cloud SQL instances.
balance roundrobin: Distributes connections evenly.
server primary ... check ...: Defines the primary instance with health checks. fall 3 means it must fail 3 consecutive checks to be marked down. rise 2 means it must pass 2 consecutive checks to be marked up.
server standby ... check ... backup: Defines the standby instance. The backup keyword is key here. HAProxy will only send traffic to a backup server if all non-backup servers in the same group are down.

Dynamic IP Management and Automation

The static IP addresses in the HAProxy configuration are a simplification. Cloud SQL HA failover changes the primary instance’s IP. To automate this, we need a mechanism to detect the IP change and update HAProxy. This can be achieved by:

1. Cloud SQL Instance IP Address Changes

When Cloud SQL performs an HA failover, the IP address of the promoted instance might change. You can monitor this via:

Cloud Logging/Monitoring: Set up alerts for Cloud SQL instance status changes or IP address modifications.
Cloud SQL Admin API: Periodically query the API to get the current IP address of your primary instance.

2. Automating HAProxy Configuration Updates

A small script running on the same Compute Engine instance as HAProxy (or a separate management instance) can perform these steps:

Periodically query the Cloud SQL Admin API for the primary instance’s IP address.
Compare the current IP with the one configured in HAProxy.
If different, update the haproxy.cfg file and gracefully reload HAProxy (e.g., systemctl reload haproxy or echo "RELOAD" | socat stdio /run/haproxy/admin.sock).

Here’s a conceptual Python script using the Google Cloud client libraries:

import googleapiclient.discovery
import requests
import time
import os

# Configuration
PROJECT_ID = 'your-gcp-project-id'
REGION = 'us-central1'
INSTANCE_NAME = 'your-mysql-instance-name' # The name of your Cloud SQL instance
HAPROXY_CONFIG_PATH = '/etc/haproxy/haproxy.cfg'
HAPROXY_RELOAD_CMD = 'sudo systemctl reload haproxy' # Or use socat if preferred

# --- Helper function to get current primary IP ---
def get_cloudsql_primary_ip(project_id, region, instance_name):
    sqladmin = googleapiclient.discovery.build('sqladmin', 'v1beta4')
    try:
        request = sqladmin.instances().get(project=project_id, instance=instance_name)
        response = request.execute()
        # For HA instances, the 'ipAddresses' field contains the primary IP
        for ip in response.get('ipAddresses', []):
            if ip.get('type') == 'PRIMARY':
                return ip.get('ipAddress')
        return None
    except Exception as e:
        print(f"Error fetching Cloud SQL IP: {e}")
        return None

# --- Helper function to update HAProxy config ---
def update_haproxy_config(current_primary_ip, standby_ip):
    try:
        with open(HAPROXY_CONFIG_PATH, 'r') as f:
            lines = f.readlines()

        new_lines = []
        primary_server_line_updated = False
        for line in lines:
            if line.strip().startswith('server primary'):
                # Assuming the line format is "server primary : check ..."
                parts = line.split()
                if len(parts) > 2:
                    original_ip = parts[2].split(':')[0]
                    if original_ip != current_primary_ip:
                        print(f"Updating primary IP from {original_ip} to {current_primary_ip}")
                        parts[2] = f"{current_primary_ip}:3306" # Assuming default MySQL port
                        new_lines.append(" ".join(parts) + "\n")
                        primary_server_line_updated = True
                    else:
                        new_lines.append(line) # No change needed
                else:
                    new_lines.append(line) # Malformed line, keep as is
            else:
                new_lines.append(line)

        if primary_server_line_updated:
            with open(HAPROXY_CONFIG_PATH, 'w') as f:
                f.writelines(new_lines)
            print("HAProxy config updated. Reloading HAProxy...")
            os.system(HAPROXY_RELOAD_CMD)
        else:
            print("No HAProxy config update needed.")

    except Exception as e:
        print(f"Error updating HAProxy config: {e}")

# --- Main loop ---
if __name__ == "__main__":
    # You need to know the IP of your standby instance beforehand
    # In a real scenario, this might also be dynamic or managed differently.
    STANDBY_IP = '35.200.100.51' # Replace with your actual standby IP

    while True:
        current_primary_ip = get_cloudsql_primary_ip(PROJECT_ID, REGION, INSTANCE_NAME)
        if current_primary_ip:
            print(f"Current Cloud SQL primary IP: {current_primary_ip}")
            update_haproxy_config(current_primary_ip, STANDBY_IP)
        else:
            print("Could not retrieve Cloud SQL primary IP. Skipping update.")

        time.sleep(60) # Check every 60 seconds

This script needs to be run on the Compute Engine instance hosting HAProxy or a dedicated management VM. Ensure the service account running this script has the necessary IAM permissions to access Cloud SQL Admin API (e.g., roles/cloudsql.client).

WordPress Application Configuration

Your WordPress wp-config.php file should point to the HAProxy instance’s IP address and port, not directly to the Cloud SQL instance.

/** The name of the database for WordPress */
define( 'DB_NAME', 'your_database_name' );

/** MySQL database username */
define( 'DB_USER', 'your_db_user' );

/** MySQL database password */
define( 'DB_PASSWORD', 'your_db_password' );

/** MySQL hostname */
// Point this to your HAProxy instance's IP address
define( 'DB_HOST', 'YOUR_HAPROXY_IP:3306' );

/** Database Charset to use in creating database tables. */
define( 'DB_CHARSET', 'utf8mb4' );

/** The Database Collate type. Don't change this if in doubt. */
define( 'DB_COLLATE', '' );

Replace YOUR_HAPROXY_IP with the internal or external IP address of the Compute Engine instance running HAProxy. Ensure your WordPress instances can reach this IP on port 3306.

Multi-Region Disaster Recovery with Read Replicas

For a more robust disaster recovery strategy that protects against entire region failures, you can leverage Cloud SQL cross-region read replicas. In this scenario:

Your primary Cloud SQL instance (with HA) resides in us-central1.
A cross-region read replica is provisioned in us-east1.
You would have a separate HAProxy instance (and potentially a separate WordPress deployment) in us-east1.

In the event of a us-central1 region outage:

You would manually (or via an automated script triggered by external monitoring) promote the us-east1 read replica to a standalone instance.
Update the DB_HOST in your WordPress wp-config.php (or via a configuration management system) for the us-east1 deployment to point to the newly promoted instance’s IP address.
The HAProxy instance in us-east1 would then route traffic to this new primary.

This process is typically more manual than the intra-region HA failover because promoting a read replica is a deliberate action. However, it provides a higher level of resilience against catastrophic regional failures. The potential for data loss exists here due to asynchronous replication, so RPO (Recovery Point Objective) needs to be carefully considered.

Testing Your Failover Strategy

A disaster recovery plan is only as good as its last successful test. Regularly simulate failures to ensure your automated failover works as expected:

Simulate Primary Instance Failure: Stop the primary Cloud SQL instance (if possible without triggering immediate standby promotion) or block network access to it from HAProxy. Observe HAProxy’s behavior and WordPress connectivity.
Simulate HAProxy Failure: If HAProxy is on a Compute Engine instance, stop the instance. Ensure a redundant HAProxy setup or a mechanism to restart it quickly.
Simulate Regional Outage: For multi-region DR, test the manual promotion of a read replica and the subsequent application configuration updates.

Thorough testing will reveal any gaps in your automation, monitoring, or configuration management, allowing you to refine your disaster recovery architecture before a real event occurs.