Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Laravel Deployments on Google Cloud

GCP Cloud SQL High Availability vs. Manual Failover Orchestration

Google Cloud SQL for MySQL offers a built-in High Availability (HA) configuration. This feature automatically provisions a standby instance in a different zone within the same region. In the event of a primary instance failure, Cloud SQL automatically promotes the standby to become the new primary, with minimal downtime. While this is a robust solution for many use cases, it abstracts away the underlying failover mechanisms. For scenarios demanding granular control, custom failover logic, or integration with application-level health checks, a more orchestrated approach is often necessary. This post details how to architect an automated failover system for a Laravel application leveraging GCP Compute Engine instances and Cloud SQL, focusing on application-aware health checks and automated instance promotion.

Designing the Automated Failover Architecture

Our architecture will consist of the following components:

Primary and Standby Compute Engine Instances: Two GCE instances running the Laravel application. These instances will be in different zones within the same GCP region for resilience.
Cloud SQL Instance: A single Cloud SQL for MySQL instance. For this manual orchestration approach, we will *not* use the built-in Cloud SQL HA feature. Instead, we’ll manage failover at the application tier.
Load Balancer: A Google Cloud Load Balancer (specifically, a Network Load Balancer or Application Load Balancer) distributing traffic to the healthy application instances.
Health Check Service: A custom mechanism to determine the health of the primary database connection from the application instances.
Failover Orchestration Script: A script (e.g., Python or Bash) running on a separate, highly available monitoring instance or as a Kubernetes CronJob, responsible for detecting failures and initiating the failover process.

Cloud SQL Instance Configuration (No Built-in HA)

When setting up your Cloud SQL instance for this manual orchestration, ensure High Availability is *disabled*. This is crucial because we want to control the failover process ourselves. The instance should be configured with appropriate resources (CPU, RAM, storage) based on your application’s load. For connectivity, we’ll use Private IP to ensure secure communication between GCE instances and Cloud SQL within your VPC network.

Compute Engine Instance Setup

We’ll deploy two GCE instances, each in a different zone (e.g., us-central1-a and us-central1-b). These instances will host our Laravel application. They should be configured with identical software stacks, including the web server (Nginx/Apache), PHP-FPM, and the Laravel application code. They will connect to the Cloud SQL instance using its Private IP address.

Load Balancer Configuration

A Google Cloud Load Balancer will sit in front of the two GCE instances. We’ll configure it with a backend service that targets both instances. Crucially, the load balancer’s health checks will monitor the *application’s response* (e.g., a specific health check endpoint), not the database connection directly. This ensures that traffic is only sent to instances that can successfully communicate with the database.

Application-Level Database Health Check

The core of our automated failover lies in an application-level health check that verifies database connectivity. In Laravel, this can be implemented by creating a dedicated health check route that attempts a simple database query.

Laravel Health Check Route

Create a new route in routes/api.php (or routes/web.php if you prefer):

// routes/api.php
use Illuminate\Support\Facades\Route;
use Illuminate\Support\Facades\DB;
use Illuminate\Http\JsonResponse;

Route::get('/health/db', function () {
    try {
        // Attempt a simple query to check DB connectivity
        DB::connection()->getPdo();
        return response()->json(['status' => 'ok', 'database' => 'connected']);
    } catch (\Exception $e) {
        // Log the error for debugging
        \Log::error("Database connection error: " . $e->getMessage());
        return response()->json(['status' => 'error', 'database' => 'disconnected', 'message' => $e->getMessage()], 503); // 503 Service Unavailable
    }
});

Ensure this route is accessible via your load balancer. The load balancer’s health check should target this endpoint (e.g., http://[INSTANCE_IP]/health/db).

Failover Orchestration Script (Python Example)

We need a script that periodically checks the health of the *current primary* database connection from the perspective of the application instances. If the primary fails, the script will need to execute a failover procedure. For this example, we’ll assume a single Cloud SQL instance and the need to manually “promote” it (which in this context means ensuring it’s the *only* instance being written to, and potentially reconfiguring application instances if they were pointing to a specific IP that changed, though Cloud SQL’s Private IP is stable).

A more realistic scenario for manual orchestration might involve a primary/replica setup where you promote a replica. However, for simplicity with a single Cloud SQL instance, the “failover” here is more about detecting the *loss* of connectivity and potentially alerting or triggering a manual intervention. If you were using a self-managed MySQL cluster on GCE, this script would be more complex, involving promoting a replica.

Let’s consider a scenario where we have a primary application instance and a standby. The script monitors the primary’s database health. If it fails, it attempts to switch traffic to the standby.

Monitoring Script Logic

The script will:

Define the primary and standby application instance IPs/hostnames.
Define the Cloud SQL instance connection details.
Periodically attempt to connect to the Cloud SQL instance from a *monitoring instance* (or run this as a cron job on one of the app instances, though a separate instance is more robust).
If the connection fails, attempt to switch the load balancer’s backend to the standby instance.
If the connection succeeds, ensure the load balancer is directing traffic to the primary.

import requests
import time
import googleapiclient.discovery
from google.oauth2 import service_account

# --- Configuration ---
PROJECT_ID = 'your-gcp-project-id'
REGION = 'us-central1'
PRIMARY_INSTANCE_NAME = 'app-instance-1'
STANDBY_INSTANCE_NAME = 'app-instance-2'
LOAD_BALANCER_NAME = 'your-load-balancer-name' # e.g., your-nlb-name
BACKEND_SERVICE_NAME = 'your-backend-service-name' # Associated with the LB
CLOUD_SQL_CONNECTION_NAME = 'your-gcp-project-id:us-central1:your-cloudsql-instance-name'
HEALTH_CHECK_ENDPOINT = '/health/db' # The Laravel health check endpoint
MONITORING_INTERVAL_SECONDS = 30
FAILOVER_THRESHOLD = 3 # Number of consecutive failures before triggering failover

# GCP Service Account Key (ensure it has compute.instances.update, compute.backendServices.update permissions)
# It's recommended to use Workload Identity or Instance Service Accounts for better security.
# For simplicity, using a service account key file here.
SERVICE_ACCOUNT_FILE = 'path/to/your/service-account-key.json'

# --- GCP API Clients ---
try:
    credentials = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE)
    compute_service = googleapiclient.discovery.build('compute', 'v1', credentials=credentials)
except Exception as e:
    print(f"Error initializing GCP clients: {e}")
    exit(1)

# --- Helper Functions ---
def get_instance_ip(instance_name):
    """Gets the internal IP of a GCE instance."""
    try:
        request = compute_service.instances().get(project=PROJECT_ID, zone=f"{REGION}-a", instance=instance_name) # Assuming zone-a for primary
        response = request.execute()
        for network_interface in response.get('networkInterfaces', []):
            for ip in network_interface.get('networkIPs', []):
                if ip.startswith('10.'): # Internal IP
                    return ip
        return None
    except Exception as e:
        print(f"Error getting IP for {instance_name}: {e}")
        return None

def check_database_health(instance_ip):
    """Checks the database health endpoint of a given instance."""
    url = f"http://{instance_ip}{HEALTH_CHECK_ENDPOINT}"
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
        data = response.json()
        return data.get('database') == 'connected'
    except requests.exceptions.RequestException as e:
        print(f"Health check failed for {instance_ip}: {e}")
        return False

def get_backend_service(backend_service_name):
    """Retrieves the backend service configuration."""
    try:
        request = compute_service.backendServices().get(project=PROJECT_ID, backendService=backend_service_name)
        return request.execute()
    except Exception as e:
        print(f"Error getting backend service {backend_service_name}: {e}")
        return None

def update_backend_service(backend_service_name, updated_backends):
    """Updates the backends of a backend service."""
    try:
        backend_service_body = {
            "backends": updated_backends
        }
        request = compute_service.backendServices().patch(
            project=PROJECT_ID,
            backendService=backend_service_name,
            body=backend_service_body
        )
        operation = request.execute()
        print(f"Patching backend service {backend_service_name}. Operation: {operation['name']}")
        # In a real-world scenario, you'd poll the operation status
        return True
    except Exception as e:
        print(f"Error updating backend service {backend_service_name}: {e}")
        return False

def get_instance_backend_config(instance_name, instance_zone):
    """Constructs the backend configuration for a GCE instance."""
    return {
        "group": f"projects/{PROJECT_ID}/zones/{instance_zone}/instanceGroups/{instance_name}-ig" # Assumes instance group name matches instance name
    }

# --- Main Loop ---
def main():
    primary_ip = get_instance_ip(PRIMARY_INSTANCE_NAME)
    standby_ip = get_instance_ip(STANDBY_INSTANCE_NAME) # Assuming standby is in a different zone, e.g., -b

    if not primary_ip or not standby_ip:
        print("Could not retrieve instance IPs. Exiting.")
        return

    print(f"Primary Instance IP: {primary_ip}")
    print(f"Standby Instance IP: {standby_ip}")

    consecutive_failures = 0
    is_primary_active = True # Assume primary is initially active

    while True:
        print(f"\nChecking health of primary instance ({primary_ip})...")
        if check_database_health(primary_ip):
            print("Primary database connection is healthy.")
            consecutive_failures = 0
            if not is_primary_active:
                print("Primary is back online. Re-enabling primary in load balancer.")
                # Reconfigure load balancer to use primary
                primary_backend = get_instance_backend_config(PRIMARY_INSTANCE_NAME, f"{REGION}-a")
                standby_backend = get_instance_backend_config(STANDBY_INSTANCE_NAME, f"{REGION}-b") # Adjust zone as needed
                
                # Ensure only primary is active
                backends_to_set = [primary_backend]
                
                if update_backend_service(BACKEND_SERVICE_NAME, backends_to_set):
                    is_primary_active = True
                else:
                    print("Failed to re-enable primary. Manual intervention may be required.")
        else:
            print("Primary database connection is unhealthy.")
            consecutive_failures += 1
            if consecutive_failures >= FAILOVER_THRESHOLD and is_primary_active:
                print(f"Failover threshold ({FAILOVER_THRESHOLD}) reached. Initiating failover to standby.")
                
                # Attempt to switch traffic to standby
                primary_backend = get_instance_backend_config(PRIMARY_INSTANCE_NAME, f"{REGION}-a")
                standby_backend = get_instance_backend_config(STANDBY_INSTANCE_NAME, f"{REGION}-b") # Adjust zone as needed

                # Check if standby is healthy before switching
                if check_database_health(standby_ip):
                    print("Standby instance is healthy. Switching traffic.")
                    backends_to_set = [standby_backend]
                    if update_backend_service(BACKEND_SERVICE_NAME, backends_to_set):
                        is_primary_active = False
                        print("Failover successful. Traffic directed to standby.")
                    else:
                        print("Failed to update backend service for failover. Manual intervention required.")
                else:
                    print("Standby instance is also unhealthy. Cannot perform failover. Alerting required.")
                    # Trigger alerts here
            elif consecutive_failures < FAILOVER_THRESHOLD:
                print(f"Consecutive failures: {consecutive_failures}/{FAILOVER_THRESHOLD}. Waiting.")

        time.sleep(MONITORING_INTERVAL_SECONDS)

if __name__ == "__main__":
    main()

Important Considerations for the Script:

Instance Groups: The script assumes you have Instance Groups created for each GCE instance, and the Instance Group name matches the instance name. You'll need to create these in GCP.
Permissions: The service account used by this script needs sufficient IAM permissions to list and update Compute Engine backend services. Granting roles/compute.backendServiceAdmin is a good starting point, but refine based on your security policies. Using Instance Service Accounts is highly recommended over service account key files.
Load Balancer Type: This script is tailored for a Global or Regional External HTTP(S) Load Balancer or a Network Load Balancer where you can modify backend services. For Network Load Balancers, you might be manipulating forwarding rules or target pools. The GCP API calls will vary slightly.
State Management: The `is_primary_active` flag is a simple way to track state. In a more complex system, you might use a distributed lock or a dedicated state store.
Recovery: The script includes logic to switch back to the primary if it becomes healthy again. This "failback" mechanism is important.
Alerting: This script *detects* failures and attempts to *act*. It does not include robust alerting. You should integrate this script with a monitoring system (e.g., Cloud Monitoring, Prometheus Alertmanager) to send notifications when failovers occur or when failover attempts fail.
Cloud SQL Private IP: The script assumes your GCE instances and the monitoring instance can reach the Cloud SQL Private IP. Ensure your VPC network, subnets, and firewall rules are configured correctly.
Zone Awareness: The script hardcodes zones (-a, -b). In a production system, you'd want to dynamically determine the zones of your instances.
Instance Group Management: The script assumes a static setup where each instance has its own instance group. For more dynamic scaling, you'd integrate with Managed Instance Groups (MIGs).

Deployment and Orchestration

The failover orchestration script needs to run continuously. Several options exist:

Dedicated Monitoring VM: A small, highly available GCE instance running the script. This VM itself needs to be resilient.
Kubernetes CronJob: If you're running your application on GKE, a CronJob can periodically execute the script. This leverages Kubernetes' built-in scheduling and self-healing.
Cloud Run Job: For a serverless approach, a Cloud Run Job can be scheduled to run the script.

For this manual orchestration, running the script on a dedicated, highly available monitoring VM is a common pattern. Ensure this monitoring VM has network access to both your application instances and the GCP API.

Testing the Failover

Thorough testing is paramount. Simulate failures by:

Stopping the web server (Nginx/Apache) or PHP-FPM on the primary application instance.
Simulating network partition between the primary application instance and Cloud SQL (e.g., temporarily modifying firewall rules).
Stopping the Cloud SQL instance (if feasible in a staging environment).

Observe the load balancer's health checks and the orchestration script's logs to verify that traffic is correctly rerouted to the standby instance. Test the application's functionality after the failover to ensure data integrity and application responsiveness.

Limitations and Alternatives

This manual orchestration approach provides granular control but introduces complexity. It requires careful scripting, robust error handling, and continuous monitoring. The primary limitation is the potential for a longer RTO (Recovery Time Objective) compared to managed HA solutions, and the risk of misconfiguration.

Alternatives:

Cloud SQL Built-in HA: As mentioned, this is the simplest and often most effective solution for many use cases. It handles failover automatically with minimal configuration.
MySQL Replication with GCE: For more advanced control, you could set up MySQL replication between two GCE instances (or a GCE instance and Cloud SQL read replica) and use a tool like Orchestrator or custom scripts to manage promotion. This is significantly more complex.
Database-as-a-Service with Automatic Failover: Consider other managed database services that offer robust, automated failover capabilities.

For most Laravel deployments on GCP, leveraging Cloud SQL's built-in High Availability feature is the recommended path unless specific requirements necessitate custom orchestration.