Disaster Recovery 101: Architecting Auto-Failovers for DynamoDB and PHP Deployments on Google Cloud

Establishing Multi-Region DynamoDB Replication

Automated failover for critical applications hinges on resilient data stores. For DynamoDB, this means leveraging its built-in global tables feature. This isn’t just about backups; it’s about active-active replication across multiple AWS regions, enabling near-instantaneous read and write availability in a secondary region should the primary region become unavailable.

The process involves creating a DynamoDB table in your primary region and then enabling global tables. This creates replica tables in other specified regions. DynamoDB handles the replication of data changes automatically. The key is to ensure your application is designed to connect to the *nearest* available replica table.

PHP Application Configuration for Multi-Region Awareness

Your PHP application needs to be aware of the available DynamoDB endpoints and intelligently select the closest one. This typically involves a configuration layer that can dynamically update the DynamoDB client’s region based on the application’s current deployment location. We’ll use environment variables to dictate the application’s region.

Consider a simple PHP class that abstracts DynamoDB interactions and handles region selection. This class will read an environment variable, `APP_REGION`, to determine which AWS region it should connect to. If the primary region is unavailable, a separate mechanism (discussed later) will update this environment variable to point to the secondary region.

DynamoDB Client Abstraction in PHP

Here’s a basic PHP class using the AWS SDK for PHP to manage DynamoDB connections. It prioritizes connecting to the region specified by the `APP_REGION` environment variable.

<?php

require 'vendor/autoload.php'; // Assuming you're using Composer

use Aws\DynamoDb\DynamoDbClient;
use Aws\Exception\AwsException;

class MultiRegionDynamoDBClient {
    private $client;
    private $tableName;
    private $region;

    public function __construct(string $tableName, array $config = []) {
        $this->tableName = $tableName;
        $this->region = getenv('APP_REGION') ?: ($config['default_region'] ?? 'us-east-1'); // Default to us-east-1 if not set

        try {
            $this->client = new DynamoDbClient([
                'region' => $this->region,
                'version' => 'latest',
                // Add other SDK configurations like credentials if not using IAM roles
                // 'credentials' => [
                //     'key'    => 'YOUR_AWS_ACCESS_KEY_ID',
                //     'secret' => 'YOUR_AWS_SECRET_ACCESS_KEY',
                // ],
            ]);
            // Ping the table to ensure connectivity
            $this->client->describeTable(['TableName' => $this->tableName]);
            error_log("Successfully connected to DynamoDB table '{$this->tableName}' in region '{$this->region}'.");
        } catch (AwsException $e) {
            error_log("Error connecting to DynamoDB in region '{$this->region}': " . $e->getMessage());
            // In a real failover scenario, you might attempt to switch regions here
            // or rely on an external mechanism to update APP_REGION.
            throw $e; // Re-throw to indicate connection failure
        }
    }

    public function getItem(array $key) {
        try {
            $result = $this->client->getItem([
                'TableName' => $this->tableName,
                'Key' => $key,
            ]);
            return $result->toArray();
        } catch (AwsException $e) {
            error_log("Error getting item from DynamoDB: " . $e->getMessage());
            throw $e;
        }
    }

    public function putItem(array $item) {
        try {
            $result = $this->client->putItem([
                'TableName' => $this->tableName,
                'Item' => $item,
            ]);
            return $result->toArray();
        } catch (AwsException $e) {
            error_log("Error putting item to DynamoDB: " . $e->getMessage());
            throw $e;
        }
    }

    // Add other DynamoDB operations as needed (query, scan, update, delete)

    public function getCurrentRegion(): string {
        return $this->region;
    }
}
?>

Automated Failover Orchestration with Google Cloud Load Balancing and Health Checks

The core of automated failover lies in detecting failures and rerouting traffic. On Google Cloud Platform (GCP), this is elegantly handled by combining Global External HTTP(S) Load Balancers with custom health checks and instance group management.

Setting up GCP Global Load Balancer

We’ll deploy our PHP application across multiple GCP regions (e.g., `us-central1` and `europe-west1`). A Global External HTTP(S) Load Balancer will sit in front of these deployments. The load balancer will distribute traffic to the healthy backend services in each region.

Backend Services and Instance Groups

For each region, you’ll have a Managed Instance Group (MIG) running your PHP application. Each MIG will be configured with an instance template that includes startup scripts to:

Install necessary dependencies (PHP, web server, AWS SDK).
Configure the web server (e.g., Apache or Nginx) to serve the PHP application.
Crucially, set the `APP_REGION` environment variable to the region the instance is deployed in (e.g., `us-central1` for instances in `us-central1`).

The load balancer will have a backend service associated with each regional MIG. These backend services will be configured to use a health check.

Custom Health Checks for Application and Data Layer Availability

A standard HTTP health check on the application’s `/health` endpoint is insufficient. We need to verify not only that the PHP application is running but also that it can successfully communicate with its *intended* DynamoDB region. This requires a custom health check endpoint within your PHP application.

PHP Health Check Endpoint Implementation

The `/health` endpoint should:

Check the `APP_REGION` environment variable to know which DynamoDB region it *should* be talking to.
Instantiate the `MultiRegionDynamoDBClient` (or a similar health-checking variant).
Attempt a simple, low-cost DynamoDB operation (e.g., `describeTable` or a `getItem` on a known, non-critical key).
Return an HTTP 200 OK if successful, and an HTTP 503 Service Unavailable if any part of the check fails.

<?php
// health.php

require 'vendor/autoload.php';
require_once 'MultiRegionDynamoDBClient.php'; // Assuming the class is in this file

header('Content-Type: application/json');

$tableName = getenv('DYNAMODB_TABLE_NAME'); // Your DynamoDB table name
$appRegion = getenv('APP_REGION');

if (!$tableName || !$appRegion) {
    http_response_code(500);
    echo json_encode(['status' => 'error', 'message' => 'Configuration missing: DYNAMODB_TABLE_NAME or APP_REGION not set.']);
    exit;
}

try {
    // Instantiate the client, but we're primarily testing connectivity
    // The constructor already attempts a describeTable.
    // For a more robust health check, you might add a specific check here.
    $dynamoDBHealthCheck = new MultiRegionDynamoDBClient($tableName, ['default_region' => $appRegion]);

    // Optional: Perform a quick read operation to further verify
    // This assumes you have a known item, or can generate a dummy one.
    // For simplicity, we'll rely on the constructor's describeTable for now.
    // If you have a specific health check item, use getItem here.
    // $healthCheckKey = ['id' => 'health_check_key'];
    // $dynamoDBHealthCheck->getItem($healthCheckKey);

    http_response_code(200);
    echo json_encode([
        'status' => 'ok',
        'region' => $dynamoDBHealthCheck->getCurrentRegion(),
        'table' => $tableName,
        'message' => 'DynamoDB connection successful.'
    ]);

} catch (AwsException $e) {
    http_response_code(503); // Service Unavailable
    error_log("Health check failed for region {$appRegion}: " . $e->getMessage());
    echo json_encode([
        'status' => 'error',
        'region' => $appRegion,
        'table' => $tableName,
        'message' => 'DynamoDB connection failed: ' . $e->getMessage()
    ]);
} catch (Exception $e) {
    http_response_code(503); // Service Unavailable
    error_log("General health check error for region {$appRegion}: " . $e->getMessage());
    echo json_encode([
        'status' => 'error',
        'region' => $appRegion,
        'table' => $tableName,
        'message' => 'An unexpected error occurred: ' . $e->getMessage()
    ]);
}
?>

Configuring GCP Health Checks

In GCP, you’ll create a custom health check that targets your application’s `/health` endpoint. Configure this health check to use the protocol and port your application is listening on (e.g., HTTP on port 80 or 443 if using SSL). Set appropriate timeouts and thresholds.

# Example using gcloud CLI
gcloud compute health-checks create http my-app-health-check \
    --request-path=/health \
    --port=80 \
    --check-interval=5s \
    --timeout=5s \
    --unhealthy-threshold=3 \
    --healthy-threshold=2 \
    --global

Associate this health check with the backend services for each regional MIG. When the health check fails for a specific region’s backend service, the Global Load Balancer will stop sending traffic to that region’s MIG.

Implementing the Failover Trigger and Region Switch

The critical piece is the mechanism that detects a complete regional outage and orchestrates the switch. This involves monitoring the health of *all* regions and, when a primary region fails, updating the `APP_REGION` environment variable for the remaining healthy instances.

Monitoring and Alerting

Google Cloud Monitoring (formerly Stackdriver) is your primary tool here. Configure alerts based on the health check status of your backend services. You’ll want alerts when:

A regional backend service becomes unhealthy.
A regional backend service remains unhealthy for an extended period (indicating a potential full region failure).

Automated Region Switching Logic

This is the most complex part and can be implemented in several ways:

Option 1: Cloud Functions/Cloud Run Triggered by Alerts

When a critical alert fires (e.g., “Primary Region `us-central1` is unhealthy”), it can trigger a Cloud Function or Cloud Run service. This service would then:

Query the status of all regional backend services.
Identify the healthy secondary region(s).
Use the GCP API (via client libraries for Python, Node.js, etc.) to update the instance templates or startup scripts of the MIGs in the *remaining* healthy regions. This update would change the `APP_REGION` environment variable to point to the new primary region (e.g., from `us-central1` to `europe-west1`).
Alternatively, and perhaps simpler, if your application instances are long-lived and can react to environment variable changes, the Cloud Function could trigger a rolling restart of the MIGs in the healthy region, forcing them to pick up the new `APP_REGION` value.

# Example Python Cloud Function snippet (simplified)
from google.cloud import compute_v1
from google.cloud import monitoring_v3
import os

PROJECT_ID = os.environ.get('GCP_PROJECT')
PRIMARY_REGION = 'us-central1'
SECONDARY_REGION = 'europe-west1'
TABLE_NAME = os.environ.get('DYNAMODB_TABLE_NAME') # Set as env var for the function

def trigger_failover(event, context):
    # This function would be triggered by a Pub/Sub message from Cloud Monitoring alert
    alert_details = event # Parse the Pub/Sub message payload

    # Logic to determine which region is down and which is up
    # This would involve querying GCP Monitoring API for health check status
    # For demonstration, assume us-central1 is down and europe-west1 is up.

    print(f"Detected failure in {PRIMARY_REGION}. Initiating failover to {SECONDARY_REGION}.")

    # Option A: Update Instance Template (more complex, requires careful management)
    # instance_template_name = f"php-app-template-{SECONDARY_REGION}"
    # update_instance_template(instance_template_name, {'APP_REGION': SECONDARY_REGION, 'DYNAMODB_TABLE_NAME': TABLE_NAME})

    # Option B: Trigger Rolling Restart of MIG in Secondary Region
    # This forces instances to re-read their startup script/env vars
    mig_name = f"php-app-mig-{SECONDARY_REGION}"
    trigger_rolling_restart(mig_name)

    print(f"Failover initiated. Instances in {SECONDARY_REGION} should now use {SECONDARY_REGION} as APP_REGION.")

def trigger_rolling_restart(mig_name):
    instance_group_manager_client = compute_v1.InstanceGroupManagersClient()
    request = compute_v1.RestartNodesInstanceGroupManagerRequest(
        project=PROJECT_ID,
        zone=f"{SECONDARY_REGION}-a", # Assuming zone 'a' for the secondary region
        instance_group_manager=mig_name,
        instance_group_managers_restart_nodes_request_resource=compute_v1.InstanceGroupManagersRestartNodesRequest()
    )
    try:
        operation = instance_group_manager_client.restart_nodes(request=request)
        print(f"Rolling restart operation initiated for MIG {mig_name}: {operation.name}")
    except Exception as e:
        print(f"Error triggering rolling restart for MIG {mig_name}: {e}")

# You would need to implement update_instance_template similarly using compute_v1.InstanceTemplatesClient
# and handle IAM permissions for the Cloud Function to manage compute resources.

Option 2: Application-Level Failover (More Complex, Less Recommended for Full Outage)

While the PHP application can detect DynamoDB connection errors, it’s generally not robust enough to orchestrate a *global* failover of its own instances. However, it can be part of a layered approach. If the primary region’s DynamoDB becomes unreachable, the application instances in *that* region could attempt to connect to the secondary region’s DynamoDB. This requires the `APP_REGION` environment variable to be dynamically updated *within* the instances themselves, or for the application to have logic to try multiple regions. This is brittle and doesn’t solve the problem of the load balancer directing traffic away from the failed region.

DNS and Global Load Balancer Considerations

The GCP Global External HTTP(S) Load Balancer provides a single, global IP address. When a regional backend service fails its health checks, the load balancer automatically stops sending traffic to that region. This is the primary traffic redirection mechanism. You don’t typically need to manage DNS failover for the application traffic itself if using a global load balancer.

However, ensure your DynamoDB client is configured to use the correct region. The `APP_REGION` environment variable is key here. When a failover occurs, the instances in the *remaining* healthy region(s) must update their `APP_REGION` to reflect the new primary. This is why the rolling restart or instance template update approach is crucial.

Testing and Validation

Thorough testing is non-negotiable. Simulate regional failures by:

Temporarily disabling health checks for a specific region’s backend service.
Manually stopping instances within a MIG.
Simulating network partitions or DynamoDB unavailability within a region (more difficult but valuable).

Monitor Cloud Monitoring dashboards to ensure alerts fire correctly and that traffic is rerouted as expected. Verify that the application in the surviving region(s) continues to operate correctly, connecting to the correct DynamoDB replica.