Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and PHP Deployments on Google Cloud

Elasticsearch Cluster Architecture for High Availability

Achieving true disaster recovery for Elasticsearch hinges on a robust, multi-zone or multi-region architecture. For production deployments on Google Cloud Platform (GCP), this means leveraging Compute Engine instances across different zones within a region, or even across multiple regions for maximum resilience. The core principle is to ensure that a single zone or region failure does not render your data inaccessible or your search capabilities unavailable.

A typical highly available Elasticsearch cluster will consist of dedicated master nodes, data nodes, and ingest nodes. For automatic failover, we’ll focus on the data nodes, as they hold the actual indices and shards. Elasticsearch’s built-in quorum-based voting mechanism for master election is crucial here. To ensure a stable master, it’s recommended to have an odd number of dedicated master nodes (e.g., 3 or 5). These nodes should be configured with low resource utilization and minimal load to prioritize their role in cluster management.

Configuring Elasticsearch for Multi-Zone Deployment

When deploying Elasticsearch on GCP, each node should be provisioned with a static internal IP address. This is essential for reliable inter-node communication. We’ll use instance groups managed by GCP’s Managed Instance Groups (MIGs) to ensure that if a node fails, a replacement can be automatically provisioned in the same zone or a different zone within the region, depending on the MIG’s configuration.

The Elasticsearch configuration file (elasticsearch.yml) needs to be aware of its network environment and cluster settings. Key parameters include:

cluster.name: A unique name for your Elasticsearch cluster.
node.name: A unique name for each node. This can be dynamically set using instance metadata.
network.host: Set to the node’s internal IP address.
discovery.seed_hosts: A list of IP addresses or hostnames of other nodes in the cluster that new nodes can use to discover each other.
cluster.initial_master_nodes: A list of node names that are eligible to become the initial master.
xpack.security.enabled: Crucial for securing your cluster.

Here’s a snippet of a typical elasticsearch.yml configuration, assuming nodes are in different zones within the same GCP region:

cluster.name: my-gcp-es-cluster
node.name: ${HOSTNAME} # Dynamically set via instance metadata or startup script

network.host: [_local_] # Or a specific internal IP if static IPs are used

discovery.seed_hosts:
  - "10.128.0.10" # Example static IP for master-eligible node 1
  - "10.128.0.11" # Example static IP for master-eligible node 2
  - "10.128.0.12" # Example static IP for master-eligible node 3

cluster.initial_master_nodes:
  - "es-master-01"
  - "es-master-02"
  - "es-master-03"

http.port: 9200
transport.port: 9300

# For data nodes, you might want to configure shard allocation awareness
# to keep replicas in different zones than their primaries.
# index.routing.allocation.awareness.attributes: zone

# Security settings (essential for production)
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true
# ... other security configurations (certificates, users, roles)

To ensure nodes can discover each other reliably, especially in dynamic environments, using GCP’s internal DNS or a service discovery mechanism like Consul can be more robust than hardcoding IPs. However, for a fixed set of master nodes, static internal IPs are a common and effective approach.

PHP Application Integration and Failover Logic

Your PHP application needs to be aware of the Elasticsearch cluster’s health and be able to switch to a healthy node or cluster if the primary one becomes unavailable. This involves implementing client-side failover logic.

Client-Side Failover with the Official Elasticsearch PHP Client

The official Elasticsearch PHP client (elasticsearch-php) provides mechanisms for handling multiple hosts and retries. We can configure it to connect to a list of Elasticsearch endpoints. When a request fails, the client can be configured to retry or failover to the next available host in the list.

A common strategy is to list your Elasticsearch nodes (or load balancers in front of them) in order of preference. The client will attempt to connect to the first host. If it fails (e.g., connection refused, timeout), it will try the next host in the list. This needs to be combined with appropriate timeout and retry settings.

Consider a scenario where you have Elasticsearch nodes in different GCP regions for disaster recovery. Your PHP application, deployed in each region, would primarily connect to the local Elasticsearch cluster. If the local cluster is unhealthy, it would then attempt to connect to the Elasticsearch cluster in the secondary region.

<?php
require 'vendor/autoload.php';

use Elasticsearch\ClientBuilder;

// Configuration for primary Elasticsearch cluster (e.g., in us-central1)
$primaryHosts = [
    'http://10.0.1.10:9200', // Primary ES node 1 (internal IP)
    'http://10.0.1.11:9200', // Primary ES node 2
    'http://10.0.1.12:9200', // Primary ES node 3
];

// Configuration for secondary Elasticsearch cluster (e.g., in us-east1)
$secondaryHosts = [
    'http://10.0.2.10:9200', // Secondary ES node 1 (internal IP)
    'http://10.0.2.11:9200', // Secondary ES node 2
    'http://10.0.2.12:9200', // Secondary ES node 3
];

$client = null;
$connected = false;

// Attempt to connect to the primary cluster first
try {
    $client = ClientBuilder::create()
        ->setHosts($primaryHosts)
        ->setConnectionPool('\Elasticsearch\ConnectionPool\StaticNoPingConnectionPool') // Use StaticNoPingConnectionPool for explicit host list
        ->setSelector('\Elasticsearch\ConnectionPool\Selectors\RoundRobinSelector') // Or other selectors
        ->setTimeout(5) // Connection timeout in seconds
        ->setConnectionTimeout(5) // Connection timeout in seconds
        ->build();

    // Perform a simple health check
    $client->cluster()->health(['timeout' => '1s']);
    $connected = true;
    echo "Connected to primary Elasticsearch cluster.\n";

} catch (\Exception $e) {
    echo "Failed to connect to primary Elasticsearch cluster: " . $e->getMessage() . "\n";
    // Fallback to secondary cluster
    try {
        $client = ClientBuilder::create()
            ->setHosts($secondaryHosts)
            ->setConnectionPool('\Elasticsearch\ConnectionPool\StaticNoPingConnectionPool')
            ->setSelector('\Elasticsearch\ConnectionPool\Selectors\RoundRobinSelector')
            ->setTimeout(5)
            ->setConnectionTimeout(5)
            ->build();

        // Perform a simple health check
        $client->cluster()->health(['timeout' => '1s']);
        $connected = true;
        echo "Connected to secondary Elasticsearch cluster.\n";

    } catch (\Exception $e2) {
        echo "Failed to connect to secondary Elasticsearch cluster: " . $e2->getMessage() . "\n";
        // Handle critical failure - application cannot proceed without ES
        // Log error, display user-friendly message, etc.
        die("Elasticsearch is unavailable. Please try again later.");
    }
}

if ($connected && $client) {
    // Now you can use the $client object for your Elasticsearch operations
    // Example: Indexing a document
    try {
        $params = [
            'index' => 'my_index',
            'id'    => 'my_document_id',
            'body'  => ['test' => 'This is a test document.']
        ];
        $response = $client->index($params);
        print_r($response);
    } catch (\Exception $e) {
        echo "Error indexing document: " . $e->getMessage() . "\n";
        // Implement retry logic or error handling for specific operations
    }
}
?>

In this example, we first attempt to connect to the primary cluster. If the cluster()->health() call throws an exception (indicating unavailability or a significant issue), we catch it and attempt to connect to the secondary cluster. The StaticNoPingConnectionPool is used to ensure the client uses the provided list of hosts directly without attempting to ping them first, which is useful when dealing with internal IPs or specific network configurations.

Leveraging GCP Load Balancers for Elasticsearch

For a more robust and managed solution, consider placing GCP Network Load Balancers (NLBs) or Internal TCP/UDP Load Balancers in front of your Elasticsearch nodes. This abstracts the individual node IPs and provides a single stable endpoint for your PHP application.

An NLB can distribute traffic across your Elasticsearch nodes within a region. For multi-region failover, you would typically use a Global External HTTP(S) Load Balancer with backend services pointing to regional NLBs or directly to instance groups. Health checks configured on the load balancer will automatically remove unhealthy Elasticsearch nodes from the pool.

Your PHP application would then connect to the IP address of the load balancer. The failover logic would then shift from client-side host list management to the load balancer’s health checking and traffic distribution capabilities.

<?php
require 'vendor/autoload.php';

use Elasticsearch\ClientBuilder;

// Connect to the Internal Load Balancer IP for the Elasticsearch cluster
$esLoadBalancerIp = '10.0.0.50'; // Example Internal Load Balancer IP

try {
    $client = ClientBuilder::create()
        ->setHosts([$esLoadBalancerIp . ':9200'])
        ->setConnectionPool('\Elasticsearch\ConnectionPool\StaticNoPingConnectionPool')
        ->setTimeout(5)
        ->setConnectionTimeout(5)
        ->build();

    // Perform a health check
    $client->cluster()->health(['timeout' => '1s']);
    echo "Connected to Elasticsearch via Load Balancer.\n";

    // Use $client for operations...

} catch (\Exception $e) {
    echo "Failed to connect to Elasticsearch via Load Balancer: " . $e->getMessage() . "\n";
    // If the LB is unavailable, it implies a significant regional outage or LB failure.
    // This is where you might implement a fallback to a cross-region LB or a direct connection to a secondary cluster's LB.
    // For simplicity, we'll just die here, but a real-world scenario would have more sophisticated cross-region LB logic.
    die("Elasticsearch is unavailable. Please try again later.");
}
?>

When using load balancers, the PHP client configuration becomes simpler, pointing to a single endpoint. The resilience is then managed by GCP’s infrastructure. For multi-region DR, you’d configure a global load balancer to direct traffic to the closest healthy regional load balancer.

Automated Failover Orchestration with GCP Tools

True automated failover involves more than just client-side logic or load balancer health checks. It requires a system that can detect a complete regional failure and initiate actions, such as promoting a read-replica Elasticsearch cluster to a primary, or reconfiguring global DNS/load balancers.

Managed Instance Groups (MIGs) and Autohealing

GCP’s Managed Instance Groups are fundamental. Configure your Elasticsearch nodes within MIGs that span multiple zones within a region. Enable autohealing for these MIGs. GCP’s health check mechanism will monitor the instances, and if an instance fails its health check, the MIG will automatically recreate it. This provides resilience against individual instance failures.

For disaster recovery, you’d typically have separate MIGs for each region. A failure in one region would not affect the other. The challenge then becomes how to direct traffic to the healthy region.

Global Load Balancing and Health Checks

A Global External HTTP(S) Load Balancer is your primary tool for directing traffic to the correct region. Configure backend services for each regional Elasticsearch cluster (or its regional load balancer). The global load balancer performs health checks against the regional endpoints. If a regional backend service becomes unhealthy, the global load balancer will automatically stop sending traffic to that region and direct it to the remaining healthy regions.

The health check configuration is critical. For Elasticsearch, a simple HTTP GET request to /_cluster/health with an expected status code of 200 (or a specific status like green/yellow) is usually sufficient. Ensure the health check probes are configured to reach your Elasticsearch nodes or their regional load balancers.

# Example of a GCP health check configuration (conceptual)
# This would be configured via `gcloud` CLI or GCP Console

gcloud compute health-checks create http es-health-check \
    --request-path="/_cluster/health?level=cluster" \
    --port=9200 \
    --check-interval=30s \
    --timeout=5s \
    --unhealthy-threshold=3 \
    --healthy-threshold=2 \
    --region=us-central1 # Or global for global load balancer health checks

# Then associate this health check with your backend services
gcloud compute backend-services update es-backend-service-us-central1 \
    --health-checks=es-health-check \
    --global # if it's a global backend service

When a regional failure is detected by the global load balancer, traffic is automatically rerouted. Your PHP application, configured to point to the global load balancer’s IP, will seamlessly receive traffic from the healthy region. This is the most automated and robust form of failover.

Automated Cluster Promotion (Advanced)

For scenarios requiring a fully active-active or active-passive setup where data consistency across regions is paramount, you might need more advanced strategies. This could involve:

Cross-cluster replication (CCR): Elasticsearch’s built-in CCR allows you to replicate indices from a primary cluster to one or more secondary clusters. In case of a primary cluster failure, you can stop replication and promote the secondary cluster to be the new primary. This requires careful management of replication lag and failover procedures.
Custom Automation Scripts: Using GCP Cloud Functions or Cloud Run triggered by monitoring alerts (e.g., from Cloud Monitoring) to execute scripts that perform cluster promotion, update DNS records, or reconfigure load balancers.

Implementing automated cluster promotion is complex and depends heavily on your specific RPO (Recovery Point Objective) and RTO (Recovery Time Objective). For most use cases, relying on GCP’s global load balancing with regional MIGs and health checks provides a strong foundation for automated failover without the complexity of full cluster promotion automation.