Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and PHP Deployments on OVH

Elasticsearch Cluster Architecture for High Availability

Achieving true disaster recovery for Elasticsearch hinges on a robust, multi-AZ (Availability Zone) or multi-region architecture. For OVH, this typically means leveraging their infrastructure to distribute nodes across distinct physical locations. A common pattern is a 3-node master-eligible set, with additional data nodes distributed to ensure quorum and data redundancy. We’ll focus on a single region with multiple availability zones for this example, as it’s a common and cost-effective starting point. For true DR, a multi-region setup is paramount, but the principles of node distribution and quorum remain.

A critical component is the Elasticsearch cluster’s quorum. With an odd number of master-eligible nodes (typically 3 or 5), the cluster can tolerate the failure of (N-1)/2 master nodes. For a 3-node master set, this means the cluster can survive the loss of one master node. Data nodes should also be distributed across availability zones. Elasticsearch’s shard allocation awareness can be configured to ensure replicas are placed in different zones than their primaries, further enhancing resilience.

Configuring Elasticsearch for Zone Awareness

To enable Elasticsearch to understand and utilize availability zones, we need to configure discovery.seed_hosts and cluster.initial_master_nodes. Additionally, for shard allocation, we’ll use cluster.routing.allocation.awareness.attributes. This requires tagging your Elasticsearch nodes with their respective availability zone information.

On each Elasticsearch node, in elasticsearch.yml, configure the following:

cluster.name: "my-production-cluster"
node.name: "${HOSTNAME}"
network.host: 0.0.0.0
discovery.seed_hosts:
  - "es-node-1.example.com:9300"
  - "es-node-2.example.com:9300"
  - "es-node-3.example.com:9300"
cluster.initial_master_nodes:
  - "es-node-1.example.com"
  - "es-node-2.example.com"
  - "es-node-3.example.com"

# Zone awareness configuration
node.attr.zone: "eu-west-3a" # Example: Set this dynamically based on the node's actual zone

The node.attr.zone attribute is crucial. This value should be dynamically set during node provisioning or startup, reflecting the actual OVH availability zone the node resides in. For example, if a node is in OVH’s ‘Gravelines’ region, ‘zone A’, this attribute would be set to ‘gra-a’ or similar. This allows Elasticsearch to make intelligent decisions about shard placement.

Shard Allocation with Awareness Attributes

To leverage the zone awareness attributes for shard placement, you need to configure the index settings. This is typically done via the Elasticsearch API, either at index creation or by updating existing index settings.

When creating an index:

{
  "settings": {
    "index.routing.allocation.awareness.attributes": "zone",
    "number_of_shards": 3,
    "number_of_replicas": 2
  }
}

This setting tells Elasticsearch to distribute shards and their replicas across nodes that have different values for the zone attribute. With 3 nodes across 3 zones and 2 replicas per shard, you ensure that each replica resides in a different availability zone than its primary, providing excellent resilience against zone failures.

PHP Application Integration and Failover Logic

Your PHP application needs to be aware of the Elasticsearch cluster topology and implement failover logic. The official Elasticsearch PHP client provides mechanisms for handling multiple hosts and retries, but for robust failover, we often need more explicit control.

A common strategy is to maintain a list of Elasticsearch endpoints in your PHP configuration and iterate through them upon connection or in case of failure. Using a load balancer in front of Elasticsearch is also a viable option, but direct client-side failover offers fine-grained control.

Implementing Client-Side Failover in PHP

We can create a simple wrapper class around the Elasticsearch client to manage multiple hosts and implement retry logic. This class will attempt to connect to a primary host, and if it fails, it will cycle through a list of secondary hosts.

<?php
require 'vendor/autoload.php'; // Assuming you use Composer

use Elasticsearch\ClientBuilder;

class ElasticsearchFailoverClient {
    private $hosts;
    private $client;
    private $currentHostIndex = 0;
    private $maxRetries = 3; // Number of times to retry a failed operation

    public function __construct(array $esHosts) {
        $this->hosts = $esHosts;
        $this->initializeClient();
    }

    private function initializeClient() {
        if (empty($this->hosts)) {
            throw new \RuntimeException("No Elasticsearch hosts configured.");
        }
        $host = $this->hosts[$this->currentHostIndex];
        try {
            $this->client = ClientBuilder::create()
                ->setHosts([$host])
                ->build();
            // Perform a simple ping to verify connection
            $this->client->ping();
            echo "Successfully connected to Elasticsearch at: " . $host . "\n";
        } catch (\Exception $e) {
            echo "Failed to connect to Elasticsearch at: " . $host . ". Error: " . $e->getMessage() . "\n";
            $this->failoverToNextHost();
        }
    }

    private function failoverToNextHost() {
        $this->currentHostIndex++;
        if ($this->currentHostIndex < count($this->hosts)) {
            $this->initializeClient();
        } else {
            throw new \RuntimeException("All Elasticsearch hosts are unreachable.");
        }
    }

    public function __call($method, $arguments) {
        $retries = 0;
        while ($retries <= $this->maxRetries) {
            try {
                if (!$this->client) {
                    $this->initializeClient(); // Re-initialize if client was lost
                }
                // Dynamically call the method on the underlying Elasticsearch client
                return call_user_func_array([$this->client, $method], $arguments);
            } catch (\Exception $e) {
                echo "Operation failed on host " . $this->hosts[$this->currentHostIndex] . ": " . $e->getMessage() . "\n";
                $retries++;
                if ($retries > $this->maxRetries) {
                    echo "Max retries reached. Attempting failover.\n";
                    // If max retries are hit on current host, try to failover
                    try {
                        $this->failoverToNextHost();
                        // Reset retries for the new host
                        $retries = 0;
                        continue; // Try the operation again on the new host
                    } catch (\RuntimeException $failoverException) {
                        // If failover also fails, re-throw the original exception or a new one
                        throw new \RuntimeException("Elasticsearch operation failed after multiple retries and failover attempts.", 0, $e);
                    }
                }
                // If not max retries, and still on the same host, wait a bit before retrying
                usleep(500000); // 500ms
            }
        }
    }
}

// --- Usage Example ---
$esHosts = [
    'http://es-node-1.example.com:9200',
    'http://es-node-2.example.com:9200',
    'http://es-node-3.example.com:9200',
    // Add more hosts as needed, ideally in different AZs
];

try {
    $esClient = new ElasticsearchFailoverClient($esHosts);

    // Example: Index a document
    $params = [
        'index' => 'my_index',
        'id'    => '1',
        'body'  => ['testField' => 'abc']
    ];
    $response = $esClient->index($params);
    print_r($response);

    // Example: Search
    $searchParams = [
        'index' => 'my_index',
        'body'  => [
            'query' => [
                'match' => [
                    'testField' => 'abc'
                ]
            ]
        ]
    ];
    $searchResponse = $esClient->search($searchParams);
    print_r($searchResponse);

} catch (\RuntimeException $e) {
    // Log the error and potentially alert administrators
    error_log("Elasticsearch critical error: " . $e->getMessage());
    // Display a user-friendly error message or redirect to an error page
    echo "An error occurred while accessing our data service. Please try again later.";
}
?>

This wrapper class attempts to connect to the first host. If the connection fails or an operation times out after a configured number of retries, it automatically switches to the next host in the list. This provides a basic but effective client-side failover mechanism.

Orchestration and Health Checks

For automated failover to be truly effective, you need robust health checks and an orchestration layer. This could be:

Kubernetes: If running Elasticsearch and your PHP app on Kubernetes, you can leverage Kubernetes’ built-in health probes (liveness and readiness probes) and service discovery. Elasticsearch operators can manage cluster health and scaling. For PHP, Kubernetes Services with multiple Endpoints (pointing to healthy pods) and readiness probes are key.
OVH Managed Services: OVH offers managed Kubernetes (ODE) and other services that can abstract away some of this complexity.
Custom Scripts/Monitoring: For bare-metal or VM deployments, you’ll need external monitoring tools (e.g., Prometheus, Nagios) to check Elasticsearch and application health. These tools can trigger automated actions, such as restarting services or updating DNS records.

Automating Elasticsearch Node Recovery

When an Elasticsearch node fails (e.g., due to an AZ outage), the cluster will automatically rebalance shards to ensure data availability, provided you have sufficient replicas and nodes. However, recovering the failed node itself requires automation.

For VM-based deployments on OVH:

Automated Re-provisioning: Use infrastructure-as-code tools like Terraform or Ansible. When a node is detected as unhealthy, the orchestration layer can trigger a script to terminate the old VM and provision a new one in the same AZ.
Configuration Management: Ensure your provisioning scripts automatically configure the new node with the correct Elasticsearch settings (cluster name, seed hosts, node attributes like zone).
DNS/Service Discovery Updates: If using DNS for service discovery, ensure the new node’s IP address is registered. If using a load balancer, update its backend pool.

Example Ansible task to register a new Elasticsearch node:

- name: Configure Elasticsearch node attributes
  ansible.builtin.lineinfile:
    path: /etc/elasticsearch/elasticsearch.yml
    regexp: '^node\.attr\.zone:'
    line: "node.attr.zone: {{ ovh_availability_zone }}" # Variable passed to Ansible
    state: present
  notify: restart elasticsearch

- name: Ensure Elasticsearch service is running and enabled
  ansible.builtin.systemd:
    name: elasticsearch
    state: started
    enabled: yes

Multi-Region Disaster Recovery Strategy

For true disaster recovery, a single-region, multi-AZ setup is insufficient. You need a multi-region strategy. This involves replicating your Elasticsearch data and application instances to a geographically separate region.

Key considerations for multi-region Elasticsearch DR:

Cross-Cluster Replication (CCR): Elasticsearch’s CCR feature allows you to replicate indices from a primary cluster in one region to a secondary cluster in another. This is essential for keeping your DR site’s data up-to-date.
Active-Passive vs. Active-Active: Decide on your RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Active-passive is simpler and often more cost-effective, with the DR site only becoming active upon a disaster. Active-active provides near-zero downtime but is significantly more complex and expensive.
DNS Failover: Implement a global DNS solution (e.g., OVH’s DNS, Cloudflare, AWS Route 53) that can automatically reroute traffic to the DR region if the primary region becomes unavailable. This DNS failover should be triggered by comprehensive health checks of your primary infrastructure.
Application Deployment: Ensure your PHP application is also deployed in the DR region, with its own data sources and dependencies replicated or available.

Implementing Cross-Cluster Replication (CCR)

CCR requires setting up a remote cluster connection and then configuring replication for specific indices.

1. **Configure Remote Cluster:** On your DR Elasticsearch cluster, define the primary cluster as a remote cluster in elasticsearch.yml:

cluster.remote.primary_cluster_alias:
  seeds:
    - "es-node-1.primary.example.com:9300"
    - "es-node-2.primary.example.com:9300"

2. **Enable Replication:** Use the Elasticsearch API to create a follower index:

PUT /my_follower_index/_ccr/initial_sync/my_leader_index
{
  "leader_alias": "primary_cluster",
  "leader_index": "my_leader_index"
}

This command initiates a sync from my_leader_index on the primary_cluster to my_follower_index on the current (DR) cluster. You would typically automate the creation of these follower indices for all critical indices.

Automated DNS Failover with Health Checks

A robust DR strategy relies on automated traffic redirection. This involves:

Comprehensive Health Checks: Implement checks that go beyond simple port availability. These checks should verify the health of your Elasticsearch cluster (e.g., cluster status API, node health) and your PHP application (e.g., a dedicated health check endpoint that verifies database connectivity, Elasticsearch connectivity, etc.).
Monitoring System: Use a system like Prometheus with Alertmanager, or a commercial service, to continuously run these health checks.
Automated DNS Updates: Configure the monitoring system to trigger an automated update of your DNS records (e.g., changing an A record’s IP address) when the primary region’s health checks fail. OVH’s API can be used to programmatically update DNS records.

Example conceptual flow:

# Health check script (simplified)
PRIMARY_ES_HEALTH_URL="http://es-primary.example.com:9200/_cluster/health?wait_for_status=yellow&timeout=30s"
PRIMARY_APP_HEALTH_URL="http://app-primary.example.com/health"

if ! curl -s --fail "$PRIMARY_ES_HEALTH_URL" || ! curl -s --fail "$PRIMARY_APP_HEALTH_URL"; then
    echo "Primary region unhealthy. Initiating failover..."
    # Call OVH API to update DNS record to point to DR region IP
    # e.g., ovh-cli domain update-record --domain example.com --name www --type A --value DR_IP_ADDRESS
    echo "DNS failover initiated."
else
    echo "Primary region healthy."
fi

This script would be run periodically by a cron job or a dedicated monitoring agent. The actual DNS update would involve interacting with OVH’s API, likely using their CLI tool or direct API calls via cURL or a dedicated SDK.