Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and PHP Deployments on AWS

Elasticsearch Cluster Architecture for High Availability

Achieving robust disaster recovery for Elasticsearch hinges on a well-architected cluster that inherently supports high availability. This involves understanding Elasticsearch’s distributed nature and leveraging its built-in features for resilience. We’ll focus on a multi-AZ deployment strategy within AWS, ensuring that a single Availability Zone failure does not cripple search capabilities.

A critical component is the Elasticsearch master node quorum. For a stable cluster, an odd number of master-eligible nodes is recommended. A common pattern for production is to have at least three master-eligible nodes distributed across different Availability Zones. This prevents split-brain scenarios where the cluster cannot elect a master due to network partitions or node failures.

Configuring Master Nodes and Shard Allocation

The Elasticsearch configuration file (elasticsearch.yml) is central to defining cluster behavior. For master nodes, the following settings are crucial:

cluster.name: "my-production-cluster"
node.name: ${HOSTNAME}
network.host: 0.0.0.0
discovery.seed_hosts:
  - "es-node-1.example.com:9300"
  - "es-node-2.example.com:9300"
  - "es-node-3.example.com:9300"
cluster.initial_master_nodes:
  - "es-node-1.example.com"
  - "es-node-2.example.com"
  - "es-node-3.example.com"
# For data nodes, you might have these settings:
# node.roles: [ data, ingest ]
# For master-only nodes:
# node.roles: [ master ]

The discovery.seed_hosts list provides potential master nodes for new nodes to discover. cluster.initial_master_nodes is used only during the very first startup of the cluster to bootstrap the master election process. Once the cluster is formed, this setting becomes less critical but should be maintained for recovery scenarios.

Shard allocation awareness is paramount for multi-AZ deployments. This ensures that primary and replica shards are distributed across different physical locations (Availability Zones in AWS). This is configured via the cluster.routing.allocation.awareness.attributes setting. You’ll need to tag your EC2 instances with a custom attribute, for example, zone, and then configure Elasticsearch to respect it.

# In elasticsearch.yml on each node
cluster.routing.allocation.awareness.attributes: zone

When launching EC2 instances for Elasticsearch nodes, ensure they are tagged with the appropriate zone attribute. For example, an instance in us-east-1a would have a tag key zone and value us-east-1a. Elasticsearch will then attempt to place shards on nodes in different zones.

Automating Elasticsearch Failover with AWS Services

Manual intervention during an Elasticsearch outage is unacceptable for a production system. Automation is key. This involves a combination of AWS services to detect failures and orchestrate recovery.

Health Checks and Load Balancer Integration

Elasticsearch exposes a health API (/_cluster/health) that provides crucial information about the cluster’s status. We can leverage AWS Elastic Load Balancing (ELB) or Network Load Balancing (NLB) to distribute traffic to Elasticsearch nodes and perform health checks.

Configure your ELB/NLB with a health check targeting the /_cluster/health endpoint. A successful response (HTTP 200 OK) with a JSON body indicating a "status": "green" or "status": "yellow" is generally acceptable for read operations. A "status": "red" indicates a critical issue (e.g., unassigned shards) and should be treated as unhealthy.

# Example ELB Health Check Configuration
Protocol: HTTP
Port: 9200
Path: /_cluster/health
Healthy Threshold: 3
Unhealthy Threshold: 2
Timeout: 5
Interval: 10

When a node becomes unhealthy, the load balancer will stop sending traffic to it. However, this doesn’t automatically resolve the underlying issue or failover the cluster’s master role if a master node fails. For that, we need a more sophisticated approach.

Leveraging AWS Auto Scaling Groups and Lambda for Master Node Failover

A common strategy for master node resilience is to use an Auto Scaling Group (ASG) with a desired capacity of 3 (or more, depending on your master node count) and configure it to launch instances in multiple Availability Zones. The ASG can monitor the health of instances and replace unhealthy ones.

To automate the *master election* process when a master node fails, we can integrate AWS Lambda with CloudWatch Alarms. The alarm would be triggered by metrics indicating Elasticsearch cluster instability (e.g., a drop in the number of master-eligible nodes reported by the Elasticsearch API, or a high number of unassigned shards).

The Lambda function would then perform the following actions:

Query the Elasticsearch API to confirm the state of the cluster and identify the failed master node.
If a master node is indeed down and the cluster is in an unrecoverable state (e.g., quorum lost), the Lambda function can attempt to trigger a replacement of the failed instance via the ASG.
Crucially, the Lambda function might need to re-run the cluster.initial_master_nodes bootstrapping process if the cluster is completely down and needs to be re-initialized. This is a more advanced recovery scenario and might involve custom tooling or careful state management.

A simpler approach for master node failure is to rely on the ASG’s health checks. If the ASG’s health check (which can be configured to use ELB health checks or custom EC2 health checks) detects an unhealthy master node, it will terminate the instance and launch a replacement. Elasticsearch’s discovery mechanism will then attempt to re-form the cluster with the remaining master-eligible nodes. This relies on having enough master-eligible nodes to maintain quorum.

PHP Application Resilience and Failover Strategies

The PHP application layer also needs to be resilient to Elasticsearch unavailability. This involves implementing robust error handling, connection pooling, and potentially a fallback mechanism.

Elasticsearch Client Configuration and Connection Pooling

When using an Elasticsearch client library in PHP (e.g., the official elasticsearch-php client), configure it to connect to multiple nodes and implement retry logic. This allows the client to automatically try other nodes if the primary one is unresponsive.

$client = \Elasticsearch\ClientBuilder::create()
    ->setHosts([
        'http://es-node-1.example.com:9200',
        'http://es-node-2.example.com:9200',
        'http://es-node-3.example.com:9200',
    ])
    ->setConnectionPoolParams([
        'randomize_hosts' => true, // Distribute requests randomly
        'max_retries' => 3, // Retry failed requests up to 3 times
        'default_timeout' => 5, // Timeout for each request
        'connection_timeout' => 2, // Timeout for establishing connection
    ])
    ->build();

The randomize_hosts option helps distribute load across available nodes. The max_retries and timeouts are crucial for handling transient network issues or slow responses from Elasticsearch. Adjust these values based on your network latency and Elasticsearch performance characteristics.

Implementing Application-Level Failover Logic

Even with client-side retries, a complete Elasticsearch cluster outage will eventually lead to application errors. To mitigate this, implement application-level failover logic. This could involve:

Graceful Degradation: If Elasticsearch is unavailable, the application can continue to function by disabling search features or serving cached data.
Fallback Data Source: For critical search functionalities, consider a secondary, less performant data source (e.g., a relational database) that can be queried as a last resort.
Circuit Breaker Pattern: Implement a circuit breaker in your PHP code. If Elasticsearch requests consistently fail, “trip” the circuit breaker to stop sending requests for a period, preventing cascading failures and allowing Elasticsearch time to recover.

Here’s a simplified example of a circuit breaker pattern in PHP:

class ElasticsearchCircuitBreaker {
    private $threshold;
    private $timeout;
    private $failures = 0;
    private $lastFailureTime = 0;
    private $isOpen = false;

    public function __construct(int $threshold = 5, int $timeout = 60) {
        $this->threshold = $threshold;
        $this->timeout = $timeout; // in seconds
    }

    public function allowRequest(): bool {
        if ($this->isOpen) {
            if (time() - $this->lastFailureTime > $this->timeout) {
                // Timeout period has passed, attempt to close the circuit
                $this->isOpen = false;
                $this->failures = 0; // Reset failures for a test request
                return true; // Allow a test request
            }
            return false; // Circuit is open, block request
        }
        return true; // Circuit is closed, allow request
    }

    public function recordFailure() {
        if (!$this->isOpen) {
            $this->failures++;
            $this->lastFailureTime = time();
            if ($this->failures >= $this->threshold) {
                $this->isOpen = true;
                // Log the circuit opening event
                error_log("Elasticsearch Circuit Breaker opened.");
            }
        }
    }

    public function recordSuccess() {
        if ($this->isOpen) {
            // If a successful request is made after timeout, we might have closed it in allowRequest
            // But if it was still open, and we get a success, we can reset.
            // This logic can be refined. For now, assume allowRequest handles closing.
        } else {
            $this->failures = 0; // Reset failures on success
        }
    }

    public function isOpen(): bool {
        return $this->isOpen;
    }
}

// Usage in your application:
$circuitBreaker = new ElasticsearchCircuitBreaker(5, 120); // 5 failures, 120s timeout

try {
    if ($circuitBreaker->allowRequest()) {
        // Attempt Elasticsearch operation
        $params = ['index' => 'my_index', 'body' => ['query' => ['match' => ['title' => 'elasticsearch']]]];
        $response = $client->search($params);
        $circuitBreaker->recordSuccess();
        // Process $response
    } else {
        // Circuit is open, handle gracefully (e.g., serve cached data, disable search)
        echo "Search is temporarily unavailable. Please try again later.";
        // Potentially log this event
    }
} catch (\Exception $e) {
    // Log the Elasticsearch exception
    error_log("Elasticsearch error: " . $e->getMessage());
    $circuitBreaker->recordFailure();
    // Handle the error: display a user-friendly message, try fallback, etc.
    echo "An error occurred while searching. Please try again later.";
}

This circuit breaker pattern prevents repeated calls to a failing service, protecting both your application and the failing service from further strain. When the circuit opens, the application can switch to a degraded mode or serve stale data.

Monitoring and Alerting for Proactive Recovery

Effective disaster recovery is not just about reacting to failures but also about anticipating them. Comprehensive monitoring and alerting are essential.

Key Elasticsearch Metrics to Monitor

Utilize CloudWatch or a third-party monitoring tool to track the following Elasticsearch metrics:

Cluster Health Status: status (green, yellow, red)
Node Count: Number of master, data, and ingest nodes. A sudden drop indicates a failure.
Shard Status: unassigned_shards, initializing_shards, relocating_shards. High numbers of unassigned shards are a critical alert.
JVM Heap Usage: Monitor heap_used_percent to prevent OutOfMemory errors.
CPU and Memory Utilization: Standard system metrics for EC2 instances.
Disk I/O and Space: Ensure nodes have sufficient disk resources.
Request Latency and Throughput: Monitor search and indexing performance.

Setting Up CloudWatch Alarms

Configure CloudWatch alarms based on these metrics. For example:

An alarm on unassigned_shards exceeding a threshold (e.g., > 0 for more than 5 minutes).
An alarm on the number of master-eligible nodes dropping below the quorum requirement.
An alarm on heap_used_percent consistently above 85%.

These alarms should trigger notifications (e.g., SNS to Slack, PagerDuty) and, as discussed earlier, potentially trigger automated recovery actions via AWS Lambda.

Conclusion: A Multi-Layered Approach to Resilience

Architecting for auto-failover for Elasticsearch and PHP deployments on AWS requires a multi-layered strategy. It begins with a resilient Elasticsearch cluster design, leveraging multi-AZ deployments and proper master node configuration. This is augmented by AWS services like ELB, Auto Scaling Groups, and Lambda for automated detection and recovery. Finally, the PHP application itself must be built with resilience in mind, incorporating robust client configurations, error handling, and patterns like the circuit breaker. Continuous monitoring and proactive alerting form the backbone of this strategy, ensuring that potential issues are identified and addressed before they impact users.