Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and PHP Deployments on DigitalOcean

Elasticsearch Cluster Architecture for High Availability

Achieving true disaster recovery for Elasticsearch hinges on a robust, multi-node cluster design that inherently supports failover. For production environments, especially those serving critical applications, a single-node Elasticsearch instance is a non-starter. We’ll focus on a setup that leverages Elasticsearch’s built-in master-eligible and data node roles, distributed across multiple DigitalOcean Droplets for resilience.

A minimum viable HA cluster consists of at least three master-eligible nodes. This quorum-based approach prevents split-brain scenarios. Data nodes will store the actual indices and shards. For optimal performance and resilience, we’ll also include dedicated coordinating nodes, which offload query and indexing requests from data and master nodes.

DigitalOcean Droplet Configuration

We’ll provision Droplets with sufficient RAM and CPU for Elasticsearch. Given Elasticsearch’s memory demands, especially for heap, Droplets with at least 8GB RAM are recommended. For production, consider dedicated CPU instances. Network latency is also a critical factor; placing nodes within the same DigitalOcean region and availability zone (or across zones for higher availability, with careful consideration of latency impact) is crucial.

Example Droplet setup:

Master Nodes (3x): `c-4` (4 vCPU, 8GB RAM) or larger. These nodes will run with node.roles: [ master ].
Data Nodes (2x+): `c-4` or larger, depending on data volume and query load. These nodes will run with node.roles: [ data ].
Coordinating Nodes (1x+): `c-2` (2 vCPU, 4GB RAM) or larger. These nodes will run with node.roles: [ ingest, search ] or simply node.roles: [ search ] if ingest is handled elsewhere.

Elasticsearch Configuration for HA

The core of Elasticsearch’s HA lies in its configuration file, elasticsearch.yml. We need to configure discovery, cluster naming, and node roles correctly on each node.

Master Node Configuration (Example: `/etc/elasticsearch/elasticsearch.yml`)

On each of the three master nodes:

cluster.name: "my-prod-cluster"
node.name: "master-node-1" # Unique for each master node
node.roles: [ master ]
network.host: 0.0.0.0
discovery.seed_hosts:
  - "192.168.1.10:9300" # IP of master-node-1
  - "192.168.1.11:9300" # IP of master-node-2
  - "192.168.1.12:9300" # IP of master-node-3
cluster.initial_master_nodes:
  - "master-node-1"
  - "master-node-2"
  - "master-node-3"
http.port: 9200
transport.port: 9300
# Ensure these are set to a reasonable value, especially for large clusters
indices.memory.index_buffer_size: "50%"
# JVM heap size - typically 50% of system RAM, up to 30GB
# Edit jvm.options file for this

Data Node Configuration (Example: `/etc/elasticsearch/elasticsearch.yml`)

On each data node:

cluster.name: "my-prod-cluster"
node.name: "data-node-1" # Unique for each data node
node.roles: [ data ]
network.host: 0.0.0.0
discovery.seed_hosts:
  - "192.168.1.10:9300" # IP of master-node-1
  - "192.168.1.11:9300" # IP of master-node-2
  - "192.168.1.12:9300" # IP of master-node-3
cluster.initial_master_nodes:
  - "master-node-1"
  - "master-node-2"
  - "master-node-3"
http.port: 9200
transport.port: 9300
# Ensure these are set to a reasonable value, especially for large clusters
indices.memory.index_buffer_size: "50%"
# JVM heap size - typically 50% of system RAM, up to 30GB
# Edit jvm.options file for this

Coordinating Node Configuration (Example: `/etc/elasticsearch/elasticsearch.yml`)

On each coordinating node:

cluster.name: "my-prod-cluster"
node.name: "coord-node-1" # Unique for each coordinating node
node.roles: [ search ] # Or [ ingest, search ] if using ingest pipelines
network.host: 0.0.0.0
discovery.seed_hosts:
  - "192.168.1.10:9300" # IP of master-node-1
  - "192.168.1.11:9300" # IP of master-node-2
  - "192.168.1.12:9300" # IP of master-node-3
cluster.initial_master_nodes:
  - "master-node-1"
  - "master-node-2"
  - "master-node-3"
http.port: 9200
transport.port: 9300
# Coordinating nodes do not need large heap sizes, but sufficient for request handling
# JVM heap size - adjust as needed, e.g., 2GB

Shard Allocation and Replication

To ensure data availability and resilience, we must configure index settings for replica shards. A minimum of one replica shard per primary shard is essential for failover. For higher availability, two or more replicas are recommended.

This can be set at index creation time or updated dynamically. For dynamic updates:

curl -X PUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d'
{
  "index": {
    "number_of_replicas": 2
  }
}
'

This command sets the number of replicas to 2 for all existing and future indices. Elasticsearch will automatically distribute these replica shards across different data nodes. If a data node fails, Elasticsearch can promote a replica shard to become a primary shard on another available node, ensuring data continuity.

PHP Application Integration and Failover Strategy

The PHP application needs to be aware of the Elasticsearch cluster and handle potential connection failures gracefully. This involves configuring the Elasticsearch client and implementing retry mechanisms or fallback strategies.

PHP Elasticsearch Client Configuration

We’ll use the official Elasticsearch PHP client. The key is to provide multiple hosts to the client, allowing it to attempt connections to different nodes in the cluster.

<?php
require 'vendor/autoload.php';

use Elasticsearch\ClientBuilder;

$hosts = [
    'http://192.168.1.20:9200', // Coordinating Node 1
    'http://192.168.1.21:9200', // Coordinating Node 2
    'http://192.168.1.10:9200', // Master Node 1 (can also serve HTTP)
];

$client = ClientBuilder::create()
    ->setHosts($hosts)
    ->build();

try {
    // Example search query
    $params = [
        'index' => 'my_index',
        'body'  => [
            'query' => [
                'match' => [ 'title' => 'elasticsearch' ]
            ]
        ]
    ];
    $response = $client->search($params);
    // Process $response
    print_r($response);

} catch (\Elasticsearch\Common\Exceptions\NoNodesAvailableException $e) {
    // Handle the case where no Elasticsearch nodes are reachable
    error_log("Elasticsearch is unavailable: " . $e->getMessage());
    // Implement fallback strategy here
} catch (\Exception $e) {
    // Handle other potential Elasticsearch exceptions
    error_log("An error occurred with Elasticsearch: " . $e->getMessage());
}
?>

In this example, the client is configured with a list of potential Elasticsearch endpoints. The client library automatically handles load balancing and failover among these hosts. If a node becomes unresponsive, the client will try the next available host in the list.

Implementing a Fallback Strategy

When Elasticsearch is completely unavailable, the application should not simply crash or return a 5xx error to the user without context. A graceful fallback is essential.

Possible fallback strategies include:

Serving Stale Data: If the application has a local cache (e.g., Redis, Memcached, or even file-based) of frequently accessed search results, it can serve this stale data while Elasticsearch is down.
Displaying a User-Friendly Message: Inform the user that search functionality is temporarily unavailable.
Queueing Writes: For write operations (indexing), if Elasticsearch is down, these operations can be queued (e.g., using RabbitMQ or Kafka) and replayed once the cluster is back online.
Using a Secondary Data Source: In some scenarios, a simpler, less performant, but more resilient data source (e.g., a relational database with full-text search capabilities) could be used as a last resort.

Example Fallback Logic in PHP

<?php
require 'vendor/autoload.php';

use Elasticsearch\ClientBuilder;
use Elasticsearch\Common\Exceptions\NoNodesAvailableException;

// Assume $cacheService is an instance of your caching mechanism (e.g., Redis client)
// Assume $queueService is an instance of your message queue client (e.g., RabbitMQ client)

$hosts = [
    'http://192.168.1.20:9200',
    'http://192.168.1.21:9200',
    'http://192.168.1.10:9200',
];

$client = ClientBuilder::create()
    ->setHosts($hosts)
    ->build();

function performSearch($client) {
    $params = [
        'index' => 'my_index',
        'body'  => [
            'query' => [
                'match' => [ 'title' => 'elasticsearch' ]
            ]
        ]
    ];
    return $client->search($params);
}

function getFromCache($key) {
    // Placeholder for cache retrieval logic
    return $_SESSION[$key] ?? null; // Example using $_SESSION for simplicity
}

function saveToCache($key, $data, $ttl = 300) {
    // Placeholder for cache storage logic
    $_SESSION[$key] = $data;
    // In a real app, use Redis/Memcached with TTL
}

function displayErrorToUser($message) {
    // Placeholder for user-facing error display
    echo "<div style='color: red;'>" . htmlspecialchars($message) . "</div>";
}

function queueWriteOperation($operationData) {
    // Placeholder for queuing logic
    error_log("Queuing write operation: " . json_encode($operationData));
    // $queueService->publish('elasticsearch_writes', json_encode($operationData));
}

try {
    $response = performSearch($client);
    // Cache the successful search results
    saveToCache('search_results_elasticsearch', $response);
    // Process and display $response
    print_r($response);

} catch (NoNodesAvailableException $e) {
    error_log("Elasticsearch is unavailable: " . $e->getMessage());
    
    // Attempt to retrieve from cache
    $cachedResults = getFromCache('search_results_elasticsearch');
    if ($cachedResults) {
        echo "<p>Displaying cached search results as Elasticsearch is currently unavailable.</p>";
        print_r($cachedResults);
    } else {
        displayErrorToUser("Search functionality is temporarily unavailable. Please try again later.");
        // Optionally, log this event for monitoring
    }
} catch (\Exception $e) {
    error_log("An error occurred with Elasticsearch: " . $e->getMessage());
    displayErrorToUser("An unexpected error occurred during search. Please try again later.");
}

// Example of handling a write operation (indexing)
function indexDocument($client, $index, $id, $doc) {
    try {
        $params = [
            'index' => $index,
            'id'    => $id,
            'body'  => $doc
        ];
        $client->index($params);
        error_log("Document indexed successfully: {$index}/{$id}");
    } catch (NoNodesAvailableException $e) {
        error_log("Elasticsearch unavailable for indexing: " . $e->getMessage());
        // Queue the write operation for later replay
        queueWriteOperation(['action' => 'index', 'index' => $index, 'id' => $id, 'doc' => $doc]);
    } catch (\Exception $e) {
        error_log("Error indexing document {$index}/{$id}: " . $e->getMessage());
        // Decide on fallback for other errors, possibly queueing as well
        queueWriteOperation(['action' => 'index', 'index' => $index, 'id' => $id, 'doc' => $doc]);
    }
}

// Example usage of indexDocument
// indexDocument($client, 'my_index', 'doc_123', ['title' => 'New Article', 'content' => '...']);

?>

Automated Failover Orchestration with DigitalOcean and External Tools

While Elasticsearch and the PHP client handle internal node failover and connection retries, true disaster recovery often requires orchestrating failover at the infrastructure level, especially if an entire DigitalOcean region becomes unavailable.

Load Balancer Configuration (HAProxy)

A highly available setup for the PHP application itself is paramount. Deploying PHP applications behind a load balancer like HAProxy, running on a separate, resilient Droplet (or a managed DigitalOcean Load Balancer), is standard practice. This load balancer should point to multiple instances of your PHP application.

For Elasticsearch, you might also place a load balancer in front of your coordinating nodes. This adds another layer of abstraction and failover capability.

HAProxy Configuration for Elasticsearch Coordinating Nodes

# /etc/haproxy/haproxy.cfg

global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

frontend http_in
    bind *:80
    acl is_elasticsearch_api hdr(Host) -i api.yourdomain.com
    use_backend elasticsearch_backend if is_elasticsearch_api
    default_backend php_app_backend

backend php_app_backend
    balance roundrobin
    server app1 192.168.1.30:80 check
    server app2 192.168.1.31:80 check

backend elasticsearch_backend
    balance roundrobin
    option httpchk GET / HTTP/1.1\r\nHost:\ api.yourdomain.com
    # Health check for Elasticsearch nodes (coordinating nodes)
    # Ensure these IPs are your coordinating nodes
    server coord1 192.168.1.20:9200 check port 9200 inter 2s fall 3 rise 2
    server coord2 192.168.1.21:9200 check port 9200 inter 2s fall 3 rise 2
    # Optionally add master nodes if they also serve HTTP and you want them in rotation
    # server master1 192.168.1.10:9200 check port 9200 inter 2s fall 3 rise 2

In this HAProxy configuration, requests to api.yourdomain.com are routed to the Elasticsearch coordinating nodes. HAProxy monitors the health of these nodes and automatically removes unhealthy ones from the rotation. The PHP client, configured with the HAProxy IP as its primary endpoint, will then benefit from this load balancing and failover.

Cross-Region Disaster Recovery

For true DR against a regional outage, a multi-region deployment is necessary. This involves replicating your Elasticsearch data and application infrastructure to a secondary DigitalOcean region.

Elasticsearch Cross-Cluster Replication (CCR):

Elasticsearch’s Cross-Cluster Replication (CCR) allows you to replicate indices from a primary cluster in one region to a secondary cluster in another region. This is a powerful feature for DR, ensuring that your data is available in a separate geographical location.

# On the primary cluster (e.g., NYC1)
curl -X PUT "localhost:9200/_ccr/auto_follow/my_leader_index_follower?wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
  "remote_cluster": "nyc1_cluster",
  "leader_index": "my_leader_index",
  "follower_index": "my_leader_index_follower",
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1
  }
}
'

# On the secondary cluster (e.g., SFO3), configure the remote cluster connection
# In elasticsearch.yml on SFO3 cluster nodes:
# cluster.remote.nyc1_cluster.seeds: "192.168.2.10:9300,192.168.2.11:9300"
# Then on the SFO3 cluster, create the follower index:
curl -X PUT "localhost:9200/my_leader_index_follower" -H 'Content-Type: application/json' -d'
{
  "index": {
    "creation_date": "...",
    "number_of_shards": "1",
    "number_of_replicas": "1",
    "uuid": "...",
    "version": "...",
    "provided_name": "my_leader_index_follower"
  }
}
'
# Then configure the auto-follow pattern on the secondary cluster to follow the leader index
# This requires setting up the remote cluster connection first.
# The actual CCR setup involves more detailed configuration of remote cluster access and security.

Application Deployment and DNS Failover:

Your PHP application instances should also be deployed in the secondary region. A global DNS provider (like DigitalOcean’s managed DNS or a third-party like Cloudflare) can be configured to point your application’s primary domain to the load balancer in the primary region. In the event of a regional outage, you can manually or automatically update the DNS records to point to the load balancer in the secondary region.

Automated DNS failover can be achieved using health checks provided by DNS services or by custom scripts that monitor the primary region’s health and trigger DNS updates via API calls.

Monitoring and Alerting

Robust monitoring is non-negotiable for any HA/DR strategy. Tools like Prometheus and Grafana, or DigitalOcean’s integrated monitoring, should be used to track:

Elasticsearch cluster health (green, yellow, red status).
Node status (up/down, CPU, memory, disk usage).
Network latency between nodes and regions.
Application error rates and response times.
HAProxy backend health.

Alerting should be configured for critical thresholds and failures, notifying the operations team immediately when issues arise, allowing for prompt intervention or automated failover procedures.