Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and PHP Deployments on DigitalOcean
Elasticsearch Cluster Architecture for High Availability
Achieving true disaster recovery for Elasticsearch hinges on a robust, multi-node cluster design that inherently supports failover. For production environments, especially those serving critical applications, a single-node Elasticsearch instance is a non-starter. We’ll focus on a setup that leverages Elasticsearch’s built-in master-eligible and data node roles, distributed across multiple DigitalOcean Droplets for resilience.
A minimum viable HA cluster consists of at least three master-eligible nodes. This quorum-based approach prevents split-brain scenarios. Data nodes will store the actual indices and shards. For optimal performance and resilience, we’ll also include dedicated coordinating nodes, which offload query and indexing requests from data and master nodes.
DigitalOcean Droplet Configuration
We’ll provision Droplets with sufficient RAM and CPU for Elasticsearch. Given Elasticsearch’s memory demands, especially for heap, Droplets with at least 8GB RAM are recommended. For production, consider dedicated CPU instances. Network latency is also a critical factor; placing nodes within the same DigitalOcean region and availability zone (or across zones for higher availability, with careful consideration of latency impact) is crucial.
Example Droplet setup:
- Master Nodes (3x): `c-4` (4 vCPU, 8GB RAM) or larger. These nodes will run with
node.roles: [ master ]. - Data Nodes (2x+): `c-4` or larger, depending on data volume and query load. These nodes will run with
node.roles: [ data ]. - Coordinating Nodes (1x+): `c-2` (2 vCPU, 4GB RAM) or larger. These nodes will run with
node.roles: [ ingest, search ]or simplynode.roles: [ search ]if ingest is handled elsewhere.
Elasticsearch Configuration for HA
The core of Elasticsearch’s HA lies in its configuration file, elasticsearch.yml. We need to configure discovery, cluster naming, and node roles correctly on each node.
Master Node Configuration (Example: /etc/elasticsearch/elasticsearch.yml)
On each of the three master nodes:
cluster.name: "my-prod-cluster" node.name: "master-node-1" # Unique for each master node node.roles: [ master ] network.host: 0.0.0.0 discovery.seed_hosts: - "192.168.1.10:9300" # IP of master-node-1 - "192.168.1.11:9300" # IP of master-node-2 - "192.168.1.12:9300" # IP of master-node-3 cluster.initial_master_nodes: - "master-node-1" - "master-node-2" - "master-node-3" http.port: 9200 transport.port: 9300 # Ensure these are set to a reasonable value, especially for large clusters indices.memory.index_buffer_size: "50%" # JVM heap size - typically 50% of system RAM, up to 30GB # Edit jvm.options file for this
Data Node Configuration (Example: /etc/elasticsearch/elasticsearch.yml)
On each data node:
cluster.name: "my-prod-cluster" node.name: "data-node-1" # Unique for each data node node.roles: [ data ] network.host: 0.0.0.0 discovery.seed_hosts: - "192.168.1.10:9300" # IP of master-node-1 - "192.168.1.11:9300" # IP of master-node-2 - "192.168.1.12:9300" # IP of master-node-3 cluster.initial_master_nodes: - "master-node-1" - "master-node-2" - "master-node-3" http.port: 9200 transport.port: 9300 # Ensure these are set to a reasonable value, especially for large clusters indices.memory.index_buffer_size: "50%" # JVM heap size - typically 50% of system RAM, up to 30GB # Edit jvm.options file for this
Coordinating Node Configuration (Example: /etc/elasticsearch/elasticsearch.yml)
On each coordinating node:
cluster.name: "my-prod-cluster" node.name: "coord-node-1" # Unique for each coordinating node node.roles: [ search ] # Or [ ingest, search ] if using ingest pipelines network.host: 0.0.0.0 discovery.seed_hosts: - "192.168.1.10:9300" # IP of master-node-1 - "192.168.1.11:9300" # IP of master-node-2 - "192.168.1.12:9300" # IP of master-node-3 cluster.initial_master_nodes: - "master-node-1" - "master-node-2" - "master-node-3" http.port: 9200 transport.port: 9300 # Coordinating nodes do not need large heap sizes, but sufficient for request handling # JVM heap size - adjust as needed, e.g., 2GB
Shard Allocation and Replication
To ensure data availability and resilience, we must configure index settings for replica shards. A minimum of one replica shard per primary shard is essential for failover. For higher availability, two or more replicas are recommended.
This can be set at index creation time or updated dynamically. For dynamic updates:
curl -X PUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d'
{
"index": {
"number_of_replicas": 2
}
}
'
This command sets the number of replicas to 2 for all existing and future indices. Elasticsearch will automatically distribute these replica shards across different data nodes. If a data node fails, Elasticsearch can promote a replica shard to become a primary shard on another available node, ensuring data continuity.
PHP Application Integration and Failover Strategy
The PHP application needs to be aware of the Elasticsearch cluster and handle potential connection failures gracefully. This involves configuring the Elasticsearch client and implementing retry mechanisms or fallback strategies.
PHP Elasticsearch Client Configuration
We’ll use the official Elasticsearch PHP client. The key is to provide multiple hosts to the client, allowing it to attempt connections to different nodes in the cluster.
<?php
require 'vendor/autoload.php';
use Elasticsearch\ClientBuilder;
$hosts = [
'http://192.168.1.20:9200', // Coordinating Node 1
'http://192.168.1.21:9200', // Coordinating Node 2
'http://192.168.1.10:9200', // Master Node 1 (can also serve HTTP)
];
$client = ClientBuilder::create()
->setHosts($hosts)
->build();
try {
// Example search query
$params = [
'index' => 'my_index',
'body' => [
'query' => [
'match' => [ 'title' => 'elasticsearch' ]
]
]
];
$response = $client->search($params);
// Process $response
print_r($response);
} catch (\Elasticsearch\Common\Exceptions\NoNodesAvailableException $e) {
// Handle the case where no Elasticsearch nodes are reachable
error_log("Elasticsearch is unavailable: " . $e->getMessage());
// Implement fallback strategy here
} catch (\Exception $e) {
// Handle other potential Elasticsearch exceptions
error_log("An error occurred with Elasticsearch: " . $e->getMessage());
}
?>
In this example, the client is configured with a list of potential Elasticsearch endpoints. The client library automatically handles load balancing and failover among these hosts. If a node becomes unresponsive, the client will try the next available host in the list.
Implementing a Fallback Strategy
When Elasticsearch is completely unavailable, the application should not simply crash or return a 5xx error to the user without context. A graceful fallback is essential.
Possible fallback strategies include:
- Serving Stale Data: If the application has a local cache (e.g., Redis, Memcached, or even file-based) of frequently accessed search results, it can serve this stale data while Elasticsearch is down.
- Displaying a User-Friendly Message: Inform the user that search functionality is temporarily unavailable.
- Queueing Writes: For write operations (indexing), if Elasticsearch is down, these operations can be queued (e.g., using RabbitMQ or Kafka) and replayed once the cluster is back online.
- Using a Secondary Data Source: In some scenarios, a simpler, less performant, but more resilient data source (e.g., a relational database with full-text search capabilities) could be used as a last resort.
Example Fallback Logic in PHP
<?php
require 'vendor/autoload.php';
use Elasticsearch\ClientBuilder;
use Elasticsearch\Common\Exceptions\NoNodesAvailableException;
// Assume $cacheService is an instance of your caching mechanism (e.g., Redis client)
// Assume $queueService is an instance of your message queue client (e.g., RabbitMQ client)
$hosts = [
'http://192.168.1.20:9200',
'http://192.168.1.21:9200',
'http://192.168.1.10:9200',
];
$client = ClientBuilder::create()
->setHosts($hosts)
->build();
function performSearch($client) {
$params = [
'index' => 'my_index',
'body' => [
'query' => [
'match' => [ 'title' => 'elasticsearch' ]
]
]
];
return $client->search($params);
}
function getFromCache($key) {
// Placeholder for cache retrieval logic
return $_SESSION[$key] ?? null; // Example using $_SESSION for simplicity
}
function saveToCache($key, $data, $ttl = 300) {
// Placeholder for cache storage logic
$_SESSION[$key] = $data;
// In a real app, use Redis/Memcached with TTL
}
function displayErrorToUser($message) {
// Placeholder for user-facing error display
echo "<div style='color: red;'>" . htmlspecialchars($message) . "</div>";
}
function queueWriteOperation($operationData) {
// Placeholder for queuing logic
error_log("Queuing write operation: " . json_encode($operationData));
// $queueService->publish('elasticsearch_writes', json_encode($operationData));
}
try {
$response = performSearch($client);
// Cache the successful search results
saveToCache('search_results_elasticsearch', $response);
// Process and display $response
print_r($response);
} catch (NoNodesAvailableException $e) {
error_log("Elasticsearch is unavailable: " . $e->getMessage());
// Attempt to retrieve from cache
$cachedResults = getFromCache('search_results_elasticsearch');
if ($cachedResults) {
echo "<p>Displaying cached search results as Elasticsearch is currently unavailable.</p>";
print_r($cachedResults);
} else {
displayErrorToUser("Search functionality is temporarily unavailable. Please try again later.");
// Optionally, log this event for monitoring
}
} catch (\Exception $e) {
error_log("An error occurred with Elasticsearch: " . $e->getMessage());
displayErrorToUser("An unexpected error occurred during search. Please try again later.");
}
// Example of handling a write operation (indexing)
function indexDocument($client, $index, $id, $doc) {
try {
$params = [
'index' => $index,
'id' => $id,
'body' => $doc
];
$client->index($params);
error_log("Document indexed successfully: {$index}/{$id}");
} catch (NoNodesAvailableException $e) {
error_log("Elasticsearch unavailable for indexing: " . $e->getMessage());
// Queue the write operation for later replay
queueWriteOperation(['action' => 'index', 'index' => $index, 'id' => $id, 'doc' => $doc]);
} catch (\Exception $e) {
error_log("Error indexing document {$index}/{$id}: " . $e->getMessage());
// Decide on fallback for other errors, possibly queueing as well
queueWriteOperation(['action' => 'index', 'index' => $index, 'id' => $id, 'doc' => $doc]);
}
}
// Example usage of indexDocument
// indexDocument($client, 'my_index', 'doc_123', ['title' => 'New Article', 'content' => '...']);
?>
Automated Failover Orchestration with DigitalOcean and External Tools
While Elasticsearch and the PHP client handle internal node failover and connection retries, true disaster recovery often requires orchestrating failover at the infrastructure level, especially if an entire DigitalOcean region becomes unavailable.
Load Balancer Configuration (HAProxy)
A highly available setup for the PHP application itself is paramount. Deploying PHP applications behind a load balancer like HAProxy, running on a separate, resilient Droplet (or a managed DigitalOcean Load Balancer), is standard practice. This load balancer should point to multiple instances of your PHP application.
For Elasticsearch, you might also place a load balancer in front of your coordinating nodes. This adds another layer of abstraction and failover capability.
HAProxy Configuration for Elasticsearch Coordinating Nodes
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend http_in
bind *:80
acl is_elasticsearch_api hdr(Host) -i api.yourdomain.com
use_backend elasticsearch_backend if is_elasticsearch_api
default_backend php_app_backend
backend php_app_backend
balance roundrobin
server app1 192.168.1.30:80 check
server app2 192.168.1.31:80 check
backend elasticsearch_backend
balance roundrobin
option httpchk GET / HTTP/1.1\r\nHost:\ api.yourdomain.com
# Health check for Elasticsearch nodes (coordinating nodes)
# Ensure these IPs are your coordinating nodes
server coord1 192.168.1.20:9200 check port 9200 inter 2s fall 3 rise 2
server coord2 192.168.1.21:9200 check port 9200 inter 2s fall 3 rise 2
# Optionally add master nodes if they also serve HTTP and you want them in rotation
# server master1 192.168.1.10:9200 check port 9200 inter 2s fall 3 rise 2
In this HAProxy configuration, requests to api.yourdomain.com are routed to the Elasticsearch coordinating nodes. HAProxy monitors the health of these nodes and automatically removes unhealthy ones from the rotation. The PHP client, configured with the HAProxy IP as its primary endpoint, will then benefit from this load balancing and failover.
Cross-Region Disaster Recovery
For true DR against a regional outage, a multi-region deployment is necessary. This involves replicating your Elasticsearch data and application infrastructure to a secondary DigitalOcean region.
Elasticsearch Cross-Cluster Replication (CCR):
Elasticsearch’s Cross-Cluster Replication (CCR) allows you to replicate indices from a primary cluster in one region to a secondary cluster in another region. This is a powerful feature for DR, ensuring that your data is available in a separate geographical location.
# On the primary cluster (e.g., NYC1)
curl -X PUT "localhost:9200/_ccr/auto_follow/my_leader_index_follower?wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
"remote_cluster": "nyc1_cluster",
"leader_index": "my_leader_index",
"follower_index": "my_leader_index_follower",
"settings": {
"index.number_of_shards": 1,
"index.number_of_replicas": 1
}
}
'
# On the secondary cluster (e.g., SFO3), configure the remote cluster connection
# In elasticsearch.yml on SFO3 cluster nodes:
# cluster.remote.nyc1_cluster.seeds: "192.168.2.10:9300,192.168.2.11:9300"
# Then on the SFO3 cluster, create the follower index:
curl -X PUT "localhost:9200/my_leader_index_follower" -H 'Content-Type: application/json' -d'
{
"index": {
"creation_date": "...",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "...",
"version": "...",
"provided_name": "my_leader_index_follower"
}
}
'
# Then configure the auto-follow pattern on the secondary cluster to follow the leader index
# This requires setting up the remote cluster connection first.
# The actual CCR setup involves more detailed configuration of remote cluster access and security.
Application Deployment and DNS Failover:
Your PHP application instances should also be deployed in the secondary region. A global DNS provider (like DigitalOcean’s managed DNS or a third-party like Cloudflare) can be configured to point your application’s primary domain to the load balancer in the primary region. In the event of a regional outage, you can manually or automatically update the DNS records to point to the load balancer in the secondary region.
Automated DNS failover can be achieved using health checks provided by DNS services or by custom scripts that monitor the primary region’s health and trigger DNS updates via API calls.
Monitoring and Alerting
Robust monitoring is non-negotiable for any HA/DR strategy. Tools like Prometheus and Grafana, or DigitalOcean’s integrated monitoring, should be used to track:
- Elasticsearch cluster health (green, yellow, red status).
- Node status (up/down, CPU, memory, disk usage).
- Network latency between nodes and regions.
- Application error rates and response times.
- HAProxy backend health.
Alerting should be configured for critical thresholds and failures, notifying the operations team immediately when issues arise, allowing for prompt intervention or automated failover procedures.