Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and PHP Deployments on Linode

Elasticsearch Cluster Setup for High Availability

Achieving automated failover for Elasticsearch necessitates a robust, multi-node cluster configuration. We’ll focus on a setup that leverages Elasticsearch’s built-in master-eligible nodes and shard replication to ensure data durability and service continuity. For this example, we’ll assume a Linode environment with at least three Elasticsearch nodes, each running a recent version of Elasticsearch (e.g., 7.x or 8.x).

Each Elasticsearch node must be configured to participate in cluster formation. The critical parameters are cluster.name, node.name, discovery.seed_hosts, and cluster.initial_master_nodes.

Elasticsearch Node Configuration (`elasticsearch.yml`)

On each Elasticsearch node, modify the elasticsearch.yml file. Ensure consistency in cluster.name across all nodes. The discovery.seed_hosts should list the IP addresses or hostnames of all potential master-eligible nodes. cluster.initial_master_nodes is crucial for bootstrapping the cluster; it should contain the names of the nodes that will initially form the master quorum. Once the cluster is established, this setting can be removed or commented out for subsequent restarts.

Node 1 Configuration

cluster.name: my-production-cluster
node.name: es-node-1
node.master: true
node.data: true
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - "192.168.1.101:9300" # IP of es-node-1
  - "192.168.1.102:9300" # IP of es-node-2
  - "192.168.1.103:9300" # IP of es-node-3

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

# For production, consider security settings, JVM heap size, and shard allocation awareness
# xpack.security.enabled: true
# bootstrap.memory_lock: true
# indices.cluster.routing.allocation.awareness.attributes: zone

Node 2 Configuration

cluster.name: my-production-cluster
node.name: es-node-2
node.master: true
node.data: true
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - "192.168.1.101:9300"
  - "192.168.1.102:9300"
  - "192.168.1.103:9300"

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

# xpack.security.enabled: true
# bootstrap.memory_lock: true
# indices.cluster.routing.allocation.awareness.attributes: zone

Node 3 Configuration

cluster.name: my-production-cluster
node.name: es-node-3
node.master: true
node.data: true
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - "192.168.1.101:9300"
  - "192.168.1.102:9300"
  - "192.168.1.103:9300"

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

# xpack.security.enabled: true
# bootstrap.memory_lock: true
# indices.cluster.routing.allocation.awareness.attributes: zone

Shard Replication and Allocation

To ensure data availability, configure indices with appropriate replica counts. For a highly available setup, a number_of_replicas of at least 1 is recommended, meaning each primary shard will have one copy on a different node. For disaster recovery across availability zones (if your Linode setup supports it), consider using shard allocation awareness.

Index Template Example

PUT _index_template/default_template
{
  "index_patterns": ["*"],
  "template": {
    "settings": {
      "index.number_of_shards": 3,
      "index.number_of_replicas": 1,
      "index.routing.allocation.require._name": "es-node-*"
      // If using allocation awareness:
      // "index.routing.allocation.awareness.attributes": "zone"
    }
  }
}

After configuring and starting all Elasticsearch nodes, verify cluster health using the _cat/health and _cat/nodes APIs. A green or yellow health status is generally acceptable for operations, with yellow indicating that all primary shards are allocated but some replicas are not. Red indicates unallocated primary shards, which is a critical failure.

PHP Application Integration with Elasticsearch

Your PHP application needs to be resilient to Elasticsearch node failures. This involves implementing proper connection handling, retry mechanisms, and potentially load balancing or service discovery for your Elasticsearch clients.

PHP Elasticsearch Client Configuration

We’ll use the official Elasticsearch PHP client. The key is to provide multiple hosts to the client, allowing it to automatically failover to an available node if the primary connection fails. For more advanced scenarios, consider integrating with a service discovery mechanism or a dedicated load balancer.

Basic PHP Client Setup with Multiple Hosts

<?php
require 'vendor/autoload.php';

use Elasticsearch\ClientBuilder;

$hosts = [
    'http://192.168.1.101:9200', // IP of es-node-1
    'http://192.168.1.102:9200', // IP of es-node-2
    'http://192.168.1.103:9200'  // IP of es-node-3
];

$client = ClientBuilder::create()
    ->setHosts($hosts)
    // Optional: Configure connection timeouts and retries for robustness
    ->setConnectionPoolConfig('StickyRoundRobin', [
        'randomize_hosts' => true,
        'retries' => 3, // Number of retries on connection failure
        'http_compression' => true,
        'sniffing_interval' => 30, // How often to sniff cluster nodes
        'resurrect_timeout' => 60 // How long to wait before trying a dead node again
    ])
    ->build();

try {
    // Example: Index a document
    $params = [
        'index' => 'my_index',
        'id'    => 'my_document_id',
        'body'  => ['testField' => 'abc']
    ];
    $response = $client->index($params);
    print_r($response);

    // Example: Search
    $searchParams = [
        'index' => 'my_index',
        'body'  => [
            'query' => [
                'match' => ['testField' => 'abc']
            ]
        ]
    ];
    $searchResponse = $client->search($searchParams);
    print_r($searchResponse);

} catch (\Exception $e) {
    // Log the error and handle gracefully
    error_log("Elasticsearch operation failed: " . $e->getMessage());
    // Depending on the operation, you might want to return an error to the user,
    // queue the operation for later, or use a fallback data source.
    echo "An error occurred. Please try again later.";
}
?>

Implementing Application-Level Failover Logic

While the Elasticsearch client handles basic node failover, your PHP application might need more sophisticated logic. This could involve:

Graceful Degradation: If Elasticsearch is unavailable, can your application still serve some content or perform essential functions?
Asynchronous Operations: For non-critical writes, consider queuing them (e.g., using Redis or RabbitMQ) and retrying later when Elasticsearch is healthy.
Fallback Data Sources: In extreme cases, can you fall back to a simpler, less performant data store (e.g., a relational database) for certain queries?
Health Checks: Implement periodic health checks of your Elasticsearch cluster from within your application or via external monitoring tools.

Example: Asynchronous Indexing with a Queue

<?php
// Assuming you have a Redis client configured and connected
// $redis = new Redis();
// $redis->connect('127.0.0.1', 6379);

function indexDocumentAsync(array $documentData, string $indexName, string $documentId) {
    global $redis; // Access the global Redis client instance

    $payload = [
        'index' => $indexName,
        'id'    => $documentId,
        'body'  => $documentData
    ];

    // Push the job to a Redis queue
    $redis->rPush('elasticsearch_queue', json_encode($payload));
    echo "Document queued for indexing.\n";
}

// In your main application logic:
// indexDocumentAsync(['testField' => 'data for async index'], 'my_async_index', 'async_doc_1');

// A separate worker script would process this queue:
function processElasticsearchQueue() {
    global $redis, $client; // $client is your Elasticsearch client instance

    while (true) {
        $job = $redis->blPop('elasticsearch_queue', 0); // Blocking pop
        if ($job) {
            $payload = json_decode($job[1], true);
            try {
                $response = $client->index($payload);
                // Log success or handle response if needed
                echo "Successfully indexed document: " . $payload['id'] . "\n";
            } catch (\Exception $e) {
                // If indexing fails, push it back to the queue with a delay or to a dead-letter queue
                error_log("Failed to index document " . $payload['id'] . ": " . $e->getMessage());
                // Implement retry logic here, e.g., push back to queue after a delay
                // $redis->zAdd('elasticsearch_retry_queue', time() + 60, json_encode($payload)); // Retry in 60 seconds
            }
        }
        sleep(1); // Prevent busy-waiting
    }
}

// To run the worker:
// processElasticsearchQueue();
?>

Automated Failover Orchestration with Linode and External Tools

While Elasticsearch and the PHP client provide internal failover mechanisms, true automated failover often requires external orchestration, especially for infrastructure-level events like a Linode instance failure. This involves monitoring and automated recovery actions.

Monitoring Elasticsearch Cluster Health

Implement robust monitoring for your Elasticsearch cluster. This should include:

Node Status: Are all nodes online and part of the cluster?
Cluster Health: Is the cluster status green, yellow, or red?
Shard Allocation: Are all primary and replica shards allocated?
Resource Utilization: CPU, memory, disk I/O, and network traffic on Elasticsearch nodes.

Tools like Prometheus with the Elasticsearch Exporter, Datadog, or Nagios can be used for this. Alerts should be configured to trigger on critical events.

Linode Instance Health Checks and Automated Recovery

Linode offers features like automated reboots for unresponsive instances. For more advanced failover, consider:

External Load Balancers: Use Linode’s NodeBalancers or a third-party solution (like HAProxy running on a separate instance) to distribute traffic to your PHP application servers. Configure health checks on the NodeBalancer to remove unhealthy application instances from rotation.
Orchestration Tools: Tools like Ansible, Terraform, or Kubernetes (if you’re using it) can be used to automate the deployment and management of new instances.
Custom Scripts: For specific Linode API-driven failover, you might write custom scripts that are triggered by monitoring alerts. These scripts could:
- Detect a failed Elasticsearch node (e.g., via API checks or monitoring alerts).
- Provision a new Linode instance.
- Configure and join the new instance to the Elasticsearch cluster.
- Update DNS records or load balancer configurations.

Example: Using Linode API and Bash for Node Replacement (Conceptual)

This is a simplified conceptual example. A production-ready solution would require more robust error handling, state management, and security.

#!/bin/bash

# Configuration
LINODE_API_TOKEN="YOUR_LINODE_API_TOKEN"
REGION="us-east"
IMAGE="linode/ubuntu22.04"
PLAN="g6-standard-2" # Or your preferred plan
ELASTICSEARCH_CONFIG_URL="http://your-config-server/elasticsearch.yml"
ELASTICSEARCH_SERVICE_NAME="elasticsearch.service"
FAILED_NODE_IP="192.168.1.101" # The IP of the failed node

# Function to check Elasticsearch cluster health
check_es_health() {
    curl -s "http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=30s"
}

# Function to get Linode instance ID by IP (requires jq)
get_instance_id_by_ip() {
    local ip_address="$1"
    curl -s -X GET "https://api.linode.com/v4/linode/instances?ip=${ip_address}" \
        -H "Authorization: Bearer ${LINODE_API_TOKEN}" \
        -H "Content-Type: application/json" | jq -r '.data[0].id'
}

# Function to create a new Linode instance
create_new_linode() {
    echo "Creating new Linode instance..."
    local response=$(curl -s -X POST "https://api.linode.com/v4/linode/instances" \
        -H "Authorization: Bearer ${LINODE_API_TOKEN}" \
        -H "Content-Type: application/json" \
        -d '{
            "region": "'"${REGION}"'",
            "image": "'"${IMAGE}"'",
            "type": "'"${PLAN}"'",
            "label": "es-failover-$(date +%s)",
            "root_pass": "YOUR_SECURE_ROOT_PASSWORD",
            "authorized_keys": ["YOUR_SSH_PUBLIC_KEY"]
        }')
    echo "$response" | jq -r '.id'
}

# Function to configure and start Elasticsearch on a new node
configure_es_node() {
    local new_instance_id="$1"
    local new_instance_ip=$(curl -s -X GET "https://api.linode.com/v4/linode/instances/${new_instance_id}/ips" \
        -H "Authorization: Bearer ${LINODE_API_TOKEN}" \
        -H "Content-Type: application/json" | jq -r '.ipv4[0].address')

    echo "Configuring new Elasticsearch node at IP: ${new_instance_ip}"

    # Use SSH to provision the new node
    ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 root@${new_instance_ip} << EOF
        # Update and install dependencies
        apt-get update -y && apt-get upgrade -y
        apt-get install -y apt-transport-https openjdk-11-jre wget curl jq

        # Download and install Elasticsearch (adjust version as needed)
        wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -
        echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee /etc/apt/sources.list.d/elastic-7.x.list
        apt-get update -y && apt-get install -y elasticsearch

        # Download and apply configuration
        curl -s ${ELASTICSEARCH_CONFIG_URL} -o /etc/elasticsearch/elasticsearch.yml
        # Ensure node.name is unique, e.g., based on IP or hostname
        echo "node.name: es-node-$(echo ${new_instance_ip} | sed 's/\./-/g')" >> /etc/elasticsearch/elasticsearch.yml
        # Update discovery.seed_hosts if necessary (e.g., if you have a dynamic list)

        # Enable and start Elasticsearch service
        systemctl daemon-reload
        systemctl enable ${ELASTICSEARCH_SERVICE_NAME}
        systemctl start ${ELASTICSEARCH_SERVICE_NAME}

        # Wait for Elasticsearch to start
        sleep 30

        # Verify cluster join (optional, but good practice)
        curl -s "http://${new_instance_ip}:9200/_cat/health?h=status"
EOF
    echo "Elasticsearch configuration complete for ${new_instance_ip}"
    echo "${new_instance_ip}" # Return the IP of the new node
}

# --- Main Failover Logic ---

echo "Checking Elasticsearch cluster health..."
HEALTH_STATUS=$(check_es_health | jq -r '.status')

if [ "$HEALTH_STATUS" == "red" ]; then
    echo "CRITICAL: Elasticsearch cluster is RED. Initiating failover procedure."

    # 1. Identify the failed node (this is a simplification; real detection is complex)
    #    In a real scenario, you'd use monitoring alerts to identify the specific node.
    echo "Assuming node with IP ${FAILED_NODE_IP} has failed."

    # 2. Remove the failed node from DNS/Load Balancer (if applicable)
    #    This step is highly dependent on your infrastructure.
    echo "TODO: Remove ${FAILED_NODE_IP} from DNS/Load Balancer."

    # 3. Provision a new Linode instance
    NEW_INSTANCE_ID=$(create_new_linode)
    if [ -z "$NEW_INSTANCE_ID" ]; then
        echo "ERROR: Failed to create new Linode instance. Aborting."
        exit 1
    fi
    echo "New Linode instance created with ID: ${NEW_INSTANCE_ID}"

    # Wait for the instance to be ready (Linode API might provide status)
    echo "Waiting for new instance to boot..."
    sleep 120 # Adjust as needed

    # 4. Configure the new instance as an Elasticsearch node
    NEW_NODE_IP=$(configure_es_node "${NEW_INSTANCE_ID}")
    if [ -z "$NEW_NODE_IP" ]; then
        echo "ERROR: Failed to configure new Elasticsearch node. Aborting."
        # Consider cleaning up the created Linode instance
        exit 1
    fi
    echo "New Elasticsearch node is running at ${NEW_NODE_IP}"

    # 5. Update DNS/Load Balancer with the new node's IP
    echo "TODO: Add ${NEW_NODE_IP} to DNS/Load Balancer."

    # 6. Update Elasticsearch configuration on other nodes if necessary (e.g., discovery.seed_hosts)
    #    This is often handled by the cluster itself if seed hosts are static.
    #    If using dynamic discovery, this might not be needed.

    echo "Failover procedure initiated. Monitor cluster health."
else
    echo "Elasticsearch cluster health is ${HEALTH_STATUS}. No immediate action required."
fi

Conclusion

Architecting for automated failover is a multi-layered approach. It begins with a resilient Elasticsearch cluster configuration, complemented by a robust PHP application client that can handle transient network issues. For true disaster recovery and automated recovery from infrastructure failures, external monitoring and orchestration tools are indispensable. By combining these elements, you can build a highly available Elasticsearch and PHP deployment on Linode that minimizes downtime and ensures business continuity.