Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and C Deployments on OVH

Elasticsearch Cluster Setup for High Availability on OVH

Achieving robust disaster recovery for Elasticsearch hinges on a well-architected, multi-region deployment with automated failover. This section details the foundational setup for an Elasticsearch cluster designed for resilience on OVH’s infrastructure, focusing on replication and shard allocation strategies that facilitate seamless failover.

We’ll assume a primary region (e.g., GRA) and a secondary, geographically distinct region (e.g., BHS) for failover. Each region will host a dedicated Elasticsearch cluster. The key is to ensure data is replicated across these clusters and that a mechanism exists to redirect traffic to the secondary cluster should the primary become unavailable.

Configuring Elasticsearch Replication and Shard Allocation

For disaster recovery, we employ cross-cluster replication (CCR). This allows indices in one cluster to be replicated to another. We also need to configure shard allocation awareness to ensure replicas are placed in different availability zones within a region and, crucially, across regions for DR purposes.

Cross-Cluster Replication (CCR) Setup

First, establish a secure connection between your primary and secondary Elasticsearch clusters. This typically involves configuring transport layer security (TLS) and setting up remote cluster configurations.

Remote Cluster Configuration (Primary Cluster)

On your primary cluster’s elasticsearch.yml, define the remote cluster connection:

cluster.remote.secondary_cluster:
  seeds:
    - secondary-es-node-1.ovh.example.com:9300
    - secondary-es-node-2.ovh.example.com:9300
  skip_unavailable: false
  mode: ALL

Remote Cluster Configuration (Secondary Cluster)

Similarly, on your secondary cluster’s elasticsearch.yml:

cluster.remote.primary_cluster:
  seeds:
    - primary-es-node-1.ovh.example.com:9300
    - primary-es-node-2.ovh.example.com:9300
  skip_unavailable: false
  mode: ALL

Configuring CCR Policies

Once remote clusters are configured, you can set up CCR policies. This is typically done via the Elasticsearch API. For example, to replicate an index named my-app-logs from the primary cluster to the secondary:

curl -X PUT "http://primary-es-node-1.ovh.example.com:9200/_ccr/auto/my-app-logs?from_remote_cluster=secondary_cluster" -H 'Content-Type: application/json' -d'
{
  "remote_cluster": "secondary_cluster",
  "leader_index": "my-app-logs",
  "follower_index": "my-app-logs-replica"
}'

This command creates a follower index (my-app-logs-replica) on the secondary cluster that continuously replicates data from the leader index (my-app-logs) on the primary cluster. Ensure your index templates are also replicated or managed to maintain consistent mappings and settings.

Automated Failover Orchestration with HAProxy and Custom Scripts

Manual failover is not an option for true disaster recovery. We’ll implement an automated failover mechanism using HAProxy as a load balancer and a custom health check script that monitors the primary Elasticsearch cluster. If the primary becomes unhealthy, the script will reconfigure HAProxy to direct traffic to the secondary cluster.

HAProxy Configuration for Elasticsearch

Deploy HAProxy instances in front of both your primary and secondary Elasticsearch clusters. These HAProxy instances will act as the single entry point for your applications.

Primary HAProxy Configuration

The HAProxy configuration (/etc/haproxy/haproxy.cfg) will initially point to the primary Elasticsearch nodes. We’ll use a custom health check endpoint.

frontend elasticsearch_frontend
    bind *:9200
    mode http
    default_backend elasticsearch_backend

backend elasticsearch_backend
    mode http
    balance roundrobin
    option httpchk GET /_cluster/health?pretty
    http-check expect status 200
    server primary-es-1 192.168.1.10:9200 check port 9200 inter 2s fall 3 rise 2
    server primary-es-2 192.168.1.11:9200 check port 9200 inter 2s fall 3 rise 2
    # Secondary cluster is initially commented out or marked as backup
    # server secondary-es-1 192.168.2.10:9200 check port 9200 inter 2s fall 3 rise 2 backup
    # server secondary-es-2 192.168.2.11:9200 check port 9200 inter 2s fall 3 rise 2 backup

Health Check Script for Automated Failover

We need a script that periodically checks the health of the primary Elasticsearch cluster. If the health check fails, the script will update the HAProxy configuration to switch traffic to the secondary cluster.

Python Health Check and Failover Script

This script will run on a dedicated monitoring server or one of the HAProxy nodes. It uses the Elasticsearch Python client to check cluster health and `subprocess` to reload HAProxy.

import requests
import json
import subprocess
import time
import os

PRIMARY_ES_URL = "http://primary-es-node-1.ovh.example.com:9200"
SECONDARY_ES_URL = "http://secondary-es-node-1.ovh.example.com:9200" # For health check of secondary
HAPROXY_CONFIG_PATH = "/etc/haproxy/haproxy.cfg"
HAPROXY_RELOAD_CMD = ["sudo", "systemctl", "restart", "haproxy"] # Or "haproxy -c -f /etc/haproxy/haproxy.cfg && sudo systemctl reload haproxy" for graceful reload

CHECK_INTERVAL = 10 # seconds
FAIL_THRESHOLD = 3 # consecutive failures to trigger failover
RECOVERY_THRESHOLD = 5 # consecutive successes to trigger failback

def get_es_health(url):
    try:
        response = requests.get(f"{url}/_cluster/health", timeout=5)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error checking Elasticsearch health at {url}: {e}")
        return None

def is_primary_healthy(health_data):
    if not health_data:
        return False
    # Consider 'green' or 'yellow' as healthy for read operations
    return health_data.get("status") in ["green", "yellow"]

def modify_haproxy_config(failover_active):
    print(f"Modifying HAProxy config. Failover active: {failover_active}")
    with open(HAPROXY_CONFIG_PATH, 'r') as f:
        lines = f.readlines()

    new_lines = []
    in_backend = False
    primary_servers = []
    secondary_servers = []
    primary_backup_mode = False
    secondary_backup_mode = False

    for line in lines:
        if line.strip().startswith("backend elasticsearch_backend"):
            in_backend = True
            new_lines.append(line)
            continue
        if in_backend:
            if line.strip().startswith("server "):
                if "backup" in line:
                    secondary_servers.append(line)
                    secondary_backup_mode = True
                else:
                    primary_servers.append(line)
                    primary_backup_mode = True
            elif line.strip().startswith("backend "): # End of backend section
                in_backend = False
                # Reconstruct backend based on failover_active
                if failover_active:
                    # Primary servers become backup, secondary servers become active
                    for s in primary_servers:
                        if "backup" not in s:
                            new_lines.append(s.replace("check", "check backup"))
                    for s in secondary_servers:
                        if "backup" in s:
                            new_lines.append(s.replace("backup", ""))
                else:
                    # Secondary servers become backup, primary servers become active
                    for s in primary_servers:
                        if "backup" not in s:
                            new_lines.append(s)
                    for s in secondary_servers:
                        if "backup" in s:
                            new_lines.append(s.replace("backup", "check backup"))
                new_lines.append(line) # Add the next backend line
                continue
            else:
                # Keep other lines within backend (e.g., option httpchk)
                new_lines.append(line)
        else:
            new_lines.append(line)

    # Write to a temporary file first
    temp_config_path = HAPROXY_CONFIG_PATH + ".tmp"
    with open(temp_config_path, 'w') as f:
        f.writelines(new_lines)

    # Validate and reload HAProxy
    try:
        subprocess.run(["sudo", "haproxy", "-c", "-f", temp_config_path], check=True)
        os.replace(temp_config_path, HAPROXY_CONFIG_PATH)
        subprocess.run(HAPROXY_RELOAD_CMD, check=True)
        print("HAProxy reloaded successfully.")
    except subprocess.CalledProcessError as e:
        print(f"HAProxy configuration error or reload failed: {e}")
        os.remove(temp_config_path) # Clean up temp file on error
    except FileNotFoundError:
        print("Error: HAProxy command not found. Is HAProxy installed and in PATH?")
        os.remove(temp_config_path)

def main():
    failover_state = {"active": False, "consecutive_failures": 0, "consecutive_successes": 0}

    while True:
        primary_health = get_es_health(PRIMARY_ES_URL)

        if primary_health and is_primary_healthy(primary_health):
            print("Primary Elasticsearch cluster is healthy.")
            failover_state["consecutive_failures"] = 0
            failover_state["consecutive_successes"] += 1

            if failover_state["active"] and failover_state["consecutive_successes"] >= RECOVERY_THRESHOLD:
                print("Primary cluster recovered. Initiating failback.")
                modify_haproxy_config(failover_active=False)
                failover_state["active"] = False
                failover_state["consecutive_successes"] = 0
        else:
            print("Primary Elasticsearch cluster is UNHEALTHY.")
            failover_state["consecutive_successes"] = 0
            failover_state["consecutive_failures"] += 1

            if not failover_state["active"] and failover_state["consecutive_failures"] >= FAIL_THRESHOLD:
                print("Triggering failover to secondary cluster.")
                # Optional: Verify secondary cluster is healthy before failing over
                secondary_health = get_es_health(SECONDARY_ES_URL)
                if secondary_health and is_primary_healthy(secondary_health): # Reusing is_primary_healthy for simplicity
                    modify_haproxy_config(failover_active=True)
                    failover_state["active"] = True
                    failover_state["consecutive_failures"] = 0
                else:
                    print("Secondary cluster is also unhealthy. Cannot failover.")
                    failover_state["consecutive_failures"] = 0 # Reset to avoid repeated attempts if secondary is down

        time.sleep(CHECK_INTERVAL)

if __name__ == "__main__":
    main()

Important Considerations for the Script:

State Management: The script maintains a simple state to track whether failover is active and to implement thresholds for triggering failover and failback.
HAProxy Reload: The script uses systemctl restart haproxy. For zero-downtime reloads, consider using haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy after validating the configuration.
Configuration Backup: Always back up your haproxy.cfg before running such scripts.
Permissions: The script needs appropriate permissions to read the HAProxy configuration and execute the reload command (e.g., via sudo).
Monitoring: This script is a basic example. In production, integrate it with a robust monitoring system (e.g., Prometheus, Nagios) for alerting and better state tracking.
Secondary Cluster Health: The script includes a basic check for the secondary cluster’s health before committing to failover.

Orchestrating Failover for C Deployments

For C deployments, the failover strategy depends on how your C applications interact with Elasticsearch. If they use a load balancer (like the HAProxy configured above), the failover is transparent. However, if your C application has direct connections or connection pooling to Elasticsearch, you’ll need to manage connection re-establishment or updates.

Application-Level Failover Logic

If your C application needs to be aware of the failover, you can implement logic to query the cluster health or use a separate service discovery mechanism. A common pattern is to have a configuration file or environment variable that points to the active Elasticsearch endpoint.

Dynamic Configuration Updates

When the HAProxy script triggers a failover, it could also trigger an update to a configuration management system (like Consul, etcd, or even a simple file update) that your C applications monitor. Alternatively, your C application could periodically poll the HAProxy status endpoint or a dedicated health check endpoint that reflects the active cluster.

Example: C Application Configuration Update (Conceptual)

Imagine your C application reads its Elasticsearch endpoint from a configuration file:

[Elasticsearch]
Host = primary-es-node-1.ovh.example.com
Port = 9200

When failover occurs, a separate process (or an extension of the HAProxy script) would update this file. The C application would need to be designed to detect changes in its configuration and re-initialize its Elasticsearch client connections. This might involve:

Using a library that supports dynamic configuration reloading.
Implementing a signal handler (e.g., SIGHUP) that tells the application to re-read its configuration.
Periodically polling the configuration file.

Connection Pooling and Re-establishment

If your C application uses a persistent connection pool to Elasticsearch, a failover event will invalidate existing connections. The application must be able to gracefully close these connections and establish new ones to the active cluster. This often involves:

Implementing retry logic with exponential backoff when establishing new connections.
Ensuring the connection pool can be reset or re-initialized.
Handling connection errors during read/write operations by attempting to reconnect.

Testing and Validation

Thorough testing is paramount. Simulate various failure scenarios to ensure your automated failover works as expected.

Test Scenarios

Primary Node Failure: Stop one or more primary Elasticsearch nodes. Verify that CCR continues and that the health check script detects the failure and triggers HAProxy re-configuration.
Primary Cluster Unavailability: Simulate a network partition or a complete shutdown of the primary Elasticsearch cluster. Observe the failover process and verify application connectivity to the secondary cluster.
Network Issues: Introduce latency or packet loss between your application and the primary cluster to test resilience.
Failback Testing: After a failover, bring the primary cluster back online and verify that the failback mechanism correctly redirects traffic and that CCR resumes its primary role.
Data Integrity: After failover and failback, perform checks to ensure no data was lost or corrupted.

Monitoring and Alerting

Implement comprehensive monitoring for:

Elasticsearch cluster health (both primary and secondary).
CCR status (replication lag).
HAProxy health and backend status.
The health check script itself.
Application-level connectivity and error rates.

Set up alerts for any deviations from expected behavior, especially during failover events. This ensures that even if automation fails, you are immediately notified.