Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and C Deployments on OVH
Elasticsearch Cluster Setup for High Availability on OVH
Achieving robust disaster recovery for Elasticsearch hinges on a well-architected, multi-region deployment with automated failover. This section details the foundational setup for an Elasticsearch cluster designed for resilience on OVH’s infrastructure, focusing on replication and shard allocation strategies that facilitate seamless failover.
We’ll assume a primary region (e.g., GRA) and a secondary, geographically distinct region (e.g., BHS) for failover. Each region will host a dedicated Elasticsearch cluster. The key is to ensure data is replicated across these clusters and that a mechanism exists to redirect traffic to the secondary cluster should the primary become unavailable.
Configuring Elasticsearch Replication and Shard Allocation
For disaster recovery, we employ cross-cluster replication (CCR). This allows indices in one cluster to be replicated to another. We also need to configure shard allocation awareness to ensure replicas are placed in different availability zones within a region and, crucially, across regions for DR purposes.
Cross-Cluster Replication (CCR) Setup
First, establish a secure connection between your primary and secondary Elasticsearch clusters. This typically involves configuring transport layer security (TLS) and setting up remote cluster configurations.
Remote Cluster Configuration (Primary Cluster)
On your primary cluster’s elasticsearch.yml, define the remote cluster connection:
cluster.remote.secondary_cluster:
seeds:
- secondary-es-node-1.ovh.example.com:9300
- secondary-es-node-2.ovh.example.com:9300
skip_unavailable: false
mode: ALL
Remote Cluster Configuration (Secondary Cluster)
Similarly, on your secondary cluster’s elasticsearch.yml:
cluster.remote.primary_cluster:
seeds:
- primary-es-node-1.ovh.example.com:9300
- primary-es-node-2.ovh.example.com:9300
skip_unavailable: false
mode: ALL
Configuring CCR Policies
Once remote clusters are configured, you can set up CCR policies. This is typically done via the Elasticsearch API. For example, to replicate an index named my-app-logs from the primary cluster to the secondary:
curl -X PUT "http://primary-es-node-1.ovh.example.com:9200/_ccr/auto/my-app-logs?from_remote_cluster=secondary_cluster" -H 'Content-Type: application/json' -d'
{
"remote_cluster": "secondary_cluster",
"leader_index": "my-app-logs",
"follower_index": "my-app-logs-replica"
}'
This command creates a follower index (my-app-logs-replica) on the secondary cluster that continuously replicates data from the leader index (my-app-logs) on the primary cluster. Ensure your index templates are also replicated or managed to maintain consistent mappings and settings.
Automated Failover Orchestration with HAProxy and Custom Scripts
Manual failover is not an option for true disaster recovery. We’ll implement an automated failover mechanism using HAProxy as a load balancer and a custom health check script that monitors the primary Elasticsearch cluster. If the primary becomes unhealthy, the script will reconfigure HAProxy to direct traffic to the secondary cluster.
HAProxy Configuration for Elasticsearch
Deploy HAProxy instances in front of both your primary and secondary Elasticsearch clusters. These HAProxy instances will act as the single entry point for your applications.
Primary HAProxy Configuration
The HAProxy configuration (/etc/haproxy/haproxy.cfg) will initially point to the primary Elasticsearch nodes. We’ll use a custom health check endpoint.
frontend elasticsearch_frontend
bind *:9200
mode http
default_backend elasticsearch_backend
backend elasticsearch_backend
mode http
balance roundrobin
option httpchk GET /_cluster/health?pretty
http-check expect status 200
server primary-es-1 192.168.1.10:9200 check port 9200 inter 2s fall 3 rise 2
server primary-es-2 192.168.1.11:9200 check port 9200 inter 2s fall 3 rise 2
# Secondary cluster is initially commented out or marked as backup
# server secondary-es-1 192.168.2.10:9200 check port 9200 inter 2s fall 3 rise 2 backup
# server secondary-es-2 192.168.2.11:9200 check port 9200 inter 2s fall 3 rise 2 backup
Health Check Script for Automated Failover
We need a script that periodically checks the health of the primary Elasticsearch cluster. If the health check fails, the script will update the HAProxy configuration to switch traffic to the secondary cluster.
Python Health Check and Failover Script
This script will run on a dedicated monitoring server or one of the HAProxy nodes. It uses the Elasticsearch Python client to check cluster health and `subprocess` to reload HAProxy.
import requests
import json
import subprocess
import time
import os
PRIMARY_ES_URL = "http://primary-es-node-1.ovh.example.com:9200"
SECONDARY_ES_URL = "http://secondary-es-node-1.ovh.example.com:9200" # For health check of secondary
HAPROXY_CONFIG_PATH = "/etc/haproxy/haproxy.cfg"
HAPROXY_RELOAD_CMD = ["sudo", "systemctl", "restart", "haproxy"] # Or "haproxy -c -f /etc/haproxy/haproxy.cfg && sudo systemctl reload haproxy" for graceful reload
CHECK_INTERVAL = 10 # seconds
FAIL_THRESHOLD = 3 # consecutive failures to trigger failover
RECOVERY_THRESHOLD = 5 # consecutive successes to trigger failback
def get_es_health(url):
try:
response = requests.get(f"{url}/_cluster/health", timeout=5)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error checking Elasticsearch health at {url}: {e}")
return None
def is_primary_healthy(health_data):
if not health_data:
return False
# Consider 'green' or 'yellow' as healthy for read operations
return health_data.get("status") in ["green", "yellow"]
def modify_haproxy_config(failover_active):
print(f"Modifying HAProxy config. Failover active: {failover_active}")
with open(HAPROXY_CONFIG_PATH, 'r') as f:
lines = f.readlines()
new_lines = []
in_backend = False
primary_servers = []
secondary_servers = []
primary_backup_mode = False
secondary_backup_mode = False
for line in lines:
if line.strip().startswith("backend elasticsearch_backend"):
in_backend = True
new_lines.append(line)
continue
if in_backend:
if line.strip().startswith("server "):
if "backup" in line:
secondary_servers.append(line)
secondary_backup_mode = True
else:
primary_servers.append(line)
primary_backup_mode = True
elif line.strip().startswith("backend "): # End of backend section
in_backend = False
# Reconstruct backend based on failover_active
if failover_active:
# Primary servers become backup, secondary servers become active
for s in primary_servers:
if "backup" not in s:
new_lines.append(s.replace("check", "check backup"))
for s in secondary_servers:
if "backup" in s:
new_lines.append(s.replace("backup", ""))
else:
# Secondary servers become backup, primary servers become active
for s in primary_servers:
if "backup" not in s:
new_lines.append(s)
for s in secondary_servers:
if "backup" in s:
new_lines.append(s.replace("backup", "check backup"))
new_lines.append(line) # Add the next backend line
continue
else:
# Keep other lines within backend (e.g., option httpchk)
new_lines.append(line)
else:
new_lines.append(line)
# Write to a temporary file first
temp_config_path = HAPROXY_CONFIG_PATH + ".tmp"
with open(temp_config_path, 'w') as f:
f.writelines(new_lines)
# Validate and reload HAProxy
try:
subprocess.run(["sudo", "haproxy", "-c", "-f", temp_config_path], check=True)
os.replace(temp_config_path, HAPROXY_CONFIG_PATH)
subprocess.run(HAPROXY_RELOAD_CMD, check=True)
print("HAProxy reloaded successfully.")
except subprocess.CalledProcessError as e:
print(f"HAProxy configuration error or reload failed: {e}")
os.remove(temp_config_path) # Clean up temp file on error
except FileNotFoundError:
print("Error: HAProxy command not found. Is HAProxy installed and in PATH?")
os.remove(temp_config_path)
def main():
failover_state = {"active": False, "consecutive_failures": 0, "consecutive_successes": 0}
while True:
primary_health = get_es_health(PRIMARY_ES_URL)
if primary_health and is_primary_healthy(primary_health):
print("Primary Elasticsearch cluster is healthy.")
failover_state["consecutive_failures"] = 0
failover_state["consecutive_successes"] += 1
if failover_state["active"] and failover_state["consecutive_successes"] >= RECOVERY_THRESHOLD:
print("Primary cluster recovered. Initiating failback.")
modify_haproxy_config(failover_active=False)
failover_state["active"] = False
failover_state["consecutive_successes"] = 0
else:
print("Primary Elasticsearch cluster is UNHEALTHY.")
failover_state["consecutive_successes"] = 0
failover_state["consecutive_failures"] += 1
if not failover_state["active"] and failover_state["consecutive_failures"] >= FAIL_THRESHOLD:
print("Triggering failover to secondary cluster.")
# Optional: Verify secondary cluster is healthy before failing over
secondary_health = get_es_health(SECONDARY_ES_URL)
if secondary_health and is_primary_healthy(secondary_health): # Reusing is_primary_healthy for simplicity
modify_haproxy_config(failover_active=True)
failover_state["active"] = True
failover_state["consecutive_failures"] = 0
else:
print("Secondary cluster is also unhealthy. Cannot failover.")
failover_state["consecutive_failures"] = 0 # Reset to avoid repeated attempts if secondary is down
time.sleep(CHECK_INTERVAL)
if __name__ == "__main__":
main()
Important Considerations for the Script:
- State Management: The script maintains a simple state to track whether failover is active and to implement thresholds for triggering failover and failback.
- HAProxy Reload: The script uses
systemctl restart haproxy. For zero-downtime reloads, consider usinghaproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxyafter validating the configuration. - Configuration Backup: Always back up your
haproxy.cfgbefore running such scripts. - Permissions: The script needs appropriate permissions to read the HAProxy configuration and execute the reload command (e.g., via
sudo). - Monitoring: This script is a basic example. In production, integrate it with a robust monitoring system (e.g., Prometheus, Nagios) for alerting and better state tracking.
- Secondary Cluster Health: The script includes a basic check for the secondary cluster’s health before committing to failover.
Orchestrating Failover for C Deployments
For C deployments, the failover strategy depends on how your C applications interact with Elasticsearch. If they use a load balancer (like the HAProxy configured above), the failover is transparent. However, if your C application has direct connections or connection pooling to Elasticsearch, you’ll need to manage connection re-establishment or updates.
Application-Level Failover Logic
If your C application needs to be aware of the failover, you can implement logic to query the cluster health or use a separate service discovery mechanism. A common pattern is to have a configuration file or environment variable that points to the active Elasticsearch endpoint.
Dynamic Configuration Updates
When the HAProxy script triggers a failover, it could also trigger an update to a configuration management system (like Consul, etcd, or even a simple file update) that your C applications monitor. Alternatively, your C application could periodically poll the HAProxy status endpoint or a dedicated health check endpoint that reflects the active cluster.
Example: C Application Configuration Update (Conceptual)
Imagine your C application reads its Elasticsearch endpoint from a configuration file:
[Elasticsearch] Host = primary-es-node-1.ovh.example.com Port = 9200
When failover occurs, a separate process (or an extension of the HAProxy script) would update this file. The C application would need to be designed to detect changes in its configuration and re-initialize its Elasticsearch client connections. This might involve:
- Using a library that supports dynamic configuration reloading.
- Implementing a signal handler (e.g., SIGHUP) that tells the application to re-read its configuration.
- Periodically polling the configuration file.
Connection Pooling and Re-establishment
If your C application uses a persistent connection pool to Elasticsearch, a failover event will invalidate existing connections. The application must be able to gracefully close these connections and establish new ones to the active cluster. This often involves:
- Implementing retry logic with exponential backoff when establishing new connections.
- Ensuring the connection pool can be reset or re-initialized.
- Handling connection errors during read/write operations by attempting to reconnect.
Testing and Validation
Thorough testing is paramount. Simulate various failure scenarios to ensure your automated failover works as expected.
Test Scenarios
- Primary Node Failure: Stop one or more primary Elasticsearch nodes. Verify that CCR continues and that the health check script detects the failure and triggers HAProxy re-configuration.
- Primary Cluster Unavailability: Simulate a network partition or a complete shutdown of the primary Elasticsearch cluster. Observe the failover process and verify application connectivity to the secondary cluster.
- Network Issues: Introduce latency or packet loss between your application and the primary cluster to test resilience.
- Failback Testing: After a failover, bring the primary cluster back online and verify that the failback mechanism correctly redirects traffic and that CCR resumes its primary role.
- Data Integrity: After failover and failback, perform checks to ensure no data was lost or corrupted.
Monitoring and Alerting
Implement comprehensive monitoring for:
- Elasticsearch cluster health (both primary and secondary).
- CCR status (replication lag).
- HAProxy health and backend status.
- The health check script itself.
- Application-level connectivity and error rates.
Set up alerts for any deviations from expected behavior, especially during failover events. This ensures that even if automation fails, you are immediately notified.