Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and C++ Deployments on Linode

Designing for Resilience: Automated Failover for C++ Services and Elasticsearch on Linode

This document outlines a robust, automated failover strategy for critical C++ microservices and their backing Elasticsearch clusters, deployed on Linode infrastructure. The focus is on minimizing Mean Time To Recovery (MTTR) through proactive monitoring and automated orchestration, ensuring high availability for your core services.

Elasticsearch Cluster Health and Automated Failover

Elasticsearch’s inherent distributed nature provides a strong foundation for high availability. However, achieving true automated failover requires external orchestration to manage node failures, shard rebalancing, and cluster state transitions. We’ll leverage a combination of Linode’s monitoring capabilities, custom health checks, and a simple orchestration script.

Elasticsearch Cluster Configuration for High Availability

A minimum of three master-eligible nodes is crucial for quorum. Data nodes should be configured with appropriate shard allocation awareness to distribute data across availability zones or distinct Linode regions if your architecture spans them. For simplicity in this example, we assume a single Linode region with multiple Linode instances acting as nodes.

The elasticsearch.yml configuration on each node should include:

cluster.name: "my-production-cluster"
node.name: "${HOSTNAME}"
network.host: 0.0.0.0
discovery.seed_hosts:
  - "elasticsearch-node-1.linode.internal:9300"
  - "elasticsearch-node-2.linode.internal:9300"
  - "elasticsearch-node-3.linode.internal:9300"
cluster.initial_master_nodes:
  - "elasticsearch-node-1"
  - "elasticsearch-node-2"
  - "elasticsearch-node-3"
# For data nodes, ensure shard allocation awareness is configured if using multiple racks/zones
# cluster.routing.allocation.awareness.attributes: zone
# For master nodes, consider dedicated master configuration for larger clusters
# node.roles: [ master, data, ingest ]

Health Check Mechanism

We need a reliable way to determine if an Elasticsearch node is healthy and contributing to the cluster. The Elasticsearch Cluster Health API is ideal for this. A simple `curl` command can poll the cluster status.

curl -s "http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=5s"
# Expected output for a healthy cluster (status: green or yellow)
# {"cluster_name":"my-production-cluster","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":10,"active_shards":30,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue":0,"active_shards_percent_as_number":100.0}

A non-zero exit code from `curl` or a status other than ‘green’ or ‘yellow’ (depending on your tolerance for yellow status during transient issues) indicates a problem. For automated failover, we’ll specifically look for the absence of a healthy response.

Orchestration Script for Node Failure Detection and Recovery

A Python script running on a separate control node (or a dedicated monitoring instance) can periodically check the health of each Elasticsearch node. If a node becomes unresponsive, the script can attempt to restart it. If the restart fails, it can trigger alerts and potentially initiate more drastic measures (though for Elasticsearch, manual intervention or automated scaling of new nodes is often preferred over aggressive automated failover of the cluster itself).

import requests
import time
import subprocess
import logging

# Configuration
ELASTICSEARCH_NODES = [
    {"host": "elasticsearch-node-1.linode.internal", "port": 9200, "name": "node1"},
    {"host": "elasticsearch-node-2.linode.internal", "port": 9200, "name": "node2"},
    {"host": "elasticsearch-node-3.linode.internal", "port": 9200, "name": "node3"},
]
HEALTH_CHECK_URL_TEMPLATE = "http://{host}:{port}/_cluster/health?wait_for_status=yellow&timeout=5s"
RESTART_COMMAND_TEMPLATE = "ssh user@{} 'sudo systemctl restart elasticsearch'"
CHECK_INTERVAL_SECONDS = 30
RETRY_ATTEMPTS = 3
RETRY_DELAY_SECONDS = 15

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def check_node_health(node):
    url = HEALTH_CHECK_URL_TEMPLATE.format(host=node["host"], port=node["port"])
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
        health_data = response.json()
        if health_data["status"] in ["green", "yellow"]:
            logging.info(f"Node {node['name']} ({node['host']}) is healthy. Status: {health_data['status']}")
            return True
        else:
            logging.warning(f"Node {node['name']} ({node['host']}) is unhealthy. Cluster status: {health_data['status']}")
            return False
    except requests.exceptions.RequestException as e:
        logging.error(f"Node {node['name']} ({node['host']}) is unreachable or failed health check: {e}")
        return False

def restart_node(node):
    logging.warning(f"Attempting to restart Elasticsearch on node {node['name']} ({node['host']})...")
    try:
        # Ensure you have passwordless SSH set up for 'user' or use SSH keys
        command = RESTART_COMMAND_TEMPLATE.format(node["host"])
        process = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
        logging.info(f"Restart command executed successfully for {node['name']}. Output: {process.stdout}")
        return True
    except subprocess.CalledProcessError as e:
        logging.error(f"Failed to restart Elasticsearch on {node['name']} ({node['host']}). Error: {e.stderr}")
        return False
    except Exception as e:
        logging.error(f"An unexpected error occurred during restart for {node['name']}: {e}")
        return False

def main():
    while True:
        for node in ELASTICSEARCH_NODES:
            if not check_node_health(node):
                logging.warning(f"Node {node['name']} ({node['host']}) is down. Initiating recovery sequence.")
                for attempt in range(RETRY_ATTEMPTS):
                    if restart_node(node):
                        logging.info(f"Restart attempt {attempt + 1}/{RETRY_ATTEMPTS} for {node['name']} succeeded. Waiting for cluster to recover...")
                        # Give Elasticsearch time to rejoin and rebalance
                        time.sleep(RETRY_DELAY_SECONDS * 2)
                        if check_node_health(node):
                            logging.info(f"Node {node['name']} is back online and healthy.")
                            break # Node recovered, move to next node
                    else:
                        logging.error(f"Restart attempt {attempt + 1}/{RETRY_ATTEMPTS} failed for {node['name']}.")
                    time.sleep(RETRY_DELAY_SECONDS)
                else:
                    logging.critical(f"Failed to recover node {node['name']} ({node['host']}) after {RETRY_ATTEMPTS} attempts. Manual intervention may be required.")
                    # Consider sending alerts here (e.g., PagerDuty, Slack)
        time.sleep(CHECK_INTERVAL_SECONDS)

if __name__ == "__main__":
    main()

Prerequisites for the script:

Python 3 installed on the control node.
requests library installed (`pip install requests`).
Passwordless SSH access configured from the control node to each Elasticsearch node for the specified user (e.g., `user`). The user must have `sudo` privileges to restart the `elasticsearch` service.
The `elasticsearch` service must be managed by `systemd` (or equivalent).

C++ Service Auto-Failover

For stateless C++ microservices, achieving auto-failover typically involves a load balancer and multiple instances of the service running across different Linode instances. If one instance becomes unhealthy, the load balancer should stop sending traffic to it and redirect it to healthy instances.

Load Balancer Configuration (HAProxy Example)

HAProxy is a powerful, high-performance TCP/HTTP load balancer. We’ll configure it to monitor the health of our C++ service instances.

frontend http_in
    bind *:80
    mode http
    default_backend cpp_services

backend cpp_services
    mode http
    balance roundrobin
    option httpchk GET /health # Assuming your C++ service exposes a /health endpoint
    http-check expect status 200 # Expect a 200 OK from the health check
    server cpp_service_1 192.168.1.10:8080 check # Replace with actual Linode IPs and ports
    server cpp_service_2 192.168.1.11:8080 check
    server cpp_service_3 192.168.1.12:8080 check
    # Add more servers as needed

In this HAProxy configuration:

frontend http_in: Listens on port 80 for incoming HTTP traffic.
backend cpp_services: Defines the pool of C++ service instances.
balance roundrobin: Distributes traffic evenly.
option httpchk GET /health: Configures HAProxy to send an HTTP GET request to the /health endpoint on each backend server.
http-check expect status 200: Specifies that a 200 OK response from the health check indicates a healthy server.
server ... check: Defines each backend server and enables health checking. HAProxy will automatically mark unhealthy servers as DOWN and remove them from the rotation.

C++ Service Health Endpoint Implementation

Your C++ service needs to expose an HTTP endpoint (e.g., /health) that returns a 200 OK status code if the service is healthy and operational. This endpoint should perform minimal checks, such as verifying its connection to essential dependencies (like Elasticsearch, if applicable) or its internal state.

# Example using a simple HTTP server library (e.g., Boost.Beast, cpp-httplib)
# This is a conceptual snippet, actual implementation depends on your chosen library.

# Assuming a web server framework is set up:
// ... server setup ...

server.Get("/health", [](const Request& req, Response& res) {
    // Perform minimal health checks here.
    // For example, check if connected to Elasticsearch, database, etc.
    bool is_healthy = true;
    // if (!is_elasticsearch_connected()) {
    //     is_healthy = false;
    // }

    if (is_healthy) {
        res.status = 200;
        res.set_content("OK", "text/plain");
    } else {
        res.status = 503; // Service Unavailable
        res.set_content("Service Unavailable", "text/plain");
    }
});

// ... start server ...

When HAProxy receives a non-200 response (e.g., 503 Service Unavailable) from the /health endpoint, it will mark the corresponding C++ service instance as DOWN. Traffic will then be automatically routed to the remaining healthy instances. Once the unhealthy instance recovers and starts responding with 200 OK, HAProxy will mark it as UP again and include it in the rotation.

Orchestrating Deployments and Failover with Linode

Linode’s infrastructure provides the building blocks. For true automation, consider integrating these checks and recovery mechanisms with a CI/CD pipeline or an orchestration tool like Ansible, Terraform, or Kubernetes (if you abstract your Linode instances into a managed Kubernetes cluster).

Automated Deployment of C++ Services

A typical workflow would involve:

Building the C++ service artifact.
Pushing the artifact to a repository (e.g., Docker Hub, private artifact repository).
Using a deployment tool (e.g., Ansible, `linode-cli`) to deploy new instances of the service onto available Linode instances.
Updating the HAProxy configuration (or the load balancer service in your orchestration tool) to include the new instances.

# Example using Ansible to deploy and configure HAProxy
# playbook.yml
---
- hosts: load_balancers
  tasks:
    - name: Update HAProxy configuration
      template:
        src: haproxy.cfg.j2
        dest: /etc/haproxy/haproxy.cfg
      notify:
        - Restart HAProxy

- hosts: cpp_services
  tasks:
    - name: Deploy C++ service
      copy:
        src: /path/to/your/cpp_service_binary
        dest: /usr/local/bin/cpp_service
        mode: '0755'
    - name: Ensure C++ service is running and enabled
      systemd:
        name: cpp_service.service
        state: started
        enabled: yes

  handlers:
    - name: Restart HAProxy
      systemd:
        name: haproxy
        state: restarted

Monitoring and Alerting Integration

While the Python script handles basic Elasticsearch node restarts, a comprehensive monitoring solution is essential. Linode’s native monitoring can provide basic metrics. For advanced alerting, integrate with tools like Prometheus, Grafana, Alertmanager, or cloud-native solutions.

Key metrics to monitor:

Elasticsearch: Cluster health status, node status (master, data), JVM heap usage, disk space, indexing/search latency.
C++ Services: Request latency, error rates (especially 5xx), CPU/memory usage, network traffic.
HAProxy: Backend server status (UP/DOWN), connection errors, request rates.
Linode Instances: CPU utilization, memory usage, disk I/O, network I/O.

Configure alerts for critical thresholds. For example, if an Elasticsearch node remains DOWN for more than 5 minutes after automated restart attempts, or if the C++ service error rate exceeds 5% for more than 2 minutes, trigger an alert to your operations team.

Conclusion

Architecting for automated failover requires a multi-layered approach. By combining Elasticsearch’s built-in resilience with external orchestration for health checks and restarts, and by leveraging load balancers with health-aware routing for stateless C++ services, you can significantly improve the availability and fault tolerance of your applications on Linode. Continuous monitoring and a well-defined alerting strategy are paramount to ensuring that automated recovery mechanisms are effective and that manual intervention is only required for truly exceptional circumstances.