Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Python Deployments on Linode

Elasticsearch Cluster Setup for High Availability

Achieving automated failover for Elasticsearch hinges on a robust, multi-node cluster configuration. We’ll focus on a setup that leverages Elasticsearch’s built-in quorum and discovery mechanisms. For this example, we’ll assume three Linode instances, each running Elasticsearch. This provides the minimum for quorum (majority voting) to prevent split-brain scenarios.

Key to this is the discovery.seed_hosts setting, which tells each node where to find other potential master-eligible nodes. We also configure cluster.initial_master_nodes to bootstrap the cluster on its first startup. Ensure your Linode instances have static IP addresses or resolvable hostnames.

Elasticsearch Configuration (`elasticsearch.yml`)

On each Elasticsearch node (e.g., es-node-1, es-node-2, es-node-3), the elasticsearch.yml file should be configured as follows. Replace the IP addresses with your actual Linode instance IPs.

cluster.name: "my-production-cluster"
node.name: "${NODE_NAME}" # e.g., es-node-1, es-node-2, es-node-3
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - "192.168.1.101:9300" # IP of es-node-1
  - "192.168.1.102:9300" # IP of es-node-2
  - "192.168.1.103:9300" # IP of es-node-3

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

# For production, consider security settings like xpack.security.enabled: true
# and proper TLS configuration.

To manage the NODE_NAME environment variable, you can use systemd service files or a simple shell script during startup.

Verifying Cluster Health

Once Elasticsearch is running on all nodes, you can check the cluster health using the following API call. A status of green or yellow indicates a healthy cluster. green means all primary and replica shards are allocated. yellow means primary shards are allocated, but some replicas are not (acceptable for failover testing, but not ideal for production).

curl -X GET "http://localhost:9200/_cluster/health?pretty"

The output should show:

{
  "cluster_name" : "my-production-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 1,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue" : 0,
  "active_shards_percent_as_number" : 100.0
}

Python Application Deployment and Load Balancing

Our Python application will interact with Elasticsearch. For high availability of the application itself, we’ll deploy multiple instances behind a load balancer. Linode’s NodeBalancers are a suitable choice for this. The Python application will be configured to connect to the NodeBalancer’s IP address.

Python Application Structure (Example)

A simple Flask application demonstrating interaction with Elasticsearch. This example uses the elasticsearch-py client.

from flask import Flask, request, jsonify
from elasticsearch import Elasticsearch
import os

app = Flask(__name__)

# Configure Elasticsearch connection
# In a real-world scenario, use environment variables or a config file
# The ES_HOST should point to your Elasticsearch NodeBalancer or individual nodes
ES_HOST = os.environ.get("ES_HOST", "http://localhost:9200")
es = Elasticsearch([ES_HOST])

@app.route('/index', methods=['POST'])
def index_document():
    doc_id = request.json.get('id')
    data = request.json.get('data')
    if not doc_id or not data:
        return jsonify({"error": "Missing 'id' or 'data'"}), 400

    try:
        response = es.index(index="my-index", id=doc_id, document=data)
        return jsonify(response), 201
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route('/search', methods=['GET'])
def search_documents():
    query = request.args.get('q')
    if not query:
        return jsonify({"error": "Missing 'q' query parameter"}), 400

    try:
        response = es.search(index="my-index", body={"query": {"match": {"content": query}}})
        return jsonify(response['hits']['hits'])
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    # For production, use a proper WSGI server like Gunicorn
    # app.run(debug=True, host='0.0.0.0', port=5000)
    # Example with Gunicorn:
    # gunicorn -w 4 -b 0.0.0.0:5000 app:app
    app.run(host='0.0.0.0', port=5000)

Deploying Python Applications on Linode

We’ll deploy multiple instances of this Flask application on separate Linode instances. For simplicity, let’s assume we have two application servers (app-server-1, app-server-2). Each server will run the Python application using Gunicorn.

On each application server:

Install Python, pip, and necessary libraries: pip install Flask elasticsearch-py gunicorn
Save the Python code as app.py.
Run Gunicorn: gunicorn -w 4 -b 0.0.0.0:5000 app:app

Configuring Linode NodeBalancer

Create a NodeBalancer in your Linode Cloud Manager. Configure it with the following:

Frontend Protocol: HTTP (or HTTPS if you’ve set up SSL termination)
Frontend Port: 80 (or 443)
Backend Protocol: HTTP
Backend Port: 5000 (the port Gunicorn is listening on)
Backend Nodes: Add the private IP addresses of your application servers (e.g., app-server-1 and app-server-2) on port 5000.
Health Checks: Configure a simple HTTP health check (e.g., a GET request to / on port 5000) to ensure the NodeBalancer only sends traffic to healthy application instances.

The NodeBalancer will now distribute incoming traffic to your Python application instances. If one instance becomes unhealthy, the NodeBalancer will automatically stop sending traffic to it.

Automated Failover Scenarios and Testing

Automated failover in this architecture occurs at two levels: Elasticsearch cluster and application instances.

Elasticsearch Failover

If a master-eligible Elasticsearch node fails:

The remaining master-eligible nodes will detect the failure through the discovery mechanism.
They will hold a new election for the master role.
As long as a quorum (majority) of master-eligible nodes is available, a new master will be elected, and the cluster will continue to operate.
If a data node fails, Elasticsearch will automatically reallocate its shards to other available data nodes to maintain replication.

Testing Elasticsearch Failover:

Stop the Elasticsearch service on one of the master-eligible nodes: sudo systemctl stop elasticsearch
Monitor cluster health from another node: curl -X GET "http://localhost:9200/_cluster/health?pretty". You should observe a brief period where the status might be yellow or the node count decreases, but it should recover to green with a reduced node count.
Restart the stopped Elasticsearch service.
Verify the node rejoins the cluster and shards are rebalanced if necessary.

Application Instance Failover

If an application server (running the Python app) fails:

The Linode NodeBalancer’s health checks will detect that the instance is unresponsive.
The NodeBalancer will automatically remove the unhealthy instance from its rotation.
Incoming traffic will be directed solely to the remaining healthy application instances.

Testing Application Failover:

Stop the Gunicorn process on one of your application servers (e.g., app-server-1).
Send requests to your NodeBalancer’s IP address. You should see requests being served by the remaining healthy instance(s).
Check the NodeBalancer’s status in the Linode Cloud Manager; it should mark the instance as unhealthy.
Restart the Gunicorn process on the failed server.
The NodeBalancer should detect its health and add it back to the rotation.

Application-Level Resilience (Client-Side)

While the NodeBalancer handles infrastructure-level failover for the application instances, the Python application itself should be resilient to temporary Elasticsearch unavailability. The elasticsearch-py client has built-in retry mechanisms, but for more robust handling, consider implementing:

Connection Pooling: The client manages a pool of connections.
Timeouts and Retries: Configure appropriate timeouts for requests and implement custom retry logic in your application code for transient network issues or brief Elasticsearch unavailability.
Circuit Breaker Pattern: If Elasticsearch is consistently unavailable, your application could temporarily stop sending requests to it to prevent cascading failures.

For example, to add basic retry logic to your Elasticsearch client:

from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests.exceptions import ConnectionError
import time

# ... (other imports and Flask setup)

# Configure Elasticsearch connection with retries
es = Elasticsearch(
    [ES_HOST],
    connection_class=RequestsHttpConnection,
    retry_on_timeout=True,
    max_retries=3,
    retry_on_status=[500, 502, 503, 504],
    request_timeout=30
)

# Example of a function that might need retry logic
def index_document_with_retry(index_name, doc_id, data):
    for attempt in range(5): # Try up to 5 times
        try:
            response = es.index(index=index_name, id=doc_id, document=data)
            return response
        except ConnectionError as ce:
            print(f"Connection error on attempt {attempt + 1}: {ce}")
            time.sleep(2 ** attempt) # Exponential backoff
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return {"error": str(e)}
    return {"error": "Failed to index document after multiple retries"}

# In your route:
# response = index_document_with_retry("my-index", doc_id, data)
# if "error" in response:
#     return jsonify({"error": response["error"]}), 500
# else:
#     return jsonify(response), 201

This layered approach—robust Elasticsearch clustering, load-balanced application instances with health checks, and resilient application code—forms the foundation of an automated disaster recovery strategy for your Python and Elasticsearch deployments on Linode.