Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and C Deployments on DigitalOcean

Elasticsearch Cluster Setup for High Availability

Achieving robust disaster recovery for Elasticsearch hinges on a well-architected, multi-node cluster with proper shard allocation and replication. For production deployments on DigitalOcean, we’ll leverage a combination of Elasticsearch’s built-in features and external orchestration for failover.

A typical Elasticsearch cluster for HA should consist of at least three master-eligible nodes and multiple data nodes. Data nodes should be configured with sufficient resources (CPU, RAM, and fast SSDs) to handle the expected load. We’ll assume a basic setup with dedicated master nodes and data nodes, though a mixed role configuration is also viable for smaller clusters.

Configuring Elasticsearch for Resilience

The core of Elasticsearch’s resilience lies in its distributed nature and replication. We need to ensure that our indices have sufficient replicas and that these replicas are distributed across different availability zones or even regions if the DR strategy demands it. For DigitalOcean, this translates to deploying Droplets in different Datacenters.

The primary configuration file, elasticsearch.yml, is critical. Key settings for HA include:

cluster.name: Must be identical across all nodes in the cluster.
node.name: Unique identifier for each node.
network.host: The IP address or hostname the node binds to. Use a private IP for inter-node communication.
discovery.seed_hosts: A list of IP addresses or hostnames of other nodes in the cluster that new nodes can discover.
cluster.initial_master_nodes: A list of node names that are eligible to become the initial master. This is crucial for bootstrapping the cluster.
indices.recovery.max_bytes_per_sec: Controls the speed of shard recovery. Adjust based on network bandwidth and disk I/O.
cluster.routing.allocation.enable: Set to all to allow shard allocation.
cluster.routing.allocation.node_concurrent_recoveries: Controls how many shards can be recovered concurrently on a single node.
cluster.routing.allocation.cluster_concurrent_rebalance: Controls how many shards can be rebalanced concurrently across the cluster.

Here’s an example snippet for elasticsearch.yml on a master-eligible node:

cluster.name: my-production-cluster
node.name: es-master-01
network.host: [_local_,_site_]
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - es-master-01:9300
  - es-master-02:9300
  - es-master-03:9300
  - es-data-01:9300
  - es-data-02:9300

cluster.initial_master_nodes:
  - es-master-01
  - es-master-02
  - es-master-03

indices.recovery.max_bytes_per_sec: 50mb
cluster.routing.allocation.enable: all
cluster.routing.allocation.node_concurrent_recoveries: 2
cluster.routing.allocation.cluster_concurrent_rebalance: 2

xpack.security.enabled: true # Assuming X-Pack security is enabled
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

And for a data node:

cluster.name: my-production-cluster
node.name: es-data-01
network.host: [_local_,_site_]
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - es-master-01:9300
  - es-master-02:9300
  - es-master-03:9300
  - es-data-01:9300
  - es-data-02:9300

cluster.initial_master_nodes:
  - es-master-01
  - es-master-02
  - es-master-03

indices.recovery.max_bytes_per_sec: 50mb
cluster.routing.allocation.enable: all
cluster.routing.allocation.node_concurrent_recoveries: 2
cluster.routing.allocation.cluster_concurrent_rebalance: 2

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Shard Replication and Allocation Strategies

To ensure data availability, each index must have at least one replica. For DR, we aim for a minimum of two replicas (one primary, one replica) distributed across different DigitalOcean datacenters. This means if one datacenter becomes unavailable, the replica can be promoted to primary.

When creating an index, specify the number of replicas and shards. For example, to create an index with 3 primary shards and 2 replicas:

{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 2
    }
  }
}

Elasticsearch’s allocation deciders are crucial for controlling where shards are placed. We can use cluster.routing.allocation.awareness.attributes to ensure replicas are placed on nodes in different datacenters. First, you need to tag your nodes with datacenter awareness attributes in elasticsearch.yml:

# On nodes in datacenter A
cluster.routing.allocation.awareness.attributes: datacenter
node.attr.datacenter: nyc3

# On nodes in datacenter B
cluster.routing.allocation.awareness.attributes: datacenter
node.attr.datacenter: ams3

With this configuration, Elasticsearch will attempt to place replicas on nodes in different datacenters. If you have multiple nodes within the same datacenter, it will try to distribute them as evenly as possible.

Automated Failover Orchestration with Keepalived and HAProxy

While Elasticsearch handles internal node failures and shard rebalancing, we need an external mechanism to manage client-facing access and ensure seamless failover of the API endpoint. This is where Keepalived and HAProxy come into play.

We’ll set up a highly available pair of HAProxy instances, managed by Keepalived for virtual IP (VIP) failover. This VIP will be the single point of access for your applications.

Keepalived Configuration for VIP Management

Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to manage a shared IP address. Two Droplets will act as Keepalived peers. One will be in MASTER state, holding the VIP, while the other is in BACKUP state, ready to take over.

Install Keepalived on both HAProxy servers:

sudo apt update
sudo apt install keepalived -y

The primary configuration file is /etc/keepalived/keepalived.conf. Here’s an example for the MASTER node:

vrrp_script chk_haproxy {
    script "/usr/local/bin/check_haproxy.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0 # Replace with your primary network interface
    virtual_router_id 51
    priority 150 # Higher priority for MASTER
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_secret_password
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 # Your Virtual IP address
    }
    track_script {
        chk_haproxy
    }
}

And for the BACKUP node:

vrrp_script chk_haproxy {
    script "/usr/local/bin/check_haproxy.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0 # Replace with your primary network interface
    virtual_router_id 51
    priority 100 # Lower priority for BACKUP
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_secret_password
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 # Your Virtual IP address
    }
    track_script {
        chk_haproxy
    }
}

The chk_haproxy.sh script is crucial for monitoring HAProxy’s health. Create it at /usr/local/bin/check_haproxy.sh:

#!/bin/bash
if pgrep haproxy > /dev/null; then
    exit 0
else
    exit 1
fi

Make it executable:

sudo chmod +x /usr/local/bin/check_haproxy.sh

Restart Keepalived on both nodes:

sudo systemctl restart keepalived

HAProxy Configuration for Elasticsearch Backend

HAProxy will act as the load balancer for your Elasticsearch cluster. It will distribute traffic across your Elasticsearch data nodes. We’ll configure it to perform health checks on the Elasticsearch nodes.

Install HAProxy:

sudo apt update
sudo apt install haproxy -y

Edit the HAProxy configuration file, typically /etc/haproxy/haproxy.cfg. Ensure the VIP is bound to the HAProxy interface (or accessible by it).

[global]
log         /dev/log local0
log         /dev/log local1 notice
maxconn     4096
user        haproxy
group       haproxy
daemon

defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

listen elasticsearch_cluster
    bind *:9200 # Or the port your Elasticsearch API listens on
    mode tcp
    balance roundrobin
    option httpchk GET / HTTP/1.1\r\nHost: localhost\r\nConnection: close
    http-check expect status 200
    server es-data-01 10.10.0.1:9200 check port 9200 inter 2s fall 3 rise 2
    server es-data-02 10.10.0.2:9200 check port 9200 inter 2s fall 3 rise 2
    server es-data-03 10.10.0.3:9200 check port 9200 inter 2s fall 3 rise 2
    # Add more data nodes as needed

Explanation of HAProxy settings:

mode tcp: We use TCP mode because Elasticsearch’s HTTP API is not designed for HTTP-level load balancing in the same way as a web server.
balance roundrobin: Distributes requests evenly across available servers.
option httpchk GET / HTTP/1.1\r\nHost: localhost\r\nConnection: close and http-check expect status 200: These lines configure an HTTP health check. HAProxy sends a GET request to the root path of each Elasticsearch node. A 200 OK response indicates the node is healthy.
server [name] [ip]:[port] check port [port] inter [interval] fall [failures] rise [successes]: Defines each Elasticsearch backend server. The check directive enables health checks. fall and rise determine how many consecutive failures or successes are needed to mark a server as down or up, respectively.

Restart HAProxy:

sudo systemctl restart haproxy

Testing the Failover Mechanism

To test the automated failover, simulate failures:

Simulate Elasticsearch Node Failure: Stop an Elasticsearch data node (e.g., sudo systemctl stop elasticsearch). HAProxy should detect the failure via its health checks and stop sending traffic to it. Elasticsearch itself will rebalance shards if necessary.
Simulate HAProxy/Keepalived Failure: On the MASTER Keepalived node, stop Keepalived (sudo systemctl stop keepalived). The VIP should be transferred to the BACKUP node. Verify by checking the VIP on the BACKUP node.
Simulate Datacenter Failure: If you have Droplets in multiple DigitalOcean datacenters, simulate a datacenter outage by shutting down all nodes in one datacenter. Your applications should continue to receive responses from the remaining active nodes via the HAProxy VIP.

Monitor logs for Keepalived (/var/log/syslog) and HAProxy (/var/log/haproxy.log) to diagnose any issues during testing.

Considerations for Production Deployments

Monitoring: Implement comprehensive monitoring for your Elasticsearch cluster (e.g., using Elasticsearch’s own monitoring features, Prometheus/Grafana) and your HAProxy/Keepalived setup. Alerting on node failures, cluster health, and VIP status is paramount.

Security: Ensure all inter-node communication in Elasticsearch is secured with TLS. Configure X-Pack security for authentication and authorization. Secure your HAProxy and Keepalived instances.

Network Configuration: Use private networking for inter-node communication within your DigitalOcean VPC. Ensure firewall rules (DigitalOcean Cloud Firewalls or UFW) allow necessary traffic between nodes and from clients to HAProxy.

Backup and Restore: While replication provides high availability, it’s not a substitute for backups. Implement a robust backup strategy for your Elasticsearch data (e.g., using Elasticsearch’s snapshot/restore API to S3-compatible storage).

Automated Deployment: Use infrastructure-as-code tools like Terraform or Ansible to automate the deployment and configuration of your Elasticsearch cluster, HAProxy, and Keepalived instances. This ensures consistency and simplifies recovery.

Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and C Deployments on DigitalOcean

Elasticsearch Cluster Setup for High Availability

Configuring Elasticsearch for Resilience

Shard Replication and Allocation Strategies

Automated Failover Orchestration with Keepalived and HAProxy

Keepalived Configuration for VIP Management

HAProxy Configuration for Elasticsearch Backend

Testing the Failover Mechanism

Considerations for Production Deployments

Recent Posts

Top Categories

Our Products

Our Services