Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Ruby Deployments on DigitalOcean

Elasticsearch Cluster Setup for High Availability on DigitalOcean

Achieving robust disaster recovery for Elasticsearch hinges on a well-architected, multi-node cluster with automatic failover capabilities. We’ll focus on a setup using DigitalOcean Droplets, leveraging Elasticsearch’s built-in quorum and discovery mechanisms. For this example, we’ll assume a three-node cluster, which is the minimum recommended for quorum-based failover (a majority of nodes must be available for the cluster to function).

Each Droplet will run a dedicated Elasticsearch instance. Network configuration is critical; ensure all Elasticsearch nodes can communicate with each other on the configured transport port (default 9300) and HTTP port (default 9200). We’ll use private networking for inter-node communication for security and performance.

Elasticsearch Configuration (`elasticsearch.yml`)

On each Elasticsearch Droplet, the elasticsearch.yml file needs to be configured to enable discovery and specify cluster details. The following configuration is a template; adjust IP addresses and cluster names as per your environment.

Node 1 (Master/Data):

cluster.name: my-production-cluster
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - 10.10.0.1:9300  # Private IP of es-node-1
  - 10.10.0.2:9300  # Private IP of es-node-2
  - 10.10.0.3:9300  # Private IP of es-node-3

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

node.roles: [ master, data, ingest ]

# For production, consider these settings:
# xpack.security.enabled: true
# xpack.security.transport.ssl.enabled: true
# xpack.security.http.ssl.enabled: true

Node 2 (Master/Data):

cluster.name: my-production-cluster
node.name: es-node-2
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - 10.10.0.1:9300
  - 10.10.0.2:9300
  - 10.10.0.3:9300

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

node.roles: [ master, data, ingest ]

# For production, consider these settings:
# xpack.security.enabled: true
# xpack.security.transport.ssl.enabled: true
# xpack.security.http.ssl.enabled: true

Node 3 (Master/Data):

cluster.name: my-production-cluster
node.name: es-node-3
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - 10.10.0.1:9300
  - 10.10.0.2:9300
  - 10.10.0.3:9300

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

node.roles: [ master, data, ingest ]

# For production, consider these settings:
# xpack.security.enabled: true
# xpack.security.transport.ssl.enabled: true
# xpack.security.http.ssl.enabled: true

Key Configuration Points:

cluster.name: Must be identical across all nodes.
node.name: Unique identifier for each node.
network.host: Set to 0.0.0.0 to bind to all network interfaces, or to the specific private IP if you want to be more restrictive.
discovery.seed_hosts: A list of IP addresses and ports of other nodes in the cluster. Elasticsearch will attempt to connect to these to discover the cluster.
cluster.initial_master_nodes: A list of node names that are eligible to become the master node when the cluster is first formed. This is crucial for bootstrapping.
node.roles: Defines the roles of the node. For a small cluster, combining master and data roles is common. For larger clusters, dedicated master nodes are recommended.

Firewall Configuration (UFW Example)

On each Droplet, configure the firewall to allow necessary traffic. Assuming UFW is installed:

# Allow SSH (replace with your actual SSH port if not 22)
sudo ufw allow 22/tcp

# Allow Elasticsearch HTTP API (port 9200) from your application servers' IPs
# Replace 192.168.1.0/24 with your application subnet or specific IPs
sudo ufw allow from 192.168.1.0/24 to any port 9200 proto tcp

# Allow Elasticsearch Transport Layer (port 9300) between Elasticsearch nodes
# Replace 10.10.0.0/16 with your private network subnet
sudo ufw allow from 10.10.0.0/16 to any port 9300 proto tcp

# Enable the firewall
sudo ufw enable
sudo ufw status verbose

Important: For production, you’ll want to restrict access to port 9200 to only your application servers and potentially your management IPs. Port 9300 should only be accessible between your Elasticsearch nodes.

Ruby Application Integration and Health Checks

Your Ruby application needs to interact with Elasticsearch and, critically, be aware of its health. We’ll use the elasticsearch-ruby client and implement a basic health check mechanism.

Elasticsearch Ruby Client Setup

Add the gem to your Gemfile:

gem 'elasticsearch-ruby'

Then run bundle install.

Initialize the client in your application. It’s best practice to configure multiple hosts to allow the client to automatically discover other nodes in the cluster and failover if a node becomes unavailable.

require 'elasticsearch'

# Assuming your Elasticsearch nodes are accessible via these IPs on port 9200
# The client will automatically discover other nodes in the cluster.
ELASTICSEARCH_HOSTS = [
  { url: 'http://10.10.0.1:9200', transport_options: { request: { timeout: 5 } } },
  { url: 'http://10.10.0.2:9200', transport_options: { request: { timeout: 5 } } },
  { url: 'http://10.10.0.3:9200', transport_options: { request: { timeout: 5 } } }
]

# Initialize the client
# The client will automatically manage connections and failover between hosts.
# If one host is down, it will try the next one.
# The timeout is crucial for quick detection of unresponsive nodes.
$elasticsearch_client = Elasticsearch::Client.new(hosts: ELASTICSEARCH_HOSTS, log: false)

# Example of a simple search query
begin
  response = $elasticsearch_client.search index: 'my_index', body: { query: { match_all: {} } }
  puts "Search successful: #{response['hits']['total']['value']} hits"
rescue Elasticsearch::Transport::Transport::Errors::ServiceUnavailable => e
  puts "Elasticsearch is unavailable: #{e.message}"
rescue => e
  puts "An error occurred: #{e.message}"
end

The elasticsearch-ruby client is designed to be resilient. When initialized with multiple hosts, it will attempt to connect to them in sequence. If a host is unresponsive (due to timeout or network issues), it will try the next one. This provides a basic level of application-level failover.

Implementing Application-Level Health Checks

Your application needs a way to determine if Elasticsearch is healthy. This can be integrated into your application’s monitoring or used by an external load balancer/orchestrator.

A simple health check endpoint in your Ruby on Rails application (e.g., in config/routes.rb and a corresponding controller action):

# config/routes.rb
get '/health', to: 'health#show'

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def show
    begin
      # Use the client's ping method for a lightweight check
      if $elasticsearch_client.ping
        render json: { elasticsearch: 'healthy' }, status: :ok
      else
        # ping might return false if no nodes are reachable,
        # or an exception might be raised for more severe issues.
        render json: { elasticsearch: 'unhealthy', error: 'Ping failed' }, status: :service_unavailable
      end
    rescue Elasticsearch::Transport::Transport::Errors::ServiceUnavailable => e
      render json: { elasticsearch: 'unhealthy', error: "Service Unavailable: #{e.message}" }, status: :service_unavailable
    rescue => e
      # Catch any other unexpected errors
      render json: { elasticsearch: 'unhealthy', error: "Unexpected error: #{e.message}" }, status: :internal_server_error
    end
  end
end

This health check endpoint can be polled by external monitoring tools (like DigitalOcean’s monitoring, Prometheus, or a custom script) or by a load balancer. If the endpoint returns a non-2xx status code, the system can take action.

Automated Failover Orchestration with Keepalived and HAProxy

While Elasticsearch and the Ruby client offer internal resilience, true automated failover for your application’s access to Elasticsearch often requires an external component. We’ll use HAProxy as a load balancer for Elasticsearch and Keepalived for High Availability of the HAProxy instance itself. This setup ensures that your application always has a stable IP address to connect to, even if one HAProxy server fails.

HAProxy Configuration for Elasticsearch

We’ll deploy two HAProxy Droplets. These will sit in front of your Elasticsearch cluster. The application will connect to a Virtual IP (VIP) managed by Keepalived, which will direct traffic to one of the HAProxy instances. Each HAProxy instance will then load balance traffic to the Elasticsearch nodes.

HAProxy Configuration (`haproxy.cfg`) on both HAProxy Droplets:

[global]
log         /dev/log local0
log         /dev/log local1 notice
maxconn     4096
daemon

# Default settings
defaults
  log     global
  mode    http
  option  httplog
  option  dontlognull
  timeout connect 5000
  timeout client  50000
  timeout server  50000
  errorfile 400 /etc/haproxy/errors/400.http
  errorfile 403 /etc/haproxy/errors/403.http
  errorfile 408 /etc/haproxy/errors/408.http
  errorfile 500 /etc/haproxy/errors/500.http
  errorfile 502 /etc/haproxy/errors/502.http
  errorfile 503 /etc/haproxy/errors/503.http
  errorfile 504 /etc/haproxy/errors/504.http

# Frontend for application traffic to Elasticsearch
frontend elasticsearch_frontend
  bind *:9200
  mode http
  default_backend elasticsearch_backend

# Backend for Elasticsearch nodes
backend elasticsearch_backend
  mode http
  balance roundrobin
  option httpchk GET /_cluster/health?pretty
  http-check expect status 200
  # List your Elasticsearch nodes here. Use private IPs.
  server es1 10.10.0.1:9200 check port 9200 inter 2000 rise 2 fall 3
  server es2 10.10.0.2:9200 check port 9200 inter 2000 rise 2 fall 3
  server es3 10.10.0.3:9200 check port 9200 inter 2000 rise 2 fall 3

# Optional: HAProxy stats page
listen stats
  bind *:8404
  mode http
  stats enable
  stats uri /stats
  stats realm Haproxy\ Statistics
  stats auth admin:YourSecurePassword

Explanation:

mode http: HAProxy operates at the application layer for HTTP traffic.
balance roundrobin: Distributes requests evenly across available servers. Other options like leastconn can also be effective.
option httpchk: Configures HAProxy to perform HTTP health checks against the Elasticsearch cluster health endpoint.
http-check expect status 200: The health check expects an HTTP 200 OK status code. Elasticsearch’s cluster health endpoint typically returns 200 even if the cluster is not fully green, as long as it’s responsive. You might refine this to check for specific cluster states if needed, but it adds complexity.
server ... check port 9200 inter 2000 rise 2 fall 3: Defines each Elasticsearch node as a server. check enables health monitoring. inter is the interval between checks, rise is the number of successful checks to consider a server up, and fall is the number of failed checks to consider a server down.

Keepalived for HAProxy High Availability

Keepalived provides a Virtual IP (VIP) that floats between the two HAProxy servers. If the primary HAProxy server fails, Keepalived automatically assigns the VIP to the secondary server.

Keepalived Configuration (`keepalived.conf`) on both HAProxy Droplets:

Droplet 1 (MASTER):

vrrp_script chk_haproxy {
    script "/usr/local/bin/check_haproxy.sh"
    interval 2
    weight 20
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0  # Replace with your primary network interface
    virtual_router_id 51
    priority 150    # Higher priority for MASTER
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_secret_password
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0  # Your Virtual IP address
    }
    track_script {
        chk_haproxy
    }
}

Droplet 2 (BACKUP):

vrrp_script chk_haproxy {
    script "/usr/local/bin/check_haproxy.sh"
    interval 2
    weight 20
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0  # Replace with your primary network interface
    virtual_router_id 51
    priority 100    # Lower priority for BACKUP
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_secret_password
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0  # Your Virtual IP address
    }
    track_script {
        chk_haproxy
    }
}

Health Check Script (`/usr/local/bin/check_haproxy.sh`):

#!/bin/bash

# Check if HAProxy process is running
if pgrep haproxy > /dev/null; then
    # Check HAProxy stats page for any active backend servers
    # This is a simplified check. A more robust check might parse the stats page.
    # For simplicity, we assume if HAProxy is running and can reach at least one ES node, it's healthy.
    # A more advanced check would query the ES cluster health via HAProxy itself.
    # For now, we rely on HAProxy's internal checks and the process running.
    exit 0
else
    exit 1
fi

Explanation:

vrrp_instance VI_1: Defines a VRRP instance. VI_1 is an arbitrary name.
state MASTER/BACKUP: Sets the initial state of the node.
interface eth0: The network interface to bind the VIP to. Ensure this is correct for your Droplets.
virtual_router_id: A unique ID for this VRRP group. Must be the same on both nodes.
priority: Determines which node becomes MASTER. Higher value wins.
authentication: Simple password authentication for VRRP packets.
virtual_ipaddress: The IP address that will be managed by Keepalived. Your application will connect to this IP.
vrrp_script chk_haproxy: Defines a script to run for health checking.
track_script { chk_haproxy }: Tells Keepalived to monitor the script’s exit code. If the script fails, the priority of the node is reduced, potentially triggering a failover.

Important: Ensure the /usr/local/bin/check_haproxy.sh script is executable (`chmod +x /usr/local/bin/check_haproxy.sh`) and that the eth0 interface is correct. You’ll also need to install keepalived and haproxy packages on both HAProxy Droplets.

Application Configuration Update

Update your Ruby application’s Elasticsearch client configuration to point to the Virtual IP managed by Keepalived.

require 'elasticsearch'

# Point to the Virtual IP managed by Keepalived
ELASTICSEARCH_HOSTS = [
  { url: 'http://192.168.1.100:9200', transport_options: { request: { timeout: 5 } } }
  # You can optionally add the direct IPs of the HAProxy servers as fallbacks,
  # but the primary goal is to use the VIP.
  # { url: 'http://10.10.0.10:9200', transport_options: { request: { timeout: 5 } } }, # HAProxy 1 IP
  # { url: 'http://10.10.0.11:9200', transport_options: { request: { timeout: 5 } } }  # HAProxy 2 IP
]

$elasticsearch_client = Elasticsearch::Client.new(hosts: ELASTICSEARCH_HOSTS, log: false)

# ... rest of your application logic

With this setup, your application connects to the stable VIP. If the primary HAProxy server fails, Keepalived promotes the backup server, and the VIP moves. HAProxy’s internal health checks will then direct traffic only to the remaining healthy Elasticsearch nodes. If an Elasticsearch node fails, HAProxy will stop sending traffic to it, and the elasticsearch-ruby client will attempt to use other available nodes.

Testing and Monitoring Failover Scenarios

Thorough testing is paramount. Simulate failures to ensure your automated failover mechanisms work as expected.

Test Scenarios

Elasticsearch Node Failure: Stop the Elasticsearch service on one Droplet (e.g., sudo systemctl stop elasticsearch). Monitor HAProxy’s stats page (if enabled) to see the node marked as down. Verify your application can still perform searches.
HAProxy Server Failure: Stop the HAProxy service on one of the HAProxy Droplets (sudo systemctl stop haproxy). Observe the Keepalived logs on both HAProxy servers. The VIP should move to the surviving HAProxy server. Test application connectivity to the VIP.
Keepalived Failure (Simulated): Stop Keepalived on the MASTER HAProxy server. The VIP should transfer to the BACKUP server.
Network Partition: Simulate network issues between nodes (e.g., using `iptables` to block traffic).

Monitoring Tools

Implement comprehensive monitoring:

DigitalOcean Monitoring: Utilize Droplet-level metrics (CPU, RAM, Network).
HAProxy Stats: Regularly check the HAProxy stats page for backend server status and traffic volume.
Elasticsearch Cluster Health API: Periodically query GET /_cluster/health from your application or a monitoring agent. Alert on statuses other than green or yellow.
Application Health Endpoint: Use external monitoring services to poll your application’s /health endpoint.
Log Aggregation: Centralize logs from Elasticsearch, HAProxy, and your application for easier debugging. Tools like ELK stack (ironically), Graylog, or cloud-native solutions can be used.

By combining Elasticsearch’s inherent clustering, the resilience of the elasticsearch-ruby client, and the robust HA capabilities of HAProxy and Keepalived, you can architect a highly available Elasticsearch deployment on DigitalOcean that automatically handles node and service failures, ensuring minimal downtime for your Ruby applications.