Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Ruby Deployments on DigitalOcean
Elasticsearch Cluster Setup for High Availability on DigitalOcean
Achieving robust disaster recovery for Elasticsearch hinges on a well-architected, multi-node cluster with automatic failover capabilities. We’ll focus on a setup using DigitalOcean Droplets, leveraging Elasticsearch’s built-in quorum and discovery mechanisms. For this example, we’ll assume a three-node cluster, which is the minimum recommended for quorum-based failover (a majority of nodes must be available for the cluster to function).
Each Droplet will run a dedicated Elasticsearch instance. Network configuration is critical; ensure all Elasticsearch nodes can communicate with each other on the configured transport port (default 9300) and HTTP port (default 9200). We’ll use private networking for inter-node communication for security and performance.
Elasticsearch Configuration (`elasticsearch.yml`)
On each Elasticsearch Droplet, the elasticsearch.yml file needs to be configured to enable discovery and specify cluster details. The following configuration is a template; adjust IP addresses and cluster names as per your environment.
Node 1 (Master/Data):
cluster.name: my-production-cluster node.name: es-node-1 network.host: 0.0.0.0 http.port: 9200 transport.port: 9300 discovery.seed_hosts: - 10.10.0.1:9300 # Private IP of es-node-1 - 10.10.0.2:9300 # Private IP of es-node-2 - 10.10.0.3:9300 # Private IP of es-node-3 cluster.initial_master_nodes: - "es-node-1" - "es-node-2" - "es-node-3" node.roles: [ master, data, ingest ] # For production, consider these settings: # xpack.security.enabled: true # xpack.security.transport.ssl.enabled: true # xpack.security.http.ssl.enabled: true
Node 2 (Master/Data):
cluster.name: my-production-cluster node.name: es-node-2 network.host: 0.0.0.0 http.port: 9200 transport.port: 9300 discovery.seed_hosts: - 10.10.0.1:9300 - 10.10.0.2:9300 - 10.10.0.3:9300 cluster.initial_master_nodes: - "es-node-1" - "es-node-2" - "es-node-3" node.roles: [ master, data, ingest ] # For production, consider these settings: # xpack.security.enabled: true # xpack.security.transport.ssl.enabled: true # xpack.security.http.ssl.enabled: true
Node 3 (Master/Data):
cluster.name: my-production-cluster node.name: es-node-3 network.host: 0.0.0.0 http.port: 9200 transport.port: 9300 discovery.seed_hosts: - 10.10.0.1:9300 - 10.10.0.2:9300 - 10.10.0.3:9300 cluster.initial_master_nodes: - "es-node-1" - "es-node-2" - "es-node-3" node.roles: [ master, data, ingest ] # For production, consider these settings: # xpack.security.enabled: true # xpack.security.transport.ssl.enabled: true # xpack.security.http.ssl.enabled: true
Key Configuration Points:
cluster.name: Must be identical across all nodes.node.name: Unique identifier for each node.network.host: Set to0.0.0.0to bind to all network interfaces, or to the specific private IP if you want to be more restrictive.discovery.seed_hosts: A list of IP addresses and ports of other nodes in the cluster. Elasticsearch will attempt to connect to these to discover the cluster.cluster.initial_master_nodes: A list of node names that are eligible to become the master node when the cluster is first formed. This is crucial for bootstrapping.node.roles: Defines the roles of the node. For a small cluster, combining master and data roles is common. For larger clusters, dedicated master nodes are recommended.
Firewall Configuration (UFW Example)
On each Droplet, configure the firewall to allow necessary traffic. Assuming UFW is installed:
# Allow SSH (replace with your actual SSH port if not 22) sudo ufw allow 22/tcp # Allow Elasticsearch HTTP API (port 9200) from your application servers' IPs # Replace 192.168.1.0/24 with your application subnet or specific IPs sudo ufw allow from 192.168.1.0/24 to any port 9200 proto tcp # Allow Elasticsearch Transport Layer (port 9300) between Elasticsearch nodes # Replace 10.10.0.0/16 with your private network subnet sudo ufw allow from 10.10.0.0/16 to any port 9300 proto tcp # Enable the firewall sudo ufw enable sudo ufw status verbose
Important: For production, you’ll want to restrict access to port 9200 to only your application servers and potentially your management IPs. Port 9300 should only be accessible between your Elasticsearch nodes.
Ruby Application Integration and Health Checks
Your Ruby application needs to interact with Elasticsearch and, critically, be aware of its health. We’ll use the elasticsearch-ruby client and implement a basic health check mechanism.
Elasticsearch Ruby Client Setup
Add the gem to your Gemfile:
gem 'elasticsearch-ruby'
Then run bundle install.
Initialize the client in your application. It’s best practice to configure multiple hosts to allow the client to automatically discover other nodes in the cluster and failover if a node becomes unavailable.
require 'elasticsearch'
# Assuming your Elasticsearch nodes are accessible via these IPs on port 9200
# The client will automatically discover other nodes in the cluster.
ELASTICSEARCH_HOSTS = [
{ url: 'http://10.10.0.1:9200', transport_options: { request: { timeout: 5 } } },
{ url: 'http://10.10.0.2:9200', transport_options: { request: { timeout: 5 } } },
{ url: 'http://10.10.0.3:9200', transport_options: { request: { timeout: 5 } } }
]
# Initialize the client
# The client will automatically manage connections and failover between hosts.
# If one host is down, it will try the next one.
# The timeout is crucial for quick detection of unresponsive nodes.
$elasticsearch_client = Elasticsearch::Client.new(hosts: ELASTICSEARCH_HOSTS, log: false)
# Example of a simple search query
begin
response = $elasticsearch_client.search index: 'my_index', body: { query: { match_all: {} } }
puts "Search successful: #{response['hits']['total']['value']} hits"
rescue Elasticsearch::Transport::Transport::Errors::ServiceUnavailable => e
puts "Elasticsearch is unavailable: #{e.message}"
rescue => e
puts "An error occurred: #{e.message}"
end
The elasticsearch-ruby client is designed to be resilient. When initialized with multiple hosts, it will attempt to connect to them in sequence. If a host is unresponsive (due to timeout or network issues), it will try the next one. This provides a basic level of application-level failover.
Implementing Application-Level Health Checks
Your application needs a way to determine if Elasticsearch is healthy. This can be integrated into your application’s monitoring or used by an external load balancer/orchestrator.
A simple health check endpoint in your Ruby on Rails application (e.g., in config/routes.rb and a corresponding controller action):
# config/routes.rb get '/health', to: 'health#show'
# app/controllers/health_controller.rb
class HealthController < ApplicationController
def show
begin
# Use the client's ping method for a lightweight check
if $elasticsearch_client.ping
render json: { elasticsearch: 'healthy' }, status: :ok
else
# ping might return false if no nodes are reachable,
# or an exception might be raised for more severe issues.
render json: { elasticsearch: 'unhealthy', error: 'Ping failed' }, status: :service_unavailable
end
rescue Elasticsearch::Transport::Transport::Errors::ServiceUnavailable => e
render json: { elasticsearch: 'unhealthy', error: "Service Unavailable: #{e.message}" }, status: :service_unavailable
rescue => e
# Catch any other unexpected errors
render json: { elasticsearch: 'unhealthy', error: "Unexpected error: #{e.message}" }, status: :internal_server_error
end
end
end
This health check endpoint can be polled by external monitoring tools (like DigitalOcean’s monitoring, Prometheus, or a custom script) or by a load balancer. If the endpoint returns a non-2xx status code, the system can take action.
Automated Failover Orchestration with Keepalived and HAProxy
While Elasticsearch and the Ruby client offer internal resilience, true automated failover for your application’s access to Elasticsearch often requires an external component. We’ll use HAProxy as a load balancer for Elasticsearch and Keepalived for High Availability of the HAProxy instance itself. This setup ensures that your application always has a stable IP address to connect to, even if one HAProxy server fails.
HAProxy Configuration for Elasticsearch
We’ll deploy two HAProxy Droplets. These will sit in front of your Elasticsearch cluster. The application will connect to a Virtual IP (VIP) managed by Keepalived, which will direct traffic to one of the HAProxy instances. Each HAProxy instance will then load balance traffic to the Elasticsearch nodes.
HAProxy Configuration (`haproxy.cfg`) on both HAProxy Droplets:
[global] log /dev/log local0 log /dev/log local1 notice maxconn 4096 daemon # Default settings defaults log global mode http option httplog option dontlognull timeout connect 5000 timeout client 50000 timeout server 50000 errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http # Frontend for application traffic to Elasticsearch frontend elasticsearch_frontend bind *:9200 mode http default_backend elasticsearch_backend # Backend for Elasticsearch nodes backend elasticsearch_backend mode http balance roundrobin option httpchk GET /_cluster/health?pretty http-check expect status 200 # List your Elasticsearch nodes here. Use private IPs. server es1 10.10.0.1:9200 check port 9200 inter 2000 rise 2 fall 3 server es2 10.10.0.2:9200 check port 9200 inter 2000 rise 2 fall 3 server es3 10.10.0.3:9200 check port 9200 inter 2000 rise 2 fall 3 # Optional: HAProxy stats page listen stats bind *:8404 mode http stats enable stats uri /stats stats realm Haproxy\ Statistics stats auth admin:YourSecurePassword
Explanation:
mode http: HAProxy operates at the application layer for HTTP traffic.balance roundrobin: Distributes requests evenly across available servers. Other options likeleastconncan also be effective.option httpchk: Configures HAProxy to perform HTTP health checks against the Elasticsearch cluster health endpoint.http-check expect status 200: The health check expects an HTTP 200 OK status code. Elasticsearch’s cluster health endpoint typically returns 200 even if the cluster is not fully green, as long as it’s responsive. You might refine this to check for specific cluster states if needed, but it adds complexity.server ... check port 9200 inter 2000 rise 2 fall 3: Defines each Elasticsearch node as a server.checkenables health monitoring.interis the interval between checks,riseis the number of successful checks to consider a server up, andfallis the number of failed checks to consider a server down.
Keepalived for HAProxy High Availability
Keepalived provides a Virtual IP (VIP) that floats between the two HAProxy servers. If the primary HAProxy server fails, Keepalived automatically assigns the VIP to the secondary server.
Keepalived Configuration (`keepalived.conf`) on both HAProxy Droplets:
Droplet 1 (MASTER):
vrrp_script chk_haproxy {
script "/usr/local/bin/check_haproxy.sh"
interval 2
weight 20
fall 2
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0 # Replace with your primary network interface
virtual_router_id 51
priority 150 # Higher priority for MASTER
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret_password
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 # Your Virtual IP address
}
track_script {
chk_haproxy
}
}
Droplet 2 (BACKUP):
vrrp_script chk_haproxy {
script "/usr/local/bin/check_haproxy.sh"
interval 2
weight 20
fall 2
rise 2
}
vrrp_instance VI_1 {
state BACKUP
interface eth0 # Replace with your primary network interface
virtual_router_id 51
priority 100 # Lower priority for BACKUP
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret_password
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 # Your Virtual IP address
}
track_script {
chk_haproxy
}
}
Health Check Script (`/usr/local/bin/check_haproxy.sh`):
#!/bin/bash
# Check if HAProxy process is running
if pgrep haproxy > /dev/null; then
# Check HAProxy stats page for any active backend servers
# This is a simplified check. A more robust check might parse the stats page.
# For simplicity, we assume if HAProxy is running and can reach at least one ES node, it's healthy.
# A more advanced check would query the ES cluster health via HAProxy itself.
# For now, we rely on HAProxy's internal checks and the process running.
exit 0
else
exit 1
fi
Explanation:
vrrp_instance VI_1: Defines a VRRP instance.VI_1is an arbitrary name.state MASTER/BACKUP: Sets the initial state of the node.interface eth0: The network interface to bind the VIP to. Ensure this is correct for your Droplets.virtual_router_id: A unique ID for this VRRP group. Must be the same on both nodes.priority: Determines which node becomes MASTER. Higher value wins.authentication: Simple password authentication for VRRP packets.virtual_ipaddress: The IP address that will be managed by Keepalived. Your application will connect to this IP.vrrp_script chk_haproxy: Defines a script to run for health checking.track_script { chk_haproxy }: Tells Keepalived to monitor the script’s exit code. If the script fails, the priority of the node is reduced, potentially triggering a failover.
Important: Ensure the /usr/local/bin/check_haproxy.sh script is executable (`chmod +x /usr/local/bin/check_haproxy.sh`) and that the eth0 interface is correct. You’ll also need to install keepalived and haproxy packages on both HAProxy Droplets.
Application Configuration Update
Update your Ruby application’s Elasticsearch client configuration to point to the Virtual IP managed by Keepalived.
require 'elasticsearch'
# Point to the Virtual IP managed by Keepalived
ELASTICSEARCH_HOSTS = [
{ url: 'http://192.168.1.100:9200', transport_options: { request: { timeout: 5 } } }
# You can optionally add the direct IPs of the HAProxy servers as fallbacks,
# but the primary goal is to use the VIP.
# { url: 'http://10.10.0.10:9200', transport_options: { request: { timeout: 5 } } }, # HAProxy 1 IP
# { url: 'http://10.10.0.11:9200', transport_options: { request: { timeout: 5 } } } # HAProxy 2 IP
]
$elasticsearch_client = Elasticsearch::Client.new(hosts: ELASTICSEARCH_HOSTS, log: false)
# ... rest of your application logic
With this setup, your application connects to the stable VIP. If the primary HAProxy server fails, Keepalived promotes the backup server, and the VIP moves. HAProxy’s internal health checks will then direct traffic only to the remaining healthy Elasticsearch nodes. If an Elasticsearch node fails, HAProxy will stop sending traffic to it, and the elasticsearch-ruby client will attempt to use other available nodes.
Testing and Monitoring Failover Scenarios
Thorough testing is paramount. Simulate failures to ensure your automated failover mechanisms work as expected.
Test Scenarios
- Elasticsearch Node Failure: Stop the Elasticsearch service on one Droplet (e.g.,
sudo systemctl stop elasticsearch). Monitor HAProxy’s stats page (if enabled) to see the node marked as down. Verify your application can still perform searches. - HAProxy Server Failure: Stop the HAProxy service on one of the HAProxy Droplets (
sudo systemctl stop haproxy). Observe the Keepalived logs on both HAProxy servers. The VIP should move to the surviving HAProxy server. Test application connectivity to the VIP. - Keepalived Failure (Simulated): Stop Keepalived on the MASTER HAProxy server. The VIP should transfer to the BACKUP server.
- Network Partition: Simulate network issues between nodes (e.g., using `iptables` to block traffic).
Monitoring Tools
Implement comprehensive monitoring:
- DigitalOcean Monitoring: Utilize Droplet-level metrics (CPU, RAM, Network).
- HAProxy Stats: Regularly check the HAProxy stats page for backend server status and traffic volume.
- Elasticsearch Cluster Health API: Periodically query
GET /_cluster/healthfrom your application or a monitoring agent. Alert on statuses other thangreenoryellow. - Application Health Endpoint: Use external monitoring services to poll your application’s
/healthendpoint. - Log Aggregation: Centralize logs from Elasticsearch, HAProxy, and your application for easier debugging. Tools like ELK stack (ironically), Graylog, or cloud-native solutions can be used.
By combining Elasticsearch’s inherent clustering, the resilience of the elasticsearch-ruby client, and the robust HA capabilities of HAProxy and Keepalived, you can architect a highly available Elasticsearch deployment on DigitalOcean that automatically handles node and service failures, ensuring minimal downtime for your Ruby applications.