Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Ruby Deployments on Linode

Elasticsearch Cluster Setup for High Availability

Achieving automated failover for Elasticsearch hinges on a robust, multi-node cluster configuration. We’ll focus on a setup designed for resilience, leveraging Elasticsearch’s built-in master election and shard replication mechanisms. For this example, we’ll assume a Linode environment with three dedicated Elasticsearch nodes, each running Ubuntu 22.04 LTS.

The core of Elasticsearch’s HA lies in its distributed nature. A minimum of three master-eligible nodes is recommended to prevent split-brain scenarios. Each data node should have its primary and replica shards distributed across different nodes. This ensures that if one node fails, its data is still accessible from replicas on other nodes.

Elasticsearch Configuration (`elasticsearch.yml`)

On each Elasticsearch node, the primary configuration file, /etc/elasticsearch/elasticsearch.yml, needs to be tuned for cluster discovery and fault tolerance. Key parameters include:

cluster.name: A unique name for your Elasticsearch cluster.
node.name: A unique identifier for each node.
network.host: The IP address Elasticsearch will bind to. Use 0.0.0.0 for all interfaces or a specific private IP.
discovery.seed_hosts: A list of IP addresses of other master-eligible nodes in the cluster.
cluster.initial_master_nodes: A list of node names that are eligible to become the initial master. This is crucial for bootstrapping the cluster.
discovery.zen.minimum_master_nodes: For older versions, this was critical. In modern Elasticsearch (7.x+), cluster.initial_master_nodes handles this. However, for compatibility or specific configurations, understanding this concept is vital. The rule of thumb is (N/2) + 1, where N is the number of master-eligible nodes. For 3 nodes, this would be 2.
xpack.security.enabled: Set to true for production environments to enable authentication and authorization.

Here’s a sample elasticsearch.yml for Node 1:

Node 1 Configuration

cluster.name: my-production-cluster
node.name: es-node-1
network.host: 10.10.0.1 # Private IP of Node 1
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - 10.10.0.1:9300
  - 10.10.0.2:9300
  - 10.10.0.3:9300

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

xpack.security.enabled: true

Node 2 Configuration

cluster.name: my-production-cluster
node.name: es-node-2
network.host: 10.10.0.2 # Private IP of Node 2
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - 10.10.0.1:9300
  - 10.10.0.2:9300
  - 10.10.0.3:9300

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

xpack.security.enabled: true

Node 3 Configuration

cluster.name: my-production-cluster
node.name: es-node-3
network.host: 10.10.0.3 # Private IP of Node 3
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - 10.10.0.1:9300
  - 10.10.0.2:9300
  - 10.10.0.3:9300

cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"

xpack.security.enabled: true

After configuring each node, restart the Elasticsearch service:

sudo systemctl restart elasticsearch
sudo systemctl status elasticsearch

Verify cluster health by querying any node:

curl -X GET "http://localhost:9200/_cluster/health?pretty"

You should see a status of green or yellow (if replicas are not yet allocated) and the correct number of nodes reported.

Ruby Application Deployment and Configuration

Our Ruby application will interact with Elasticsearch for search and data storage. For high availability of the application itself, we’ll deploy it using a process manager like Puma and manage multiple instances behind a load balancer. Linode’s NodeBalancers are ideal for this.

Puma Configuration for Resilience

Puma is a popular HTTP server for Ruby web applications. To ensure it can handle failures and restarts, we’ll configure it to run as a service using systemd and manage multiple worker processes.

Create a config/puma.rb file in your Ruby application’s root directory:

# config/puma.rb
workers Integer(ENV.fetch("WEB_CONCURRENCY") { 4 })
threads_count = Integer(ENV.fetch("RAILS_MAX_THREADS") { 5 })
threads threads_count, threads_count

preload_app!

environment ENV.fetch("RAILS_ENV") { "production" }

# Bind to a specific IP and port for Linode NodeBalancer
bind "tcp://0.0.0.0:3000" # Or a specific private IP if preferred

# Daemonize the process
daemonize false

# Logging
stdout_redirect "log/puma.stdout.log", "log/puma.stderr.log", true

# PID file
pidfile "tmp/pids/puma.pid"

# State file
state_path "tmp/pids/puma.state"

# Workers can share connections
on_worker_boot do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end

# Clean up when workers exit
on_worker_shutdown do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart

Next, create a systemd service file for Puma. Assuming your application is located at /var/www/my_ruby_app and you’re running as the user deploy:

# /etc/systemd/system/puma_my_ruby_app.service
[Unit]
Description=Puma HTTP Server for MyRubyApp
After=network.target

[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/var/www/my_ruby_app
Environment="RAILS_ENV=production"
Environment="WEB_CONCURRENCY=4"
Environment="RAILS_MAX_THREADS=5"
ExecStart=/usr/local/bin/bundle exec puma -C config/puma.rb
ExecStop=/bin/kill -s TERM $MAINPID
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable puma_my_ruby_app.service
sudo systemctl start puma_my_ruby_app.service
sudo systemctl status puma_my_ruby_app.service

Deploy multiple instances of this application on separate Linode instances. For example, two application servers (App Server 1 and App Server 2).

Elasticsearch Client Configuration in Ruby

Your Ruby application needs to connect to the Elasticsearch cluster. The elasticsearch-ruby gem is the standard client. Ensure your client configuration points to your cluster’s HTTP endpoints.

In your application’s initialization (e.g., config/initializers/elasticsearch.rb for Rails):

# config/initializers/elasticsearch.rb
require 'elasticsearch'

# Use private IPs of your Elasticsearch nodes
ES_CLIENT = Elasticsearch::Client.new(
  hosts: [
    { url: 'http://10.10.0.1:9200' },
    { url: 'http://10.10.0.2:9200' },
    { url: 'http://10.10.0.3:9200' }
  ],
  # If xpack.security.enabled is true, configure authentication
  # http_auth: { username: 'elastic', password: 'changeme' },
  # ssl_options: { verify_mode: OpenSSL::SSL::VERIFY_NONE } # Use with caution, prefer proper certs
)

# Optional: Ping the cluster to ensure connectivity on startup
begin
  ES_CLIENT.ping
  Rails.logger.info "Successfully connected to Elasticsearch cluster."
rescue Elasticsearch::Transport::Transport::Errors::ConnectionError => e
  Rails.logger.error "Failed to connect to Elasticsearch cluster: #{e.message}"
  # Depending on your app, you might want to exit or handle this more gracefully
end

This configuration provides a list of hosts. The elasticsearch-ruby client will automatically try the next host if one fails, providing basic resilience at the client level.

Load Balancing and Health Checks

To achieve true auto-failover for the application layer, we need a load balancer that can detect unhealthy application instances and route traffic away from them. Linode NodeBalancers are a managed solution that simplifies this.

Linode NodeBalancer Configuration

1. Create a NodeBalancer: In your Linode Cloud Manager, navigate to “NodeBalancers” and create a new one. Choose a region close to your Linode instances.

2. Configure Frontend: Set up a listener for your application’s port (e.g., port 80 or 443 for HTTP/S). If using HTTPS, you’ll configure SSL certificates here.

3. Configure Backend Nodes: Add your application server Linodes (App Server 1 and App Server 2) as backend nodes. Specify the port your application is listening on (e.g., 3000).

4. Set Up Health Checks: This is critical for auto-failover. Configure health checks to ensure the NodeBalancer knows when an application instance is unresponsive.

Protocol: HTTP
Path: A specific endpoint in your application that indicates health. A common practice is to create a /health or /status endpoint in your Rails app that returns a 200 OK if the app is functioning.
Check Interval: How often to check (e.g., 10 seconds).
Response Timeout: How long to wait for a response (e.g., 5 seconds).
Unhealthy Threshold: Number of consecutive failures before marking a node unhealthy (e.g., 3).
Healthy Threshold: Number of consecutive successes before marking a node healthy again (e.g., 2).

Example /health endpoint in Rails (app/controllers/health_controller.rb):

class HealthController < ApplicationController
  def show
    # Check Elasticsearch connection as part of health
    begin
      ES_CLIENT.ping
      es_healthy = true
    rescue Elasticsearch::Transport::Transport::Errors::ConnectionError
      es_healthy = false
    end

    if es_healthy
      render json: { status: "ok", elasticsearch: "connected" }, status: :ok
    else
      render json: { status: "error", elasticsearch: "disconnected" }, status: :service_unavailable
    end
  end
end

And its route (config/routes.rb):

Rails.application.routes.draw do
  get '/health', to: 'health#show'
  # ... other routes
end

With these health checks configured, the Linode NodeBalancer will automatically stop sending traffic to any application instance that fails the health check, and resume sending traffic once it becomes healthy again.

Automated Failover Orchestration

The components described above form the foundation for automated failover. Let’s summarize how they work together during a failure scenario:

Scenario: Application Server Failure

Failure Detection: If App Server 1 becomes unresponsive (e.g., Puma process crashes, network issue), the Linode NodeBalancer’s health checks will start failing for that instance.
Traffic Rerouting: After the unhealthy threshold is met, the NodeBalancer will remove App Server 1 from its rotation. All new incoming traffic will be directed solely to App Server 2.
Application Resilience: The Ruby application on App Server 2 continues to serve requests. If it was configured with multiple Puma workers, those workers can handle the increased load.
Recovery: If App Server 1 is restarted and passes its health checks, the NodeBalancer will automatically add it back into the rotation.

Scenario: Elasticsearch Node Failure

Master Node Failure: If a master-eligible node fails, the remaining master-eligible nodes will initiate a new election. As long as a quorum ((N/2) + 1) of master-eligible nodes remains, a new master will be elected, and the cluster will continue to operate.
Data Node Failure: If a data node fails, Elasticsearch will detect that some shards are no longer available. It will then promote a replica shard from another node to become a primary shard for the affected indices. If the number of replicas is sufficient (e.g., 1 or more), data availability will be maintained. The cluster will then attempt to reallocate shards to new nodes if they come online.
Application Impact: The elasticsearch-ruby client, configured with multiple hosts, will automatically attempt to connect to the remaining healthy Elasticsearch nodes. If the cluster remains operational (which it should with 3 nodes), the application will continue to function. If the cluster becomes unavailable (e.g., only 1 node left), the application’s search and indexing operations will fail, and the /health endpoint will reflect this, potentially leading to the NodeBalancer taking the application instances offline if they can’t reach ES.

Monitoring and Alerting

While automated failover handles immediate recovery, robust monitoring and alerting are crucial for proactive management and understanding failure patterns. Implement monitoring for:

Linode NodeBalancer Health: Linode provides basic metrics. For deeper insights, consider external monitoring services.
Elasticsearch Cluster Health: Use Elasticsearch’s own APIs (e.g., _cluster/health, _nodes/stats) and tools like Prometheus with the Elasticsearch Exporter, or commercial solutions. Set up alerts for cluster status changes (red/yellow), node failures, and high resource utilization.
Application Performance: Monitor Puma worker status, request latency, error rates, and resource usage on your application servers. Tools like New Relic, Datadog, or Prometheus with Node Exporter and application-specific exporters are valuable.
System Metrics: CPU, memory, disk I/O, and network traffic on all Linode instances.

By combining a resilient Elasticsearch cluster, a well-configured Ruby application with a load-balanced deployment, and comprehensive health checks, you can architect a system capable of automatically failing over during component failures, significantly improving your application’s uptime and reliability on Linode.