Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Ruby Deployments on Linode
Elasticsearch Cluster Setup for High Availability
Achieving automated failover for Elasticsearch hinges on a robust, multi-node cluster configuration. We’ll focus on a setup designed for resilience, leveraging Elasticsearch’s built-in master election and shard replication mechanisms. For this example, we’ll assume a Linode environment with three dedicated Elasticsearch nodes, each running Ubuntu 22.04 LTS.
The core of Elasticsearch’s HA lies in its distributed nature. A minimum of three master-eligible nodes is recommended to prevent split-brain scenarios. Each data node should have its primary and replica shards distributed across different nodes. This ensures that if one node fails, its data is still accessible from replicas on other nodes.
Elasticsearch Configuration (`elasticsearch.yml`)
On each Elasticsearch node, the primary configuration file, /etc/elasticsearch/elasticsearch.yml, needs to be tuned for cluster discovery and fault tolerance. Key parameters include:
cluster.name: A unique name for your Elasticsearch cluster.node.name: A unique identifier for each node.network.host: The IP address Elasticsearch will bind to. Use0.0.0.0for all interfaces or a specific private IP.discovery.seed_hosts: A list of IP addresses of other master-eligible nodes in the cluster.cluster.initial_master_nodes: A list of node names that are eligible to become the initial master. This is crucial for bootstrapping the cluster.discovery.zen.minimum_master_nodes: For older versions, this was critical. In modern Elasticsearch (7.x+),cluster.initial_master_nodeshandles this. However, for compatibility or specific configurations, understanding this concept is vital. The rule of thumb is(N/2) + 1, where N is the number of master-eligible nodes. For 3 nodes, this would be 2.xpack.security.enabled: Set totruefor production environments to enable authentication and authorization.
Here’s a sample elasticsearch.yml for Node 1:
Node 1 Configuration
cluster.name: my-production-cluster node.name: es-node-1 network.host: 10.10.0.1 # Private IP of Node 1 http.port: 9200 transport.port: 9300 discovery.seed_hosts: - 10.10.0.1:9300 - 10.10.0.2:9300 - 10.10.0.3:9300 cluster.initial_master_nodes: - "es-node-1" - "es-node-2" - "es-node-3" xpack.security.enabled: true
Node 2 Configuration
cluster.name: my-production-cluster node.name: es-node-2 network.host: 10.10.0.2 # Private IP of Node 2 http.port: 9200 transport.port: 9300 discovery.seed_hosts: - 10.10.0.1:9300 - 10.10.0.2:9300 - 10.10.0.3:9300 cluster.initial_master_nodes: - "es-node-1" - "es-node-2" - "es-node-3" xpack.security.enabled: true
Node 3 Configuration
cluster.name: my-production-cluster node.name: es-node-3 network.host: 10.10.0.3 # Private IP of Node 3 http.port: 9200 transport.port: 9300 discovery.seed_hosts: - 10.10.0.1:9300 - 10.10.0.2:9300 - 10.10.0.3:9300 cluster.initial_master_nodes: - "es-node-1" - "es-node-2" - "es-node-3" xpack.security.enabled: true
After configuring each node, restart the Elasticsearch service:
sudo systemctl restart elasticsearch sudo systemctl status elasticsearch
Verify cluster health by querying any node:
curl -X GET "http://localhost:9200/_cluster/health?pretty"
You should see a status of green or yellow (if replicas are not yet allocated) and the correct number of nodes reported.
Ruby Application Deployment and Configuration
Our Ruby application will interact with Elasticsearch for search and data storage. For high availability of the application itself, we’ll deploy it using a process manager like Puma and manage multiple instances behind a load balancer. Linode’s NodeBalancers are ideal for this.
Puma Configuration for Resilience
Puma is a popular HTTP server for Ruby web applications. To ensure it can handle failures and restarts, we’ll configure it to run as a service using systemd and manage multiple worker processes.
Create a config/puma.rb file in your Ruby application’s root directory:
# config/puma.rb
workers Integer(ENV.fetch("WEB_CONCURRENCY") { 4 })
threads_count = Integer(ENV.fetch("RAILS_MAX_THREADS") { 5 })
threads threads_count, threads_count
preload_app!
environment ENV.fetch("RAILS_ENV") { "production" }
# Bind to a specific IP and port for Linode NodeBalancer
bind "tcp://0.0.0.0:3000" # Or a specific private IP if preferred
# Daemonize the process
daemonize false
# Logging
stdout_redirect "log/puma.stdout.log", "log/puma.stderr.log", true
# PID file
pidfile "tmp/pids/puma.pid"
# State file
state_path "tmp/pids/puma.state"
# Workers can share connections
on_worker_boot do
ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end
# Clean up when workers exit
on_worker_shutdown do
ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end
# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart
Next, create a systemd service file for Puma. Assuming your application is located at /var/www/my_ruby_app and you’re running as the user deploy:
# /etc/systemd/system/puma_my_ruby_app.service [Unit] Description=Puma HTTP Server for MyRubyApp After=network.target [Service] Type=simple User=deploy Group=deploy WorkingDirectory=/var/www/my_ruby_app Environment="RAILS_ENV=production" Environment="WEB_CONCURRENCY=4" Environment="RAILS_MAX_THREADS=5" ExecStart=/usr/local/bin/bundle exec puma -C config/puma.rb ExecStop=/bin/kill -s TERM $MAINPID Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable puma_my_ruby_app.service sudo systemctl start puma_my_ruby_app.service sudo systemctl status puma_my_ruby_app.service
Deploy multiple instances of this application on separate Linode instances. For example, two application servers (App Server 1 and App Server 2).
Elasticsearch Client Configuration in Ruby
Your Ruby application needs to connect to the Elasticsearch cluster. The elasticsearch-ruby gem is the standard client. Ensure your client configuration points to your cluster’s HTTP endpoints.
In your application’s initialization (e.g., config/initializers/elasticsearch.rb for Rails):
# config/initializers/elasticsearch.rb
require 'elasticsearch'
# Use private IPs of your Elasticsearch nodes
ES_CLIENT = Elasticsearch::Client.new(
hosts: [
{ url: 'http://10.10.0.1:9200' },
{ url: 'http://10.10.0.2:9200' },
{ url: 'http://10.10.0.3:9200' }
],
# If xpack.security.enabled is true, configure authentication
# http_auth: { username: 'elastic', password: 'changeme' },
# ssl_options: { verify_mode: OpenSSL::SSL::VERIFY_NONE } # Use with caution, prefer proper certs
)
# Optional: Ping the cluster to ensure connectivity on startup
begin
ES_CLIENT.ping
Rails.logger.info "Successfully connected to Elasticsearch cluster."
rescue Elasticsearch::Transport::Transport::Errors::ConnectionError => e
Rails.logger.error "Failed to connect to Elasticsearch cluster: #{e.message}"
# Depending on your app, you might want to exit or handle this more gracefully
end
This configuration provides a list of hosts. The elasticsearch-ruby client will automatically try the next host if one fails, providing basic resilience at the client level.
Load Balancing and Health Checks
To achieve true auto-failover for the application layer, we need a load balancer that can detect unhealthy application instances and route traffic away from them. Linode NodeBalancers are a managed solution that simplifies this.
Linode NodeBalancer Configuration
1. Create a NodeBalancer: In your Linode Cloud Manager, navigate to “NodeBalancers” and create a new one. Choose a region close to your Linode instances.
2. Configure Frontend: Set up a listener for your application’s port (e.g., port 80 or 443 for HTTP/S). If using HTTPS, you’ll configure SSL certificates here.
3. Configure Backend Nodes: Add your application server Linodes (App Server 1 and App Server 2) as backend nodes. Specify the port your application is listening on (e.g., 3000).
4. Set Up Health Checks: This is critical for auto-failover. Configure health checks to ensure the NodeBalancer knows when an application instance is unresponsive.
- Protocol: HTTP
- Path: A specific endpoint in your application that indicates health. A common practice is to create a
/healthor/statusendpoint in your Rails app that returns a 200 OK if the app is functioning. - Check Interval: How often to check (e.g., 10 seconds).
- Response Timeout: How long to wait for a response (e.g., 5 seconds).
- Unhealthy Threshold: Number of consecutive failures before marking a node unhealthy (e.g., 3).
- Healthy Threshold: Number of consecutive successes before marking a node healthy again (e.g., 2).
Example /health endpoint in Rails (app/controllers/health_controller.rb):
class HealthController < ApplicationController
def show
# Check Elasticsearch connection as part of health
begin
ES_CLIENT.ping
es_healthy = true
rescue Elasticsearch::Transport::Transport::Errors::ConnectionError
es_healthy = false
end
if es_healthy
render json: { status: "ok", elasticsearch: "connected" }, status: :ok
else
render json: { status: "error", elasticsearch: "disconnected" }, status: :service_unavailable
end
end
end
And its route (config/routes.rb):
Rails.application.routes.draw do get '/health', to: 'health#show' # ... other routes end
With these health checks configured, the Linode NodeBalancer will automatically stop sending traffic to any application instance that fails the health check, and resume sending traffic once it becomes healthy again.
Automated Failover Orchestration
The components described above form the foundation for automated failover. Let’s summarize how they work together during a failure scenario:
Scenario: Application Server Failure
- Failure Detection: If App Server 1 becomes unresponsive (e.g., Puma process crashes, network issue), the Linode NodeBalancer’s health checks will start failing for that instance.
- Traffic Rerouting: After the unhealthy threshold is met, the NodeBalancer will remove App Server 1 from its rotation. All new incoming traffic will be directed solely to App Server 2.
- Application Resilience: The Ruby application on App Server 2 continues to serve requests. If it was configured with multiple Puma workers, those workers can handle the increased load.
- Recovery: If App Server 1 is restarted and passes its health checks, the NodeBalancer will automatically add it back into the rotation.
Scenario: Elasticsearch Node Failure
- Master Node Failure: If a master-eligible node fails, the remaining master-eligible nodes will initiate a new election. As long as a quorum (
(N/2) + 1) of master-eligible nodes remains, a new master will be elected, and the cluster will continue to operate. - Data Node Failure: If a data node fails, Elasticsearch will detect that some shards are no longer available. It will then promote a replica shard from another node to become a primary shard for the affected indices. If the number of replicas is sufficient (e.g., 1 or more), data availability will be maintained. The cluster will then attempt to reallocate shards to new nodes if they come online.
- Application Impact: The
elasticsearch-rubyclient, configured with multiple hosts, will automatically attempt to connect to the remaining healthy Elasticsearch nodes. If the cluster remains operational (which it should with 3 nodes), the application will continue to function. If the cluster becomes unavailable (e.g., only 1 node left), the application’s search and indexing operations will fail, and the/healthendpoint will reflect this, potentially leading to the NodeBalancer taking the application instances offline if they can’t reach ES.
Monitoring and Alerting
While automated failover handles immediate recovery, robust monitoring and alerting are crucial for proactive management and understanding failure patterns. Implement monitoring for:
- Linode NodeBalancer Health: Linode provides basic metrics. For deeper insights, consider external monitoring services.
- Elasticsearch Cluster Health: Use Elasticsearch’s own APIs (e.g.,
_cluster/health,_nodes/stats) and tools like Prometheus with the Elasticsearch Exporter, or commercial solutions. Set up alerts for cluster status changes (red/yellow), node failures, and high resource utilization. - Application Performance: Monitor Puma worker status, request latency, error rates, and resource usage on your application servers. Tools like New Relic, Datadog, or Prometheus with Node Exporter and application-specific exporters are valuable.
- System Metrics: CPU, memory, disk I/O, and network traffic on all Linode instances.
By combining a resilient Elasticsearch cluster, a well-configured Ruby application with a load-balanced deployment, and comprehensive health checks, you can architect a system capable of automatically failing over during component failures, significantly improving your application’s uptime and reliability on Linode.