Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Ruby Deployments on AWS
Elasticsearch Cluster Health and Node Failures
A robust Elasticsearch deployment hinges on maintaining cluster health and rapidly responding to node failures. Elasticsearch’s distributed nature allows for resilience, but automated failover requires careful configuration and external orchestration. We’ll focus on a multi-AZ deployment within AWS, leveraging EC2 instances and a managed Elasticsearch service (like AWS Elasticsearch Service, now OpenSearch Service) or a self-managed cluster.
For self-managed clusters, understanding shard allocation and recovery is paramount. When a node fails, Elasticsearch attempts to reallocate its shards to other available nodes. This process is governed by settings like index.number_of_replicas and cluster.routing.allocation.enable. Ensuring sufficient replica shards and healthy nodes are available is the first line of defense.
Automating Elasticsearch Failover with AWS Services
For self-managed Elasticsearch on EC2, we can architect an automated failover mechanism using a combination of AWS services. This typically involves:
- Health Checks: Regularly monitoring Elasticsearch cluster health.
- Node Replacement: Automatically replacing unhealthy or unresponsive EC2 instances.
- Cluster Re-configuration: Ensuring the cluster rebalances and recovers after node replacement.
A common pattern is to use an Auto Scaling Group (ASG) for the Elasticsearch nodes. The ASG can be configured with custom health checks that go beyond basic EC2 instance status. We can leverage CloudWatch Alarms to trigger ASG actions.
Custom Elasticsearch Health Check Script
We’ll create a script that runs on each Elasticsearch node and reports its health to CloudWatch. This script will query the Elasticsearch `_cluster/health` API. If the cluster status is not ‘green’ or ‘yellow’ for a sustained period, or if the node itself is unresponsive, it will publish a custom metric.
Here’s a Python script that can be deployed as a cron job on each Elasticsearch node:
import requests
import boto3
import time
import os
# Configuration
ES_HOST = os.environ.get("ES_HOST", "localhost")
ES_PORT = int(os.environ.get("ES_PORT", 9200))
CLUSTER_NAME = os.environ.get("CLUSTER_NAME", "my-es-cluster")
REGION = os.environ.get("AWS_REGION", "us-east-1")
NAMESPACE = "ElasticsearchHealth"
METRIC_NAME = "ClusterStatus"
INSTANCE_ID = os.environ.get("EC2_INSTANCE_ID") # Should be set by instance metadata
if not INSTANCE_ID:
try:
# Attempt to get instance ID from EC2 metadata service
metadata_url = "http://169.254.169.254/latest/meta-data/instance-id"
response = requests.get(metadata_url, timeout=1)
if response.status_code == 200:
INSTANCE_ID = response.text
else:
print(f"Could not retrieve instance ID from metadata service. Status: {response.status_code}")
# Fallback or exit if instance ID is critical
exit(1)
except requests.exceptions.RequestException as e:
print(f"Error fetching EC2 instance ID: {e}")
exit(1)
cloudwatch = boto3.client('cloudwatch', region_name=REGION)
def get_es_health():
try:
response = requests.get(f"http://{ES_HOST}:{ES_PORT}/_cluster/health", timeout=5)
response.raise_for_status() # Raise an exception for bad status codes
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error connecting to Elasticsearch: {e}")
return None
def publish_metric(status_value):
try:
cloudwatch.put_metric_data(
Namespace=NAMESPACE,
MetricData=[
{
'MetricName': METRIC_NAME,
'Dimensions': [
{
'Name': 'ClusterName',
'Value': CLUSTER_NAME
},
{
'Name': 'InstanceId',
'Value': INSTANCE_ID
}
],
'Value': status_value,
'Unit': 'Count'
},
]
)
print(f"Published metric: {METRIC_NAME}={status_value} for instance {INSTANCE_ID}")
except Exception as e:
print(f"Error publishing metric to CloudWatch: {e}")
if __name__ == "__main__":
health_data = get_es_health()
if health_data:
status = health_data.get('status')
if status == 'green':
publish_metric(1) # Healthy
elif status == 'yellow':
publish_metric(0.5) # Degraded but functional
else: # red or other unexpected status
publish_metric(0) # Unhealthy
else:
publish_metric(0) # Unhealthy if connection fails
# Optional: Publish node-specific metrics like number of shards
# This requires more complex parsing of _nodes/stats API
This script should be scheduled to run every minute or so via cron:
# In crontab -e * * * * * /usr/bin/python /path/to/your/es_health_checker.py >> /var/log/es_health_checker.log 2>&1
CloudWatch Alarm and Auto Scaling Group Configuration
Next, we configure a CloudWatch Alarm to monitor the custom metric published by our script. This alarm will trigger an action when the metric indicates an unhealthy state.
- Metric:
ElasticsearchHealth/ClusterStatus - Statistic:
Minimum - Period:
5 Minutes(adjust based on tolerance) - Threshold:
Less than 1(meaning ‘yellow’ or ‘red’ status, or connection failure) - Datapoints to Alarm:
2 out of 3(requires 2 consecutive periods of unhealthy state)
The alarm action should be configured to trigger an Auto Scaling Group action, specifically to terminate the unhealthy instance. The ASG’s health check configuration needs to be updated to include EC2 and potentially ELB health checks, but crucially, it should also be set to use Custom Health Checks that are tied to the CloudWatch Alarm.
When an instance is terminated by the ASG, the ASG automatically launches a new instance to maintain the desired capacity. This new instance will start up, join the cluster (assuming discovery is configured correctly), and Elasticsearch will begin rebalancing shards onto it.
Ruby Application Resilience and Failover
For the Ruby application layer, resilience against Elasticsearch unavailability is key. This involves:
- Connection Pooling: Efficiently managing connections to Elasticsearch.
- Retries and Backoff: Gracefully handling temporary connection issues.
- Circuit Breakers: Preventing cascading failures.
- Graceful Degradation: Allowing the application to function partially if Elasticsearch is down.
We’ll use the official Elasticsearch Ruby client, which provides some of these features out-of-the-box. For more advanced patterns, we might implement custom middleware or use gems like circuitbox.
Configuring the Elasticsearch Ruby Client
The elasticsearch-ruby gem allows for configuring multiple hosts and retry strategies. When one host becomes unavailable, the client can automatically failover to another.
require 'elasticsearch'
# Assuming you have multiple Elasticsearch nodes or a load balancer
# If using a load balancer (e.g., ALB in front of ES nodes), list the LB's DNS name.
# If self-managed with multiple nodes, list their IPs/hostnames.
# For AWS OpenSearch Service, use the provided endpoint.
# Example with multiple nodes (replace with your actual endpoints)
es_hosts = [
'http://es-node-1.example.com:9200',
'http://es-node-2.example.com:9200',
'http://es-node-3.example.com:9200'
]
# Or if using a load balancer/AWS OpenSearch Service endpoint:
# es_hosts = ['https://your-es-endpoint.region.es.amazonaws.com']
client = Elasticsearch::Client.new(
hosts: es_hosts,
retry_on_failure: 5, # Number of times to retry a failed request
reload_connections: true, # Re-establish connections if they drop
# Optional: Add adapter for specific HTTP client features
# adapter: :net_http,
# transport_options: {
# request: { timeout: 60 } # Global request timeout
# }
)
# Example of a search operation with potential failure
begin
response = client.search index: 'my_index', body: { query: { match: { title: 'ruby' } } }
puts "Search results: #{response['hits']['total']['value']} hits"
rescue Elasticsearch::Transport::Transport::Errors::ServiceUnavailable => e
puts "Elasticsearch is unavailable: #{e.message}"
# Implement fallback logic here:
# - Log the error
# - Return cached data if available
# - Display a user-friendly error message
# - Trigger alerts
rescue Elasticsearch::Transport::Transport::Errors::RequestTimeout => e
puts "Elasticsearch request timed out: #{e.message}"
# Implement fallback logic
rescue StandardError => e
puts "An unexpected error occurred: #{e.message}"
# Implement fallback logic
end
The retry_on_failure option is crucial. The client will attempt to send the request to the next available host in the list if a connection fails or times out. reload_connections: true ensures that if a connection becomes stale, it will be re-established.
Implementing Circuit Breakers and Graceful Degradation
For more sophisticated failure handling, especially when Elasticsearch is experiencing prolonged issues, a circuit breaker pattern is advisable. This prevents the application from continuously hammering a failing service.
We can implement a simple circuit breaker using a shared cache (like Redis) or a gem. The idea is to track failures. If failures exceed a threshold within a time window, the circuit “opens,” and subsequent requests are immediately rejected or handled by a fallback mechanism without attempting to contact Elasticsearch.
# Example using a hypothetical Circuitbox gem or custom implementation
require 'circuitbox'
require 'redis' # Assuming Redis is used for state persistence
# Initialize Redis client
redis_client = Redis.new(url: ENV['REDIS_URL'])
# Configure Circuitbox
# State is stored in Redis to be shared across application instances
circuit = Circuitbox.new(
name: 'elasticsearch_circuit',
failure_threshold: 5, # Open the circuit after 5 consecutive failures
recovery_timeout: 60, # Try to close the circuit after 60 seconds
store: Circuitbox::RedisStore.new(redis_client)
)
def perform_es_search(client, query)
# Wrap the Elasticsearch call in the circuit breaker
circuit.run do
begin
response = client.search index: 'my_index', body: query
# If successful, the circuit is considered healthy
circuit.success!
return response
rescue Elasticsearch::Transport::Transport::Errors::ServiceUnavailable,
Elasticsearch::Transport::Transport::Errors::RequestTimeout,
StandardError => e
# If an error occurs, the circuit records the failure
circuit.failure!
raise e # Re-raise the exception to be handled by the caller
end
end
rescue Circuitbox::OpenCircuitError
puts "Elasticsearch circuit is open. Falling back..."
# Implement fallback logic here:
# - Return cached data
# - Return default data
# - Return an empty result set
return { 'hits' => { 'total' => { 'value' => 0 }, 'hits' => [] } }
end
# Usage:
# search_query = { query: { match: { title: 'ruby' } } }
# results = perform_es_search(client, search_query)
# puts "Results: #{results['hits']['total']['value']}"
Graceful degradation means that if Elasticsearch is unavailable, the application should still serve *something*. This could be:
- Serving stale data from a cache (e.g., Redis, Memcached).
- Returning a simplified view or a “data unavailable” message.
- Prioritizing critical read operations over search if possible.
The logic for this fallback should be implemented within the rescue blocks of your Elasticsearch calls or within the circuit breaker’s fallback mechanism.
Deployment and Monitoring Considerations
Automated failover is only effective if it’s thoroughly tested and continuously monitored. Key considerations include:
- Testing: Regularly simulate node failures (e.g., by stopping an Elasticsearch EC2 instance) to verify that the failover process works as expected and that the application handles the temporary unavailability correctly.
- Monitoring: Beyond the custom CloudWatch metric, monitor key Elasticsearch performance indicators (latency, query throughput, JVM heap usage, disk I/O) and application-level metrics (error rates, response times).
- Alerting: Set up alerts not just for failures, but also for warning conditions (e.g., cluster status turning yellow, high latency) to proactively address issues before they trigger a full failover.
- Infrastructure as Code (IaC): Manage your ASG, CloudWatch Alarms, and application configurations using tools like Terraform or CloudFormation to ensure consistency and repeatability.
- AWS OpenSearch Service: If using AWS’s managed service, much of the underlying infrastructure management and node replacement is handled. Focus shifts to configuring the service’s replication, multi-AZ deployment, and ensuring your application client is configured for resilience against endpoint unavailability. The custom health check script would then monitor the OpenSearch Service endpoint.
By combining robust Elasticsearch cluster management with resilient application design, you can build a highly available system on AWS that can withstand individual component failures.