Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Ruby Deployments on OVH
Elasticsearch Cluster Architecture for High Availability on OVH
Achieving robust disaster recovery for Elasticsearch, especially in a multi-region setup on OVH, necessitates a well-defined cluster architecture. We’ll focus on a primary-replica model with cross-region replication, leveraging OVH’s infrastructure for resilience. This involves configuring Elasticsearch’s built-in features and supplementing them with external tooling for automated failover.
Our strategy involves deploying an Elasticsearch cluster in a primary region (e.g., GRA) and configuring cross-cluster replication (CCR) to a secondary region (e.g., BHS). This ensures data redundancy and allows for a swift switchover in case of a primary region outage.
Elasticsearch Configuration for CCR
The core of cross-region replication lies in configuring the Elasticsearch cluster to act as a leader and follower. This is typically done via the Elasticsearch API. We’ll assume you have two independent Elasticsearch clusters running in different OVH regions.
First, ensure your Elasticsearch instances are accessible from each other. This might involve configuring security groups or firewall rules within your OVH cloud project to allow traffic on port 9200 (or your configured HTTP port) between the clusters.
On the leader cluster (primary region), you need to define the replication configuration. This involves specifying the remote cluster alias and the indices you wish to replicate.
Leader Cluster Configuration (Primary Region)
Add the follower cluster as a remote cluster. This is done in the elasticsearch.yml configuration file on each node of the leader cluster, or dynamically via the API.
cluster.remote.bhs_cluster:
seeds:
- 192.168.1.10:9300
- 192.168.1.11:9300
skip_unavailable: false
Then, on the leader cluster, you’ll create a replication task for specific indices. For example, to replicate an index named my_application_logs:
This API call initiates the replication process. Elasticsearch will then manage the synchronization of data from the leader to the follower.
PUT /my_application_logs/_ccr/follow
{
"remote_cluster": "bhs_cluster",
"leader_index": "my_application_logs"
}
Follower Cluster Configuration (Secondary Region)
Similarly, on the follower cluster, you need to configure the leader cluster as a remote cluster.
cluster.remote.gra_cluster:
seeds:
- 192.168.2.10:9300
- 192.168.2.11:9300
skip_unavailable: false
The follower cluster will then receive the replicated data. You can monitor the replication status using the following API call on the follower cluster:
GET /my_application_logs/_ccr/stats
This provides insights into the replication lag and overall health of the CCR process.
Automated Failover for Ruby Applications on OVH
Automating failover for your Ruby applications involves a multi-pronged approach: detecting failures, orchestrating the switch, and ensuring application connectivity to the correct Elasticsearch instance.
Health Check and Failure Detection
We need a mechanism to continuously monitor the health of the primary Elasticsearch cluster and the Ruby application instances. This can be achieved using a combination of:
- Elasticsearch Cluster Health API: Regularly query the
_cluster/healthendpoint. A status ofredoryellow, or a significant increase inunassigned_shards, indicates a problem. - Application-Level Health Checks: Implement endpoints in your Ruby application that check connectivity to Elasticsearch and perform a basic query.
- External Monitoring Tools: Services like Prometheus with Alertmanager, or even custom scripts, can poll these health endpoints and trigger alerts.
For this example, let’s consider a simple Bash script that periodically checks the Elasticsearch health and triggers a failover process if issues are detected.
#!/bin/bash
PRIMARY_ES_URL="http://primary-es.example.com:9200"
SECONDARY_ES_URL="http://secondary-es.example.com:9200"
FAILOVER_TRIGGER_FILE="/tmp/elasticsearch_failover_in_progress"
FAILOVER_ACTIVE_FILE="/tmp/elasticsearch_failover_active"
check_es_health() {
local es_url=$1
local status=$(curl -s -X GET "${es_url}/_cluster/health?pretty" | grep '"status"' | awk '{print $2}' | tr -d '",')
if [[ "$status" == "green" || "$status" == "yellow" ]]; then
return 0 # Healthy
else
return 1 # Unhealthy
fi
}
trigger_failover() {
echo "$(date): Primary Elasticsearch unhealthy. Initiating failover..."
# Mark failover as in progress to prevent multiple triggers
touch "$FAILOVER_TRIGGER_FILE"
# Update application configurations to point to secondary ES
echo "$(date): Updating Ruby application configurations..."
# This is a placeholder. Actual implementation depends on your deployment strategy.
# For example, you might update environment variables, configuration files, or use a service discovery mechanism.
update_ruby_app_config "$SECONDARY_ES_URL"
# Optionally, promote the secondary cluster to primary if using active-passive with manual promotion
# promote_secondary_es_cluster
# Mark failover as active
touch "$FAILOVER_ACTIVE_FILE"
echo "$(date): Failover complete. Applications now using secondary Elasticsearch."
}
revert_failover() {
echo "$(date): Primary Elasticsearch healthy again. Reverting failover..."
rm -f "$FAILOVER_TRIGGER_FILE" "$FAILOVER_ACTIVE_FILE"
# Update application configurations back to primary ES
echo "$(date): Reverting Ruby application configurations..."
update_ruby_app_config "$PRIMARY_ES_URL"
# If secondary was promoted, demote it back
# demote_secondary_es_cluster
echo "$(date): Revert complete. Applications now using primary Elasticsearch."
}
update_ruby_app_config() {
local new_es_url=$1
# Example: If using environment variables managed by a deployment tool like Kubernetes or Consul
# kubectl patch deployment your-ruby-app --patch '{"spec": {"template": {"spec": {"containers": [{"name": "your-ruby-app", "env": [{"name": "ELASTICSEARCH_URL", "value": "'$new_es_url'"}]}]}}}}'
# Or, if updating a configuration file on each app server:
# sed -i "s|ELASTICSEARCH_URL=.*|ELASTICSEARCH_URL=$new_es_url|g" /etc/your_app/config.yml
echo "Simulating update of Ruby app config to use: $new_es_url"
# In a real-world scenario, this would involve interacting with your deployment system.
}
# Main loop
while true; do
if [ -f "$FAILOVER_ACTIVE_FILE" ]; then
# Failover is active, check if primary is back up
if check_es_health "$PRIMARY_ES_URL"; then
revert_failover
else
echo "$(date): Failover active. Primary ES still unhealthy."
fi
else
# Failover is not active, check primary health
if ! check_es_health "$PRIMARY_ES_URL"; then
if [ ! -f "$FAILOVER_TRIGGER_FILE" ]; then
trigger_failover
else
echo "$(date): Primary ES unhealthy, but failover already in progress."
fi
else
echo "$(date): Primary ES healthy. No failover needed."
# Clean up any stale trigger file if primary is healthy
rm -f "$FAILOVER_TRIGGER_FILE"
fi
fi
sleep 60 # Check every 60 seconds
done
Orchestrating the Switch for Ruby Deployments
The Bash script above is a rudimentary example. In a production environment, you’d likely integrate this logic with a more sophisticated orchestration tool or service discovery mechanism.
Deployment Strategies:
- Environment Variables: The most straightforward approach. Update the
ELASTICSEARCH_URLenvironment variable for your Ruby application instances. This can be managed by your deployment platform (e.g., Kubernetes, Docker Swarm, Nomad). - Configuration Files: If your Ruby application reads Elasticsearch connection details from a configuration file (e.g.,
database.yml,elasticsearch.yml), the failover script would need to update this file and then signal the application to reload its configuration (e.g., via SIGHUP or a rolling restart). - Service Discovery: Tools like Consul or etcd can be used to store the active Elasticsearch endpoint. Your Ruby application would query the service discovery tool to get the current endpoint. The failover script would update the service discovery record.
For a Ruby on Rails application, you might have a configuration like this:
# config/initializers/elasticsearch.rb # Assuming ELASTICSEARCH_URL is set as an environment variable if ENV['ELASTICSEARCH_URL'].present? Elasticsearch::Model.client = Elasticsearch::Client.new url: ENV['ELASTICSEARCH_URL'] else # Fallback or error handling if URL is not set raise "ELASTICSEARCH_URL environment variable is not set!" end
The Bash script’s update_ruby_app_config function would then be responsible for updating the environment variable in your deployment system.
OVH Specific Considerations
When deploying on OVH, pay close attention to:
- Network Latency: Ensure your chosen regions have acceptable latency for Elasticsearch replication and application access. OVH’s network performance between its datacenters is generally good, but always test.
- Security Groups/Firewalls: Properly configure security groups to allow necessary traffic between your Elasticsearch clusters and between your application servers and Elasticsearch.
- Instance Sizing: Ensure your Elasticsearch nodes and application servers are adequately sized for peak load, especially during a failover event where the secondary cluster might initially bear the full load.
- Managed Services: If you are using OVH’s managed Elasticsearch service, consult their specific documentation for CCR and failover capabilities, as the configuration might differ from self-hosted instances.
Advanced Considerations: Promoting Secondary and Reverting
The provided script assumes a simple switch. In a more complex active-passive setup, you might need to explicitly “promote” the secondary Elasticsearch cluster to become the new primary once a failover occurs. This is crucial if the secondary cluster was initially configured as read-only for replication.
Promoting the Secondary Cluster
If your secondary cluster is configured to only follow the primary, you’ll need to stop the replication and make it writable. This is typically done via the Elasticsearch API:
POST /my_application_logs/_ccr/unfollow
After unfollowing, the index on the secondary cluster becomes a standalone, writable index. You would then update your application’s configuration to point to this now-primary secondary cluster.
Reverting to the Primary Cluster
When the original primary region becomes healthy again, you’ll want to revert. This is often the trickiest part:
- Stop Applications: Briefly stop writes to the *current* primary (which was the secondary during failover) to ensure data consistency.
- Re-establish Replication: Reconfigure the *original* primary cluster to follow the *new* primary (the one that was the secondary). This might involve deleting and recreating the CCR follower task, or re-initializing it.
- Data Synchronization: Allow time for the original primary to catch up.
- Switch Back: Once synchronized, stop writes to the current primary, switch applications back to the original primary, and then re-establish replication in the reverse direction.
This process requires careful orchestration to avoid data loss or corruption. Automation here is key, but also requires robust error handling and rollback capabilities.
Conclusion
Architecting auto-failover for Elasticsearch and Ruby deployments on OVH involves a layered approach. It starts with a resilient Elasticsearch cluster setup using cross-cluster replication, complemented by intelligent health monitoring and an automated orchestration layer for application configuration updates. While the core concepts are universal, specific implementation details will depend heavily on your deployment tools, application architecture, and tolerance for downtime.