Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Ruby Deployments on OVH

Elasticsearch Cluster Architecture for High Availability on OVH

Achieving robust disaster recovery for Elasticsearch, especially in a multi-region setup on OVH, necessitates a well-defined cluster architecture. We’ll focus on a primary-replica model with cross-region replication, leveraging OVH’s infrastructure for resilience. This involves configuring Elasticsearch’s built-in features and supplementing them with external tooling for automated failover.

Our strategy involves deploying an Elasticsearch cluster in a primary region (e.g., GRA) and configuring cross-cluster replication (CCR) to a secondary region (e.g., BHS). This ensures data redundancy and allows for a swift switchover in case of a primary region outage.

Elasticsearch Configuration for CCR

The core of cross-region replication lies in configuring the Elasticsearch cluster to act as a leader and follower. This is typically done via the Elasticsearch API. We’ll assume you have two independent Elasticsearch clusters running in different OVH regions.

First, ensure your Elasticsearch instances are accessible from each other. This might involve configuring security groups or firewall rules within your OVH cloud project to allow traffic on port 9200 (or your configured HTTP port) between the clusters.

On the leader cluster (primary region), you need to define the replication configuration. This involves specifying the remote cluster alias and the indices you wish to replicate.

Leader Cluster Configuration (Primary Region)

Add the follower cluster as a remote cluster. This is done in the elasticsearch.yml configuration file on each node of the leader cluster, or dynamically via the API.

cluster.remote.bhs_cluster:
  seeds:
    - 192.168.1.10:9300
    - 192.168.1.11:9300
  skip_unavailable: false

Then, on the leader cluster, you’ll create a replication task for specific indices. For example, to replicate an index named my_application_logs:

This API call initiates the replication process. Elasticsearch will then manage the synchronization of data from the leader to the follower.

PUT /my_application_logs/_ccr/follow
{
  "remote_cluster": "bhs_cluster",
  "leader_index": "my_application_logs"
}

Follower Cluster Configuration (Secondary Region)

Similarly, on the follower cluster, you need to configure the leader cluster as a remote cluster.

cluster.remote.gra_cluster:
  seeds:
    - 192.168.2.10:9300
    - 192.168.2.11:9300
  skip_unavailable: false

The follower cluster will then receive the replicated data. You can monitor the replication status using the following API call on the follower cluster:

GET /my_application_logs/_ccr/stats

This provides insights into the replication lag and overall health of the CCR process.

Automated Failover for Ruby Applications on OVH

Automating failover for your Ruby applications involves a multi-pronged approach: detecting failures, orchestrating the switch, and ensuring application connectivity to the correct Elasticsearch instance.

Health Check and Failure Detection

We need a mechanism to continuously monitor the health of the primary Elasticsearch cluster and the Ruby application instances. This can be achieved using a combination of:

Elasticsearch Cluster Health API: Regularly query the _cluster/health endpoint. A status of red or yellow, or a significant increase in unassigned_shards, indicates a problem.
Application-Level Health Checks: Implement endpoints in your Ruby application that check connectivity to Elasticsearch and perform a basic query.
External Monitoring Tools: Services like Prometheus with Alertmanager, or even custom scripts, can poll these health endpoints and trigger alerts.

For this example, let’s consider a simple Bash script that periodically checks the Elasticsearch health and triggers a failover process if issues are detected.

#!/bin/bash

PRIMARY_ES_URL="http://primary-es.example.com:9200"
SECONDARY_ES_URL="http://secondary-es.example.com:9200"
FAILOVER_TRIGGER_FILE="/tmp/elasticsearch_failover_in_progress"
FAILOVER_ACTIVE_FILE="/tmp/elasticsearch_failover_active"

check_es_health() {
  local es_url=$1
  local status=$(curl -s -X GET "${es_url}/_cluster/health?pretty" | grep '"status"' | awk '{print $2}' | tr -d '",')
  if [[ "$status" == "green" || "$status" == "yellow" ]]; then
    return 0 # Healthy
  else
    return 1 # Unhealthy
  fi
}

trigger_failover() {
  echo "$(date): Primary Elasticsearch unhealthy. Initiating failover..."
  # Mark failover as in progress to prevent multiple triggers
  touch "$FAILOVER_TRIGGER_FILE"

  # Update application configurations to point to secondary ES
  echo "$(date): Updating Ruby application configurations..."
  # This is a placeholder. Actual implementation depends on your deployment strategy.
  # For example, you might update environment variables, configuration files, or use a service discovery mechanism.
  update_ruby_app_config "$SECONDARY_ES_URL"

  # Optionally, promote the secondary cluster to primary if using active-passive with manual promotion
  # promote_secondary_es_cluster

  # Mark failover as active
  touch "$FAILOVER_ACTIVE_FILE"
  echo "$(date): Failover complete. Applications now using secondary Elasticsearch."
}

revert_failover() {
  echo "$(date): Primary Elasticsearch healthy again. Reverting failover..."
  rm -f "$FAILOVER_TRIGGER_FILE" "$FAILOVER_ACTIVE_FILE"

  # Update application configurations back to primary ES
  echo "$(date): Reverting Ruby application configurations..."
  update_ruby_app_config "$PRIMARY_ES_URL"

  # If secondary was promoted, demote it back
  # demote_secondary_es_cluster

  echo "$(date): Revert complete. Applications now using primary Elasticsearch."
}

update_ruby_app_config() {
  local new_es_url=$1
  # Example: If using environment variables managed by a deployment tool like Kubernetes or Consul
  # kubectl patch deployment your-ruby-app --patch '{"spec": {"template": {"spec": {"containers": [{"name": "your-ruby-app", "env": [{"name": "ELASTICSEARCH_URL", "value": "'$new_es_url'"}]}]}}}}'
  # Or, if updating a configuration file on each app server:
  # sed -i "s|ELASTICSEARCH_URL=.*|ELASTICSEARCH_URL=$new_es_url|g" /etc/your_app/config.yml
  echo "Simulating update of Ruby app config to use: $new_es_url"
  # In a real-world scenario, this would involve interacting with your deployment system.
}

# Main loop
while true; do
  if [ -f "$FAILOVER_ACTIVE_FILE" ]; then
    # Failover is active, check if primary is back up
    if check_es_health "$PRIMARY_ES_URL"; then
      revert_failover
    else
      echo "$(date): Failover active. Primary ES still unhealthy."
    fi
  else
    # Failover is not active, check primary health
    if ! check_es_health "$PRIMARY_ES_URL"; then
      if [ ! -f "$FAILOVER_TRIGGER_FILE" ]; then
        trigger_failover
      else
        echo "$(date): Primary ES unhealthy, but failover already in progress."
      fi
    else
      echo "$(date): Primary ES healthy. No failover needed."
      # Clean up any stale trigger file if primary is healthy
      rm -f "$FAILOVER_TRIGGER_FILE"
    fi
  fi
  sleep 60 # Check every 60 seconds
done

Orchestrating the Switch for Ruby Deployments

The Bash script above is a rudimentary example. In a production environment, you’d likely integrate this logic with a more sophisticated orchestration tool or service discovery mechanism.

Deployment Strategies:

Environment Variables: The most straightforward approach. Update the ELASTICSEARCH_URL environment variable for your Ruby application instances. This can be managed by your deployment platform (e.g., Kubernetes, Docker Swarm, Nomad).
Configuration Files: If your Ruby application reads Elasticsearch connection details from a configuration file (e.g., database.yml, elasticsearch.yml), the failover script would need to update this file and then signal the application to reload its configuration (e.g., via SIGHUP or a rolling restart).
Service Discovery: Tools like Consul or etcd can be used to store the active Elasticsearch endpoint. Your Ruby application would query the service discovery tool to get the current endpoint. The failover script would update the service discovery record.

For a Ruby on Rails application, you might have a configuration like this:

# config/initializers/elasticsearch.rb
# Assuming ELASTICSEARCH_URL is set as an environment variable

if ENV['ELASTICSEARCH_URL'].present?
  Elasticsearch::Model.client = Elasticsearch::Client.new url: ENV['ELASTICSEARCH_URL']
else
  # Fallback or error handling if URL is not set
  raise "ELASTICSEARCH_URL environment variable is not set!"
end

The Bash script’s update_ruby_app_config function would then be responsible for updating the environment variable in your deployment system.

OVH Specific Considerations

When deploying on OVH, pay close attention to:

Network Latency: Ensure your chosen regions have acceptable latency for Elasticsearch replication and application access. OVH’s network performance between its datacenters is generally good, but always test.
Security Groups/Firewalls: Properly configure security groups to allow necessary traffic between your Elasticsearch clusters and between your application servers and Elasticsearch.
Instance Sizing: Ensure your Elasticsearch nodes and application servers are adequately sized for peak load, especially during a failover event where the secondary cluster might initially bear the full load.
Managed Services: If you are using OVH’s managed Elasticsearch service, consult their specific documentation for CCR and failover capabilities, as the configuration might differ from self-hosted instances.

Advanced Considerations: Promoting Secondary and Reverting

The provided script assumes a simple switch. In a more complex active-passive setup, you might need to explicitly “promote” the secondary Elasticsearch cluster to become the new primary once a failover occurs. This is crucial if the secondary cluster was initially configured as read-only for replication.

Promoting the Secondary Cluster

If your secondary cluster is configured to only follow the primary, you’ll need to stop the replication and make it writable. This is typically done via the Elasticsearch API:

POST /my_application_logs/_ccr/unfollow

After unfollowing, the index on the secondary cluster becomes a standalone, writable index. You would then update your application’s configuration to point to this now-primary secondary cluster.

Reverting to the Primary Cluster

When the original primary region becomes healthy again, you’ll want to revert. This is often the trickiest part:

Stop Applications: Briefly stop writes to the *current* primary (which was the secondary during failover) to ensure data consistency.
Re-establish Replication: Reconfigure the *original* primary cluster to follow the *new* primary (the one that was the secondary). This might involve deleting and recreating the CCR follower task, or re-initializing it.
Data Synchronization: Allow time for the original primary to catch up.
Switch Back: Once synchronized, stop writes to the current primary, switch applications back to the original primary, and then re-establish replication in the reverse direction.

This process requires careful orchestration to avoid data loss or corruption. Automation here is key, but also requires robust error handling and rollback capabilities.

Conclusion

Architecting auto-failover for Elasticsearch and Ruby deployments on OVH involves a layered approach. It starts with a resilient Elasticsearch cluster setup using cross-cluster replication, complemented by intelligent health monitoring and an automated orchestration layer for application configuration updates. While the core concepts are universal, specific implementation details will depend heavily on your deployment tools, application architecture, and tolerance for downtime.