Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Perl Deployments on OVH

Leveraging OVH’s Infrastructure for Resilient Elasticsearch and Perl Services

This document outlines a robust disaster recovery strategy focused on automated failover for critical Elasticsearch clusters and Perl-based applications hosted within OVH’s cloud ecosystem. The objective is to minimize downtime by proactively architecting for resilience, ensuring seamless transitions to standby resources in the event of primary infrastructure failure.

Elasticsearch Auto-Failover Architecture

A highly available Elasticsearch cluster relies on its inherent distributed nature and quorum-based consensus. For disaster recovery, we extend this by maintaining a secondary, geographically distinct cluster and implementing automated failover mechanisms. This involves replicating data and having a mechanism to switch traffic.

Data Replication Strategies

Elasticsearch’s cross-cluster replication (CCR) is the cornerstone for maintaining a warm standby. We’ll configure CCR from the primary OVH region to a secondary region.

Configuring Cross-Cluster Replication (CCR)

On the primary Elasticsearch cluster (e.g., in GRA1), configure a remote cluster connection pointing to the secondary cluster (e.g., RBX3).

// On Primary Cluster (GRA1) - elasticsearch.yml
cluster.remote.rbx3.seeds: "rbx3-es-node1:9300,rbx3-es-node2:9300"
cluster.remote.rbx3.skip_unavailable: false

Then, on the primary cluster, create a replication rule for the desired index (or index pattern).

# On Primary Cluster (GRA1) - using Elasticsearch API
POST /_ccr/add_follower_index
{
  "name": "my_app_logs_follower",
  "index_settings": {
    "index.creation_date": "1678886400000",
    "index.uuid": "some_uuid_for_follower_index",
    "index.version": {
      "created": "7.10.0"
    }
  },
  "leader_index": "my_app_logs",
  "remote_cluster": "rbx3",
  "leader_alias": "my_app_logs_leader"
}

The index.creation_date, index.uuid, and index.version are crucial for ensuring the follower index is created correctly. These can be obtained from the primary index’s settings.

Automated Failover Orchestration

Manual failover is error-prone and slow. We need an automated system. This can be achieved using a combination of health checks, a load balancer, and a script or orchestration tool.

Health Check and Load Balancer Configuration

OVH’s Load Balancer service is ideal for directing traffic. We’ll configure health checks that probe both the primary and secondary Elasticsearch clusters. The load balancer will automatically remove unhealthy nodes from the pool.

# OVH Load Balancer Configuration Snippet (Conceptual)
# Assume two frontend IPs: primary_lb_ip and secondary_lb_ip

# Frontend Configuration for Primary Cluster (GRA1)
frontend primary_es_frontend {
  bind *:9200
  mode tcp
  default_backend primary_es_backend
}

backend primary_es_backend {
  balance roundrobin
  option tcp-check
  # Health check for Elasticsearch nodes
  # This check should be sophisticated enough to verify cluster health (e.g., _cluster/health API)
  # For simplicity, we'll show a basic TCP check. A more robust check would involve HTTP.
  server es-node-1 192.168.1.10:9200 check port 9200 inter 5s fall 3 rise 2
  server es-node-2 192.168.1.11:9200 check port 9200 inter 5s fall 3 rise 2
  server es-node-3 192.168.1.12:9200 check port 9200 inter 5s fall 3 rise 2
}

# Frontend Configuration for Secondary Cluster (RBX3)
frontend secondary_es_frontend {
  bind *:9201 # Use a different port or IP for the secondary LB if needed
  mode tcp
  default_backend secondary_es_backend
}

backend secondary_es_backend {
  balance roundrobin
  option tcp-check
  server es-node-4 192.168.2.10:9200 check port 9200 inter 5s fall 3 rise 2
  server es-node-5 192.168.2.11:9200 check port 9200 inter 5s fall 3 rise 2
  server es-node-6 192.168.2.12:9200 check port 9200 inter 5s fall 3 rise 2
}

The critical part is the health check. A simple TCP check is insufficient. A custom script or a more advanced health check mechanism is required to query the _cluster/health API and ensure the cluster is green or yellow (depending on tolerance) and has sufficient master nodes. If the primary cluster’s health check fails consistently, the load balancer will stop sending traffic to it.

Triggering the Failover

When the primary Elasticsearch cluster becomes unhealthy, we need to activate the secondary. This involves two main steps:

Promote Secondary Cluster: The CCR follower index needs to be promoted to a read/write index.
Redirect Traffic: Update DNS or load balancer configurations to point to the secondary cluster.

A robust solution would involve a monitoring agent (e.g., Prometheus with Alertmanager, or a custom script) that:

Monitors the health of the primary Elasticsearch cluster (via API calls).
When primary health degrades below a threshold for a sustained period, it triggers a failover workflow.

Failover Workflow Script (Conceptual Perl)

This Perl script demonstrates the logic. It would be triggered by an external monitoring system (e.g., Alertmanager webhook).

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;
use JSON;

my $primary_es_host = "http://primary-es-lb.example.com:9200";
my $secondary_es_host = "http://secondary-es-lb.example.com:9200"; # Or direct IP if LB is not yet active
my $follower_index_name = "my_app_logs_follower";
my $leader_index_name = "my_app_logs";
my $remote_cluster_name = "rbx3"; # As defined in primary's config

my $ua = LWP::UserAgent->new;
$ua->timeout(10);

# --- Step 1: Check Primary Cluster Health ---
my $health_req = HTTP::Request->new(GET => "$primary_es_host/_cluster/health");
my $health_res = $ua->request($health_req);

if ($health_res->is_success) {
    my $health_data = decode_json($health_res->decoded_content);
    if ($health_data->{status} eq "green" || $health_data->{status} eq "yellow") {
        print "Primary cluster is healthy. No failover needed.\n";
        exit 0;
    } else {
        print "Primary cluster status is: " . $health_data->{status} . ". Initiating failover.\n";
        # Proceed to failover
    }
} else {
    print "Failed to get health from primary cluster: " . $health_res->status_line . ". Initiating failover.\n";
    # Assume primary is down, proceed to failover
}

# --- Step 2: Promote Secondary Cluster (if CCR is configured) ---
# This assumes CCR is already set up and the follower index is being replicated.
# We need to stop replication and make the follower index read/write.

# First, check if replication is active for the follower index
my $ccr_status_req = HTTP::Request->new(GET => "$primary_es_host/_ccr/$_/$_follower_index_name"); # Placeholder for actual CCR status API
# In newer versions, you might check _ccr/stats or similar.
# For older versions, you might need to check the follower index's settings.

# For simplicity, let's assume we know replication is happening and we just need to promote.
# The actual API to stop replication and promote might vary by ES version.
# A common pattern is to use the _ccr/pause_follower and then _ccr/unfollow APIs.

# Pause replication (if active)
my $pause_req = HTTP::Request->new(POST => "$primary_es_host/_ccr/$follower_index_name/_pause_follower");
my $pause_res = $ua->request($pause_req);
if ($pause_res->is_success) {
    print "Successfully paused CCR for $follower_index_name.\n";
} else {
    warn "Failed to pause CCR for $follower_index_name: " . $pause_res->status_line . "\n";
    # Decide if this is a critical failure or if we can proceed.
}

# Unfollow the leader index to make it a standalone index
my $unfollow_req = HTTP::Request->new(POST => "$primary_es_host/_ccr/$follower_index_name/_unfollow");
my $unfollow_res = $ua->request($unfollow_req);

if ($unfollow_res->is_success) {
    print "Successfully unfollowed leader index for $follower_index_name.\n";
    # Now the follower index is a regular index.
} else {
    die "CRITICAL: Failed to unfollow leader index for $follower_index_name: " . $unfollow_res->status_line . "\n";
}

# --- Step 3: Update Load Balancer / DNS ---
# This is highly dependent on your OVH setup.
# If using OVH Load Balancer API:
# You'd call OVH API to update frontend/backend configurations to point to the secondary cluster.
# Example: Change the backend of the primary_es_frontend to point to secondary_es_backend.

print "Failover to secondary Elasticsearch cluster initiated.\n";
print "Manual intervention may be required to re-establish replication later.\n";

exit 0;

Important Considerations for Elasticsearch Failover:

Index State: The follower index must be in a state where it can be promoted. CCR handles this, but ensure it’s not in a corrupted state.
Master Node Election: The secondary cluster must have a healthy quorum of master-eligible nodes to function.
Re-establishing Replication: After failover, a plan is needed to re-establish CCR from the now-primary cluster back to the original primary (which is now the standby). This might involve re-creating the follower index.
Data Consistency: CCR provides near real-time replication, but there might be a small window of data loss if the primary fails *during* a write operation that hasn’t yet replicated.

Perl Application Auto-Failover

Perl applications, especially those with state or database dependencies, require a similar failover strategy. We’ll assume a typical web application architecture with a database backend.

Database Replication and Failover

For the database (e.g., MySQL, PostgreSQL), robust replication is essential. OVH’s managed database services often provide built-in replication and failover capabilities. If self-hosting, standard replication techniques apply.

MySQL Replication Example

Configure primary-secondary replication. The secondary database should be kept up-to-date.

# On Primary MySQL Server (my.cnf)
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON

# On Secondary MySQL Server (my.cnf)
server-id = 2
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
read_only = ON # Important for standby

On the secondary, set up the replication user and start replication:

-- On Secondary MySQL Server
CHANGE MASTER TO
  MASTER_HOST='primary_db_ip',
  MASTER_USER='repl_user',
  MASTER_PASSWORD='repl_password',
  MASTER_PORT=3306,
  MASTER_AUTO_POSITION=1; -- For GTID-based replication

START SLAVE;
SHOW SLAVE STATUS\G

For automated failover, tools like Orchestrator, MHA (Master High Availability), or custom scripts interacting with the MySQL replication status are necessary. These tools monitor replication lag and can promote the replica if the primary fails.

Perl Application Deployment and Failover

Perl applications are typically stateless or manage state externally (e.g., in a database or cache). The failover strategy focuses on redirecting traffic to a standby application instance.

Load Balancer and Health Checks

Similar to Elasticsearch, OVH Load Balancers are used. The application servers (running Perl CGI, PSGI/Plack, or Mojolicious/Dancer apps) will have health check endpoints.

# Example health check endpoint (e.g., in a Mojolicious app)
sub healthcheck {
    my $self = shift;
    # Basic check: ensure database connection is possible
    eval {
        my $dbh = $self->db->dbh; # Assuming $self->db is your DB connection object
        $dbh->ping;
    };
    if ($@) {
        $self->render(json => { status => 'unhealthy', error => $@ }, status => 503);
    } else {
        $self->render(json => { status => 'healthy' }, status => 200);
    }
}

The OVH Load Balancer would be configured to poll this endpoint on each application server. If a server fails the health check, it’s removed from the active pool.

Orchestrating Application Failover

When the primary application servers (and potentially the primary database) are deemed unhealthy, the failover process needs to redirect traffic to the standby environment.

Database Failover: If using a tool like Orchestrator, it will promote a replica to be the new primary.
Application Instance Activation: Ensure standby application instances are running and ready to accept traffic. This might involve starting services if they were in a stopped state.
Load Balancer/DNS Update: The OVH Load Balancer’s frontend configuration needs to be updated to point to the standby application servers. If using a global DNS solution, DNS records would be updated.

Failover Workflow Script (Conceptual Bash)

This Bash script, triggered by monitoring, orchestrates the application failover. It assumes a mechanism for database failover has already completed or is triggered concurrently.

#!/bin/bash

PRIMARY_APP_LB_IP="1.2.3.4" # OVH Load Balancer IP for app
SECONDARY_APP_LB_IP="5.6.7.8" # OVH Load Balancer IP for standby app
PRIMARY_DB_HOST="primary-db.example.com"
SECONDARY_DB_HOST="secondary-db.example.com" # Assumed to be promoted

# --- Step 1: Verify Primary Application Health ---
# This would typically be done by a monitoring system.
# If monitoring confirms primary app servers are down/unhealthy:

echo "Primary application servers detected as unhealthy. Initiating failover..."

# --- Step 2: Trigger Database Failover (if not already done) ---
# Example: Call a script that uses Orchestrator API or similar
# ./trigger_db_failover.sh "$PRIMARY_DB_HOST" "$SECONDARY_DB_HOST"
# if [ $? -ne 0 ]; then
# echo "CRITICAL: Database failover failed. Aborting application failover."
# exit 1
# fi
# echo "Database failover complete. New primary DB: $SECONDARY_DB_HOST"

# --- Step 3: Ensure Standby Application Instances are Running ---
# This might involve SSHing into standby servers and starting services.
# Example:
# ssh standby-app-server-1 "systemctl start my-perl-app.service"
# ssh standby-app-server-2 "systemctl start my-perl-app.service"
# ... wait for services to stabilize and pass health checks ...

# --- Step 4: Update Load Balancer Configuration ---
# This is the most complex part and requires interaction with OVH API.
# You would typically use 'curl' to call OVH's API to:
# 1. Disable the frontend pointing to primary app servers.
# 2. Enable/reconfigure the frontend to point to standby app servers.
# OR, if using separate LBs for primary/secondary, update DNS to point to the secondary LB IP.

echo "Updating OVH Load Balancer configuration to point to standby application servers..."
# Placeholder for OVH API calls
# Example:
# curl -X PUT -H "X-Auth-Token: YOUR_OVH_TOKEN" \
# -d '{"defaultBackend": "standby_app_backend_id"}' \
# "https://api.ovh.com/1.0/loadbalancer/your_lb_id/frontend/your_frontend_id"

echo "Application failover to standby environment initiated."
echo "Ensure database connection strings in application configurations are updated if necessary."

exit 0

Key Considerations for Perl Application Failover

Configuration Management: Ensure standby application instances have the correct database connection strings, API keys, and other configurations pointing to the *new* primary database and Elasticsearch cluster after failover. Tools like Ansible or Chef are invaluable here.
Session Management: If your Perl application uses server-side sessions, ensure a shared session store (e.g., Redis, Memcached) is replicated or accessible from both primary and secondary environments.
Stateful Components: Any external state (caches, message queues) must also have a failover strategy.
Testing: Regular, automated DR drills are non-negotiable. Simulate failures and execute the failover scripts to validate their effectiveness and identify gaps.

OVH Specifics and Best Practices

OVH provides a range of services that can be leveraged:

Public Cloud Load Balancer: Essential for directing traffic and performing health checks. Configure sophisticated health checks that go beyond simple port checks.
Managed Databases: Utilize OVH’s managed database offerings (e.g., Managed Databases for MySQL, PostgreSQL) which often include automated replication and failover features.
Bare Metal Servers / Public Cloud Instances: For self-hosted Elasticsearch or application servers, ensure you have a plan for provisioning standby resources in a different OVH region.
API Access: All automation scripts will rely on OVH’s APIs for managing load balancers, DNS, and potentially other services. Securely manage API credentials.
Regional Redundancy: Deploy primary and secondary infrastructure in geographically distinct OVH regions (e.g., GRA1 vs. RBX3) to protect against regional outages.

Conclusion

Architecting for automated failover requires a holistic approach, integrating infrastructure capabilities with application-level resilience. By leveraging Elasticsearch’s CCR, robust database replication, and OVH’s Load Balancer with intelligent health checks, coupled with well-tested automation scripts, organizations can significantly reduce Mean Time To Recovery (MTTR) and ensure business continuity for their critical Perl applications and data services.