Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WooCommerce Deployments on OVH

Elasticsearch Cluster Architecture for High Availability

Achieving robust disaster recovery for Elasticsearch hinges on a well-defined, multi-zone deployment strategy. For production environments, especially those powering critical applications like WooCommerce, a single-node or even a single-region setup is unacceptable. We’ll focus on a multi-master, multi-zone architecture within OVH’s infrastructure, leveraging their global network capabilities for resilience.

A typical HA Elasticsearch cluster comprises multiple master-eligible nodes, data nodes, and coordinating nodes. For disaster recovery, the key is to distribute these roles across distinct availability zones (e.g., GRA, RBX, BHS in OVH’s European regions) and potentially across different geographic regions for true DR. This ensures that the failure of an entire data center or even a region does not lead to data loss or service interruption.

Configuring Elasticsearch for Zone Awareness

Elasticsearch’s shard allocation awareness is crucial for distributing data across failure domains. By configuring cluster.routing.allocation.awareness.attributes, we instruct Elasticsearch to consider specific node attributes (like availability zone) when placing shards. This prevents all replicas of a shard from residing in the same zone.

On each Elasticsearch node, you’ll need to define its zone in the elasticsearch.yml configuration file. This is typically done by setting a system environment variable or by directly adding attributes to the node’s configuration.

Node Configuration Example (elasticsearch.yml)

For a node in the GRA (Gravelines) zone:

cluster.name: "my-prod-es-cluster"
node.name: "es-node-gra-01"
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

# Zone awareness configuration
node.attr.zone: "gra"

discovery.seed_hosts:
  - "es-node-gra-01:9300"
  - "es-node-rbx-01:9300"
  - "es-node-bhs-01:9300"

cluster.initial_master_nodes:
  - "es-node-gra-01"
  - "es-node-rbx-01"
  - "es-node-bhs-01"

cluster.routing.allocation.awareness.attributes: "zone"
cluster.routing.allocation.enable: "all"
cluster.routing.allocation.total_shards_per_node: 1000 # Adjust as per your needs

Repeat this configuration for nodes in other zones (e.g., node.attr.zone: "rbx" for a node in Roubaix) and ensure discovery.seed_hosts and cluster.initial_master_nodes include nodes from all intended zones.

Automated Failover Strategy: Orchestration with Keepalived and HAProxy

While Elasticsearch itself provides high availability through its distributed nature and shard replication, the client-facing endpoint needs a robust failover mechanism. For WooCommerce, this means ensuring that the API requests and search queries directed to Elasticsearch are always routed to a healthy instance. We’ll use Keepalived for Virtual IP (VIP) management and HAProxy for load balancing and health checking.

Keepalived Configuration for VIP Failover

Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to manage a floating IP address. Two or more Keepalived instances are configured, with one acting as the master and others as backups. If the master fails, a backup automatically takes over the VIP.

This setup requires at least two dedicated servers (or VMs) in your infrastructure, ideally placed in different availability zones or even different regions for DR. These servers will run HAProxy, which in turn will point to your Elasticsearch nodes.

Keepalived Configuration (keepalived.conf)

On the primary Keepalived server (e.g., in GRA):

vrrp_script chk_haproxy {
    script "/usr/local/bin/check_haproxy.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP           # Change to MASTER on the primary server
    interface eth0         # Your network interface
    virtual_router_id 51
    priority 101           # Higher priority for MASTER (e.g., 101), lower for BACKUP (e.g., 100)
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mysecretpassword
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 label eth0:0 # Your floating VIP
    }
    track_script {
        chk_haproxy
    }
}

On the secondary Keepalived server (e.g., in RBX), set state MASTER and a lower priority (e.g., 100). The virtual_router_id and authentication must match.

HAProxy Configuration for Elasticsearch Load Balancing

HAProxy will listen on the floating VIP and distribute traffic to healthy Elasticsearch nodes. It performs active health checks, removing unhealthy nodes from the pool.

HAProxy Configuration (haproxy.cfg)

global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000

frontend es_frontend
    bind 192.168.1.100:9200  # Listen on the floating VIP
    mode http
    default_backend es_backend

backend es_backend
    mode http
    balance roundrobin
    option httpchk GET /_cluster/health?pretty
    http-check expect status 200
    server es-gra-01 10.0.0.1:9200 check port 9200 inter 2s fall 3 rise 2
    server es-rbx-01 10.0.0.2:9200 check port 9200 inter 2s fall 3 rise 2
    server es-bhs-01 10.0.0.3:9200 check port 9200 inter 2s fall 3 rise 2
    # Add more servers as needed, potentially from different regions for DR

The option httpchk and http-check expect status 200 lines are critical for HAProxy to verify the health of Elasticsearch nodes. The check, port, inter, fall, and rise parameters define the health check behavior. Adjust IP addresses and ports to match your Elasticsearch node configurations.

Health Check Script for Keepalived

The /usr/local/bin/check_haproxy.sh script referenced in Keepalived’s configuration is essential for ensuring that the VIP only floats to a server that is actively managing a healthy HAProxy instance.

#!/bin/bash
if [ "$(pidof haproxy)" ]; then
    exit 0
else
    exit 1
fi

Make this script executable: chmod +x /usr/local/bin/check_haproxy.sh.

WooCommerce Integration and Failover Testing

WooCommerce applications typically interact with Elasticsearch for product search, filtering, and potentially other features. The key is to configure WooCommerce to point to the floating VIP address managed by Keepalived, rather than directly to an individual Elasticsearch node.

WooCommerce Elasticsearch Plugin Configuration

Within your WooCommerce admin panel, navigate to the Elasticsearch plugin settings (the exact location depends on the plugin used, e.g., “ElasticPress”). You will find fields for the Elasticsearch host(s). Enter the floating VIP address here.

# Example configuration snippet (plugin dependent)
# In WooCommerce settings or wp-config.php
define( 'EP_HOST', 'http://192.168.1.100:9200' ); # Use the floating VIP

If your plugin supports multiple hosts, you can list them, but for this HA setup, the VIP is the primary target. The plugin will then communicate with HAProxy, which will route requests to the active Elasticsearch cluster.

Simulating Failures for Testing

Thorough testing is paramount. You must simulate various failure scenarios to validate the auto-failover mechanism.

Elasticsearch Node Failure: Stop an Elasticsearch service on one of the nodes. Monitor HAProxy’s logs and the Elasticsearch cluster health API (via the VIP) to confirm the node is marked as down and traffic is rerouted.
HAProxy Server Failure: Stop the HAProxy service on one of the servers. Observe if Keepalived promotes the backup server and the VIP moves.
Keepalived Server Failure: Shut down one of the Keepalived servers. Verify that the other server takes over the VIP.
Network Partition: Simulate network issues between zones or regions to test how Elasticsearch handles split-brain scenarios and how clients reconnect.
Full Zone/Region Outage: If deploying across multiple OVH regions, simulate an entire region outage to test cross-region failover.

During testing, pay close attention to:

The time it takes for the VIP to failover.
The accuracy of HAProxy’s health checks.
The impact on WooCommerce user experience (e.g., search latency, errors).
Data consistency across Elasticsearch replicas.

Cross-Region Disaster Recovery for Elasticsearch

For true disaster recovery, a single-region multi-zone setup is insufficient. A catastrophic event affecting an entire OVH region requires a cross-region strategy. This involves replicating your Elasticsearch data to a separate, geographically distant OVH region.

Elasticsearch Cross-Cluster Replication (CCR)

Elasticsearch’s Cross-Cluster Replication (CCR) feature allows you to replicate indices from a primary cluster in one region to a secondary cluster in another. This is a powerful tool for DR, enabling read-only replicas in the DR region that can be promoted to active primary clusters if the main region becomes unavailable.

CCR Configuration Steps

1. Set up a secondary Elasticsearch cluster in a different OVH region (e.g., RBX if your primary is GRA). Ensure it has sufficient capacity.

2. Configure remote clusters in both primary and secondary clusters. This involves defining the connection details for the remote cluster in each cluster’s elasticsearch.yml.

# In primary cluster (GRA) elasticsearch.yml
cluster.remote.secondary_cluster:
  seeds: "es-dr-node-rbx-01:9300,es-dr-node-rbx-02:9300"

# In secondary cluster (RBX) elasticsearch.yml
cluster.remote.primary_cluster:
  seeds: "es-gra-node-01:9300,es-gra-node-02:9300"

3. Create follower indices. On the secondary cluster, you define which indices from the primary cluster should be replicated. This is done via the Elasticsearch API.

# On the secondary cluster (RBX)
POST /_ccr/add_follower?pretty
{
  "remote_cluster": "primary_cluster",
  "leader_index": "my_woocommerce_products",
  "follower_index": "my_woocommerce_products_dr"
}

4. Configure auto-follow patterns for automatic replication of new indices.

# On the secondary cluster (RBX)
POST /_ccr/auto_follow/my_auto_follow_pattern?pretty
{
  "remote_cluster": "primary_cluster",
  "leader_patterns": ["my_woocommerce_*"],
  "follower_index_pattern": "my_woocommerce_dr"
}

Promoting the DR Cluster

In a disaster scenario, the process involves:

Stop replication from the primary cluster to prevent data corruption if the primary comes back online unexpectedly.
Promote the follower indices on the secondary cluster to become read/write indices.
Update WooCommerce configuration (or your application’s configuration) to point to the newly promoted DR Elasticsearch cluster’s VIP. This would involve a similar Keepalived/HAProxy setup in the DR region, or a DNS-based failover mechanism.

The promotion API call would look something like this:

# On the secondary cluster (RBX)
POST /my_woocommerce_products_dr/_ccr/promote?pretty

Orchestrating Cross-Region Failover for WooCommerce

Automating cross-region failover is significantly more complex than single-region HA. It typically involves a combination of:

Global Load Balancers / DNS: Services like OVH’s Global DNS Load Balancing or external solutions can direct traffic to the active region based on health checks.
Automated Scripts: Custom scripts triggered by monitoring alerts (e.g., Prometheus Alertmanager) to perform CCR promotion, update DNS records, and reconfigure application endpoints.
Infrastructure as Code (IaC): Tools like Terraform or Ansible can be used to provision and configure the DR environment and manage the failover process.

Example: DNS Failover with Health Checks

Configure your DNS records (e.g., search.yourdomain.com) to point to the VIP of your primary region’s HAProxy. Set up health checks for this endpoint. If the health checks fail consistently, an automated process (e.g., a Lambda function, a CI/CD pipeline job) can update the DNS record to point to the VIP of the DR region’s HAProxy. This DNS propagation time is a critical factor in RTO (Recovery Time Objective).

Conclusion

Architecting auto-failover for Elasticsearch and WooCommerce on OVH requires a layered approach. For high availability within a region, Keepalived and HAProxy provide a robust VIP failover and load balancing solution. For true disaster recovery, Elasticsearch’s Cross-Cluster Replication, combined with intelligent orchestration and DNS management, is essential. Continuous testing and monitoring are non-negotiable to ensure these systems perform as expected when disaster strikes.