Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WooCommerce Deployments on OVH
Elasticsearch Cluster Architecture for High Availability
Achieving robust disaster recovery for Elasticsearch hinges on a well-defined, multi-zone deployment strategy. For production environments, especially those powering critical applications like WooCommerce, a single-node or even a single-region setup is unacceptable. We’ll focus on a multi-master, multi-zone architecture within OVH’s infrastructure, leveraging their global network capabilities for resilience.
A typical HA Elasticsearch cluster comprises multiple master-eligible nodes, data nodes, and coordinating nodes. For disaster recovery, the key is to distribute these roles across distinct availability zones (e.g., GRA, RBX, BHS in OVH’s European regions) and potentially across different geographic regions for true DR. This ensures that the failure of an entire data center or even a region does not lead to data loss or service interruption.
Configuring Elasticsearch for Zone Awareness
Elasticsearch’s shard allocation awareness is crucial for distributing data across failure domains. By configuring cluster.routing.allocation.awareness.attributes, we instruct Elasticsearch to consider specific node attributes (like availability zone) when placing shards. This prevents all replicas of a shard from residing in the same zone.
On each Elasticsearch node, you’ll need to define its zone in the elasticsearch.yml configuration file. This is typically done by setting a system environment variable or by directly adding attributes to the node’s configuration.
Node Configuration Example (elasticsearch.yml)
For a node in the GRA (Gravelines) zone:
cluster.name: "my-prod-es-cluster" node.name: "es-node-gra-01" network.host: 0.0.0.0 http.port: 9200 transport.port: 9300 # Zone awareness configuration node.attr.zone: "gra" discovery.seed_hosts: - "es-node-gra-01:9300" - "es-node-rbx-01:9300" - "es-node-bhs-01:9300" cluster.initial_master_nodes: - "es-node-gra-01" - "es-node-rbx-01" - "es-node-bhs-01" cluster.routing.allocation.awareness.attributes: "zone" cluster.routing.allocation.enable: "all" cluster.routing.allocation.total_shards_per_node: 1000 # Adjust as per your needs
Repeat this configuration for nodes in other zones (e.g., node.attr.zone: "rbx" for a node in Roubaix) and ensure discovery.seed_hosts and cluster.initial_master_nodes include nodes from all intended zones.
Automated Failover Strategy: Orchestration with Keepalived and HAProxy
While Elasticsearch itself provides high availability through its distributed nature and shard replication, the client-facing endpoint needs a robust failover mechanism. For WooCommerce, this means ensuring that the API requests and search queries directed to Elasticsearch are always routed to a healthy instance. We’ll use Keepalived for Virtual IP (VIP) management and HAProxy for load balancing and health checking.
Keepalived Configuration for VIP Failover
Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to manage a floating IP address. Two or more Keepalived instances are configured, with one acting as the master and others as backups. If the master fails, a backup automatically takes over the VIP.
This setup requires at least two dedicated servers (or VMs) in your infrastructure, ideally placed in different availability zones or even different regions for DR. These servers will run HAProxy, which in turn will point to your Elasticsearch nodes.
Keepalived Configuration (keepalived.conf)
On the primary Keepalived server (e.g., in GRA):
vrrp_script chk_haproxy {
script "/usr/local/bin/check_haproxy.sh"
interval 2
weight 2
fall 2
rise 2
}
vrrp_instance VI_1 {
state BACKUP # Change to MASTER on the primary server
interface eth0 # Your network interface
virtual_router_id 51
priority 101 # Higher priority for MASTER (e.g., 101), lower for BACKUP (e.g., 100)
advert_int 1
authentication {
auth_type PASS
auth_pass mysecretpassword
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 label eth0:0 # Your floating VIP
}
track_script {
chk_haproxy
}
}
On the secondary Keepalived server (e.g., in RBX), set state MASTER and a lower priority (e.g., 100). The virtual_router_id and authentication must match.
HAProxy Configuration for Elasticsearch Load Balancing
HAProxy will listen on the floating VIP and distribute traffic to healthy Elasticsearch nodes. It performs active health checks, removing unhealthy nodes from the pool.
HAProxy Configuration (haproxy.cfg)
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
frontend es_frontend
bind 192.168.1.100:9200 # Listen on the floating VIP
mode http
default_backend es_backend
backend es_backend
mode http
balance roundrobin
option httpchk GET /_cluster/health?pretty
http-check expect status 200
server es-gra-01 10.0.0.1:9200 check port 9200 inter 2s fall 3 rise 2
server es-rbx-01 10.0.0.2:9200 check port 9200 inter 2s fall 3 rise 2
server es-bhs-01 10.0.0.3:9200 check port 9200 inter 2s fall 3 rise 2
# Add more servers as needed, potentially from different regions for DR
The option httpchk and http-check expect status 200 lines are critical for HAProxy to verify the health of Elasticsearch nodes. The check, port, inter, fall, and rise parameters define the health check behavior. Adjust IP addresses and ports to match your Elasticsearch node configurations.
Health Check Script for Keepalived
The /usr/local/bin/check_haproxy.sh script referenced in Keepalived’s configuration is essential for ensuring that the VIP only floats to a server that is actively managing a healthy HAProxy instance.
#!/bin/bash
if [ "$(pidof haproxy)" ]; then
exit 0
else
exit 1
fi
Make this script executable: chmod +x /usr/local/bin/check_haproxy.sh.
WooCommerce Integration and Failover Testing
WooCommerce applications typically interact with Elasticsearch for product search, filtering, and potentially other features. The key is to configure WooCommerce to point to the floating VIP address managed by Keepalived, rather than directly to an individual Elasticsearch node.
WooCommerce Elasticsearch Plugin Configuration
Within your WooCommerce admin panel, navigate to the Elasticsearch plugin settings (the exact location depends on the plugin used, e.g., “ElasticPress”). You will find fields for the Elasticsearch host(s). Enter the floating VIP address here.
# Example configuration snippet (plugin dependent) # In WooCommerce settings or wp-config.php define( 'EP_HOST', 'http://192.168.1.100:9200' ); # Use the floating VIP
If your plugin supports multiple hosts, you can list them, but for this HA setup, the VIP is the primary target. The plugin will then communicate with HAProxy, which will route requests to the active Elasticsearch cluster.
Simulating Failures for Testing
Thorough testing is paramount. You must simulate various failure scenarios to validate the auto-failover mechanism.
- Elasticsearch Node Failure: Stop an Elasticsearch service on one of the nodes. Monitor HAProxy’s logs and the Elasticsearch cluster health API (via the VIP) to confirm the node is marked as down and traffic is rerouted.
- HAProxy Server Failure: Stop the HAProxy service on one of the servers. Observe if Keepalived promotes the backup server and the VIP moves.
- Keepalived Server Failure: Shut down one of the Keepalived servers. Verify that the other server takes over the VIP.
- Network Partition: Simulate network issues between zones or regions to test how Elasticsearch handles split-brain scenarios and how clients reconnect.
- Full Zone/Region Outage: If deploying across multiple OVH regions, simulate an entire region outage to test cross-region failover.
During testing, pay close attention to:
- The time it takes for the VIP to failover.
- The accuracy of HAProxy’s health checks.
- The impact on WooCommerce user experience (e.g., search latency, errors).
- Data consistency across Elasticsearch replicas.
Cross-Region Disaster Recovery for Elasticsearch
For true disaster recovery, a single-region multi-zone setup is insufficient. A catastrophic event affecting an entire OVH region requires a cross-region strategy. This involves replicating your Elasticsearch data to a separate, geographically distant OVH region.
Elasticsearch Cross-Cluster Replication (CCR)
Elasticsearch’s Cross-Cluster Replication (CCR) feature allows you to replicate indices from a primary cluster in one region to a secondary cluster in another. This is a powerful tool for DR, enabling read-only replicas in the DR region that can be promoted to active primary clusters if the main region becomes unavailable.
CCR Configuration Steps
1. Set up a secondary Elasticsearch cluster in a different OVH region (e.g., RBX if your primary is GRA). Ensure it has sufficient capacity.
2. Configure remote clusters in both primary and secondary clusters. This involves defining the connection details for the remote cluster in each cluster’s elasticsearch.yml.
# In primary cluster (GRA) elasticsearch.yml cluster.remote.secondary_cluster: seeds: "es-dr-node-rbx-01:9300,es-dr-node-rbx-02:9300" # In secondary cluster (RBX) elasticsearch.yml cluster.remote.primary_cluster: seeds: "es-gra-node-01:9300,es-gra-node-02:9300"
3. Create follower indices. On the secondary cluster, you define which indices from the primary cluster should be replicated. This is done via the Elasticsearch API.
# On the secondary cluster (RBX)
POST /_ccr/add_follower?pretty
{
"remote_cluster": "primary_cluster",
"leader_index": "my_woocommerce_products",
"follower_index": "my_woocommerce_products_dr"
}
4. Configure auto-follow patterns for automatic replication of new indices.
# On the secondary cluster (RBX)
POST /_ccr/auto_follow/my_auto_follow_pattern?pretty
{
"remote_cluster": "primary_cluster",
"leader_patterns": ["my_woocommerce_*"],
"follower_index_pattern": "my_woocommerce_dr"
}
Promoting the DR Cluster
In a disaster scenario, the process involves:
- Stop replication from the primary cluster to prevent data corruption if the primary comes back online unexpectedly.
- Promote the follower indices on the secondary cluster to become read/write indices.
- Update WooCommerce configuration (or your application’s configuration) to point to the newly promoted DR Elasticsearch cluster’s VIP. This would involve a similar Keepalived/HAProxy setup in the DR region, or a DNS-based failover mechanism.
The promotion API call would look something like this:
# On the secondary cluster (RBX) POST /my_woocommerce_products_dr/_ccr/promote?pretty
Orchestrating Cross-Region Failover for WooCommerce
Automating cross-region failover is significantly more complex than single-region HA. It typically involves a combination of:
- Global Load Balancers / DNS: Services like OVH’s Global DNS Load Balancing or external solutions can direct traffic to the active region based on health checks.
- Automated Scripts: Custom scripts triggered by monitoring alerts (e.g., Prometheus Alertmanager) to perform CCR promotion, update DNS records, and reconfigure application endpoints.
- Infrastructure as Code (IaC): Tools like Terraform or Ansible can be used to provision and configure the DR environment and manage the failover process.
Example: DNS Failover with Health Checks
Configure your DNS records (e.g., search.yourdomain.com) to point to the VIP of your primary region’s HAProxy. Set up health checks for this endpoint. If the health checks fail consistently, an automated process (e.g., a Lambda function, a CI/CD pipeline job) can update the DNS record to point to the VIP of the DR region’s HAProxy. This DNS propagation time is a critical factor in RTO (Recovery Time Objective).
Conclusion
Architecting auto-failover for Elasticsearch and WooCommerce on OVH requires a layered approach. For high availability within a region, Keepalived and HAProxy provide a robust VIP failover and load balancing solution. For true disaster recovery, Elasticsearch’s Cross-Cluster Replication, combined with intelligent orchestration and DNS management, is essential. Continuous testing and monitoring are non-negotiable to ensure these systems perform as expected when disaster strikes.