Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WordPress Deployments on OVH

Automated Elasticsearch Failover with OVH Public Cloud Load Balancers

Achieving high availability for Elasticsearch clusters, especially when coupled with critical applications like WordPress, necessitates a robust failover strategy. This section details how to architect an automated failover mechanism for Elasticsearch leveraging OVH Public Cloud’s Load Balancer service. We’ll focus on a multi-node Elasticsearch cluster and configure the load balancer to detect and route traffic away from unhealthy nodes.

Elasticsearch Cluster Setup for High Availability

A foundational element for failover is a resilient Elasticsearch cluster. For this scenario, we assume a minimum of three Elasticsearch nodes configured in a master-eligible, data, and ingest role setup. This redundancy ensures that even if one node fails, the cluster can maintain quorum and continue operations. Key configurations within elasticsearch.yml should include:

discovery.seed_hosts: A list of potential master nodes to bootstrap discovery.
cluster.initial_master_nodes: Explicitly list nodes that can become master on initial startup.
node.master: true, node.data: true, node.ingest: true (depending on node roles).

Ensure these nodes are accessible within your OVH Public Cloud network, ideally on private IPs for security and performance. For example, nodes might be on 10.0.0.1, 10.0.0.2, and 10.0.0.3.

OVH Public Cloud Load Balancer Configuration

OVH’s Load Balancer service (available via the OVHcloud Control Panel or API) is central to our automated failover. We will configure a TCP load balancer to distribute traffic across our Elasticsearch nodes. The critical components are the frontend configuration and the backend pool with health checks.

Frontend Configuration

The frontend will listen on a public IP address and a specific port (typically 9200 for Elasticsearch HTTP API). This is the endpoint your WordPress application or other clients will connect to.

Backend Pool and Health Checks

The backend pool will contain the private IP addresses and port of your Elasticsearch nodes. The health check is paramount for automated failover. We’ll configure a TCP check, but a more robust approach involves an HTTP check.

TCP Health Check (Basic)

A simple TCP check verifies if the port is open and accepting connections. While quick, it doesn’t confirm Elasticsearch is fully operational.

HTTP Health Check (Recommended)

A more effective health check uses the Elasticsearch HTTP API. We can configure the load balancer to send a GET request to the /_cluster/health endpoint and expect a specific HTTP status code (e.g., 200 OK) and potentially a JSON response containing "status": "green" or "status": "yellow". This ensures the node is not only reachable but also part of a healthy cluster.

Example configuration parameters for the health check:

Protocol: HTTP
Port: 9200
URI: /_cluster/health
Method: GET
Expected Status Code: 200
Response Body Match (Optional): "status":"green" or "status":"yellow"
Interval: 5s (e.g., check every 5 seconds)
Timeout: 2s
Unhealthy Threshold: 3 (mark as unhealthy after 3 consecutive failures)
Healthy Threshold: 2 (mark as healthy after 2 consecutive successes)

When a node fails the health check, the OVH Load Balancer will automatically stop sending traffic to it. Once the node recovers and passes the health checks again, it will be automatically added back into the rotation.

WordPress Integration with Elasticsearch

For WordPress, the integration typically involves a plugin that replaces the default WordPress search with Elasticsearch queries. Popular choices include “ElasticPress”. The crucial aspect here is how the plugin is configured to connect to Elasticsearch.

Configuring ElasticPress for Load Balancer Endpoint

Instead of pointing ElasticPress directly to individual Elasticsearch node IPs, configure it to use the public IP address and port of the OVH Load Balancer. This abstracts the underlying Elasticsearch cluster topology from the WordPress application.

In the WordPress admin area, navigate to ElasticPress settings. You will typically find fields for:

Host: The public IP of your OVH Load Balancer.
Port: 9200
Protocol: HTTP or HTTPS (if you’ve configured SSL termination on the LB).
Authentication (if applicable): Username and Password.

Ensure your WordPress server has network access to the Load Balancer’s public IP on port 9200. If your Elasticsearch cluster is on a private network, the Load Balancer must also have access to the private IPs of the Elasticsearch nodes.

Automated WordPress Failover (Application Level)

While the Elasticsearch failover is handled by the OVH Load Balancer, WordPress itself needs to be resilient. This typically involves deploying WordPress across multiple availability zones or even regions, with a robust database failover strategy and potentially a shared file system or object storage for uploads.

Database Failover (MySQL/MariaDB)

For the WordPress database, a common strategy is to use a managed database service that offers automatic failover (e.g., OVHcloud Managed Databases for MySQL/MariaDB) or to set up a replication cluster with a virtual IP (VIP) managed by a tool like Keepalived.

Example Keepalived Configuration (Conceptual):

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mysecretpassword
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 label eth0:vip
    }
}

In this setup, two database servers run Keepalived. One is in MASTER state, holding the VIP (192.168.1.100). If the MASTER fails, the BACKUP server detects this and transitions to MASTER, taking over the VIP. WordPress is configured to connect to this VIP.

Web Server Failover

Deploying multiple WordPress web servers behind another OVH Load Balancer (this time, likely an HTTP/HTTPS LB) provides web server redundancy. Health checks would target the WordPress application’s homepage or a specific health check endpoint.

Monitoring and Alerting

Automated failover is only effective if you are aware of failures and recoveries. Implement comprehensive monitoring:

Elasticsearch Cluster Health: Monitor _cluster/health status (green, yellow, red) via API calls or dedicated monitoring tools (e.g., Prometheus with Elasticsearch exporter).
Load Balancer Health Checks: OVH provides logs and metrics for load balancer health checks. Integrate these into your alerting system.
WordPress Application Health: Implement a dedicated health check endpoint in WordPress (e.g., /healthz) that checks database connectivity, Elasticsearch connectivity, and essential plugin status.
Server Resources: Monitor CPU, memory, disk I/O, and network traffic on all Elasticsearch and WordPress nodes.

Configure alerts for prolonged periods of unhealthy nodes, cluster red status, or application errors. Tools like Grafana, Prometheus Alertmanager, or cloud-native alerting services can be integrated.

Testing Your Failover Strategy

Regularly test your failover mechanisms to ensure they function as expected. This includes:

Simulating Node Failures: Gracefully stop an Elasticsearch node and observe if the OVH Load Balancer redirects traffic. Then, restart the node and verify it’s re-added.
Network Disruptions: Simulate network partitions between nodes or between WordPress and Elasticsearch.
Database Failover Testing: Force a database failover and confirm WordPress remains accessible and functional.
Full System Outage Simulation: If possible, test a scenario where an entire availability zone becomes unavailable.

Document the expected behavior for each test scenario and compare it against the actual results. This iterative process is crucial for building confidence in your disaster recovery architecture.