Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WordPress Deployments on DigitalOcean

Elasticsearch Cluster Setup for High Availability

Achieving automated failover for Elasticsearch necessitates a robust, multi-node cluster configuration. We’ll focus on a setup designed for resilience, leveraging Elasticsearch’s built-in master-eligible nodes and shard replication. For this example, we’ll assume three DigitalOcean Droplets, each running Ubuntu 22.04 LTS, serving as Elasticsearch nodes.

The core of Elasticsearch HA lies in its distributed nature. By configuring multiple master-eligible nodes, the cluster can elect a new master if the current one becomes unavailable. Shard replication ensures data availability even if a data node fails.

Node Configuration

Each Droplet will host an Elasticsearch instance. We’ll configure them to discover each other using unicast discovery. Ensure that the necessary ports (9200 for HTTP, 9300 for transport) are open between the nodes.

Elasticsearch Configuration File (`/etc/elasticsearch/elasticsearch.yml`)

On each node, modify elasticsearch.yml. The key parameters are:

cluster.name: Must be identical across all nodes.
node.name: Unique for each node (e.g., es-node-1, es-node-2, es-node-3).
network.host: Set to the private IP address of the Droplet to bind to specific interfaces and improve security.
discovery.seed_hosts: A list of IP addresses of other nodes in the cluster that Elasticsearch can use to discover peers.
cluster.initial_master_nodes: A list of node names that are eligible to become the initial master. This is crucial for bootstrapping the cluster.
xpack.security.enabled: Set to true for production environments. This requires additional configuration for authentication and authorization.

Here’s an example configuration for es-node-1. Adapt network.host and node.name accordingly for other nodes.

Example `elasticsearch.yml` for `es-node-1`

cluster.name: my-production-cluster
node.name: es-node-1
network.host: 10.10.0.1  # Replace with the private IP of es-node-1
discovery.seed_hosts:
  - 10.10.0.1:9300
  - 10.10.0.2:9300
  - 10.10.0.3:9300
cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"
http.port: 9200
transport.port: 9300
# For production, enable security:
# xpack.security.enabled: true
# xpack.security.transport.ssl.enabled: true
# xpack.security.http.ssl.enabled: true
# xpack.security.transport.ssl.verification_mode: certificate
# xpack.security.transport.ssl.certificate_authorities: [ "certs/ca/ca.crt" ]
# xpack.security.transport.ssl.certificate: "certs/es-node-1.crt"
# xpack.security.transport.ssl.key: "certs/es-node-1.key"
# xpack.security.http.ssl.certificate_authorities: [ "certs/ca/ca.crt" ]
# xpack.security.http.ssl.certificate: "certs/es-node-1.crt"
# xpack.security.http.ssl.key: "certs/es-node-1.key"

Shard Replication Configuration

To ensure data availability, configure the number of replicas for your indices. A common practice is to set number_of_replicas to at least 1, meaning each primary shard will have one replica on a different node. This can be set at index creation time or updated dynamically.

{
  "index": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

You can verify the cluster health and node status using the Elasticsearch API:

curl -X GET "http://localhost:9200/_cluster/health?pretty"
curl -X GET "http://localhost:9200/_cat/nodes?v"

WordPress High Availability with Load Balancer and Database Replication

For WordPress, high availability involves a redundant web server layer and a replicated database layer. We’ll use a DigitalOcean Load Balancer to distribute traffic across multiple WordPress Droplets and a managed MySQL database with read replicas or a multi-master setup for resilience.

WordPress Droplet Setup

Deploy at least two identical WordPress Droplets. These should be configured to connect to the same database. For simplicity, we’ll assume a single managed MySQL instance initially, but will discuss replication next.

Web Server Configuration (Nginx Example)

Ensure your web server (e.g., Nginx) is configured to serve your WordPress site. The critical part for HA is that these Droplets are stateless or manage state externally (e.g., using Redis for sessions). If you’re using file-based sessions, ensure they are synchronized or use a shared storage solution.

server {
    listen 80;
    server_name yourdomain.com;

    root /var/www/html/wordpress;
    index index.php index.html index.htm;

    location / {
        try_files $uri $uri/ /index.php?$args;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.1-fpm.sock; # Adjust PHP version as needed
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
    }

    # Deny access to sensitive files
    location ~ /\.ht {
        deny all;
    }
}

Database High Availability

A single MySQL instance is a single point of failure. DigitalOcean Managed Databases offer built-in high availability with automatic failover. For even greater control or if you’re managing your own database, consider replication.

Option 1: DigitalOcean Managed Databases (Recommended)

When creating a Managed Database cluster, select the “High Availability” option. This provisions a primary node and a standby node. In case of primary node failure, the standby node is promoted automatically. Your WordPress application should be configured to connect to the cluster’s read/write endpoint, which will automatically point to the active primary after a failover.

Option 2: MySQL Replication (Self-Managed)

If managing your own MySQL, set up replication. A common pattern is a primary-replica setup. For automatic failover, you’ll need an external tool like Orchestrator or ProxySQL.

Setting up Primary-Replica Replication

On the primary MySQL server:

-- Enable binary logging
[mysqld]
log_bin = /var/log/mysql/mysql-bin.log
server-id = 1
binlog_format = ROW

-- Create a replication user
CREATE USER 'replicator'@'%' IDENTIFIED BY 'your_replication_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;

-- Get the current binary log file and position
SHOW MASTER STATUS;

On the replica MySQL server:

-- Set a unique server ID
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = 1 -- Or 0 if you plan for multi-master later

-- Configure replication
CHANGE MASTER TO
  MASTER_HOST='',
  MASTER_USER='replicator',
  MASTER_PASSWORD='your_replication_password',
  MASTER_LOG_FILE='',
  MASTER_LOG_POS=;

START SLAVE;

-- Verify replication status
SHOW SLAVE STATUS\G;

Load Balancer Configuration

DigitalOcean Load Balancers provide an easy way to distribute traffic. Configure it to forward HTTP/HTTPS traffic to your WordPress Droplets.

Health Checks

Crucially, configure health checks on the Load Balancer. This allows it to detect unhealthy WordPress Droplets and stop sending traffic to them. A common health check is to ping a specific URL on your WordPress site that returns a 200 OK status if the application is responsive.

# Example health check configuration (via DigitalOcean UI or API)
Protocol: HTTP
Port: 80
Path: /healthz.php  # Create a simple PHP file that outputs "OK" and exits with 0
Interval: 10s
Timeout: 5s
Unhealthy Threshold: 3
Healthy Threshold: 2

Create a healthz.php file in your WordPress root directory:

<?php
header('Content-Type: text/plain');
echo 'OK';
exit(0);
?>

Automated Failover Orchestration

True automated failover requires more than just redundant components; it needs orchestration. This involves monitoring and automated switching mechanisms.

Elasticsearch Failover

Elasticsearch handles master election and shard failover automatically. If a master node fails, the remaining master-eligible nodes will elect a new master. If a data node fails, Elasticsearch will reallocate its shards to other available nodes, using the replicas to ensure data availability. The key is ensuring discovery.seed_hosts and cluster.initial_master_nodes are correctly configured and that nodes can communicate with each other.

WordPress Failover

For WordPress, the Load Balancer is the primary orchestrator. When a Droplet becomes unresponsive (as determined by health checks), the Load Balancer automatically removes it from the rotation. When the Droplet recovers and passes health checks again, it’s automatically added back.

Database Failover Orchestration

Managed Databases: DigitalOcean handles this transparently. The read/write endpoint updates to point to the new primary. Ensure your WordPress application’s database connection string points to this endpoint.

Self-Managed MySQL with Orchestrator: Orchestrator is a popular open-source tool for MySQL replication topology management and automated failover. It monitors replication health and can automatically promote a replica to primary if the current primary fails. You would typically run Orchestrator on a separate, highly available set of instances.

# Example Orchestrator command to promote a replica (manual for illustration)
# Orchestrator would automate this based on failure detection.
orchestrator-client --promote

Your WordPress application would need to be configured to use a proxy like ProxySQL or directly connect to the IP address that Orchestrator designates as the current primary. ProxySQL can be configured to automatically detect and switch to the new primary.

Monitoring and Alerting

Automated failover is only effective if you know when it happens and if it’s working. Robust monitoring and alerting are paramount.

Elasticsearch Monitoring

Utilize Elasticsearch’s own health APIs and integrate with monitoring tools like Prometheus and Grafana. Key metrics to monitor include:

Cluster health status (green, yellow, red)
Node status (master, data, ingest)
JVM heap usage
Disk space
Indexing and search latency
Replication lag

Set up alerts for when the cluster health turns yellow or red, or when critical nodes become unresponsive.

WordPress Monitoring

Monitor the health of your WordPress Droplets and the Load Balancer. DigitalOcean provides Droplet monitoring. Additionally, use:

Load Balancer health check status
Application-level metrics (e.g., response times, error rates)
Server resource utilization (CPU, RAM, disk I/O) on WordPress Droplets
Database performance metrics (query times, connections, replication lag if self-managed)

Tools like UptimeRobot, Prometheus with Node Exporter and Blackbox Exporter, or Datadog can be invaluable here. Configure alerts for Load Balancer health check failures, high error rates, or resource exhaustion.

Testing Your Failover Strategy

A disaster recovery plan is useless if not tested. Regularly simulate failures to ensure your automated failover mechanisms work as expected.

Simulating Failures

Elasticsearch:

Stop the Elasticsearch service on a master-eligible node. Observe cluster health and master re-election.
Stop the Elasticsearch service on a data node. Observe shard reallocation.
Simulate network partitions between nodes.

WordPress:

Stop the web server service (Nginx/Apache) on a WordPress Droplet. Verify it’s removed from the Load Balancer rotation.
Simulate high load to test the Load Balancer’s ability to distribute traffic and health checks to detect issues.
If using self-managed MySQL, stop the primary database server and verify the failover process (e.g., Orchestrator promoting a replica).

Document the results of each test and refine your configurations and procedures based on the outcomes. Automation is key, but human oversight and regular validation are indispensable.