Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WordPress Deployments on Linode

Elasticsearch Cluster Architecture for High Availability

Achieving automated failover for Elasticsearch necessitates a robust, multi-node cluster design. We’ll focus on a setup that leverages Elasticsearch’s built-in replication and shard allocation capabilities, augmented by external monitoring and orchestration for true auto-failover. For this example, we’ll assume a Linode environment with at least three Elasticsearch nodes, each running a recent version of Elasticsearch (e.g., 7.x or 8.x).

The core principle is to ensure that no single point of failure exists within the Elasticsearch cluster itself. This means configuring:

Master-eligible nodes: These nodes are responsible for cluster management. A minimum of three master-eligible nodes is recommended for quorum (e.g., 2 out of 3 must agree).
Data nodes: These nodes store the actual data shards.
Replication: Each primary shard should have at least one replica shard distributed across different nodes.
Shard Allocation Awareness: Configure Elasticsearch to distribute shards and their replicas across different Linode availability zones or physical racks if possible, to prevent data loss due to a single datacenter outage.

Elasticsearch Configuration for Resilience

The primary configuration file for Elasticsearch is elasticsearch.yml. Key settings for high availability include:

Master Node Configuration

On each master-eligible node, ensure the following settings are present and correctly configured:

cluster.name: my-elasticsearch-cluster
node.name: es-master-01 # Unique name for each node
node.roles: [ master, data, ingest ] # Or separate roles for larger clusters

network.host: 0.0.0.0 # Or specific private IP for Linode internal network
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - es-master-01.internal:9300 # Use Linode's private IP or internal DNS
  - es-master-02.internal:9300
  - es-master-03.internal:9300

cluster.initial_master_nodes:
  - es-master-01
  - es-master-02
  - es-master-03

# For advanced shard allocation awareness (if using multiple Linode zones)
# cluster.routing.allocation.awareness.attributes: zone
# node.attr.zone: us-east-1a # Set this per node based on its Linode zone

Data Node Configuration (if separate)

If you’re separating master and data roles, data nodes would have:

cluster.name: my-elasticsearch-cluster
node.name: es-data-01 # Unique name for each node
node.roles: [ data, ingest ] # Exclude 'master' role

network.host: 0.0.0.0 # Or specific private IP
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - es-master-01.internal:9300
  - es-master-02.internal:9300
  - es-master-03.internal:9300

# If using allocation awareness, ensure data nodes also have zone attributes
# node.attr.zone: us-east-1b

Index Settings for Replication

Ensure your indices are created with appropriate replica counts. This can be done via the Elasticsearch API or by setting index templates.

Example API call to set default index settings:

curl -X PUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d'
{
  "index": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}
'

For production, it’s highly recommended to use Index Templates to enforce these settings for all new indices.

WordPress Database High Availability (MySQL/MariaDB)

WordPress relies heavily on its database. For automated failover, we’ll implement a primary-replica setup with a robust replication mechanism and a mechanism to detect failures and promote a replica. Linode’s Managed Databases can simplify this, but for a custom setup, we’ll outline a common approach using MySQL replication.

MySQL Replication Setup

This involves setting up one primary (master) database server and one or more replica servers. The primary logs binary events, and replicas apply these events to stay synchronized.

On the Primary Server (db-primary.internal):

[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = OFF # Crucial for primary

After applying these changes, restart MySQL and create a replication user:

-- On the primary MySQL server
CREATE USER 'replicator'@'%' IDENTIFIED BY 'your_strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;

-- Get the current binary log file and position (or GTID status)
SHOW MASTER STATUS;
-- Example output: File: mysql-bin.000001, Position: 123456, GTID_Executed: ...

On the Replica Server (db-replica-1.internal):

[mysqld]
server-id = 2 # Must be unique and different from primary
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = ON # Crucial for replicas to prevent accidental writes

After restarting MySQL on the replica, configure it to connect to the primary:

-- On the replica MySQL server
CHANGE MASTER TO
  MASTER_HOST='db-primary.internal',
  MASTER_USER='replicator',
  MASTER_PASSWORD='your_strong_password',
  MASTER_PORT=3306,
  MASTER_USE_GTID = slave_pos; -- Recommended if using GTID

START SLAVE;
SHOW SLAVE STATUS\G
-- Verify 'Slave_IO_Running: Yes' and 'Slave_SQL_Running: Yes'

Automated Failover Mechanism (Orchestration)

This is the most critical part for *automated* failover. We need a system that:

Monitors the health of the primary database.
Detects primary failure (e.g., unresponsiveness, network issues).
Promotes a healthy replica to become the new primary.
Reconfigures other replicas to follow the new primary.
Updates application configurations (WordPress) to point to the new primary.

Several tools can achieve this:

Orchestrator: A popular open-source tool specifically designed for MySQL replication topology management and automated failover.
ProxySQL: A high-performance SQL proxy that can manage read/write splitting and failover.
Custom Scripts with Monitoring: Using tools like systemd, monit, or Prometheus/Alertmanager to trigger custom failover scripts.

Let’s outline a simplified approach using Orchestrator.

Orchestrator Setup for MySQL Failover

Install Orchestrator on a separate server or one of your existing nodes (ensure it’s not a database node itself to avoid a single point of failure). Configure Orchestrator to connect to your MySQL instances.

# orchestrator.conf.json
{
  "Debug": true,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "your_orchestrator_password",
  "MySQLOrchestratorHostPort": "127.0.0.1:3306",
  "MySQLReplicationUser": "orchestrator",
  "MySQLReplicationPassword": "your_orchestrator_password",
  "DiscoveryPeriodSeconds": 10,
  "PromotionUser": "orchestrator",
  "PromotionPassword": "your_orchestrator_password",
  "PostUnsuccessfulFailoverProcesses": [
    "/path/to/your/script/notify_admin.sh"
  ],
  "PostSuccessfulFailoverProcesses": [
    "/path/to/your/script/update_wordpress_config.sh"
  ]
}

Create the orchestrator MySQL user with appropriate privileges on all database servers.

-- On all MySQL servers (primary and replicas)
CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'your_orchestrator_password';
GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, LOCK TABLES, SHOW DATABASES, SHOW VIEW ON *.* TO 'orchestrator'@'%';
FLUSH PRIVILEGES;

Start Orchestrator and let it discover your topology. Configure it to automatically promote replicas:

# Example command to start Orchestrator
orchestrator -c /etc/orchestrator/orchestrator.conf.json

Orchestrator’s web UI (default: port 3000) will show the topology. You can enable auto-failover via its configuration or API. The key is the PostSuccessfulFailoverProcesses hook, which can execute a script to update WordPress.

WordPress Application Layer Failover

WordPress itself needs to be aware of the database changes. The most common approach is to dynamically update the wp-config.php file.

Updating `wp-config.php`

The script executed by Orchestrator (e.g., update_wordpress_config.sh) would need to:

Identify the new primary database host.
Modify the wp-config.php file on all WordPress web servers to reflect the new database credentials.
Potentially restart PHP-FPM or clear WordPress object caches (e.g., Redis, Memcached) if applicable.

Example script snippet (ensure proper error handling and security):

#!/bin/bash

NEW_DB_HOST="$1" # Orchestrator passes the new host as the first argument
WP_CONFIG_PATH="/var/www/html/wp-config.php" # Adjust path as needed

if [ -z "$NEW_DB_HOST" ]; then
  echo "Error: New DB host not provided."
  exit 1
fi

echo "Updating $WP_CONFIG_PATH with new DB host: $NEW_DB_HOST"

# Use sed to replace the DB_HOST definition. This is a simplified example.
# A more robust solution might involve templating or a dedicated config management tool.
sed -i "s/define( 'DB_HOST', .* );/define( 'DB_HOST', '$NEW_DB_HOST' );/" "$WP_CONFIG_PATH"

if [ $? -eq 0 ]; then
  echo "Successfully updated DB_HOST in $WP_CONFIG_PATH."
  # Optional: Clear cache or restart services
  # systemctl restart php-fpm
  # redis-cli FLUSHALL
else
  echo "Error updating $WP_CONFIG_PATH."
  exit 1
fi

exit 0

This script needs to be executable and accessible by the user Orchestrator runs as. Ensure the DB_USER and DB_PASSWORD remain consistent across all database servers.

Elasticsearch Client Failover

Your WordPress application (or any other service interacting with Elasticsearch) needs to be resilient to Elasticsearch node failures. This typically involves:

Using an Elasticsearch client library: Most modern libraries (e.g., official PHP client) have built-in support for multiple hosts and automatic node discovery/reconnection.
Configuring multiple hosts: Provide a list of all Elasticsearch node addresses to the client.
Health checks: The client library should periodically check the health of nodes and remove unresponsive ones from its internal list.

Example using the official PHP Elasticsearch client:

$hosts = [
    'http://es-node-1.internal:9200',
    'http://es-node-2.internal:9200',
    'http://es-node-3.internal:9200',
];

$client = Elasticsearch\ClientBuilder::create()
    ->setHosts($hosts)
    ->build();

try {
    // Perform an operation, e.g., index a document
    $params = [
        'index' => 'my_index',
        'id'    => 'my_id',
        'body'  => ['testField' => 'abc']
    ];
    $response = $client->index($params);
    print_r($response);

} catch (\Elasticsearch\Common\Exceptions\NoNodesAvailableException $e) {
    // Handle the case where no Elasticsearch nodes are reachable
    echo "Error: No Elasticsearch nodes available. " . $e->getMessage();
    // Implement fallback logic, e.g., queueing requests
} catch (\Exception $e) {
    echo "An unexpected error occurred: " . $e->getMessage();
}

When a master node fails in Elasticsearch, the remaining master-eligible nodes will elect a new master. If data nodes fail, Elasticsearch will automatically reallocate shards and their replicas to healthy nodes, provided there are enough nodes and replicas configured.

Monitoring and Alerting

Automated failover is only as good as the monitoring that triggers it. Implement comprehensive monitoring for:

Elasticsearch Cluster Health: Use Elasticsearch’s `_cluster/health` API to check status (green, yellow, red) and node counts.
MySQL Replication Status: Monitor `SHOW SLAVE STATUS` on replicas.
Orchestrator Health: Ensure Orchestrator itself is running and healthy.
Application Connectivity: Monitor WordPress’s ability to connect to its database and Elasticsearch.
Linode Resource Utilization: CPU, RAM, Disk I/O, Network on all nodes.

Tools like Prometheus with Alertmanager, Datadog, or Nagios are essential. Set up alerts for critical conditions, such as:

Elasticsearch cluster status is red or yellow.
MySQL replication lag exceeds a threshold or replication is stopped.
Orchestrator reports an issue or fails to perform a failover.
High resource utilization on critical nodes.

Testing Your Failover Strategy

Regularly test your failover procedures. This is non-negotiable for production systems. Simulate failures:

Stop the MySQL primary process.
Reboot an Elasticsearch master node.
Simulate network partitions.

Verify that the automated failover mechanisms trigger correctly, the application reconnects successfully, and data integrity is maintained. Document the entire process and the results of your tests.