Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WordPress Deployments on Linode
Elasticsearch Cluster Architecture for High Availability
Achieving automated failover for Elasticsearch necessitates a robust, multi-node cluster design. We’ll focus on a setup that leverages Elasticsearch’s built-in replication and shard allocation capabilities, augmented by external monitoring and orchestration for true auto-failover. For this example, we’ll assume a Linode environment with at least three Elasticsearch nodes, each running a recent version of Elasticsearch (e.g., 7.x or 8.x).
The core principle is to ensure that no single point of failure exists within the Elasticsearch cluster itself. This means configuring:
- Master-eligible nodes: These nodes are responsible for cluster management. A minimum of three master-eligible nodes is recommended for quorum (e.g., 2 out of 3 must agree).
- Data nodes: These nodes store the actual data shards.
- Replication: Each primary shard should have at least one replica shard distributed across different nodes.
- Shard Allocation Awareness: Configure Elasticsearch to distribute shards and their replicas across different Linode availability zones or physical racks if possible, to prevent data loss due to a single datacenter outage.
Elasticsearch Configuration for Resilience
The primary configuration file for Elasticsearch is elasticsearch.yml. Key settings for high availability include:
Master Node Configuration
On each master-eligible node, ensure the following settings are present and correctly configured:
cluster.name: my-elasticsearch-cluster node.name: es-master-01 # Unique name for each node node.roles: [ master, data, ingest ] # Or separate roles for larger clusters network.host: 0.0.0.0 # Or specific private IP for Linode internal network http.port: 9200 transport.port: 9300 discovery.seed_hosts: - es-master-01.internal:9300 # Use Linode's private IP or internal DNS - es-master-02.internal:9300 - es-master-03.internal:9300 cluster.initial_master_nodes: - es-master-01 - es-master-02 - es-master-03 # For advanced shard allocation awareness (if using multiple Linode zones) # cluster.routing.allocation.awareness.attributes: zone # node.attr.zone: us-east-1a # Set this per node based on its Linode zone
Data Node Configuration (if separate)
If you’re separating master and data roles, data nodes would have:
cluster.name: my-elasticsearch-cluster node.name: es-data-01 # Unique name for each node node.roles: [ data, ingest ] # Exclude 'master' role network.host: 0.0.0.0 # Or specific private IP http.port: 9200 transport.port: 9300 discovery.seed_hosts: - es-master-01.internal:9300 - es-master-02.internal:9300 - es-master-03.internal:9300 # If using allocation awareness, ensure data nodes also have zone attributes # node.attr.zone: us-east-1b
Index Settings for Replication
Ensure your indices are created with appropriate replica counts. This can be done via the Elasticsearch API or by setting index templates.
Example API call to set default index settings:
curl -X PUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d'
{
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
'
For production, it’s highly recommended to use Index Templates to enforce these settings for all new indices.
WordPress Database High Availability (MySQL/MariaDB)
WordPress relies heavily on its database. For automated failover, we’ll implement a primary-replica setup with a robust replication mechanism and a mechanism to detect failures and promote a replica. Linode’s Managed Databases can simplify this, but for a custom setup, we’ll outline a common approach using MySQL replication.
MySQL Replication Setup
This involves setting up one primary (master) database server and one or more replica servers. The primary logs binary events, and replicas apply these events to stay synchronized.
On the Primary Server (db-primary.internal):
[mysqld] server-id = 1 log_bin = /var/log/mysql/mysql-bin.log binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = OFF # Crucial for primary
After applying these changes, restart MySQL and create a replication user:
-- On the primary MySQL server CREATE USER 'replicator'@'%' IDENTIFIED BY 'your_strong_password'; GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%'; FLUSH PRIVILEGES; -- Get the current binary log file and position (or GTID status) SHOW MASTER STATUS; -- Example output: File: mysql-bin.000001, Position: 123456, GTID_Executed: ...
On the Replica Server (db-replica-1.internal):
[mysqld] server-id = 2 # Must be unique and different from primary log_bin = /var/log/mysql/mysql-bin.log binlog_format = ROW gtid_mode = ON enforce_gtid_consistency = ON relay_log = /var/log/mysql/mysql-relay-bin.log read_only = ON # Crucial for replicas to prevent accidental writes
After restarting MySQL on the replica, configure it to connect to the primary:
-- On the replica MySQL server CHANGE MASTER TO MASTER_HOST='db-primary.internal', MASTER_USER='replicator', MASTER_PASSWORD='your_strong_password', MASTER_PORT=3306, MASTER_USE_GTID = slave_pos; -- Recommended if using GTID START SLAVE; SHOW SLAVE STATUS\G -- Verify 'Slave_IO_Running: Yes' and 'Slave_SQL_Running: Yes'
Automated Failover Mechanism (Orchestration)
This is the most critical part for *automated* failover. We need a system that:
- Monitors the health of the primary database.
- Detects primary failure (e.g., unresponsiveness, network issues).
- Promotes a healthy replica to become the new primary.
- Reconfigures other replicas to follow the new primary.
- Updates application configurations (WordPress) to point to the new primary.
Several tools can achieve this:
- Orchestrator: A popular open-source tool specifically designed for MySQL replication topology management and automated failover.
- ProxySQL: A high-performance SQL proxy that can manage read/write splitting and failover.
- Custom Scripts with Monitoring: Using tools like
systemd,monit, or Prometheus/Alertmanager to trigger custom failover scripts.
Let’s outline a simplified approach using Orchestrator.
Orchestrator Setup for MySQL Failover
Install Orchestrator on a separate server or one of your existing nodes (ensure it’s not a database node itself to avoid a single point of failure). Configure Orchestrator to connect to your MySQL instances.
# orchestrator.conf.json
{
"Debug": true,
"ListenAddress": ":3000",
"MySQLTopologyUser": "orchestrator",
"MySQLTopologyPassword": "your_orchestrator_password",
"MySQLOrchestratorHostPort": "127.0.0.1:3306",
"MySQLReplicationUser": "orchestrator",
"MySQLReplicationPassword": "your_orchestrator_password",
"DiscoveryPeriodSeconds": 10,
"PromotionUser": "orchestrator",
"PromotionPassword": "your_orchestrator_password",
"PostUnsuccessfulFailoverProcesses": [
"/path/to/your/script/notify_admin.sh"
],
"PostSuccessfulFailoverProcesses": [
"/path/to/your/script/update_wordpress_config.sh"
]
}
Create the orchestrator MySQL user with appropriate privileges on all database servers.
-- On all MySQL servers (primary and replicas) CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'your_orchestrator_password'; GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD, LOCK TABLES, SHOW DATABASES, SHOW VIEW ON *.* TO 'orchestrator'@'%'; FLUSH PRIVILEGES;
Start Orchestrator and let it discover your topology. Configure it to automatically promote replicas:
# Example command to start Orchestrator orchestrator -c /etc/orchestrator/orchestrator.conf.json
Orchestrator’s web UI (default: port 3000) will show the topology. You can enable auto-failover via its configuration or API. The key is the PostSuccessfulFailoverProcesses hook, which can execute a script to update WordPress.
WordPress Application Layer Failover
WordPress itself needs to be aware of the database changes. The most common approach is to dynamically update the wp-config.php file.
Updating wp-config.php
The script executed by Orchestrator (e.g., update_wordpress_config.sh) would need to:
- Identify the new primary database host.
- Modify the
wp-config.phpfile on all WordPress web servers to reflect the new database credentials. - Potentially restart PHP-FPM or clear WordPress object caches (e.g., Redis, Memcached) if applicable.
Example script snippet (ensure proper error handling and security):
#!/bin/bash NEW_DB_HOST="$1" # Orchestrator passes the new host as the first argument WP_CONFIG_PATH="/var/www/html/wp-config.php" # Adjust path as needed if [ -z "$NEW_DB_HOST" ]; then echo "Error: New DB host not provided." exit 1 fi echo "Updating $WP_CONFIG_PATH with new DB host: $NEW_DB_HOST" # Use sed to replace the DB_HOST definition. This is a simplified example. # A more robust solution might involve templating or a dedicated config management tool. sed -i "s/define( 'DB_HOST', .* );/define( 'DB_HOST', '$NEW_DB_HOST' );/" "$WP_CONFIG_PATH" if [ $? -eq 0 ]; then echo "Successfully updated DB_HOST in $WP_CONFIG_PATH." # Optional: Clear cache or restart services # systemctl restart php-fpm # redis-cli FLUSHALL else echo "Error updating $WP_CONFIG_PATH." exit 1 fi exit 0
This script needs to be executable and accessible by the user Orchestrator runs as. Ensure the DB_USER and DB_PASSWORD remain consistent across all database servers.
Elasticsearch Client Failover
Your WordPress application (or any other service interacting with Elasticsearch) needs to be resilient to Elasticsearch node failures. This typically involves:
- Using an Elasticsearch client library: Most modern libraries (e.g., official PHP client) have built-in support for multiple hosts and automatic node discovery/reconnection.
- Configuring multiple hosts: Provide a list of all Elasticsearch node addresses to the client.
- Health checks: The client library should periodically check the health of nodes and remove unresponsive ones from its internal list.
Example using the official PHP Elasticsearch client:
$hosts = [
'http://es-node-1.internal:9200',
'http://es-node-2.internal:9200',
'http://es-node-3.internal:9200',
];
$client = Elasticsearch\ClientBuilder::create()
->setHosts($hosts)
->build();
try {
// Perform an operation, e.g., index a document
$params = [
'index' => 'my_index',
'id' => 'my_id',
'body' => ['testField' => 'abc']
];
$response = $client->index($params);
print_r($response);
} catch (\Elasticsearch\Common\Exceptions\NoNodesAvailableException $e) {
// Handle the case where no Elasticsearch nodes are reachable
echo "Error: No Elasticsearch nodes available. " . $e->getMessage();
// Implement fallback logic, e.g., queueing requests
} catch (\Exception $e) {
echo "An unexpected error occurred: " . $e->getMessage();
}
When a master node fails in Elasticsearch, the remaining master-eligible nodes will elect a new master. If data nodes fail, Elasticsearch will automatically reallocate shards and their replicas to healthy nodes, provided there are enough nodes and replicas configured.
Monitoring and Alerting
Automated failover is only as good as the monitoring that triggers it. Implement comprehensive monitoring for:
- Elasticsearch Cluster Health: Use Elasticsearch’s `_cluster/health` API to check status (green, yellow, red) and node counts.
- MySQL Replication Status: Monitor `SHOW SLAVE STATUS` on replicas.
- Orchestrator Health: Ensure Orchestrator itself is running and healthy.
- Application Connectivity: Monitor WordPress’s ability to connect to its database and Elasticsearch.
- Linode Resource Utilization: CPU, RAM, Disk I/O, Network on all nodes.
Tools like Prometheus with Alertmanager, Datadog, or Nagios are essential. Set up alerts for critical conditions, such as:
- Elasticsearch cluster status is red or yellow.
- MySQL replication lag exceeds a threshold or replication is stopped.
- Orchestrator reports an issue or fails to perform a failover.
- High resource utilization on critical nodes.
Testing Your Failover Strategy
Regularly test your failover procedures. This is non-negotiable for production systems. Simulate failures:
- Stop the MySQL primary process.
- Reboot an Elasticsearch master node.
- Simulate network partitions.
Verify that the automated failover mechanisms trigger correctly, the application reconnects successfully, and data integrity is maintained. Document the entire process and the results of your tests.