Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Magento 2 Deployments on DigitalOcean

Elasticsearch Cluster Architecture for High Availability

Achieving true disaster recovery for Elasticsearch hinges on a robust, multi-node cluster design with automatic failover capabilities. For a Magento 2 deployment, this typically involves a dedicated Elasticsearch cluster, separate from the web and database tiers, to ensure optimal performance and resilience. We’ll focus on a setup utilizing DigitalOcean Droplets, leveraging Elasticsearch’s built-in master election and shard allocation mechanisms.

A minimum of three master-eligible nodes is crucial for reliable quorum-based master election. This prevents split-brain scenarios. Data nodes will handle indexing and searching, and ideally, these should also be master-eligible to simplify management, especially in smaller deployments. Dedicated coordinating nodes can be introduced for very large clusters to offload search and indexing requests from data nodes.

Configuring Elasticsearch for Master Election and Discovery

The core of Elasticsearch’s HA lies in its discovery and master election configuration. We’ll use the Zen Discovery module, which is the default. The key parameters are found in elasticsearch.yml.

Ensure each node in your cluster can communicate with every other node. For a DigitalOcean setup, this means configuring your firewall (e.g., using `ufw` or DigitalOcean’s Cloud Firewalls) to allow traffic on ports 9300-9400 (transport layer) and 9200 (HTTP API) between your Elasticsearch nodes.

`elasticsearch.yml` Configuration Snippet

On each Elasticsearch node, configure the following:

cluster.name: "magento-es-cluster"
node.name: "${HOSTNAME}" # Or a static name if preferred
network.host: 0.0.0.0 # Or the specific private IP of the Droplet
discovery.seed_hosts:
  - "10.10.0.1:9300" # Replace with private IPs of your master-eligible nodes
  - "10.10.0.2:9300"
  - "10.10.0.3:9300"
cluster.initial_master_nodes:
  - "node-1" # Hostnames or node names of your initial master-eligible nodes
  - "node-2"
  - "node-3"
http.port: 9200
transport.port: 9300
# For production, consider security settings like xpack.security.enabled: true
# and proper authentication/authorization.

Explanation:

cluster.name: Must be identical across all nodes in the cluster.
node.name: A unique identifier for each node. Using the hostname is convenient.
network.host: The IP address Elasticsearch will bind to. Use 0.0.0.0 to bind to all interfaces or a specific private IP for better security.
discovery.seed_hosts: A list of IP addresses and ports of other nodes in the cluster that new nodes can contact to discover the cluster.
cluster.initial_master_nodes: A list of node names that are eligible to become the master during the initial cluster formation. This is crucial for bootstrapping.

Shard Allocation and Replication for Resilience

To ensure data availability even if a node fails, Elasticsearch uses shards and replicas. Primary shards are where data is written, and replica shards are copies. If a node holding a primary shard fails, a replica can be promoted to become the new primary.

For Magento 2, you’ll typically configure the number of replicas when creating indices. A common strategy is to have at least one replica (number_of_replicas: 1) for basic redundancy. For higher availability and disaster recovery across different availability zones or regions (though we’re focusing on a single DigitalOcean region here), you’d increase the replica count.

Example Index Template Configuration

You can set default index settings using index templates. This ensures new indices created by Magento 2 (or other applications) inherit these settings.

PUT _index_template/magento_defaults
{
  "index_patterns": ["magento2-*"],
  "template": {
    "settings": {
      "index": {
        "number_of_shards": 3,
        "number_of_replicas": 2,
        "refresh_interval": "1s"
      }
    }
  }
}

In this example, we’ve set 3 primary shards and 2 replicas for any index matching the pattern magento2-*. This means for every primary shard, there are two copies distributed across different nodes. If a node fails, Elasticsearch will automatically reallocate the affected shards to other available nodes, ensuring data is not lost and search/indexing capabilities are maintained.

Automating Failover with DigitalOcean and External Monitoring

Elasticsearch’s internal mechanisms handle node failures within the cluster. However, for true disaster recovery, especially in scenarios like a Droplet failure or an entire datacenter issue, we need external orchestration. This involves monitoring the health of the Elasticsearch cluster and, if necessary, triggering failover actions for dependent services like Magento 2.

Health Checks and Monitoring

We need a reliable way to determine if the Elasticsearch cluster is healthy and accessible. This can be done via:

Elasticsearch API Health Check: The _cluster/health endpoint provides a quick overview of the cluster’s status (green, yellow, red).
Application-Level Checks: Magento 2 itself should have mechanisms to test its connection to Elasticsearch and perform a basic search.
External Monitoring Tools: Services like Prometheus with Elasticsearch Exporter, Datadog, or even custom scripts running on a separate monitoring Droplet.

Example Health Check Script (Bash)

This script can be run periodically from a monitoring Droplet or a cron job.

#!/bin/bash

ES_HOST="http://your-elasticsearch-loadbalancer-or-primary-node:9200"
EXPECTED_CLUSTER_STATUS="green" # Or "yellow" if you can tolerate unassigned replicas temporarily

# Perform a basic health check
HEALTH_STATUS=$(curl -s -X GET "$ES_HOST/_cluster/health?pretty" | grep '"status" :' | awk '{print $2}' | tr -d '",')

if [ "$HEALTH_STATUS" == "$EXPECTED_CLUSTER_STATUS" ]; then
    echo "Elasticsearch cluster is healthy. Status: $HEALTH_STATUS"
    exit 0
else
    echo "Elasticsearch cluster is UNHEALTHY. Status: $HEALTH_STATUS"
    # Trigger failover actions here
    exit 1
fi

Orchestrating Magento 2 Failover

When the Elasticsearch cluster is deemed unhealthy, Magento 2 needs to react. The most common approach is to switch to a fallback mechanism or gracefully degrade functionality. For Magento 2, this typically means:

Disabling Elasticsearch Integration: Temporarily disable Magento’s reliance on Elasticsearch. This might involve switching to MySQL for catalog search (if configured as a fallback) or disabling search entirely.
Notifying Administrators: Sending alerts via email, Slack, or PagerDuty.
Initiating Recovery Procedures: This could involve attempting to restart Elasticsearch nodes, provisioning new nodes, or failing over to a completely separate, standby Elasticsearch cluster in another region.

Magento 2 Configuration for Fallback

Magento 2 allows configuration of search engines. While not a direct “failover” in the sense of automatic switching, you can script the process of changing the search engine configuration.

# Example: Switch to MySQL search (requires prior configuration in Magento Admin)
# This command needs to be run from your Magento root directory with appropriate permissions.
# You would typically wrap this in a script triggered by the health check failure.

# First, check if Elasticsearch is healthy. If not, proceed with fallback.
# ... health check logic ...

if [ $? -ne 0 ]; then # If health check failed
    echo "Elasticsearch is down. Attempting to switch Magento to MySQL search..."
    php bin/magento config:set catalog/search/search_engine mysql --scope=default --scope-container=default
    php bin/magento cache:clean
    php bin/magento cache:flush
    echo "Magento search engine switched to MySQL."

    # Optionally, send an alert
    # send_alert "Elasticsearch is down. Magento search switched to MySQL."
else
    echo "Elasticsearch is healthy. No action needed."
fi

To implement automatic failover, the bash script above (or a more sophisticated Python/Ansible playbook) would execute the php bin/magento config:set ... commands when the Elasticsearch health check fails. You would also need a mechanism to detect when Elasticsearch is back online and then script the switch back to Elasticsearch.

Leveraging DigitalOcean Load Balancers

For the Magento 2 web servers to reliably connect to the Elasticsearch cluster, a DigitalOcean Load Balancer is essential. This load balancer will distribute traffic across your Elasticsearch nodes (specifically, the HTTP API on port 9200).

Load Balancer Configuration

1. Create a Load Balancer: In the DigitalOcean control panel, create a new Load Balancer.

Frontend: Configure a frontend rule for HTTP traffic (port 80 or 443 if using SSL).
Backend Pools: Add your Elasticsearch Droplets as backend servers. Ensure they are accessible via their private IPs.
Health Checks: Configure a health check for the backend pool. A simple HTTP check on http://<droplet-ip>:9200/_cluster/health is effective. Set a reasonable interval (e.g., 10 seconds) and failure threshold (e.g., 3 failures).

Magento 2’s configuration for Elasticsearch should then point to the Load Balancer’s IP address or hostname, rather than individual Elasticsearch nodes. This way, if one Elasticsearch node (and thus its IP in the backend pool) becomes unhealthy, the Load Balancer will automatically stop sending traffic to it.

Advanced: Cross-Region Failover with Elasticsearch Cross-Cluster Replication (CCR)

For true disaster recovery that protects against entire datacenter failures, consider setting up a secondary Elasticsearch cluster in a different DigitalOcean region. Elasticsearch’s Cross-Cluster Replication (CCR) allows you to replicate indices from your primary cluster to a secondary cluster asynchronously.

CCR Setup Overview

1. Configure Remote Clusters: On both your primary and secondary clusters, configure the other cluster as a remote cluster in elasticsearch.yml.

# On Primary Cluster (e.g., NYC1)
xpack.remote.clusters.seeds: "secondary-es-node-1:9300,secondary-es-node-2:9300"

# On Secondary Cluster (e.g., SFO2)
xpack.remote.clusters.seeds: "primary-es-node-1:9300,primary-es-node-2:9300"

2. Configure Auto-Follow Patterns: On the secondary cluster, set up an auto-follow pattern to automatically replicate indices from the primary cluster.

POST /_ccr/auto_follow
{
  "remote_cluster": "primary_cluster_alias",
  "leader_index_patterns": ["magento2-*"],
  "collection_interval": "1m"
}

3. Failover Orchestration: In the event of a disaster, the failover process would involve:

Promoting the secondary cluster to be the primary (this is a manual step or requires advanced automation).
Updating DNS records or Load Balancer configurations to point Magento 2 to the new primary cluster in the secondary region.
Ensuring Magento 2 is configured to use this new cluster.

This cross-region setup provides the highest level of resilience but comes with increased complexity and cost. The asynchronous nature of CCR means there will be a small window of data loss during a failover event, which needs to be factored into your RPO (Recovery Point Objective).