Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Magento 2 Deployments on OVH

Elasticsearch Cluster Health and Failover Strategy

Achieving high availability for Elasticsearch, the backbone of Magento 2’s search and catalog functionality, requires a robust failover strategy. We’ll focus on a multi-region deployment on OVH, leveraging their infrastructure for resilience. The core principle is to maintain a quorum across geographically dispersed data centers, ensuring that the loss of a single region does not cripple search operations.

A typical Elasticsearch cluster for a production Magento 2 deployment will consist of multiple master-eligible nodes, data nodes, and potentially coordinating-only nodes. For disaster recovery, we’ll deploy a primary cluster in one OVH region (e.g., GRA) and a secondary, read-only replica cluster in another (e.g., RBX). Data synchronization is paramount. Elasticsearch’s built-in cross-cluster replication (CCR) is the ideal mechanism for this, but it requires careful configuration to ensure data consistency and minimal lag.

Elasticsearch Cross-Cluster Replication (CCR) Setup

CCR allows you to replicate indices from a leader cluster to a follower cluster. For our DR scenario, the primary cluster in GRA will be the leader, and the secondary cluster in RBX will be the follower. This setup ensures that the RBX cluster is a near real-time replica of the GRA cluster.

First, configure the remote cluster connection on the follower cluster (RBX). This involves defining the leader cluster (GRA) as a remote cluster in the follower’s elasticsearch.yml configuration.

cluster.remote.gra_cluster.seeds: "gra-es-node1.example.com:9300,gra-es-node2.example.com:9300"
cluster.remote.gra_cluster.skip_unavailable: false

Next, on the leader cluster (GRA), you need to configure the necessary security settings if your clusters are protected by X-Pack security. This typically involves creating a user on the leader cluster that the follower cluster can use to connect.

# On the GRA cluster (leader)
bin/elasticsearch-users user create remote_user -p your_secure_password -r "CCR Remote User"
bin/elasticsearch-users role add ccr_reader --privileges manage_cross_cluster_replication,monitor,read,indices:admin/plugins/replication/index/setup,indices:admin/plugins/replication/index/sync/start,indices:admin/plugins/replication/index/sync/stop,cluster:admin/plugins/replication/index/monitor
bin/elasticsearch-users user-role add remote_user ccr_reader

Now, on the follower cluster (RBX), configure the connection to the leader cluster, including the credentials.

xpack.security.remote_clusters.gra_cluster.seeds: "gra-es-node1.example.com:9300,gra-es-node2.example.com:9300"
xpack.security.remote_clusters.gra_cluster.username: "remote_user"
xpack.security.remote_clusters.gra_cluster.password: "your_secure_password"

After restarting the Elasticsearch nodes in RBX, you should be able to verify the remote cluster connection:

curl -X GET "http://rbx-es-node1.example.com:9200/_remote/info?pretty"

Once the remote connection is established, you can configure the replication of specific indices. For Magento 2, this typically includes indices related to catalog, search, and potentially logs. You can replicate individual indices or entire index patterns.

PUT _ccr/auto_follow/magento_catalog_follower
{
  "remote_cluster": "gra_cluster",
  "leader_index_pattern": "magento_catalog_*",
  "follow_index_name": "{{leader_index_name}}",
  "max_read_request_operation_count": 5120,
  "max_read_request_size": "32mb",
  "max_write_request_operation_count": 1000,
  "max_write_request_size": "9223372036854775807b",
  "max_outstanding_read_requests": 12,
  "max_outstanding_write_requests": 15,
  "max_write_buffer_count": 2147483647,
  "max_write_buffer_size": "512mb",
  "max_retry_delay": "500ms",
  "read_poll_timeout": "1m",
  "use_roles": true,
  "roles": ["ccr_reader"]
}

The auto_follow configuration automatically creates follower indices for any new leader indices matching the pattern. This is crucial for dynamic index creation in Magento 2.

Automated Failover Orchestration

Manual failover is not an option for true disaster recovery. We need an automated system that monitors the health of the primary Elasticsearch cluster and initiates a failover to the secondary cluster when necessary. This can be achieved using a combination of external monitoring tools and custom scripts.

Monitoring Strategy:

Primary Cluster Health: Regularly poll the Elasticsearch cluster health API (GET _cluster/health) on the primary cluster (GRA). Key metrics to monitor include status (should be ‘green’ or ‘yellow’), number_of_nodes, and unassigned_shards.
CCR Lag: Monitor the replication status of CCR. You can query the follower index’s metadata to check for replication lag.
Network Connectivity: Ensure network connectivity between the primary and secondary regions.

Failover Trigger: A failover should be triggered if:

The primary cluster’s health status becomes ‘red’.
A significant number of nodes in the primary cluster become unreachable.
CCR lag exceeds a predefined acceptable threshold for an extended period.
Network connectivity to the primary cluster is lost.

Failover Scripting and Execution

A dedicated failover orchestrator (e.g., a Python script running on a separate, highly available instance, or a managed Kubernetes operator) will be responsible for monitoring and executing the failover. This orchestrator will:

Periodically check the health of the primary Elasticsearch cluster.
If a failure is detected, initiate a controlled failover process.
Update DNS records or load balancer configurations to point Magento 2 instances to the secondary Elasticsearch cluster.
Optionally, disable CCR from the secondary cluster to prevent accidental writes to the now-primary cluster.
Notify relevant teams via alerting systems (e.g., PagerDuty, Slack).

Here’s a conceptual Python script snippet for monitoring and triggering a failover. This script would run on a separate, resilient VM or container.

import requests
import time
import json
import os

PRIMARY_ES_URL = "http://gra-es-node1.example.com:9200"
SECONDARY_ES_URL = "http://rbx-es-node1.example.com:9200"
MONITOR_INTERVAL = 30  # seconds
FAILOVER_THRESHOLD_UNASSIGNED_SHARDS = 10
FAILOVER_THRESHOLD_CCR_LAG_SECONDS = 600 # 10 minutes

def get_cluster_health(es_url):
    try:
        response = requests.get(f"{es_url}/_cluster/health", timeout=10)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error getting cluster health from {es_url}: {e}")
        return None

def get_ccr_stats(es_url, index_name):
    try:
        response = requests.get(f"{es_url}/_{index_name}/_ccr/stats", timeout=10)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error getting CCR stats for {index_name} from {es_url}: {e}")
        return None

def trigger_failover():
    print("Initiating failover to secondary cluster...")
    # In a real-world scenario, this would involve:
    # 1. Updating DNS records (e.g., via OVH API or a DNS provider API)
    # 2. Reconfiguring Magento 2 application instances (e.g., via Ansible, Kubernetes config maps)
    # 3. Disabling CCR on the secondary cluster (POST _ccr/disable/all)
    # 4. Sending alerts
    print("Failover initiated. Please verify manually and update DNS/configs.")
    # Example: Update DNS (requires OVH API client or equivalent)
    # update_dns_record("magento-es.example.com", SECONDARY_ES_URL_IP)

def trigger_failback():
    print("Initiating failback to primary cluster...")
    # Similar steps as failover, but pointing back to primary and re-enabling CCR
    print("Failback initiated. Please verify manually and update DNS/configs.")
    # Example: Update DNS
    # update_dns_record("magento-es.example.com", PRIMARY_ES_URL_IP)

def main():
    failover_in_progress = False
    while True:
        primary_health = get_cluster_health(PRIMARY_ES_URL)
        secondary_health = get_cluster_health(SECONDARY_ES_URL)

        if primary_health is None:
            print("Primary cluster is unreachable. Attempting failover.")
            if not failover_in_progress:
                trigger_failover()
                failover_in_progress = True
            time.sleep(MONITOR_INTERVAL)
            continue

        if primary_health.get("status") == "red" or \
           primary_health.get("number_of_nodes") < 2 or \
           primary_health.get("unassigned_shards", 0) > FAILOVER_THRESHOLD_UNASSIGNED_SHARDS:
            print(f"Primary cluster health is critical: {primary_health.get('status')}, unassigned shards: {primary_health.get('unassigned_shards', 0)}. Attempting failover.")
            if not failover_in_progress:
                trigger_failover()
                failover_in_progress = True
        else:
            # Check CCR lag if failover is not in progress
            if not failover_in_progress:
                # Assuming we are replicating a specific index, e.g., 'magento_catalog_1'
                # In a real scenario, you'd iterate through critical indices or use auto-follow stats
                ccr_stats = get_ccr_stats(SECONDARY_ES_URL, "magento_catalog_1")
                if ccr_stats and ccr_stats.get("indices"):
                    for index_data in ccr_stats["indices"]:
                        if index_data.get("index_name") == "magento_catalog_1":
                            lag_time_ms = index_data.get("replication", {}).get("lag_time_in_millis")
                            if lag_time_ms is not None and lag_time_ms / 1000 > FAILOVER_THRESHOLD_CCR_LAG_SECONDS:
                                print(f"CCR lag for magento_catalog_1 is too high: {lag_time_ms}ms. Attempting failover.")
                                trigger_failover()
                                failover_in_progress = True
                                break

        # If primary is healthy and failover was in progress, consider failback
        if not failover_in_progress and primary_health.get("status") in ["green", "yellow"] and \
           primary_health.get("unassigned_shards", 0) <= FAILOVER_THRESHOLD_UNASSIGNED_SHARDS:
            # Add logic here to check if the secondary cluster is now the active one
            # and if the primary has been stable for a duration.
            # For simplicity, we'll assume a manual failback trigger or a separate process.
            pass

        time.sleep(MONITOR_INTERVAL)

if __name__ == "__main__":
    main()

The failover process needs to be carefully orchestrated. When a failure is detected, the system should:

Stop Writes to Primary: Prevent any new data from being written to the failing primary cluster.
Promote Secondary: Make the secondary cluster (RBX) the active, writable cluster. This involves reconfiguring Magento 2 to point to it.
Update DNS/Load Balancers: Crucially, update DNS records (e.g., magento-es.yourdomain.com) or load balancer configurations to direct all Magento 2 traffic to the IP address of the secondary cluster. This is often the most complex part and requires integration with OVH’s API or your DNS provider.
Disable CCR on Secondary: Once promoted, disable CCR from the secondary cluster to prevent it from trying to replicate from the now-unreachable primary.

Magento 2 Application Configuration and Failover

Magento 2’s configuration for Elasticsearch is typically managed via environment variables or configuration files. For automated failover, we need a mechanism to dynamically update these settings.

Dynamic Configuration Updates

The most robust approach is to use a configuration management tool (like Ansible, Chef, Puppet) or a container orchestration platform (like Kubernetes) that can dynamically update application configurations. The failover orchestrator would trigger these tools.

Example using Environment Variables (Kubernetes/Docker Compose):

Magento 2 often uses environment variables to configure its Elasticsearch connection. The failover script would update these variables for the Magento 2 pods/containers.

# Example of how a deployment tool might update env vars
# This is NOT the failover script itself, but what the failover script would trigger.

# If primary ES is active:
export MAGENTO_ES_HOSTS="gra-es-node1.example.com:9200,gra-es-node2.example.com:9200"
export MAGENTO_ES_INDEX_PREFIX="magento_gra_"

# After failover to secondary ES:
export MAGENTO_ES_HOSTS="rbx-es-node1.example.com:9200,rbx-es-node2.example.com:9200"
export MAGENTO_ES_INDEX_PREFIX="magento_rbx_"

# Then, trigger a rolling restart of Magento 2 application pods/services.

The key is to have a single, resolvable hostname for your Elasticsearch cluster (e.g., magento-es.yourdomain.com) that your failover mechanism updates to point to the active cluster’s IP address. This decouples the Magento 2 application from the underlying physical infrastructure.

Reindexing and Data Consistency Post-Failover

While CCR aims for near real-time replication, there might be a small window of data loss or inconsistency immediately after a catastrophic failure of the primary cluster. After a failover, it’s crucial to:

Verify Data Integrity: Perform spot checks on critical product data and search results on the secondary cluster.
Full Reindex (if necessary): In severe cases where CCR might have failed to replicate certain data, a full reindex on the secondary cluster might be required. This is a disruptive operation and should be a last resort.
Monitor CCR Lag on Failback: When planning a failback to the primary region, ensure that the primary cluster has caught up with all changes from the secondary cluster before switching back.

OVH Specific Considerations

When deploying on OVH, several factors are critical for a successful DR strategy:

Region Selection and Network Latency

Choose OVH regions that are geographically distant enough to mitigate regional disasters but close enough to maintain acceptable network latency for CCR and potential manual intervention. OVH’s network infrastructure is generally robust, but inter-region latency should be factored into CCR lag tolerance.

IP Failover and DNS Management

OVH offers IP failover services. This allows you to associate an IP address with different servers. You can use this to quickly redirect traffic to a standby server in the DR region. Alternatively, and often more scalable for cloud environments, is to manage DNS records via OVH’s API or a third-party DNS provider. The failover orchestrator would need to interact with these APIs.

# Example: Using OVH API (conceptual - requires ovh-python SDK or similar)
# import ovh
# client = ovh.Client(endpoint='ovh-eu') # or your specific endpoint
#
# def update_dns_record(domain, sub_domain, target_ip):
#     # Find the zone ID for your domain
#     # Find the record ID for the subdomain
#     # Update the record
#     pass

Ensure your Magento 2 instances are configured to use the dynamically updated DNS name for Elasticsearch, not hardcoded IPs.

Security and Access Control

Secure the communication between your Elasticsearch clusters using TLS. Configure firewall rules within OVH’s control panel to only allow necessary traffic between your Elasticsearch nodes and between your Magento 2 application servers and the Elasticsearch clusters. Use dedicated service accounts for CCR and application access.

Testing and Validation

Regular, scheduled DR testing is non-negotiable. This involves simulating failures (e.g., shutting down nodes in the primary region, blocking network traffic) and verifying that the automated failover process works as expected. Document the entire process, including manual steps for failback and recovery.

A comprehensive disaster recovery strategy for Elasticsearch and Magento 2 on OVH hinges on robust automation, meticulous configuration, and continuous testing. By implementing cross-cluster replication and an intelligent failover orchestrator, you can significantly minimize downtime and data loss in the event of a regional outage.