Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Laravel Deployments on OVH

Elasticsearch Cluster Health and Failover Strategies

Achieving true high availability for Elasticsearch hinges on a robust cluster configuration that anticipates node failures. For production deployments, especially those supporting critical Laravel applications, a multi-master, geographically distributed setup is paramount. This isn’t about simple replication; it’s about intelligent shard allocation, quorum management, and automated recovery mechanisms.

A fundamental aspect is understanding Elasticsearch’s quorum mechanism. For a cluster to remain operational, a majority of master-eligible nodes must be able to communicate. This prevents split-brain scenarios. The default setting for discovery.zen.minimum_master_nodes is 1, which is dangerously insufficient for production. It should be set to (N / 2) + 1, where N is the number of master-eligible nodes.

Configuring Master-Eligible Nodes and Quorum

On your OVH instances, ensure your Elasticsearch configuration files (typically /etc/elasticsearch/elasticsearch.yml) reflect a resilient master setup. For a cluster with three master-eligible nodes, the setting should be 2. If you scale to five master-eligible nodes, it becomes 3.

cluster.name: "my-production-cluster"
node.name: ${HOSTNAME}
network.host: 0.0.0.0
discovery.seed_hosts:
  - "es-node-1.example.com:9300"
  - "es-node-2.example.com:9300"
  - "es-node-3.example.com:9300"
cluster.initial_master_nodes:
  - "es-node-1.example.com"
  - "es-node-2.example.com"
  - "es-node-3.example.com"
discovery.zen.minimum_master_nodes: 2 # For 3 master-eligible nodes
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

The cluster.initial_master_nodes setting is crucial for bootstrapping the cluster. Once the cluster is up and running, this setting can be removed or commented out. For automated deployments, consider using environment variables or a configuration management tool like Ansible to manage these settings dynamically.

Automated Node Failure Detection and Recovery

Elasticsearch’s built-in mechanisms handle node failures by reallocating shards from failed nodes to available nodes. However, proactive monitoring and automated restart/replacement are essential for rapid recovery. OVH’s cloud infrastructure provides tools for this.

We can leverage OVH’s API or their cloud monitoring tools to detect unresponsive Elasticsearch nodes. A common approach is to use a combination of:

External Health Checks: Tools like Prometheus with `blackbox_exporter` or custom scripts that periodically ping the Elasticsearch HTTP API (/_cluster/health) from an external network.
OVH Instance Monitoring: Utilize OVH’s internal monitoring to detect instance failures (e.g., CPU spikes, network unreachability).
Automated Remediation: Upon detecting a failed node, trigger an automated process to restart the Elasticsearch service on the affected instance or, if the instance is deemed unrecoverable, provision a new instance and join it to the cluster.

For automated restarts, a simple systemd service restart can be triggered:

sudo systemctl restart elasticsearch.service

For more advanced scenarios, such as replacing a failed instance, you’d integrate with OVH’s Compute API. A Python script using the `ovh` SDK could:

import ovh
import time

def get_es_node_status(node_ip):
    # Implement logic to check Elasticsearch node health via HTTP API
    # Return True if healthy, False otherwise
    pass

def restart_es_service(instance_id):
    # Implement logic to SSH into instance and restart elasticsearch.service
    pass

def terminate_and_replace_instance(instance_id, image_id, flavor_id, region):
    client = ovh.Client()
    # Terminate the old instance
    client.post(f"/compute/instances/{instance_id}", "terminate")
    # Wait for termination
    time.sleep(60)
    # Create a new instance
    new_instance = client.post("/compute/instances",
                               name="es-node-new",
                               image=image_id,
                               flavor=flavor_id,
                               region=region)
    # Wait for provisioning and boot
    time.sleep(300)
    # Configure the new instance to join the cluster (e.g., via cloud-init or config management)
    return new_instance['id']

# Main loop for monitoring
# ... detect failed_es_node_ip ...
# if failed_es_node_ip_is_unrecoverable:
#     instance_id_to_replace = get_instance_id_from_ip(failed_es_node_ip)
#     new_instance_id = terminate_and_replace_instance(instance_id_to_replace, "your-image-id", "your-flavor-id", "your-region")
#     # Ensure new instance is configured and joins the cluster
# else:
#     instance_id_to_restart = get_instance_id_from_ip(failed_es_node_ip)
#     restart_es_service(instance_id_to_restart)

Laravel Application and Database Failover

For a Laravel application, high availability typically involves the database and the web servers themselves. We’ll focus on database failover, as it’s often the most critical component.

MySQL/MariaDB Replication and Automatic Failover

A standard approach for relational databases like MySQL or MariaDB is to set up a primary-replica replication topology. For automatic failover, we need a mechanism to detect primary failure and promote a replica.

Replication Setup:

-- On the primary server:
CREATE USER 'repl_user'@'%' IDENTIFIED BY 'your_replication_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';
FLUSH PRIVILEGES;
SHOW MASTER STATUS; -- Note down File and Position

-- On the replica server:
CHANGE MASTER TO
  MASTER_HOST='primary_db_host',
  MASTER_USER='repl_user',
  MASTER_PASSWORD='your_replication_password',
  MASTER_LOG_FILE='mysql-bin.XXXXXX', -- From SHOW MASTER STATUS on primary
  MASTER_LOG_POS=YYYYYY; -- From SHOW MASTER STATUS on primary
START SLAVE;
SHOW SLAVE STATUS\G; -- Verify 'Slave_IO_Running: Yes' and 'Slave_SQL_Running: Yes'

For automatic failover, tools like Orchestrator or ProxySQL are excellent choices. Orchestrator is a popular open-source tool specifically designed for MySQL replication topology management and failover.

Implementing Failover with Orchestrator

Orchestrator needs to be installed on a separate set of instances (or at least not on the database primaries themselves) to avoid a single point of failure. It connects to your MySQL instances, discovers the topology, and can be configured to automatically detect and recover from failures.

Orchestrator Configuration (orchestrator.conf.json):

{
  "Debug": true,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "your_orchestrator_db_password",
  "MySQLOrchestratorHostPort": "127.0.0.1:3306",
  "MySQLReplicationUser": "orchestrator_repl",
  "MySQLReplicationPassword": "your_orchestrator_repl_password",
  "PromotionUser": "orchestrator_promote",
  "PromotionPassword": "your_orchestrator_promote_password",
  "DetectClusterFailures": true,
  "AutoDiscoverTopology": true,
  "FailureDetectionPeriod": "10s",
  "RecoveryPeriod": "5m",
  "PostponePromotionOnDiscovery": "1m",
  "PreMasterPromotionCommand": "/path/to/pre_master_promotion.sh",
  "PostMasterPromotionCommand": "/path/to/post_master_promotion.sh",
  "PostMasterFailoverCommand": "/path/to/post_master_failover.sh"
}

The PreMasterPromotionCommand, PostMasterPromotionCommand, and PostMasterFailoverCommand are critical for integrating with your application. These scripts can update DNS, notify load balancers, or trigger application-level reconfiguration.

Example post_master_failover.sh script:

#!/bin/bash
NEW_PRIMARY_HOST=$1
CLUSTER_NAME=$2

echo "Failover detected. New primary is: $NEW_PRIMARY_HOST for cluster: $CLUSTER_NAME"

# Example: Update a load balancer or DNS record
# This would typically involve calling an API for your load balancer service
# or using a tool like 'dns-updater' to change A records.

# Example: Notify your Laravel application or a monitoring system
# curl -X POST -H "Content-Type: application/json" -d '{"message": "MySQL failover: New primary is '"$NEW_PRIMARY_HOST"'"}' http://your-monitoring-api.example.com/notify

exit 0

Ensure the Orchestrator user has sufficient privileges for replication management and promotion. The orchestrator user needs REPLICATION SLAVE, REPLICATION CLIENT, SUPER, RELOAD, PROCESS, SELECT privileges. The orchestrator_repl user needs REPLICATION SLAVE, REPLICATION CLIENT, SELECT. The orchestrator_promote user needs REPLICATION SLAVE, REPLICATION CLIENT, SUPER, RELOAD, PROCESS, SELECT, LOCK TABLES, SET USER.

Laravel Application Load Balancing and Health Checks

Your Laravel application instances themselves need to be load-balanced and monitored. OVH offers various load balancing solutions, from basic network load balancers to more advanced application load balancers.

Nginx as a Load Balancer/Reverse Proxy:

# /etc/nginx/nginx.conf or included conf file
upstream laravel_app_servers {
    server 192.168.1.10:80;
    server 192.168.1.11:80;
    server 192.168.1.12:80;
    # Add more servers as needed
    # health_check interval=5s fails=3s rise=2 timeout=1s uri=/health; # Requires Nginx Plus or custom module
}

server {
    listen 80;
    server_name your-app.example.com;

    location / {
        proxy_pass http://laravel_app_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Health check endpoint for Nginx (if not using Nginx Plus)
    # A simple check that returns 200 OK
    location = /health {
        access_log off;
        return 200 "OK";
        add_header Content-Type text/plain;
    }
}

For automated failover of the Nginx instances themselves, you can use tools like Keepalived to manage a virtual IP (VIP) that floats between active and standby Nginx servers. If the active Nginx server fails, Keepalived automatically assigns the VIP to the standby server.

Keepalived Configuration (/etc/keepalived/keepalived.conf):

[global_defs]
  router_id NGINX_MASTER

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_vrrp_password
    }
    virtual_ipaddress {
        192.168.1.200/24 dev eth0 # The VIP that clients connect to
    }
    # Health check for Nginx process
    track_script CHK_NGINX
}

vrrp_script CHK_NGINX {
    script "/usr/local/bin/check_nginx.sh"
    interval 2
    weight -20 # Lower priority if Nginx fails
    fall 2
    rise 2
}

# If using Nginx Plus with health checks, you can integrate that here.
# Otherwise, a script checking the Nginx process or its health endpoint is needed.

The check_nginx.sh script would typically check if the Nginx process is running and perhaps if it’s responding to the health check endpoint. If the script fails, Keepalived will reduce the priority of the current master, triggering a failover to the backup.

Orchestration and Monitoring for Automated Failover

A truly automated failover system requires a robust orchestration and monitoring layer. This layer is responsible for detecting failures across all components (Elasticsearch, Database, Web Servers) and initiating the appropriate recovery procedures.

Centralized Monitoring with Prometheus and Alertmanager

Prometheus is an excellent choice for collecting metrics from all your services. You’ll need exporters for:

Node Exporter: For system-level metrics (CPU, memory, disk, network).
Elasticsearch Exporter: To gather Elasticsearch cluster health, node status, shard allocation, etc.
MySQL Exporter: To monitor database replication status, query performance, connections.
Blackbox Exporter: To perform external network probes (e.g., checking HTTP endpoints of your Laravel app, Elasticsearch API).

Alertmanager then processes alerts generated by Prometheus and routes them to appropriate notification channels (e.g., Slack, PagerDuty, email). Crucially, Alertmanager can also trigger webhooks to initiate automated remediation actions.

Prometheus Alerting Rule Example (alerts.yml):

groups:
- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchClusterDown
    expr: elasticsearch_cluster_health_status{status="red"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster is in RED status"
      description: "Cluster {{ $labels.cluster_name }} is in RED status. Manual intervention may be required."
      runbook_url: "http://your-runbook-url/elasticsearch-red-status"

  - alert: ElasticsearchNodeDown
    expr: up{job="elasticsearch"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch node is down"
      description: "Node {{ $labels.instance }} in job {{ $labels.job }} is down. Attempting automated restart."
      webhook_url: "http://your-remediation-webhook:8080/webhook/elasticsearch-node-down" # Trigger remediation

- name: mysql_alerts
  rules:
  - alert: MySQLReplicationLagging
    expr: mysql_slave_status_seconds_behind_master > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "MySQL replication lag detected"
      description: "Replica {{ $labels.instance }} is lagging by {{ $value }} seconds."

  - alert: MySQLPrimaryDown
    expr: mysql_up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "MySQL primary is down"
      description: "MySQL primary {{ $labels.instance }} is unreachable. Orchestrator should handle failover."
      runbook_url: "http://your-runbook-url/mysql-primary-down"
      # Orchestrator might be configured to handle this directly, or a webhook could trigger it.
      # webhook_url: "http://your-remediation-webhook:8080/webhook/mysql-primary-down"

Custom Remediation Webhook Service

The webhook_url in the Prometheus alert annotations can point to a custom service (e.g., a small Python Flask app or a Go service) that listens for incoming alerts. This service acts as the central orchestrator for automated recovery.

When a critical alert is received (e.g., ElasticsearchNodeDown), the webhook service can:

For Elasticsearch: Trigger a script to restart the Elasticsearch service on the affected node. If the node is unrecoverable, initiate a replacement process using OVH APIs.
For MySQL: While Orchestrator handles direct database failover, the webhook could notify Orchestrator or trigger its API if direct integration is needed. It can also update application configurations or DNS.
For Laravel App Servers: If a web server instance fails, the webhook can trigger its replacement via OVH APIs and update the load balancer configuration.

This layered approach, combining built-in HA features of services like Elasticsearch and MySQL with external orchestration and monitoring tools, provides a robust framework for automated failover on OVH infrastructure.