Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Shopify Deployments on DigitalOcean

Elasticsearch Cluster Architecture for High Availability

Achieving robust disaster recovery for Elasticsearch hinges on a well-defined, multi-node cluster architecture that prioritizes data redundancy and automatic failover. For production deployments on DigitalOcean, we’ll focus on a setup that leverages Elasticsearch’s built-in quorum-based voting and shard allocation mechanisms, augmented by external monitoring and orchestration.

A typical highly available Elasticsearch cluster consists of at least three master-eligible nodes, several data nodes, and potentially dedicated coordinating nodes. The master nodes are responsible for cluster-wide operations, including managing indices, shard allocation, and node discovery. Data nodes store the actual indices and handle search and indexing requests. Coordinating nodes act as intelligent load balancers for search requests, distributing them across data nodes and aggregating results.

Configuring Elasticsearch for Master Election and Shard Allocation

The core of Elasticsearch’s HA lies in its master election process and shard allocation strategies. We need to ensure that enough master-eligible nodes are available to form a quorum and that shards are replicated across different availability zones or even regions for true disaster resilience.

On each Elasticsearch node, the elasticsearch.yml configuration file is critical. For master-eligible nodes, the following settings are paramount:

cluster.name: "my-production-cluster"
node.name: "${HOSTNAME}"
network.host: 0.0.0.0
discovery.seed_hosts:
  - "es-node-1.example.com:9300"
  - "es-node-2.example.com:9300"
  - "es-node-3.example.com:9300"
cluster.initial_master_nodes:
  - "es-node-1"
  - "es-node-2"
  - "es-node-3"
node.roles: [ master, data, ingest ] # Example: Master and Data roles combined for smaller clusters
xpack.security.enabled: true # Essential for production
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Key parameters:

cluster.name: Must be identical across all nodes in the cluster.
discovery.seed_hosts: A list of IP addresses or hostnames of other master-eligible nodes that new nodes can contact to join the cluster.
cluster.initial_master_nodes: A list of node names that are eligible to be elected as the initial master. This is crucial for bootstrapping the cluster.
node.roles: Defines the roles of the node. For HA, ensure at least three nodes have the master role.
xpack.security.*: Enabling security is non-negotiable for production environments.

For data nodes, the configuration would be similar but might omit the master role if dedicated master nodes are used:

cluster.name: "my-production-cluster"
node.name: "${HOSTNAME}"
network.host: 0.0.0.0
discovery.seed_hosts:
  - "es-node-1.example.com:9300"
  - "es-node-2.example.com:9300"
  - "es-node-3.example.com:9300"
node.roles: [ data, ingest ] # Example: Data and Ingest roles
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Shard Replication and Allocation Awareness

To prevent data loss during a node failure or even a datacenter outage, shard replication is essential. Elasticsearch uses primary and replica shards. A primary shard is the original copy of a document, and replica shards are exact copies of the primary. If a node holding a primary shard fails, one of its replicas can be promoted to become the new primary.

The number of replicas is configured per index. For high availability, a minimum of 1 replica (meaning 2 copies of the data) is recommended. For disaster recovery, consider 2 or more replicas, distributed across different DigitalOcean availability zones.

Index settings for replication:

PUT /my-index
{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 2  // Ensures 3 copies of each shard
    }
  }
}

To ensure replicas are placed in different availability zones, we can leverage allocation awareness. This requires configuring DigitalOcean droplet tags or metadata to identify zones.

First, ensure your DigitalOcean droplets have appropriate tags (e.g., us-nyc1-zone1, us-nyc1-zone2). Then, configure Elasticsearch:

# In elasticsearch.yml on master-eligible nodes
cluster.routing.allocation.awareness.attributes: zone
# Example for a specific index to enforce zone awareness
PUT /my-index/_settings
{
  "index.routing.allocation.awareness.attributes": "zone"
}

This tells Elasticsearch to try and place shards and their replicas on nodes in different zones. You’ll need to ensure your DigitalOcean infrastructure is set up with distinct zones and that your Elasticsearch nodes are deployed accordingly. For instance, if you have droplets in nyc1, nyc2, and nyc3, you’d tag them accordingly and configure Elasticsearch to be aware of the zone attribute.

Automated Failover Orchestration with Keepalived and HAProxy

While Elasticsearch handles internal node failures and shard promotion, external access to the cluster needs its own high-availability layer. This is where a combination of Keepalived for virtual IP (VIP) failover and HAProxy for load balancing comes into play. This setup ensures that your applications always have a stable endpoint to connect to, even if the underlying Elasticsearch nodes or load balancers change.

Setting up Keepalived for VIP Failover

Keepalived provides a simple yet powerful mechanism for managing a floating IP address across multiple servers. If the primary server fails, Keepalived automatically transfers the VIP to a standby server.

We’ll deploy two dedicated servers (or use existing nodes with spare capacity) for Keepalived and HAProxy. Let’s call them lb-node-1 and lb-node-2.

Install Keepalived on both nodes:

sudo apt update
sudo apt install keepalived -y

Configure Keepalived on lb-node-1 (the primary):

! Configuration File for keepalived

global_defs {
   router_id LVS_DEVEL_1
}

vrrp_script chk_haproxy {
    script "/usr/local/bin/check_haproxy.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP # Start as BACKUP on primary, will transition to MASTER
    interface eth0 # Replace with your actual network interface
    virtual_router_id 51
    priority 101 # Higher priority for MASTER
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 # Your desired Virtual IP
    }
    track_script {
        chk_haproxy
    }
}

Configure Keepalived on lb-node-2 (the standby):

! Configuration File for keepalived

global_defs {
   router_id LVS_DEVEL_2
}

vrrp_script chk_haproxy {
    script "/usr/local/bin/check_haproxy.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0 # Replace with your actual network interface
    virtual_router_id 51
    priority 100 # Lower priority for BACKUP
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 # Your desired Virtual IP
    }
    track_script {
        chk_haproxy
    }
}

The virtual_router_id must be the same on both nodes. The priority determines which node becomes MASTER. The node with the higher priority wins. The virtual_ipaddress is the floating IP that applications will connect to. The track_script is crucial for automated failover: if the HAProxy process on the MASTER node dies, Keepalived will detect this and initiate a failover.

Create the HAProxy health check script /usr/local/bin/check_haproxy.sh on both nodes:

#!/bin/bash
if pgrep haproxy > /dev/null; then
    exit 0
else
    exit 1
fi

Make the script executable:

sudo chmod +x /usr/local/bin/check_haproxy.sh

Start and enable Keepalived on both nodes:

sudo systemctl start keepalived
sudo systemctl enable keepalived

Verify that one node has taken the VIP address. You can check this using ip addr show eth0 (replace eth0 with your interface).

Configuring HAProxy for Elasticsearch Load Balancing

HAProxy will sit behind the Keepalived VIP and distribute traffic to your Elasticsearch data nodes. It can also perform health checks on the Elasticsearch nodes.

Install HAProxy on both load balancer nodes:

sudo apt update
sudo apt install haproxy -y

Configure HAProxy in /etc/haproxy/haproxy.cfg. This configuration assumes your Elasticsearch data nodes are accessible at es-data-1, es-data-2, and es-data-3 on port 9200.

global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

frontend elasticsearch_frontend
    bind *:9200 # Bind to the port your applications use to access Elasticsearch
    mode http
    default_backend elasticsearch_backend

backend elasticsearch_backend
    mode http
    balance roundrobin
    option httpchk GET / HTTP/1.1\r\nHost:\ localhost
    http-check expect status 200
    # Add your Elasticsearch data nodes here
    server es-data-1 10.10.0.1:9200 check port 9200 inter 2000 rise 2 fall 3
    server es-data-2 10.10.0.2:9200 check port 9200 inter 2000 rise 2 fall 3
    server es-data-3 10.10.0.3:9200 check port 9200 inter 2000 rise 2 fall 3

listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /stats
    stats refresh 10s
    stats auth admin:YourSecurePassword # Change this!

Explanation of key HAProxy settings:

bind *:9200: HAProxy listens on port 9200 (the default Elasticsearch HTTP port) on all interfaces. This port will be accessible via the Keepalived VIP.
balance roundrobin: Distributes requests evenly across the backend servers.
option httpchk GET / HTTP/1.1\r\nHost:\ localhost and http-check expect status 200: Configures HAProxy to perform HTTP health checks by sending a GET request to the root path of each Elasticsearch node and expecting a 200 OK response.
server es-data-X ... check port 9200 inter 2000 rise 2 fall 3: Defines each Elasticsearch data node. check enables health checking. inter 2000 means check every 2 seconds. rise 2 means the server is considered healthy after 2 successful checks. fall 3 means the server is considered down after 3 failed checks.
listen stats: Enables the HAProxy statistics page, which is invaluable for monitoring. Remember to change the admin:YourSecurePassword to a strong, unique password.

Start and enable HAProxy:

sudo systemctl start haproxy
sudo systemctl enable haproxy

Now, your applications should point to the Keepalived VIP (192.168.1.100:9200 in this example) for all Elasticsearch interactions. If the primary HAProxy/Keepalived node fails, the VIP will move to the secondary, and HAProxy on the secondary will take over load balancing. If an Elasticsearch data node fails, HAProxy will stop sending traffic to it based on the health checks.

Shopify Deployment Considerations for High Availability

Shopify’s architecture is inherently distributed and resilient. However, when deploying custom applications or services that integrate with Shopify, or when managing your own Shopify-related infrastructure (e.g., a custom backend for Shopify POS, or a headless commerce setup), you need to apply similar HA principles.

Stateless Application Design

The most critical aspect of building HA applications that interact with Shopify is designing them to be stateless. This means that no session data or application state should be stored on the application server itself. All state should be externalized to a database (like your Elasticsearch cluster), a cache (e.g., Redis), or a dedicated session store.

For a PHP application (common in Shopify development), this translates to:

Avoid storing session data in PHP’s default file-based sessions. Use Redis or a database for session management.
Ensure any background jobs or workers are idempotent, meaning they can be run multiple times without changing the outcome beyond the initial execution.
Cache frequently accessed data from Shopify’s APIs aggressively.

Multi-Instance Deployment and Load Balancing

Deploy your application instances across multiple DigitalOcean droplets. These droplets should ideally be spread across different availability zones within a region for resilience against zone-specific failures.

Use DigitalOcean’s Load Balancers to distribute incoming traffic to your application instances. Configure health checks on the load balancer to automatically remove unhealthy instances from the pool.

# Example of a basic health check endpoint in a PHP app
public function healthCheckAction() {
    // Check database connection
    if (!DB::connection()->getPdo()) {
        http_response_code(503); // Service Unavailable
        echo "Database connection failed";
        return;
    }
    // Check external API connectivity (e.g., Shopify API) - simplified
    try {
        // Attempt a simple, non-intrusive API call
        // e.g., $client->get('/admin/api/2023-10/shop.json');
        // For simplicity, we'll just simulate a successful check here.
        // In a real scenario, you'd want to test actual connectivity.
        $isShopifyConnected = true; // Assume true for this example
    } catch (\Exception $e) {
        $isShopifyConnected = false;
    }

    if (!$isShopifyConnected) {
        http_response_code(503);
        echo "Shopify API connection failed";
        return;
    }

    http_response_code(200); // OK
    echo "OK";
}

This endpoint, typically mapped to /health or /status, should be configured in your DigitalOcean Load Balancer’s health check settings. The load balancer will periodically ping this endpoint. If it returns a non-2xx status code, the instance is marked unhealthy and traffic is rerouted.

Database and Cache High Availability

Your application’s state is managed by its dependencies. For Elasticsearch, refer to the previous section on setting up a highly available cluster. For caching (e.g., Redis), DigitalOcean offers managed Redis instances, which provide built-in replication and failover capabilities. If you’re self-hosting Redis, ensure you set up a Redis Sentinel or Redis Cluster for automatic failover.

# Example Redis Sentinel configuration snippet (redis.conf)
port 26379
sentinel monitor mymaster 10.10.0.10:6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1
sentinel notification-script mymaster /etc/redis/notify.sh
sentinel client-reconfig-script mymaster /etc/redis/reconfig.sh

Your application’s connection string should point to the Sentinel, which will direct it to the current master Redis instance.

Disaster Recovery Scenarios and Testing

A robust disaster recovery strategy isn’t complete without understanding failure scenarios and regularly testing your failover mechanisms.

Simulating Failures

Regularly test the following scenarios:

Elasticsearch Node Failure: Stop one or more Elasticsearch data nodes and master-eligible nodes. Verify that the cluster remains operational, shards are reallocated, and replicas are promoted. Monitor the cluster health API (GET /_cluster/health) for any red or yellow statuses.

Load Balancer Node Failure: Stop one of the Keepalived/HAProxy nodes. Verify that the VIP automatically moves to the other node and that traffic continues to flow to Elasticsearch. Check the HAProxy stats page on the surviving node.

Application Instance Failure: Terminate one or more application droplets. Verify that the DigitalOcean Load Balancer stops sending traffic to the terminated instances and that the remaining instances handle the load.

Availability Zone Outage: If possible, simulate an AZ outage by shutting down all droplets in one zone. Verify that traffic is routed to instances in other zones and that Elasticsearch remains available via its multi-zone replica setup.

Automated Backups and Cross-Region Replication

While failover handles immediate availability, data durability requires backups. For Elasticsearch, implement regular snapshots to DigitalOcean Spaces (S3-compatible object storage). Configure these snapshots to be taken from your primary cluster and potentially copied to a different region for true disaster recovery.

# Register a repository pointing to DigitalOcean Spaces
PUT /_snapshot/my_do_backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups-nyc3",
    "endpoint": "nyc3.digitaloceanspaces.com",
    "region": "nyc3",
    "access_key": "YOUR_ACCESS_KEY",
    "secret_key": "YOUR_SECRET_KEY"
  }
}

# Schedule a daily snapshot
PUT /_scheduler/daily_snapshot
{
  "schedule": "0 0 * * * ?",
  "task": {
    "id": "snapshot_my_indices",
    "repository": "my_do_backup",
    "type": "snapshot",
    "indices": ["my-index-*"],
    "settings": {
      "ignore_unavailable": "true",
      "include_global_state": false,
      "partial": true
    }
  }
}

For your application data (if not solely in Elasticsearch), ensure your databases and caches are also backed up regularly. For critical data, consider cross-region replication for your databases and object storage.