Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and Shopify Deployments on DigitalOcean
Elasticsearch Cluster Architecture for High Availability
Achieving robust disaster recovery for Elasticsearch hinges on a well-defined, multi-node cluster architecture that prioritizes data redundancy and automatic failover. For production deployments on DigitalOcean, we’ll focus on a setup that leverages Elasticsearch’s built-in quorum-based voting and shard allocation mechanisms, augmented by external monitoring and orchestration.
A typical highly available Elasticsearch cluster consists of at least three master-eligible nodes, several data nodes, and potentially dedicated coordinating nodes. The master nodes are responsible for cluster-wide operations, including managing indices, shard allocation, and node discovery. Data nodes store the actual indices and handle search and indexing requests. Coordinating nodes act as intelligent load balancers for search requests, distributing them across data nodes and aggregating results.
Configuring Elasticsearch for Master Election and Shard Allocation
The core of Elasticsearch’s HA lies in its master election process and shard allocation strategies. We need to ensure that enough master-eligible nodes are available to form a quorum and that shards are replicated across different availability zones or even regions for true disaster resilience.
On each Elasticsearch node, the elasticsearch.yml configuration file is critical. For master-eligible nodes, the following settings are paramount:
cluster.name: "my-production-cluster"
node.name: "${HOSTNAME}"
network.host: 0.0.0.0
discovery.seed_hosts:
- "es-node-1.example.com:9300"
- "es-node-2.example.com:9300"
- "es-node-3.example.com:9300"
cluster.initial_master_nodes:
- "es-node-1"
- "es-node-2"
- "es-node-3"
node.roles: [ master, data, ingest ] # Example: Master and Data roles combined for smaller clusters
xpack.security.enabled: true # Essential for production
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true
Key parameters:
cluster.name: Must be identical across all nodes in the cluster.discovery.seed_hosts: A list of IP addresses or hostnames of other master-eligible nodes that new nodes can contact to join the cluster.cluster.initial_master_nodes: A list of node names that are eligible to be elected as the initial master. This is crucial for bootstrapping the cluster.node.roles: Defines the roles of the node. For HA, ensure at least three nodes have themasterrole.xpack.security.*: Enabling security is non-negotiable for production environments.
For data nodes, the configuration would be similar but might omit the master role if dedicated master nodes are used:
cluster.name: "my-production-cluster"
node.name: "${HOSTNAME}"
network.host: 0.0.0.0
discovery.seed_hosts:
- "es-node-1.example.com:9300"
- "es-node-2.example.com:9300"
- "es-node-3.example.com:9300"
node.roles: [ data, ingest ] # Example: Data and Ingest roles
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true
Shard Replication and Allocation Awareness
To prevent data loss during a node failure or even a datacenter outage, shard replication is essential. Elasticsearch uses primary and replica shards. A primary shard is the original copy of a document, and replica shards are exact copies of the primary. If a node holding a primary shard fails, one of its replicas can be promoted to become the new primary.
The number of replicas is configured per index. For high availability, a minimum of 1 replica (meaning 2 copies of the data) is recommended. For disaster recovery, consider 2 or more replicas, distributed across different DigitalOcean availability zones.
Index settings for replication:
PUT /my-index
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 2 // Ensures 3 copies of each shard
}
}
}
To ensure replicas are placed in different availability zones, we can leverage allocation awareness. This requires configuring DigitalOcean droplet tags or metadata to identify zones.
First, ensure your DigitalOcean droplets have appropriate tags (e.g., us-nyc1-zone1, us-nyc1-zone2). Then, configure Elasticsearch:
# In elasticsearch.yml on master-eligible nodes
cluster.routing.allocation.awareness.attributes: zone
# Example for a specific index to enforce zone awareness
PUT /my-index/_settings
{
"index.routing.allocation.awareness.attributes": "zone"
}
This tells Elasticsearch to try and place shards and their replicas on nodes in different zones. You’ll need to ensure your DigitalOcean infrastructure is set up with distinct zones and that your Elasticsearch nodes are deployed accordingly. For instance, if you have droplets in nyc1, nyc2, and nyc3, you’d tag them accordingly and configure Elasticsearch to be aware of the zone attribute.
Automated Failover Orchestration with Keepalived and HAProxy
While Elasticsearch handles internal node failures and shard promotion, external access to the cluster needs its own high-availability layer. This is where a combination of Keepalived for virtual IP (VIP) failover and HAProxy for load balancing comes into play. This setup ensures that your applications always have a stable endpoint to connect to, even if the underlying Elasticsearch nodes or load balancers change.
Setting up Keepalived for VIP Failover
Keepalived provides a simple yet powerful mechanism for managing a floating IP address across multiple servers. If the primary server fails, Keepalived automatically transfers the VIP to a standby server.
We’ll deploy two dedicated servers (or use existing nodes with spare capacity) for Keepalived and HAProxy. Let’s call them lb-node-1 and lb-node-2.
Install Keepalived on both nodes:
sudo apt update sudo apt install keepalived -y
Configure Keepalived on lb-node-1 (the primary):
! Configuration File for keepalived
global_defs {
router_id LVS_DEVEL_1
}
vrrp_script chk_haproxy {
script "/usr/local/bin/check_haproxy.sh"
interval 2
weight 2
fall 2
rise 2
}
vrrp_instance VI_1 {
state BACKUP # Start as BACKUP on primary, will transition to MASTER
interface eth0 # Replace with your actual network interface
virtual_router_id 51
priority 101 # Higher priority for MASTER
advert_int 1
authentication {
auth_type PASS
auth_pass 1234
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 # Your desired Virtual IP
}
track_script {
chk_haproxy
}
}
Configure Keepalived on lb-node-2 (the standby):
! Configuration File for keepalived
global_defs {
router_id LVS_DEVEL_2
}
vrrp_script chk_haproxy {
script "/usr/local/bin/check_haproxy.sh"
interval 2
weight 2
fall 2
rise 2
}
vrrp_instance VI_1 {
state BACKUP
interface eth0 # Replace with your actual network interface
virtual_router_id 51
priority 100 # Lower priority for BACKUP
advert_int 1
authentication {
auth_type PASS
auth_pass 1234
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 # Your desired Virtual IP
}
track_script {
chk_haproxy
}
}
The virtual_router_id must be the same on both nodes. The priority determines which node becomes MASTER. The node with the higher priority wins. The virtual_ipaddress is the floating IP that applications will connect to. The track_script is crucial for automated failover: if the HAProxy process on the MASTER node dies, Keepalived will detect this and initiate a failover.
Create the HAProxy health check script /usr/local/bin/check_haproxy.sh on both nodes:
#!/bin/bash
if pgrep haproxy > /dev/null; then
exit 0
else
exit 1
fi
Make the script executable:
sudo chmod +x /usr/local/bin/check_haproxy.sh
Start and enable Keepalived on both nodes:
sudo systemctl start keepalived sudo systemctl enable keepalived
Verify that one node has taken the VIP address. You can check this using ip addr show eth0 (replace eth0 with your interface).
Configuring HAProxy for Elasticsearch Load Balancing
HAProxy will sit behind the Keepalived VIP and distribute traffic to your Elasticsearch data nodes. It can also perform health checks on the Elasticsearch nodes.
Install HAProxy on both load balancer nodes:
sudo apt update sudo apt install haproxy -y
Configure HAProxy in /etc/haproxy/haproxy.cfg. This configuration assumes your Elasticsearch data nodes are accessible at es-data-1, es-data-2, and es-data-3 on port 9200.
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend elasticsearch_frontend
bind *:9200 # Bind to the port your applications use to access Elasticsearch
mode http
default_backend elasticsearch_backend
backend elasticsearch_backend
mode http
balance roundrobin
option httpchk GET / HTTP/1.1\r\nHost:\ localhost
http-check expect status 200
# Add your Elasticsearch data nodes here
server es-data-1 10.10.0.1:9200 check port 9200 inter 2000 rise 2 fall 3
server es-data-2 10.10.0.2:9200 check port 9200 inter 2000 rise 2 fall 3
server es-data-3 10.10.0.3:9200 check port 9200 inter 2000 rise 2 fall 3
listen stats
bind *:8404
mode http
stats enable
stats uri /stats
stats refresh 10s
stats auth admin:YourSecurePassword # Change this!
Explanation of key HAProxy settings:
bind *:9200: HAProxy listens on port 9200 (the default Elasticsearch HTTP port) on all interfaces. This port will be accessible via the Keepalived VIP.balance roundrobin: Distributes requests evenly across the backend servers.option httpchk GET / HTTP/1.1\r\nHost:\ localhostandhttp-check expect status 200: Configures HAProxy to perform HTTP health checks by sending a GET request to the root path of each Elasticsearch node and expecting a 200 OK response.server es-data-X ... check port 9200 inter 2000 rise 2 fall 3: Defines each Elasticsearch data node.checkenables health checking.inter 2000means check every 2 seconds.rise 2means the server is considered healthy after 2 successful checks.fall 3means the server is considered down after 3 failed checks.listen stats: Enables the HAProxy statistics page, which is invaluable for monitoring. Remember to change theadmin:YourSecurePasswordto a strong, unique password.
Start and enable HAProxy:
sudo systemctl start haproxy sudo systemctl enable haproxy
Now, your applications should point to the Keepalived VIP (192.168.1.100:9200 in this example) for all Elasticsearch interactions. If the primary HAProxy/Keepalived node fails, the VIP will move to the secondary, and HAProxy on the secondary will take over load balancing. If an Elasticsearch data node fails, HAProxy will stop sending traffic to it based on the health checks.
Shopify Deployment Considerations for High Availability
Shopify’s architecture is inherently distributed and resilient. However, when deploying custom applications or services that integrate with Shopify, or when managing your own Shopify-related infrastructure (e.g., a custom backend for Shopify POS, or a headless commerce setup), you need to apply similar HA principles.
Stateless Application Design
The most critical aspect of building HA applications that interact with Shopify is designing them to be stateless. This means that no session data or application state should be stored on the application server itself. All state should be externalized to a database (like your Elasticsearch cluster), a cache (e.g., Redis), or a dedicated session store.
For a PHP application (common in Shopify development), this translates to:
- Avoid storing session data in PHP’s default file-based sessions. Use Redis or a database for session management.
- Ensure any background jobs or workers are idempotent, meaning they can be run multiple times without changing the outcome beyond the initial execution.
- Cache frequently accessed data from Shopify’s APIs aggressively.
Multi-Instance Deployment and Load Balancing
Deploy your application instances across multiple DigitalOcean droplets. These droplets should ideally be spread across different availability zones within a region for resilience against zone-specific failures.
Use DigitalOcean’s Load Balancers to distribute incoming traffic to your application instances. Configure health checks on the load balancer to automatically remove unhealthy instances from the pool.
# Example of a basic health check endpoint in a PHP app
public function healthCheckAction() {
// Check database connection
if (!DB::connection()->getPdo()) {
http_response_code(503); // Service Unavailable
echo "Database connection failed";
return;
}
// Check external API connectivity (e.g., Shopify API) - simplified
try {
// Attempt a simple, non-intrusive API call
// e.g., $client->get('/admin/api/2023-10/shop.json');
// For simplicity, we'll just simulate a successful check here.
// In a real scenario, you'd want to test actual connectivity.
$isShopifyConnected = true; // Assume true for this example
} catch (\Exception $e) {
$isShopifyConnected = false;
}
if (!$isShopifyConnected) {
http_response_code(503);
echo "Shopify API connection failed";
return;
}
http_response_code(200); // OK
echo "OK";
}
This endpoint, typically mapped to /health or /status, should be configured in your DigitalOcean Load Balancer’s health check settings. The load balancer will periodically ping this endpoint. If it returns a non-2xx status code, the instance is marked unhealthy and traffic is rerouted.
Database and Cache High Availability
Your application’s state is managed by its dependencies. For Elasticsearch, refer to the previous section on setting up a highly available cluster. For caching (e.g., Redis), DigitalOcean offers managed Redis instances, which provide built-in replication and failover capabilities. If you’re self-hosting Redis, ensure you set up a Redis Sentinel or Redis Cluster for automatic failover.
# Example Redis Sentinel configuration snippet (redis.conf) port 26379 sentinel monitor mymaster 10.10.0.10:6379 2 sentinel down-after-milliseconds mymaster 5000 sentinel failover-timeout mymaster 10000 sentinel parallel-syncs mymaster 1 sentinel notification-script mymaster /etc/redis/notify.sh sentinel client-reconfig-script mymaster /etc/redis/reconfig.sh
Your application’s connection string should point to the Sentinel, which will direct it to the current master Redis instance.
Disaster Recovery Scenarios and Testing
A robust disaster recovery strategy isn’t complete without understanding failure scenarios and regularly testing your failover mechanisms.
Simulating Failures
Regularly test the following scenarios:
GET /_cluster/health) for any red or yellow statuses.Automated Backups and Cross-Region Replication
While failover handles immediate availability, data durability requires backups. For Elasticsearch, implement regular snapshots to DigitalOcean Spaces (S3-compatible object storage). Configure these snapshots to be taken from your primary cluster and potentially copied to a different region for true disaster recovery.
# Register a repository pointing to DigitalOcean Spaces
PUT /_snapshot/my_do_backup
{
"type": "s3",
"settings": {
"bucket": "my-elasticsearch-backups-nyc3",
"endpoint": "nyc3.digitaloceanspaces.com",
"region": "nyc3",
"access_key": "YOUR_ACCESS_KEY",
"secret_key": "YOUR_SECRET_KEY"
}
}
# Schedule a daily snapshot
PUT /_scheduler/daily_snapshot
{
"schedule": "0 0 * * * ?",
"task": {
"id": "snapshot_my_indices",
"repository": "my_do_backup",
"type": "snapshot",
"indices": ["my-index-*"],
"settings": {
"ignore_unavailable": "true",
"include_global_state": false,
"partial": true
}
}
}
For your application data (if not solely in Elasticsearch), ensure your databases and caches are also backed up regularly. For critical data, consider cross-region replication for your databases and object storage.