Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WordPress Deployments on Google Cloud
Designing for Resilience: Elasticsearch and WordPress Auto-Failover on GCP
Achieving true high availability for critical web applications necessitates robust disaster recovery strategies. For deployments combining Elasticsearch for search and WordPress as the content management system, this translates to architecting automated failover mechanisms. This document outlines a production-ready approach leveraging Google Cloud Platform (GCP) services to ensure minimal downtime.
Elasticsearch Cluster High Availability and Failover
Elasticsearch’s inherent distributed nature provides a strong foundation for HA. We’ll focus on configuring a multi-node cluster with appropriate shard allocation and replica settings, coupled with a load balancer for directing traffic to healthy nodes.
GCP Compute Engine Instance Group Configuration
We’ll utilize a Managed Instance Group (MIG) for our Elasticsearch nodes. This allows for auto-scaling and automated instance replacement. Health checks are paramount here.
First, define a health check that probes a specific endpoint on your Elasticsearch nodes. A simple HTTP GET request to the cluster’s health API is sufficient.
GCP Health Check Definition (gcloud CLI)
gcloud compute health-checks create http elasticsearch-health-check \
--port 9200 \
--request-path="/_cluster/health" \
--check-interval=5s \
--timeout=5s \
--unhealthy-threshold=3 \
--healthy-threshold=2
Next, create a backend service that uses this health check and targets the MIG.
GCP Backend Service Definition (gcloud CLI)
gcloud compute backend-services create elasticsearch-backend \
--protocol=HTTP \
--port-name=http \
--health-checks=elasticsearch-health-check \
--global
Now, create the Managed Instance Group. Ensure your Elasticsearch nodes are configured for discovery and have sufficient replicas. A common setup involves a master-eligible node, data nodes, and ingest nodes. For HA, ensure at least 3 master-eligible nodes and configure discovery.zen.minimum_master_nodes to (master_eligible_count / 2) + 1.
GCP Managed Instance Group Creation (gcloud CLI)
gcloud compute instance-groups managed create elasticsearch-mig \
--template=elasticsearch-instance-template \
--size=3 \
--zone=us-central1-a \
--health-checks=elasticsearch-health-check \
--initial-delay=300s
The --initial-delay is crucial to allow Elasticsearch to bootstrap and form a cluster before health checks become overly aggressive.
Elasticsearch Configuration for Resilience
Within your Elasticsearch configuration (elasticsearch.yml), ensure the following:
Key elasticsearch.yml Settings
cluster.name: "my-production-cluster" node.master: true node.data: true node.ingest: true discovery.seed_hosts: ["host1", "host2", "host3"] # Replace with actual IPs or DNS of master-eligible nodes cluster.initial_master_nodes: ["node-1", "node-2", "node-3"] # Replace with actual node names discovery.zen.minimum_master_nodes: 2 # For a 3-node master-eligible setup indices.recovery.max_bytes_per_sec: "100mb" # Adjust based on network capacity indices.thread_pool.write.size: 100 # Tune based on write load indices.thread_pool.search.size: 100 # Tune based on search load
For indices, always configure replicas. A minimum of 1 replica per shard is recommended for HA. This ensures that if a node fails, data is still available from another node.
Index Shard and Replica Configuration (Example)
PUT /my-index
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}
The GCP load balancer, configured with the backend service pointing to the MIG, will automatically route traffic away from unhealthy Elasticsearch nodes. When a node is replaced by the MIG, it will rejoin the cluster, and shards will be rebalanced.
WordPress High Availability and Failover
WordPress itself is stateless, but its reliance on a database and potentially object caching makes HA a multi-component challenge. We’ll focus on a highly available database and a resilient web server layer.
Cloud SQL for MySQL High Availability
For the WordPress database, Cloud SQL for MySQL offers a managed HA configuration. Enabling this feature automatically provisions a standby instance in a different zone. In case of primary instance failure, Cloud SQL automatically fails over to the standby.
Enabling HA for Cloud SQL for MySQL (gcloud CLI)
gcloud sql instances patch YOUR_INSTANCE_NAME \
--availability-type=REGIONAL \
--region=us-central1
Ensure your WordPress application is configured to connect to the Cloud SQL instance’s IP address. Cloud SQL handles the IP address management during failover, so your application should remain connected without code changes, provided it’s configured to retry connections.
WordPress Web Server Auto-Failover with GKE
Deploying WordPress on Google Kubernetes Engine (GKE) provides a robust platform for managing web server availability. We’ll use a Deployment with multiple replicas and a Service with a Network Load Balancer.
GKE Deployment for WordPress
apiVersion: apps/v1
kind: Deployment
metadata:
name: wordpress-deployment
labels:
app: wordpress
spec:
replicas: 3 # Start with 3, adjust based on load
selector:
matchLabels:
app: wordpress
template:
metadata:
labels:
app: wordpress
spec:
containers:
- name: wordpress
image: wordpress:latest # Use a specific, tested version in production
ports:
- containerPort: 80
env:
- name: WORDPRESS_DB_HOST
value: "YOUR_CLOUD_SQL_CONNECTION_NAME" # e.g., "my-project:us-central1:my-db-instance"
- name: WORDPRESS_DB_USER
valueFrom:
secretKeyRef:
name: wordpress-db-secrets
key: user
- name: WORDPRESS_DB_PASSWORD
valueFrom:
secretKeyRef:
name: wordpress-db-secrets
key: password
- name: WORDPRESS_DB_NAME
value: "wordpress_db"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /wp-cron.php # Or a custom health check endpoint
port: 80
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /wp-cron.php # Or a custom health check endpoint
port: 80
initialDelaySeconds: 15
periodSeconds: 5
The livenessProbe and readinessProbe are critical for GKE to manage pod health. If a pod becomes unresponsive, Kubernetes will restart it. The WORDPRESS_DB_HOST should be configured using the Cloud SQL Auth Proxy connection name for secure and reliable database access.
GKE Service with Load Balancer
apiVersion: v1
kind: Service
metadata:
name: wordpress-service
labels:
app: wordpress
spec:
type: LoadBalancer
selector:
app: wordpress
ports:
- protocol: TCP
port: 80
targetPort: 80
This Kubernetes Service will provision a GCP Network Load Balancer, distributing traffic across the healthy WordPress pods managed by the Deployment. If a pod fails, the load balancer will stop sending traffic to it, and GKE will attempt to replace the pod.
Integrating Elasticsearch with WordPress
For search functionality, plugins like “ElasticPress” can be used. The plugin configuration within WordPress needs to point to the Elasticsearch cluster’s endpoint. This endpoint should be the IP address or DNS name of the GCP load balancer fronting your Elasticsearch MIG.
Elasticsearch Endpoint Configuration (WordPress Admin)
In the WordPress admin area, navigate to the ElasticPress settings and configure the Elasticsearch host to point to your GCP load balancer’s IP address or hostname. For example:
http://YOUR_ELASTICSEARCH_LOAD_BALANCER_IP:9200
Ensure that your WordPress pods have network access to the Elasticsearch cluster’s port (9200). This is typically handled by GCP’s VPC networking rules.
Automated Failover Orchestration and Monitoring
The described architecture relies on GCP’s managed services and Kubernetes’ self-healing capabilities for automated failover. However, proactive monitoring and alerting are essential to detect issues before they impact users and to validate failover events.
Monitoring with Google Cloud Operations (Stackdriver)
Utilize Cloud Monitoring to track key metrics for both Elasticsearch and WordPress:
- Elasticsearch: Cluster health status (green, yellow, red), node status, JVM heap usage, CPU utilization, indexing/search latency.
- Cloud SQL: Instance CPU, memory, disk I/O, network traffic, replication lag (if applicable).
- GKE: Pod health (running, pending, failed), container CPU/memory usage, network traffic, load balancer health checks.
Configure alerting policies based on these metrics. For example, an alert can be triggered if the Elasticsearch cluster health status remains red for more than 5 minutes, or if a significant number of WordPress pods are in a failed state.
Testing Failover Scenarios
Regularly test your failover mechanisms. This can involve:
- Manually stopping an Elasticsearch node and observing its removal from the cluster and traffic redirection.
- Simulating a database failure by stopping the primary Cloud SQL instance (in a staging environment).
- Deleting a WordPress pod to verify GKE’s ability to replace it and the load balancer’s traffic management.
Document the observed behavior and any necessary adjustments to configurations or alerting thresholds.
Conclusion
By combining GCP’s managed services like Cloud SQL and GKE with careful configuration of Elasticsearch and WordPress, you can architect a highly available deployment with automated failover. This approach minimizes manual intervention during incidents, significantly reducing downtime and ensuring a more resilient application for your users.