Disaster Recovery 101: Architecting Auto-Failovers for Elasticsearch and WordPress Deployments on Google Cloud

Designing for Resilience: Elasticsearch and WordPress Auto-Failover on GCP

Achieving true high availability for critical web applications necessitates robust disaster recovery strategies. For deployments combining Elasticsearch for search and WordPress as the content management system, this translates to architecting automated failover mechanisms. This document outlines a production-ready approach leveraging Google Cloud Platform (GCP) services to ensure minimal downtime.

Elasticsearch Cluster High Availability and Failover

Elasticsearch’s inherent distributed nature provides a strong foundation for HA. We’ll focus on configuring a multi-node cluster with appropriate shard allocation and replica settings, coupled with a load balancer for directing traffic to healthy nodes.

GCP Compute Engine Instance Group Configuration

We’ll utilize a Managed Instance Group (MIG) for our Elasticsearch nodes. This allows for auto-scaling and automated instance replacement. Health checks are paramount here.

First, define a health check that probes a specific endpoint on your Elasticsearch nodes. A simple HTTP GET request to the cluster’s health API is sufficient.

GCP Health Check Definition (gcloud CLI)

gcloud compute health-checks create http elasticsearch-health-check \
    --port 9200 \
    --request-path="/_cluster/health" \
    --check-interval=5s \
    --timeout=5s \
    --unhealthy-threshold=3 \
    --healthy-threshold=2

Next, create a backend service that uses this health check and targets the MIG.

GCP Backend Service Definition (gcloud CLI)

gcloud compute backend-services create elasticsearch-backend \
    --protocol=HTTP \
    --port-name=http \
    --health-checks=elasticsearch-health-check \
    --global

Now, create the Managed Instance Group. Ensure your Elasticsearch nodes are configured for discovery and have sufficient replicas. A common setup involves a master-eligible node, data nodes, and ingest nodes. For HA, ensure at least 3 master-eligible nodes and configure discovery.zen.minimum_master_nodes to (master_eligible_count / 2) + 1.

GCP Managed Instance Group Creation (gcloud CLI)

gcloud compute instance-groups managed create elasticsearch-mig \
    --template=elasticsearch-instance-template \
    --size=3 \
    --zone=us-central1-a \
    --health-checks=elasticsearch-health-check \
    --initial-delay=300s

The --initial-delay is crucial to allow Elasticsearch to bootstrap and form a cluster before health checks become overly aggressive.

Elasticsearch Configuration for Resilience

Within your Elasticsearch configuration (elasticsearch.yml), ensure the following:

Key `elasticsearch.yml` Settings

cluster.name: "my-production-cluster"
node.master: true
node.data: true
node.ingest: true
discovery.seed_hosts: ["host1", "host2", "host3"] # Replace with actual IPs or DNS of master-eligible nodes
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"] # Replace with actual node names
discovery.zen.minimum_master_nodes: 2 # For a 3-node master-eligible setup
indices.recovery.max_bytes_per_sec: "100mb" # Adjust based on network capacity
indices.thread_pool.write.size: 100 # Tune based on write load
indices.thread_pool.search.size: 100 # Tune based on search load

For indices, always configure replicas. A minimum of 1 replica per shard is recommended for HA. This ensures that if a node fails, data is still available from another node.

Index Shard and Replica Configuration (Example)

PUT /my-index
{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

The GCP load balancer, configured with the backend service pointing to the MIG, will automatically route traffic away from unhealthy Elasticsearch nodes. When a node is replaced by the MIG, it will rejoin the cluster, and shards will be rebalanced.

WordPress High Availability and Failover

WordPress itself is stateless, but its reliance on a database and potentially object caching makes HA a multi-component challenge. We’ll focus on a highly available database and a resilient web server layer.

Cloud SQL for MySQL High Availability

For the WordPress database, Cloud SQL for MySQL offers a managed HA configuration. Enabling this feature automatically provisions a standby instance in a different zone. In case of primary instance failure, Cloud SQL automatically fails over to the standby.

Enabling HA for Cloud SQL for MySQL (gcloud CLI)

gcloud sql instances patch YOUR_INSTANCE_NAME \
    --availability-type=REGIONAL \
    --region=us-central1

Ensure your WordPress application is configured to connect to the Cloud SQL instance’s IP address. Cloud SQL handles the IP address management during failover, so your application should remain connected without code changes, provided it’s configured to retry connections.

WordPress Web Server Auto-Failover with GKE

Deploying WordPress on Google Kubernetes Engine (GKE) provides a robust platform for managing web server availability. We’ll use a Deployment with multiple replicas and a Service with a Network Load Balancer.

GKE Deployment for WordPress

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wordpress-deployment
  labels:
    app: wordpress
spec:
  replicas: 3 # Start with 3, adjust based on load
  selector:
    matchLabels:
      app: wordpress
  template:
    metadata:
      labels:
        app: wordpress
    spec:
      containers:
      - name: wordpress
        image: wordpress:latest # Use a specific, tested version in production
        ports:
        - containerPort: 80
        env:
        - name: WORDPRESS_DB_HOST
          value: "YOUR_CLOUD_SQL_CONNECTION_NAME" # e.g., "my-project:us-central1:my-db-instance"
        - name: WORDPRESS_DB_USER
          valueFrom:
            secretKeyRef:
              name: wordpress-db-secrets
              key: user
        - name: WORDPRESS_DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: wordpress-db-secrets
              key: password
        - name: WORDPRESS_DB_NAME
          value: "wordpress_db"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /wp-cron.php # Or a custom health check endpoint
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /wp-cron.php # Or a custom health check endpoint
            port: 80
          initialDelaySeconds: 15
          periodSeconds: 5

The livenessProbe and readinessProbe are critical for GKE to manage pod health. If a pod becomes unresponsive, Kubernetes will restart it. The WORDPRESS_DB_HOST should be configured using the Cloud SQL Auth Proxy connection name for secure and reliable database access.

GKE Service with Load Balancer

apiVersion: v1
kind: Service
metadata:
  name: wordpress-service
  labels:
    app: wordpress
spec:
  type: LoadBalancer
  selector:
    app: wordpress
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

This Kubernetes Service will provision a GCP Network Load Balancer, distributing traffic across the healthy WordPress pods managed by the Deployment. If a pod fails, the load balancer will stop sending traffic to it, and GKE will attempt to replace the pod.

Integrating Elasticsearch with WordPress

For search functionality, plugins like “ElasticPress” can be used. The plugin configuration within WordPress needs to point to the Elasticsearch cluster’s endpoint. This endpoint should be the IP address or DNS name of the GCP load balancer fronting your Elasticsearch MIG.

Elasticsearch Endpoint Configuration (WordPress Admin)

In the WordPress admin area, navigate to the ElasticPress settings and configure the Elasticsearch host to point to your GCP load balancer’s IP address or hostname. For example:

http://YOUR_ELASTICSEARCH_LOAD_BALANCER_IP:9200

Ensure that your WordPress pods have network access to the Elasticsearch cluster’s port (9200). This is typically handled by GCP’s VPC networking rules.

Automated Failover Orchestration and Monitoring

The described architecture relies on GCP’s managed services and Kubernetes’ self-healing capabilities for automated failover. However, proactive monitoring and alerting are essential to detect issues before they impact users and to validate failover events.

Monitoring with Google Cloud Operations (Stackdriver)

Utilize Cloud Monitoring to track key metrics for both Elasticsearch and WordPress:

Elasticsearch: Cluster health status (green, yellow, red), node status, JVM heap usage, CPU utilization, indexing/search latency.
Cloud SQL: Instance CPU, memory, disk I/O, network traffic, replication lag (if applicable).
GKE: Pod health (running, pending, failed), container CPU/memory usage, network traffic, load balancer health checks.

Configure alerting policies based on these metrics. For example, an alert can be triggered if the Elasticsearch cluster health status remains red for more than 5 minutes, or if a significant number of WordPress pods are in a failed state.

Testing Failover Scenarios

Regularly test your failover mechanisms. This can involve:

Manually stopping an Elasticsearch node and observing its removal from the cluster and traffic redirection.
Simulating a database failure by stopping the primary Cloud SQL instance (in a staging environment).
Deleting a WordPress pod to verify GKE’s ability to replace it and the load balancer’s traffic management.

Document the observed behavior and any necessary adjustments to configurations or alerting thresholds.

Conclusion

By combining GCP’s managed services like Cloud SQL and GKE with careful configuration of Elasticsearch and WordPress, you can architect a highly available deployment with automated failover. This approach minimizes manual intervention during incidents, significantly reducing downtime and ensuring a more resilient application for your users.