Disaster Recovery 101: Architecting Auto-Failovers for Redis and Shopify Deployments on Google Cloud

Automated Redis Failover with Google Cloud Memorystore and Compute Engine

Achieving high availability for critical services like Redis on Google Cloud Platform (GCP) necessitates an automated failover strategy. This isn’t merely about having a replica; it’s about detecting failures and seamlessly redirecting traffic with minimal to zero downtime. For Redis, especially when deployed outside of managed Memorystore High Availability (HA) configurations (e.g., self-managed clusters on Compute Engine), this involves a combination of health checks, DNS manipulation, and potentially a sentinel-like orchestration layer.

Scenario: Self-Managed Redis Cluster on Compute Engine with a Load Balancer

Consider a scenario where you’re running a Redis cluster across multiple Compute Engine instances for cost or control reasons. A common pattern is to place a Google Cloud Load Balancer (specifically, a Network Load Balancer for TCP traffic) in front of the Redis primary instance. This load balancer needs to be dynamically updated to point to the new primary if the current one fails.

Health Checking Redis Instances

The first step is robust health checking. We need a mechanism that can reliably determine if a Redis instance is healthy and capable of serving traffic. A simple `PING` command is often insufficient as it doesn’t guarantee data availability or responsiveness under load. A more comprehensive check involves attempting a read and write operation, perhaps to a dedicated health check key.

# Example health check script (Bash)
REDIS_HOST=""
REDIS_PORT="6379"
HEALTH_KEY="redis_health_check_$$" # Use a unique key per check

# Attempt to set a value
if redis-cli -h $REDIS_HOST -p $REDIS_PORT SET $HEALTH_KEY "OK" > /dev/null 2>&1; then
    # Attempt to get the value
    if redis-cli -h $REDIS_HOST -p $REDIS_PORT GET $HEALTH_KEY > /dev/null 2>&1; then
        # Clean up the key
        redis-cli -h $REDIS_HOST -p $REDIS_PORT DEL $HEALTH_KEY > /dev/null 2>&1
        exit 0 # Healthy
    else
        echo "Redis GET failed on $REDIS_HOST"
        exit 1 # Unhealthy
    fi
else
    echo "Redis SET failed on $REDIS_HOST"
    exit 1 # Unhealthy
fi

This script can be run periodically by a monitoring agent (e.g., Prometheus Node Exporter with a custom collector, or a custom daemon) on a separate instance or a dedicated monitoring VM. The output (exit code 0 for healthy, non-zero for unhealthy) is crucial for automation.

Automating Failover with Redis Sentinel (Recommended for Self-Managed)

For self-managed Redis deployments, Redis Sentinel is the de facto standard for high availability and automatic failover. It monitors Redis instances, performs automatic failover when it detects that a master has been put into “sentinel down” state, and notifies other clients of the new master.

Sentinel Configuration (`sentinel.conf`)

port 26379
daemonize yes
pidfile /var/run/redis/redis-sentinel.pid
logfile /var/log/redis/sentinel.log

# Monitor the master Redis instance
# Format: sentinel monitor    
#  is the number of Sentinels that must agree that the master is down.
sentinel monitor mymaster 10.10.1.10 6379 2

# Define the failover timeout
# How long Redis Sentinel waits for a master to be available before starting the failover.
sentinel down-after-milliseconds mymaster 5000

# How long Redis Sentinel waits before starting a failover after the master is detected as down.
# This is the time for the quorum to be reached.
sentinel failover-timeout mymaster 10000

# Number of replicas to promote to masters during failover.
# If set to 1, only one replica will be promoted.
sentinel parallel-syncs mymaster 1

# Optional: Define a custom script to run on failover
# sentinel notification-script mymaster /path/to/your/failover-script.sh
# sentinel client-reconfig-script mymaster /path/to/your/client-reconfig-script.sh

You would typically run multiple Sentinel instances (at least 3 for a quorum of 2) on separate Compute Engine VMs for redundancy. These Sentinels communicate with each other to agree on the state of the master and trigger failover.

Integrating Sentinel with Google Cloud Load Balancer

The challenge with a self-managed cluster and a Network Load Balancer (NLB) is that the NLB’s backend is typically a static IP or a set of IPs. When Sentinel promotes a replica to master, the NLB’s backend configuration needs to be updated. This is where GCP’s API and automation come into play.

Custom Failover Script (`failover-script.sh`)

The `sentinel client-reconfig-script` (or a custom script triggered by Sentinel) is the key. This script will be executed by Sentinel when a failover occurs. It receives arguments detailing the master’s old IP, new IP, and port.

#!/bin/bash

# This script is executed by Redis Sentinel on failover.
# Arguments:
# $1: master-name
# $2: event (e.g., "failover-end")
# $3: old-master-ip
# $4: new-master-ip
# $5: new-master-port

MASTER_NAME="$1"
EVENT="$2"
OLD_MASTER_IP="$3"
NEW_MASTER_IP="$4"
NEW_MASTER_PORT="$5"

# GCP Project ID and Load Balancer details
GCP_PROJECT_ID="your-gcp-project-id"
FORWARDING_RULE_NAME="your-redis-nlb-forwarding-rule" # The forwarding rule for your NLB
BACKEND_SERVICE_NAME="your-redis-nlb-backend-service" # The backend service for your NLB

# --- Step 1: Update the Load Balancer Backend ---
# This is the most critical part. We need to update the backend service
# to point to the new master IP. This typically involves:
# 1. Getting the current backend service configuration.
# 2. Removing the old master instance from the backend.
# 3. Adding the new master instance to the backend.

# For a Network Load Balancer, we usually manage instances directly or via instance groups.
# If using instance groups, Sentinel needs to trigger an update to the instance group's
# membership or the load balancer's backend service.

# Example using gcloud to update backend service (assuming instance groups)
# This is a simplified example. A robust solution might involve more complex
# gcloud commands or direct API calls.

echo "Failover event detected for $MASTER_NAME: $EVENT"
echo "Old Master: $OLD_MASTER_IP:$NEW_MASTER_PORT"
echo "New Master: $NEW_MASTER_IP:$NEW_MASTER_PORT"

# --- Option A: If using Managed Instance Groups (MIGs) ---
# Sentinel needs to trigger an update to the MIG or the LB's backend service.
# This is complex as Sentinel doesn't directly manage MIGs.
# A common approach is to have a separate automation service that listens
# to Sentinel events or polls for master changes and updates the MIG/LB.

# --- Option B: Direct Backend Service Update (less common for NLB with instances) ---
# If your NLB is configured with specific backend IPs, you'd update those.
# For NLB with instance groups, you'd update the instance group.

# Let's assume a scenario where we need to update the backend service's
# named port configuration or instance group association.
# This is highly dependent on your specific LB setup.

# A more practical approach for self-managed Redis with NLB:
# 1. Have a dedicated health check VM that monitors Redis.
# 2. This VM also monitors the NLB's target health.
# 3. When the health check VM detects the primary is down, it triggers a
# script that:
# a. Promotes a replica (if not using Sentinel, which is not recommended).
# b. Updates the NLB's backend service to point to the new primary's IP.
# This involves using `gcloud compute backend-services update` or the API.

# Example using gcloud to update backend service with a new instance group
# (This is illustrative and requires careful planning of instance groups)
# gcloud compute backend-services update $BACKEND_SERVICE_NAME \
# --project=$GCP_PROJECT_ID \
# --global \
# --instance-group= \
# --instance-group-zone=

# A more direct approach for NLB pointing to specific IPs:
# You'd need to update the forwarding rule's target.
# This is generally not how NLBs are used for dynamic backend changes.

# --- Alternative: Using a DNS-based approach with a Global External HTTP(S) LB ---
# If you were using an HTTP(S) LB (less common for raw Redis TCP), you could
# update DNS records. For TCP NLB, this is not applicable.

# --- Recommended approach for self-managed Redis + NLB ---
# 1. Use Redis Sentinel for failover detection and promotion.
# 2. Have a separate automation service (e.g., a Python script running on a VM,
# or a Cloud Function triggered by Pub/Sub) that:
# a. Subscribes to Sentinel's Pub/Sub notifications (if configured).
# b. OR polls Sentinel for master status.
# c. When a failover is detected, this service uses `gcloud` or the GCP API
# to update the NLB's backend service. This might involve updating
# the backend service to point to a different instance group, or
# dynamically managing backend IPs if not using instance groups.

# For simplicity, let's assume we have a way to update the NLB's backend.
# This script would call an external tool or API.

echo "Initiating NLB backend update..."
# Placeholder for actual GCP API call or gcloud command
# Example: update_gcp_nlb_backend $NEW_MASTER_IP

# --- Step 2: Notify other services (Optional) ---
# You might want to notify your application services about the new Redis master.
# This could be via Pub/Sub, a shared configuration store, or by reconfiguring
# the application instances.

echo "Notifying applications about new Redis master..."
# Example: publish_to_pubsub "redis-failover" "new_master_ip=$NEW_MASTER_IP"

# Exit with 0 to indicate success
exit 0

Important Considerations for the Script:

Authentication: The script needs credentials to interact with the GCP API (e.g., a service account with appropriate permissions).
Idempotency: The script should be idempotent. Running it multiple times with the same parameters should yield the same result.
Error Handling: Robust error handling and logging are critical. What happens if the GCP API call fails?
Network Load Balancer Specifics: GCP Network Load Balancers typically use backend services that point to instance groups. Updating the NLB means updating the instance group membership or the backend service’s target. A common pattern is to have two instance groups: one for the primary, one for the replica. The automation script would then switch the backend service to point to the instance group containing the new primary.

Managed Redis with Memorystore HA

For most production workloads, leveraging Google Cloud’s managed Memorystore for Redis is the superior approach. Memorystore offers built-in High Availability (HA) configurations that abstract away the complexities of managing Sentinel, Compute Engine instances, and load balancer updates.

Memorystore HA Architecture

Memorystore for Redis HA provisions a primary and a replica node within the same region. If the primary node becomes unavailable, Memorystore automatically promotes the replica to primary. This failover process is managed by Google Cloud’s infrastructure and is designed to be transparent to your application.

Application Connection Strategy

The key to seamless failover with Memorystore HA lies in how your application connects to Redis. Instead of connecting to a specific IP address of the primary node, your application should connect to the Memorystore instance endpoint. This endpoint is a stable DNS name or IP address that always resolves to the current primary node.

<?php
// Example using Predis (PHP Redis client)

require 'vendor/autoload.php';

$memorystore_host = 'your-memorystore-instance-host'; // e.g., redis-1.your-project.redis.googleapiserver.com
$memorystore_port = 6379;
$password = null; // If you have authentication enabled

try {
    $redis = new Predis\Client([
        'scheme' => 'tcp',
        'host'   => $memorystore_host,
        'port'   => $memorystore_port,
        // 'password' => $password, // Uncomment if password is set
    ]);

    // Perform a simple operation to test connection
    $redis->ping();
    echo "Successfully connected to Memorystore Redis!\n";

    // Set a key
    $redis->set('mykey', 'myvalue');
    echo "Set 'mykey' to: " . $redis->get('mykey') . "\n";

} catch (Exception $e) {
    echo "Could not connect to Redis: " . $e->getMessage() . "\n";
    // Implement retry logic or fallback mechanism here
}
?>

When a failover occurs in Memorystore, the underlying infrastructure updates the DNS resolution for the instance endpoint. Your application, when it next attempts to connect or perform an operation, will automatically resolve to the new primary. Most Redis clients have built-in retry mechanisms that handle transient connection errors during the brief failover window.

Monitoring Memorystore Failovers

While Memorystore handles the failover automatically, you still need to monitor its health and performance. Google Cloud’s operations suite (formerly Stackdriver) provides:

Memorystore Metrics: Monitor latency, memory usage, CPU utilization, and connection counts.
Memorystore Events: Subscribe to events related to node status changes, including failovers. These can be sent to Pub/Sub for alerting or automated remediation.
Cloud Monitoring Dashboards: Create custom dashboards to visualize key Memorystore metrics and set up alerting policies for critical thresholds or events.

Disaster Recovery for Shopify Deployments on GCP

Disaster Recovery (DR) for a Shopify deployment on GCP typically involves protecting your custom applications, databases, and any other stateful services that extend or integrate with Shopify. Shopify itself is a SaaS platform, so its availability is managed by Shopify. Your DR strategy focuses on your *own* infrastructure that interacts with Shopify.

Key Components to Protect

Databases: MySQL, PostgreSQL, etc. (e.g., Cloud SQL, or self-managed on Compute Engine).
Application Servers: Compute Engine instances, GKE pods.
Caching Layers: Redis, Memcached (as discussed above).
Message Queues: Pub/Sub, RabbitMQ, Kafka.
Storage: Cloud Storage buckets, Persistent Disks.

DR Strategies for GCP Services

Multi-Region Deployment (Active-Active / Active-Passive)

The most robust DR strategy is to deploy your application and its dependencies across multiple GCP regions. This provides resilience against entire region failures.

Active-Active: Traffic is served from multiple regions simultaneously. This offers the lowest RTO/RPO but is the most complex and expensive. Requires global load balancing (e.g., Cloud Load Balancing with global forwarding rules) and data synchronization strategies.
Active-Passive: A primary region serves all traffic, while a secondary region is on standby, ready to take over. Data is replicated asynchronously or synchronously to the secondary region. This is more cost-effective than active-active but has a higher RTO.

Data Replication and Synchronization

For stateful services, data replication is paramount. The method depends on the service:

Cloud SQL: Use read replicas in different regions. For DR, you can promote a read replica to a standalone instance in the event of a primary region failure. Configure automated backups and store them in a multi-region bucket.
Compute Engine (Self-Managed Databases): Implement database-native replication (e.g., PostgreSQL streaming replication, MySQL replication) across regions. This is complex and requires careful network configuration and monitoring.
Cloud Storage: Use bucket replication (same-region or multi-region) to ensure data is available across locations.
Memorystore: Memorystore for Redis HA is regional. For multi-region DR, you would typically implement application-level replication or use a third-party solution that supports cross-region replication, or rely on periodic backups to a multi-region bucket.

Automated Failover Orchestration

Automating the failover process is crucial for minimizing RTO (Recovery Time Objective). This often involves a combination of GCP services and custom logic:

Cloud DNS: Use Cloud DNS to manage your application’s DNS records. In a DR scenario, you can automate the update of DNS records to point to the IP addresses of your application instances in the secondary region. This can be triggered by health check failures or manual intervention.
Cloud Load Balancing: Global load balancers can direct traffic to backends in different regions. Health checks are configured per backend service. If a primary region’s backends fail health checks, the load balancer can automatically shift traffic to the secondary region’s backends.
Cloud Build / Cloud Deploy: For application deployments, use CI/CD pipelines to automate the deployment of your application to the secondary region during a DR event.
Custom Automation Scripts: As discussed with Redis, custom scripts (e.g., Python scripts running on Compute Engine, or Cloud Functions) can orchestrate the failover process by interacting with GCP APIs to update load balancers, DNS, promote database replicas, etc.
Disaster Recovery Orchestration Tools: For complex environments, consider dedicated DR orchestration tools that can manage multi-step failover procedures.

Testing Your DR Plan

A DR plan is only effective if it’s tested regularly. Schedule periodic DR drills where you simulate a failure of your primary region and execute your automated failover procedures. Document the results, identify any gaps, and refine your plan. This includes testing:

The automated failover scripts and their execution time.
The RTO and RPO achieved during the drill.
The integrity of replicated data in the secondary region.
The ability of applications to connect to services in the secondary region.
Communication and notification procedures.