Disaster Recovery 101: Architecting Auto-Failovers for Redis and PHP Deployments on AWS

Leveraging AWS ElastiCache for Redis with Multi-AZ and Read Replicas

For mission-critical applications, a single Redis instance is a single point of failure. AWS ElastiCache for Redis offers robust solutions for high availability and disaster recovery. The core of this strategy lies in its Multi-AZ (Availability Zone) deployment with automatic failover and the judicious use of read replicas.

When you configure ElastiCache for Redis with Multi-AZ, ElastiCache automatically provisions a primary node and a synchronous standby replica in a different Availability Zone within the same AWS region. If the primary node becomes unavailable due to an infrastructure failure, ElastiCache automatically promotes the standby replica to become the new primary. This process is transparent to your application, minimizing downtime.

Configuring ElastiCache for Multi-AZ and Read Replicas

The configuration is straightforward via the AWS Management Console, AWS CLI, or Infrastructure as Code tools like Terraform or CloudFormation. Here’s a conceptual outline using AWS CLI:

To create a new Redis cluster with Multi-AZ enabled and a single read replica:

aws elasticache create-replication-group \
    --replication-group-id my-redis-cluster \
    --replication-group-description "My production Redis cluster with HA" \
    --engine redis \
    --cache-node-type cache.m5.large \
    --num-node-groups 1 \
    --replicas-per-node-group 1 \
    --multi-az-enabled \
    --engine-version 6.x \
    --port 6379 \
    --subnet-group-name my-redis-subnet-group \
    --security-group-ids sg-xxxxxxxxxxxxxxxxx \
    --tags Key=Environment,Value=Production Key=Project,Value=MyApp

Key parameters:

--replication-group-id: A unique identifier for your Redis cluster.
--num-node-groups and --replicas-per-node-group: For Redis Cluster mode, these define the sharding and replication. For non-Cluster mode (single shard), --num-node-groups 1 and --replicas-per-node-group controls the number of read replicas.
--multi-az-enabled: Crucial for automatic failover.
--subnet-group-name: Specifies the VPC subnets where ElastiCache nodes will be deployed. Ensure these subnets span multiple Availability Zones.
--security-group-ids: Controls network access to your ElastiCache cluster.

To add more read replicas to an existing cluster (non-Cluster mode):

aws elasticache modify-replication-group \
    --replication-group-id my-redis-cluster \
    --num-cache-clusters 3 \
    --apply-immediately

Note: Modifying the number of nodes in a Redis Cluster mode replication group is more complex and involves rebalancing. For non-Cluster mode, --num-cache-clusters directly sets the total number of nodes (primary + replicas).

PHP Application Integration and Failover Handling

Your PHP application needs to be resilient to Redis failovers. The primary mechanism for this is using a Redis client library that supports connection pooling and can gracefully handle connection errors and re-establish connections to the new primary.

Using Predis with Connection Pooling

The popular predis/predis library is a good choice. It doesn’t have explicit “failover” logic built-in for ElastiCache’s automatic failover, but its connection management can be leveraged.

A common pattern is to connect to the ElastiCache cluster’s configuration endpoint. ElastiCache provides a cluster endpoint that resolves to the current primary node. When a failover occurs, this endpoint’s DNS record is updated to point to the new primary.

Here’s a basic PHP example using Predis:

<?php
require 'vendor/autoload.php';

use Predis\Client;
use Predis\Connection\ConnectionException;

// ElastiCache Cluster Endpoint (e.g., my-redis-cluster.xxxxxx.ng.0001.use1.cache.amazonaws.com)
$redisEndpoint = getenv('REDIS_ENDPOINT');
$redisPort = getenv('REDIS_PORT') ?: 6379;

$options = [
    'cluster' => 'redis', // Use 'redis' for non-cluster mode, 'predis' for cluster mode
    'parameters' => [
        'scheme' => 'tcp',
        'host'   => $redisEndpoint,
        'port'   => $redisPort,
        // Add authentication if using Redis AUTH
        // 'password' => getenv('REDIS_PASSWORD'),
    ],
    'connections' => [
        'tcp' => [
            'connection_timeout' => 2.5, // Shorter timeout for faster detection
            'read_write_timeout' => 2.5,
        ],
    ],
    'replication' => true, // Enable replication mode for failover detection
    'read_write_timeout' => 5, // General timeout for commands
];

$redis = null;

function getRedisClient(array $options, &$redis): ?Client {
    if ($redis === null) {
        try {
            // Predis will automatically connect to the primary if replication is enabled
            // and the cluster endpoint resolves correctly.
            $redis = new Client($options['parameters']['host'] . ':' . $options['parameters']['port'], $options);
            $redis->connect(); // Explicitly connect to trigger initial connection
            error_log("Successfully connected to Redis.");
        } catch (ConnectionException $e) {
            error_log("Failed to connect to Redis: " . $e->getMessage());
            return null;
        }
    }
    return $redis;
}

// --- Usage Example ---
$redis = getRedisClient($options, $redis);

if ($redis) {
    try {
        // Set a value
        $redis->set('mykey', 'myvalue', 'EX', 60); // Set with expiration
        echo "Set 'mykey' successfully.\n";

        // Get a value
        $value = $redis->get('mykey');
        echo "Got 'mykey': " . $value . "\n";

        // Example of a command that might fail during failover
        // If a failover happens, the next command might throw a ConnectionException
        // The getRedisClient function (or a wrapper around it) would then be called again
        // to re-establish the connection to the new primary.

    } catch (ConnectionException $e) {
        error_log("Redis operation failed: " . $e->getMessage());
        // Reset the client to force reconnection on next attempt
        $redis = null;
        // Implement retry logic here if needed
    } catch (\Exception $e) {
        error_log("An unexpected error occurred: " . $e->getMessage());
    }
} else {
    echo "Could not establish Redis connection.\n";
}
?>

In this example:

We pass the ElastiCache cluster endpoint directly to Predis. Predis, when configured with 'replication' => true, will attempt to connect to the primary.
The connection_timeout and read_write_timeout are set relatively low to detect connection issues faster.
The getRedisClient function acts as a simple factory and connection manager. If a ConnectionException occurs during a Redis operation, we set $redis = null;. The next time getRedisClient is called, it will attempt to establish a new connection. This is a basic form of automatic reconnection.
For production, you’d want more sophisticated retry logic with exponential backoff.

Handling Redis Cluster Mode

If you are using Redis Cluster mode (multiple shards), the configuration for Predis changes slightly. You’d typically provide a list of cluster nodes or the cluster endpoint. ElastiCache’s cluster endpoint will resolve to one of the nodes, and Predis will discover the rest of the cluster topology.

$options = [
    'cluster' => 'predis', // Use 'predis' for Redis Cluster mode
    'parameters' => [
        'scheme' => 'tcp',
        'host'   => $redisEndpoint, // ElastiCache Cluster Endpoint
        'port'   => $redisPort,
        // 'password' => getenv('REDIS_PASSWORD'),
    ],
    // ... other options ...
];

// When using 'cluster' => 'predis', Predis handles discovering other nodes.
// The initial connection to the provided host/port is used to fetch cluster topology.
$redis = new Client($options['parameters']['host'] . ':' . $options['parameters']['port'], $options);
$redis->connect();

In Redis Cluster mode, ElastiCache’s Multi-AZ applies to each shard’s primary node. If a primary node within a shard fails, its standby replica is promoted. Predis’s cluster awareness should handle routing requests to the correct shard, and if a node within a shard becomes unavailable, it will eventually discover the new primary for that shard.

AWS Lambda and Serverless Considerations

For serverless applications using AWS Lambda, managing Redis connections requires a different approach due to Lambda’s ephemeral nature and execution environment. Re-establishing connections on every invocation can be inefficient and lead to timeouts.

Connection Reuse in Lambda

The key is to reuse connections across Lambda invocations by initializing the Redis client outside the main handler function. This way, the client object persists in the Lambda execution environment’s memory between invocations.

<?php
require 'vendor/autoload.php';

use Predis\Client;
use Predis\Connection\ConnectionException;

// Initialize Redis client outside the handler to reuse connections
$redisClient = null;

function getRedisClientInstance(): ?Client {
    global $redisClient;

    if ($redisClient === null) {
        $redisEndpoint = getenv('REDIS_ENDPOINT');
        $redisPort = getenv('REDIS_PORT') ?: 6379;

        $options = [
            'cluster' => 'redis',
            'parameters' => [
                'scheme' => 'tcp',
                'host'   => $redisEndpoint,
                'port'   => $redisPort,
                // 'password' => getenv('REDIS_PASSWORD'),
            ],
            'connections' => [
                'tcp' => [
                    'connection_timeout' => 1.5, // Very short timeout for Lambda
                    'read_write_timeout' => 1.5,
                ],
            ],
            'replication' => true,
            'read_write_timeout' => 3,
        ];

        try {
            $redisClient = new Client($options['parameters']['host'] . ':' . $options['parameters']['port'], $options);
            $redisClient->connect();
            error_log("Redis client initialized and connected.");
        } catch (ConnectionException $e) {
            error_log("Failed to initialize Redis client: " . $e->getMessage());
            return null;
        }
    }
    return $redisClient;
}

// --- Lambda Handler Example ---
$handler = function (array $event, $context) {
    $redis = getRedisClientInstance();

    if (!$redis) {
        return [
            'statusCode' => 500,
            'body' => json_encode(['message' => 'Failed to connect to Redis']),
        ];
    }

    try {
        // Example: Increment a counter
        $counterKey = 'lambda_invocations';
        $currentCount = $redis->incr($counterKey);

        // Example: Set a session value
        $sessionId = $event['sessionId'] ?? 'default_session';
        $redis->set("session:$sessionId", json_encode(['user' => 'testuser', 'timestamp' => time()]), 'EX', 3600); // Expires in 1 hour

        return [
            'statusCode' => 200,
            'body' => json_encode([
                'message' => 'Redis operations successful',
                'current_invocation_count' => $currentCount,
                'session_key' => "session:$sessionId"
            ]),
        ];

    } catch (ConnectionException $e) {
        error_log("Redis operation failed during Lambda execution: " . $e->getMessage());
        // If a connection error occurs, reset the client.
        // The next invocation will attempt to re-initialize.
        global $redisClient;
        $redisClient = null;
        return [
            'statusCode' => 503, // Service Unavailable
            'body' => json_encode(['message' => 'Redis temporarily unavailable']),
        ];
    } catch (\Exception $e) {
        error_log("An unexpected error occurred in Lambda: " . $e->getMessage());
        return [
            'statusCode' => 500,
            'body' => json_encode(['message' => 'Internal server error']),
        ];
    }
};

// To test locally, you might simulate the handler call:
// $event = ['sessionId' => 'abc123'];
// $context = null;
// echo json_encode($handler($event, $context));
?>

Crucially, when a ConnectionException is caught within the Lambda handler, we set the global $redisClient to null. This ensures that the *next* Lambda invocation will attempt to create a fresh connection. ElastiCache’s DNS updates for failover will then be picked up by the new connection attempt.

Monitoring and Alerting for Failovers

Automated failover is only part of the story. You need to know when it happens and if it’s successful. AWS provides CloudWatch metrics and events for ElastiCache that are essential for monitoring.

Key CloudWatch Metrics

EngineCPUUtilization: Monitor the CPU load on your Redis nodes. High utilization can precede issues.
CacheHits and CacheMisses: Track cache performance.
CurrConnections: Number of active connections. Sudden drops might indicate issues.
ReplicationLag: For read replicas, this indicates how far behind they are from the primary. High lag is a concern.
NumberOfRedisNodes: This metric is critical. If a failover occurs, the number of nodes might temporarily change or the role of nodes might shift.

ElastiCache Events and CloudWatch Events

ElastiCache publishes events to CloudWatch Events (now EventBridge). These events are invaluable for triggering automated responses or notifications when significant changes occur, such as a failover.

A typical event pattern for a failover might look like this:

{
  "source": ["aws.elasticache"],
  "detail-type": ["ElastiCache:FailoverComplete"],
  "detail": {
    "ReplicationGroupId": ["my-redis-cluster"],
    "EventMessage": ["Failover completed for replication group my-redis-cluster"]
  }
}

You can configure EventBridge rules to:

Send notifications to an SNS topic (which can then alert via email, Slack, PagerDuty, etc.).
Trigger an AWS Lambda function for automated remediation or logging.
Trigger an AWS Step Functions workflow for more complex recovery procedures.

Setting up an EventBridge rule to capture ElastiCache:FailoverComplete events and send them to an SNS topic is a fundamental step for operational awareness.

Advanced Strategies: Global Datastores and Cross-Region Failover

For true disaster recovery across geographical regions, ElastiCache Global Datastore for Redis offers a solution. This feature allows you to have a primary cluster in one region and one or more secondary clusters in different regions. Writes to the primary cluster are replicated asynchronously to the secondary clusters.

Global Datastore Configuration

Setting up a Global Datastore involves creating a primary replication group and then adding secondary replication groups in other regions. The replication between regions is managed by ElastiCache.

# Create primary replication group (e.g., in us-east-1)
aws elasticache create-replication-group \
    --replication-group-id my-redis-primary \
    --replication-group-description "Primary for Global Datastore" \
    --engine redis \
    --cache-node-type cache.m5.large \
    --num-node-groups 1 \
    --replicas-per-node-group 1 \
    --multi-az-enabled \
    --region us-east-1 \
    --subnet-group-name my-redis-subnet-group-use1 \
    --security-group-ids sg-xxxxxxxxxxxxxxxxx

# Create secondary replication group (e.g., in eu-west-1)
aws elasticache create-replication-group \
    --replication-group-id my-redis-secondary \
    --replication-group-description "Secondary for Global Datastore" \
    --engine redis \
    --cache-node-type cache.m5.large \
    --num-node-groups 1 \
    --replicas-per-node-group 1 \
    --multi-az-enabled \
    --region eu-west-1 \
    --subnet-group-name my-redis-subnet-group-euw1 \
    --security-group-ids sg-yyyyyyyyyyyyyyyyy \
    --global-replication-group-id arn:aws:elasticache:us-east-1:123456789012:global-replication-group/my-global-redis-cluster # Replace with actual ARN

# After creating the secondary, you'd typically promote it to primary in case of a regional outage.
# This is a manual or semi-automated process.

Application-side handling for Global Datastores typically involves:

Directing read traffic to the local (regional) read replicas for lower latency.
Implementing a strategy to switch write traffic to the secondary cluster if the primary region becomes unavailable. This often involves DNS-based failover (e.g., Route 53) or application-level logic that monitors the primary region’s health.
When a cross-region failover occurs, your application needs to connect to the new primary’s endpoint in the secondary region.

The failover process for Global Datastores is not fully automatic in the same way as Multi-AZ. Promoting a secondary cluster to primary is a deliberate action, often triggered by monitoring and alerting systems detecting a regional outage. Your application’s DNS or configuration must then be updated to point to the new primary.

Conclusion

Architecting for Redis disaster recovery on AWS involves a layered approach. ElastiCache Multi-AZ provides automatic failover within a region. For cross-region resilience, Global Datastore, combined with robust application-level failover logic and DNS management, is essential. Continuous monitoring and well-defined alerting mechanisms are paramount to ensure that failovers are detected and handled effectively, minimizing the impact of outages on your applications.