Upgrading a multi-node Redis replication cluster on RHEL 9: Pre-flight failover validation runbooks

Pre-Upgrade Validation: Redis Replication Cluster Failover Readiness

Before embarking on a Redis replication cluster upgrade on RHEL 9, a critical prerequisite is validating the failover mechanism. A robust failover strategy ensures minimal downtime and data loss during maintenance. This document outlines a series of pre-flight checks and runbooks designed to confirm the health and readiness of your Redis Sentinel-managed replication setup.

Assessing Sentinel Health and Consensus

Redis Sentinel is the backbone of high availability. Its ability to elect a new master and reconfigure replicas is paramount. We’ll start by verifying Sentinel’s operational status and its ability to reach consensus among its nodes.

1. Verifying Sentinel Process Status

Ensure all Sentinel processes are running on their respective nodes. This is a fundamental check that can be performed via SSH on each Sentinel host.

On each Sentinel node:

sudo systemctl status redis-sentinel

Look for output indicating the service is active and running. If not, investigate logs (typically in /var/log/redis/sentinel.log or via journalctl -u redis-sentinel) for startup errors.

2. Checking Sentinel Cluster Quorum

Sentinels operate on a quorum system. A majority of Sentinels must agree on the state of the Redis cluster (master, replicas, failures) for actions like failover to proceed. We can query a Sentinel instance to understand its view of the cluster.

Connect to any Sentinel instance using redis-cli and execute the SENTINEL masters command. This will show the current master, its status, and the number of Sentinels monitoring it.

redis-cli -p 26379 SENTINEL masters

Examine the output for the num-other-sentinels and num-slaves fields. Crucially, ensure that the number of Sentinels reporting to each other is consistent and sufficient to form a quorum. If you have N Sentinels, you typically need floor(N/2) + 1 to be healthy for quorum.

Additionally, run SENTINEL sentinels <master-name> to see the other Sentinels known to this instance and their status.

redis-cli -p 26379 SENTINEL sentinels mymaster

Verify that all active Sentinel nodes are listed and appear to be in a s_down_from_other_sentinels state of 0, indicating they are reachable by other Sentinels.

Simulating Failover Scenarios

The most effective validation is to simulate a master failure and observe the Sentinel-led failover process. This should be done during a low-traffic period or a scheduled maintenance window.

3. Manual Master Shutdown (Simulated Failure)

This is the core validation step. We will gracefully shut down the current Redis master node. Sentinel should detect this and initiate a failover.

First, identify the current master. You can do this by connecting to any Sentinel or replica and running:

redis-cli -p 6379 GET master_host

Or, from a Sentinel:

redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster

Once the master IP and port are identified, SSH into the master node and shut down the Redis process.

sudo systemctl stop redis

Observe the Sentinel logs on all Sentinel nodes. You should see messages indicating the master is marked as SDOWN, then FAILOVER_START, followed by the election of a new master and reconfiguration of replicas.

4. Verifying New Master Election and Replica Promotion

After initiating the master shutdown, monitor the cluster state. The goal is to confirm that a new master has been elected and that the former replicas have successfully reconfigured themselves as replicas of the new master.

Connect to any Sentinel and query the master status again:

redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster

This command should now return the IP address and port of the newly elected master. Note this down.

Next, check the status of the replicas. Connect to one of the former replicas (which should now be a replica of the new master) and run:

redis-cli -p [replica_port] INFO replication

Verify that the master_host and master_port fields point to the new master identified in the previous step. Also, check the master_link_status, which should be up.

You can also query the Sentinel for the list of replicas associated with the master:

redis-cli -p 26379 SENTINEL slaves mymaster

Ensure all expected replicas are listed and their status is healthy.

5. Testing Data Consistency Post-Failover

A successful failover is meaningless if data is lost or corrupted. Perform a quick data consistency check.

Before shutting down the original master, set a unique key-value pair on it. For example:

redis-cli -p [original_master_port] SET validation_key "pre_failover_$(date +%s)"

After the failover is complete and the new master is established, connect to the new master and attempt to retrieve this key:

redis-cli -p [new_master_port] GET validation_key

If the key is retrieved successfully with its expected value, it indicates that the replication stream was caught up and no data was lost during the failover. Repeat this for a few critical keys if necessary.

Restoring the Original Master and Rejoining the Cluster

Once the failover is validated, it’s crucial to bring the original master back online and ensure it rejoins the cluster correctly as a replica.

6. Restarting the Original Master Node

On the node that was the original master, restart the Redis service:

sudo systemctl start redis

By default, Redis configured with Sentinel will attempt to resync with the current master upon startup. Check the Redis logs on this node for messages indicating it’s connecting to the new master and performing a full or partial resynchronization (RDB or PSYNC).

7. Verifying Rejoined Replica Status

After the original master has restarted and potentially resynced, verify its status within the cluster.

Connect to any Sentinel and check the list of slaves for the master:

redis-cli -p 26379 SENTINEL slaves mymaster

The original master node should now appear in this list, with its master_link_status indicating up and its role as a replica. If it’s still showing as a master or is in a disconnected state, investigate its logs and Sentinel configuration.

Automating Validation Checks

For production environments, these manual checks should be automated. A simple Bash script can orchestrate these commands and report on success or failure. Consider integrating these checks into your CI/CD pipeline or monitoring system.

#!/bin/bash

REDIS_PORT="6379"
SENTINEL_PORT="26379"
MASTER_NAME="mymaster"

echo "--- Checking Sentinel Health ---"
# Basic check for Sentinel process (run on each sentinel node)
# sudo systemctl status redis-sentinel

echo "--- Checking Sentinel Quorum ---"
# Connect to one sentinel and check masters
MASTER_INFO=$(redis-cli -p $SENTINEL_PORT SENTINEL masters $MASTER_NAME)
echo "$MASTER_INFO"
NUM_OTHER_SENTINELS=$(echo "$MASTER_INFO" | grep "num-other-sentinels:" | awk -F': ' '{print $2}')
QUORUM=$(echo "$MASTER_INFO" | grep "quorum:" | awk -F': ' '{print $2}')

if [ "$NUM_OTHER_SENTINELS" -lt "$QUORUM" ]; then
    echo "WARNING: Sentinel quorum may not be met. Expected at least $QUORUM, found $NUM_OTHER_SENTINELS."
else
    echo "Sentinel quorum appears healthy."
fi

echo "--- Simulating Failover (Manual Step Required) ---"
echo "Please manually shut down the current master node: sudo systemctl stop redis"
echo "Wait for failover to complete (monitor Sentinel logs)."
read -p "Press Enter when failover is complete..."

echo "--- Verifying New Master ---"
NEW_MASTER_ADDR=$(redis-cli -p $SENTINEL_PORT SENTINEL get-master-addr-by-name $MASTER_NAME)
NEW_MASTER_IP=$(echo "$NEW_MASTER_ADDR" | head -n 1)
NEW_MASTER_PORT=$(echo "$NEW_MASTER_ADDR" | tail -n 1)

if [ -z "$NEW_MASTER_IP" ]; then
    echo "ERROR: Could not determine new master address. Failover may have failed."
    exit 1
fi
echo "New master identified: $NEW_MASTER_IP:$NEW_MASTER_PORT"

echo "--- Verifying Replica Status ---"
REPLICAS=$(redis-cli -p $SENTINEL_PORT SENTINEL slaves $MASTER_NAME)
echo "$REPLICAS"
# Add more sophisticated checks here to ensure all expected replicas are up and connected to the new master.

echo "--- Verifying Data Consistency ---"
# Assume a key 'validation_key' was set on the old master before shutdown
# You would need to adapt this to your specific application's keys.
echo "Attempting to GET validation_key from new master ($NEW_MASTER_IP:$NEW_MASTER_PORT)..."
DATA=$(redis-cli -h $NEW_MASTER_IP -p $NEW_MASTER_PORT GET validation_key)

if [ "$DATA" == "pre_failover_$(date +%s)" ]; then # This is a simplified check, ideally you'd store the value beforehand
    echo "SUCCESS: Data consistency check passed for 'validation_key'."
else
    echo "WARNING: Data consistency check for 'validation_key' failed or key not found."
fi

echo "--- Restarting Original Master Node ---"
echo "Please manually restart the original master node: sudo systemctl start redis"
echo "Monitor its logs and Sentinel logs for rejoining."
read -p "Press Enter when original master has restarted..."

echo "--- Verifying Rejoined Replica ---"
REPLICAS_AFTER_RESTART=$(redis-cli -p $SENTINEL_PORT SENTINEL slaves $MASTER_NAME)
echo "$REPLICAS_AFTER_RESTART"
# Check if the original master IP is now listed as a slave and is connected.

echo "--- Pre-Upgrade Validation Complete ---"
echo "Review all outputs carefully. If any step failed, investigate before proceeding with the upgrade."

This script provides a framework. For production, enhance it with error handling, specific key checks, and integration with your alerting systems. Thorough validation of the failover mechanism is the most critical step in ensuring a smooth Redis cluster upgrade.

Upgrading a multi-node Redis replication cluster on RHEL 9: Pre-flight failover validation runbooks

Pre-Upgrade Validation: Redis Replication Cluster Failover Readiness

Assessing Sentinel Health and Consensus

1. Verifying Sentinel Process Status

2. Checking Sentinel Cluster Quorum

Simulating Failover Scenarios

3. Manual Master Shutdown (Simulated Failure)

4. Verifying New Master Election and Replica Promotion

5. Testing Data Consistency Post-Failover

Restoring the Original Master and Rejoining the Cluster

6. Restarting the Original Master Node

7. Verifying Rejoined Replica Status

Automating Validation Checks

Reader Interactions

Leave a Reply Cancel reply

Recent Posts

Top Categories

Our Products

Our Services