Disaster Recovery 101: Architecting Auto-Failovers for Redis and Ruby Deployments on Linode
Establishing a Redis Sentinel Cluster for High Availability
For robust Redis deployments, a Sentinel cluster is non-negotiable. It provides monitoring, notification, and automatic failover. We’ll configure a basic three-node Sentinel setup on Linode, ensuring quorum and resilience.
First, ensure Redis is installed on your chosen nodes. On Debian/Ubuntu systems, this is typically:
sudo apt update sudo apt install redis-server
Next, we’ll configure each Sentinel instance. The primary configuration file is usually located at /etc/redis/sentinel.conf. We need to ensure each Sentinel knows about the master Redis instance and its peers.
Sentinel Configuration Details
On each Sentinel node (let’s assume IPs: 192.168.1.101, 192.168.1.102, 192.168.1.103), modify /etc/redis/sentinel.conf as follows. Note that the master name (‘mymaster’) is arbitrary but must be consistent across all Sentinels.
port 26379 daemonize yes pidfile /var/run/redis_sentinel.pid logfile /var/log/redis/sentinel.log dir /tmp # Define the master Redis instance sentinel monitor mymaster 192.168.1.101 6379 2 # Minimum number of Sentinels that must agree on a master's failure sentinel down-after-milliseconds mymaster 5000 # Timeout for leader election and failover sentinel failover-timeout mymaster 10000 # Number of replicas to promote to master sentinel parallel-syncs mymaster 1 # Specify the IP addresses of other Sentinel nodes (adjust as needed) # Sentinel 1 (192.168.1.101) will have entries for 102 and 103 # Sentinel 2 (192.168.1.102) will have entries for 101 and 103 # Sentinel 3 (192.168.1.103) will have entries for 101 and 102 # Example for Sentinel 1 (192.168.1.101): # sentinel known-sentinel mymaster 192.168.1.102 26379 # sentinel known-sentinel mymaster 192.168.1.103 26379 # For simplicity in this example, we rely on Sentinel's discovery. # In production, explicitly listing known Sentinels can be beneficial. # Authentication (if Redis requires it) # requirepass your_redis_password # sentinel auth-pass mymaster your_redis_password
After configuring all three Sentinel nodes, start the Sentinel service on each:
sudo systemctl start redis-sentinel sudo systemctl enable redis-sentinel
You can verify the cluster status by connecting to any Sentinel instance:
redis-cli -p 26379 SENTINEL master mymaster
This command should show the current master, its IP, port, number of replicas, and the number of Sentinels monitoring it. The quorum value (set to 2 in sentinel monitor) is crucial: at least 2 Sentinels must agree that the master is down for a failover to be initiated.
Integrating Ruby Applications with Redis Failover
Your Ruby application needs to be aware of the Redis Sentinel setup. The redis-rb gem provides excellent Sentinel support. We’ll configure the connection to use Sentinel to discover the current master.
First, ensure you have the gem installed:
bundle add redis
In your Ruby application’s configuration (e.g., within an initializer or a configuration file), set up the Redis client to connect via Sentinel:
# config/initializers/redis.rb (for Rails applications)
# Or a similar configuration file for other Ruby frameworks
# List of Sentinel host:port pairs
sentinel_hosts = [
['192.168.1.101', 26379],
['192.168.1.102', 26379],
['192.168.1.103', 26379]
]
# The name of the master Redis instance as configured in sentinel.conf
master_name = 'mymaster'
# Initialize the Redis client using Sentinel
begin
redis_client = Redis.new(
role: :master, # Specify :master or :replica
sentinels: sentinel_hosts,
master_name: master_name,
# Optional: Add authentication if your Redis requires it
# password: 'your_redis_password'
)
# Test the connection
redis_client.ping
# Assign to a global constant or dependency injection container
$redis = redis_client
Rails.logger.info "Successfully connected to Redis via Sentinel."
rescue Redis::CannotConnectError => e
Rails.logger.error "Failed to connect to Redis Sentinel: #{e.message}"
# Implement fallback or error handling strategy here
# For example, you might want to retry connection, or use a fallback data store.
$redis = nil # Or a mock object
end
# If you need to connect to replicas for read operations:
# replica_client = Redis.new(
# role: :replica,
# sentinels: sentinel_hosts,
# master_name: master_name
# )
# $redis_replica = replica_client
This configuration tells the redis-rb client to first connect to one of the provided Sentinel hosts. It will then query the Sentinel for the current master of ‘mymaster’. If the master fails over, the Sentinel will inform the client (or the client will periodically re-query), and the client will automatically reconnect to the new master. The role: :master ensures the client attempts to connect to the primary Redis instance for write operations.
Simulating a Redis Failover
To test the failover mechanism, you can manually trigger it. Connect to one of the Sentinel instances and issue the SENTINEL failover command:
redis-cli -p 26379 SENTINEL failover mymaster
Observe the logs on the Sentinel nodes and your Ruby application. The Sentinel logs should indicate that a failover is in progress, a leader is elected among Sentinels, and a replica is promoted. Your Ruby application, if configured correctly, should detect the change and reconnect to the new master with minimal interruption.
You can also simulate a master failure by stopping the Redis master process:
# On the current Redis master node sudo systemctl stop redis-server
The Sentinels will detect the master’s unavailability after the down-after-milliseconds timeout and initiate a failover. Again, monitor the Sentinel logs and your application’s connection status.
Architecting for Linode Kubernetes Engine (LKE) with Redis Operator
For more complex, cloud-native deployments on Linode, particularly within Kubernetes, leveraging a Redis Operator is the recommended approach. Operators automate the deployment, scaling, and management of stateful applications like Redis, including high availability and failover.
We’ll use the Redis Enterprise Operator or a community-driven operator like the one from spotahome. For this example, we’ll outline the conceptual steps using a generic operator pattern.
Deploying a Redis Cluster on LKE
First, ensure you have a Linode Kubernetes Engine cluster provisioned and kubectl configured to interact with it.
Install the chosen Redis Operator. This typically involves applying a set of Kubernetes manifests:
# Example: Applying operator manifests (specifics vary by operator) kubectl apply -f https://raw.githubusercontent.com/spotahome/redis-operator/master/deploy/operator.yaml
Once the operator is running, you can define your desired Redis cluster state using a Custom Resource Definition (CRD) provided by the operator. Here’s a conceptual example of a Redis cluster manifest:
# redis-cluster.yaml
apiVersion: redis.spotahome.com/v1
kind: Redis
metadata:
name: my-redis-cluster
spec:
global:
redisProvider: "linode" # Or a generic "kubernetes" if not specific
# Define storage class for persistent volumes
storage:
persistentVolumeClaim:
storageClassName: "linode-block-storage" # Example Linode CSI driver storage class
resources:
requests:
storage: 10Gi
# Master/replica configuration for high availability
mode: "cluster" # Or "standalone" with Sentinel, depending on operator
replicas: 3 # Number of master nodes
redisNodes:
masterSet: 3 # Number of masters
replicaSet: 1 # Number of replicas per master
# Sentinel configuration (if mode is not "cluster" or operator supports it)
sentinels:
replicas: 3
# Security and access
redisTLS:
enabled: false # Set to true for TLS
password: "your_super_secret_password" # Operator will manage secrets
# Resource requests and limits for Redis pods
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
Apply this manifest to your LKE cluster:
kubectl apply -f redis-cluster.yaml
The Redis Operator will then provision the necessary StatefulSets, Services, and potentially Sentinel deployments to create a highly available Redis cluster. It will also manage the creation of Kubernetes Services that point to the current master and replicas, abstracting away the underlying failover process.
Ruby Application Integration with LKE Redis
When running your Ruby application within the same LKE cluster, you can connect to the Redis service exposed by the operator. The operator typically creates a Kubernetes Service that resolves to the current master.
# config/initializers/redis_lke.rb (for Rails on LKE)
# The Kubernetes Service name created by the Redis Operator
redis_service_name = 'my-redis-cluster-master' # This name is operator-dependent
namespace = ENV['REDIS_NAMESPACE'] || 'default' # Your application's namespace
# Construct the service DNS name
redis_host = "#{redis_service_name}.#{namespace}.svc.cluster.local"
redis_port = 6379 # Default Redis port
begin
redis_client = Redis.new(
host: redis_host,
port: redis_port,
password: ENV['REDIS_PASSWORD'], # Fetch password from environment variables/secrets
# Optional: For read replicas if the operator exposes them
# read_timeout: 1,
# write_timeout: 1
)
redis_client.ping
$redis = redis_client
Rails.logger.info "Successfully connected to Redis on LKE at #{redis_host}:#{redis_port}"
rescue Redis::CannotConnectError => e
Rails.logger.error "Failed to connect to Redis on LKE: #{e.message}"
# Handle connection failure
$redis = nil
end
The key here is that the Kubernetes Service my-redis-cluster-master (or whatever the operator names it) will automatically update its endpoints to point to the new master pod after a failover. Your application, by connecting to this stable Kubernetes Service name, transparently benefits from the operator-managed failover without needing explicit Sentinel configuration in the application code itself. The operator handles the Sentinel logic internally and updates the Kubernetes Service accordingly.
Monitoring and Alerting Strategies
A robust disaster recovery strategy is incomplete without comprehensive monitoring and alerting. For both standalone Sentinel setups and operator-managed clusters, we need to track key metrics and be notified of failures.
Metrics to Monitor
- Redis Sentinel Health: Number of masters down, number of Sentinels down, failover in progress.
- Redis Instance Health: Latency (p99, p95), memory usage, CPU usage, network traffic, connected clients, command statistics (e.g.,
INFO commandstats). - Replication Lag: For replicas, monitor the difference in replication offset.
- Application-Level Metrics: Redis command success/error rates from the application’s perspective.
Tools and Integrations
For Standalone Sentinel:
- Prometheus + Redis Exporter: Deploy the
redis_exporteralongside your Redis instances. Configure Prometheus to scrape metrics from both Redis and Sentinel. - Alertmanager: Set up Prometheus Alertmanager to define alerting rules based on Prometheus metrics (e.g., alert if a master is down for more than 30 seconds, or if Sentinel quorum is not met).
- Linode Managed Databases (if applicable): If using Linode’s managed Redis service, leverage their built-in monitoring and alerting.
For LKE with Redis Operator:
- Prometheus Operator (e.g., kube-prometheus-stack): This stack often includes Prometheus, Alertmanager, and Grafana, pre-configured for Kubernetes. It can automatically discover and scrape metrics from Redis pods and services managed by the operator.
- Operator-Specific Metrics: Some operators expose their own metrics about the health and status of the managed Redis clusters.
- Kubernetes Events: Monitor Kubernetes events for Pod failures, Service endpoint changes, and StatefulSet issues.
Alerting Channels: Integrate Alertmanager with Slack, PagerDuty, Opsgenie, or email to ensure critical alerts reach the appropriate on-call engineers promptly.
Conclusion: Architecting for Resilience
Implementing automated failover for Redis and your Ruby applications on Linode requires a multi-faceted approach. For simpler setups, Redis Sentinel provides a robust, well-tested solution. For cloud-native environments like LKE, Redis Operators abstract away much of the complexity, offering a declarative and automated way to manage high availability. Regardless of the chosen path, rigorous testing of failover scenarios and comprehensive monitoring are paramount to ensuring your application remains available when disaster strikes.