Disaster Recovery 101: Architecting Auto-Failovers for Redis and Python Deployments on OVH
Redis Sentinel for High Availability
Achieving automated failover for Redis hinges on implementing a robust high-availability strategy. Redis Sentinel is the de facto standard for this, providing monitoring, notification, and automatic failover for Redis instances. We’ll focus on a multi-datacenter deployment scenario on OVH, assuming you have at least three Redis instances (one master, two replicas) and a minimum of three Sentinel processes distributed across different availability zones or even regions for true disaster resilience.
The core configuration for Redis Sentinel is managed in a `sentinel.conf` file. Here’s a breakdown of essential directives:
Sentinel Configuration (`sentinel.conf`)
# The name of the master Redis server we are monitoring. # 'mymaster' is an arbitrary name. sentinel monitor mymaster 192.168.1.100 6379 2 # The number of Sentinels that must agree on a master's failure # before initiating a failover. A quorum of 2 is a common starting point. # For higher availability and to avoid split-brain scenarios, consider # a quorum of ceil(N/2) + 1, where N is the total number of Sentinels. sentinel parallel-syncs mymaster 1 # The time in milliseconds the Sentinel will wait before starting # to send the failover command to other Sentinels. sentinel failover-timeout mymaster 60000 # The name of the Redis master server. # This is the name used in the 'sentinel monitor' directive. # The following directives are specific to the 'mymaster' configuration. sentinel down-after-milliseconds mymaster 5000 # The IP address and port of the master Redis instance. # This is also specified in 'sentinel monitor'. # The following directives are specific to the 'mymaster' configuration. sentinel master-host mymaster 192.168.1.100 # The port of the master Redis instance. sentinel master-port mymaster 6379 # Optional: If your Redis instances are protected by a password. # sentinel auth-pass mymaster YourRedisPassword # Optional: Specify the Redis data directory for persistence if needed. # dir /var/lib/redis # Optional: Logging configuration. logfile "/var/log/redis/sentinel.log" loglevel notice # Optional: Bind Sentinel to a specific IP address to listen on. # bind 192.168.1.200
Key parameters to note:
sentinel monitor <master-name> <ip> <port> <quorum>: This is the most critical directive. It tells Sentinel to monitor a Redis master at the specified IP and port. The<quorum>is the number of Sentinels that must agree that the master is down before initiating a failover. For a cluster of 3 Sentinels, a quorum of 2 is typical. For 5 Sentinels, a quorum of 3 would be appropriate.sentinel parallel-syncs <master-name> <num>: This limits the number of replicas that can be reconfigured to sync with the new master after a failover. A value of 1 is safe to avoid overwhelming the new master.sentinel failover-timeout <master-name> <milliseconds>: The maximum time in milliseconds that Sentinel will wait before starting the failover process.sentinel down-after-milliseconds <master-name> <milliseconds>: The time a master must be unreachable for a Sentinel to consider it down. This should be tuned based on network latency and expected Redis responsiveness.sentinel auth-pass <master-name> <password>: If your Redis instances require authentication, this directive is essential for Sentinel to connect to them.
To deploy Redis Sentinel, you would typically run the Redis server with the --sentinel flag and point it to your `sentinel.conf` file. For example:
redis-server /etc/redis/sentinel.conf --sentinel
Ensure that your Sentinel instances can communicate with each other and with all Redis instances (master and replicas) on the configured ports (default 6379 for Redis, 26379 for Sentinel). Firewall rules on OVH instances must be configured accordingly.
Python Application Integration with Sentinel
Your Python application needs to be aware of the current Redis master. Directly connecting to a hardcoded IP address will break during a failover. The standard practice is to query Sentinel for the current master’s address. The redis-py library provides excellent support for this.
Using `redis-py` with Sentinel
First, ensure you have the library installed:
pip install redis
Then, configure your Python application to connect via Sentinel:
import redis
# List of Sentinel host:port tuples
sentinels = [
('192.168.1.201', 26379),
('192.168.1.202', 26379),
('192.168.1.203', 26379),
]
# The name of the master as configured in sentinel.conf
master_name = 'mymaster'
try:
# Create a Redis client that connects via Sentinel
# The 'redis' library will automatically discover the current master
# and connect to it. It also handles reconnections.
r = redis.Redis(
service_name=master_name,
sentinels=sentinels,
socket_timeout=1, # Timeout for Sentinel connections
socket_connect_timeout=1, # Timeout for initial connection
decode_responses=True # Decode responses from bytes to strings
)
# Test the connection and perform a simple operation
r.set('mykey', 'myvalue')
value = r.get('mykey')
print(f"Successfully connected to Redis. Value for 'mykey': {value}")
except redis.exceptions.ConnectionError as e:
print(f"Could not connect to Redis via Sentinel: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
The redis-py library, when configured with service_name and sentinels, will:
- Query the provided Sentinel instances to discover the current master for the given
service_name. - Establish a connection to that master.
- If the master becomes unavailable, and Sentinel initiates a failover,
redis-pywill detect the disconnection and re-query the Sentinels to find the new master, automatically reconnecting.
It’s crucial to include multiple Sentinel addresses in the sentinels list. If one Sentinel is down or unreachable, redis-py will try the others. The socket_timeout and socket_connect_timeout parameters are important for preventing your application from hanging indefinitely if a Sentinel is unresponsive.
OVH Specific Considerations for Disaster Recovery
When deploying on OVH, several factors come into play for robust disaster recovery:
Network Configuration and Security Groups
Ensure your OVH security groups (firewall rules) allow:
- Communication between all Redis instances (master and replicas) on port 6379.
- Communication between all Sentinel instances on port 26379 (default Sentinel port).
- Communication between Sentinel instances and all Redis instances on port 6379.
- Communication between your Python application servers and the Redis master (which can change) on port 6379.
- Communication between Sentinel instances themselves on port 26379 for leader election and failover coordination.
For inter-region or inter-datacenter deployments on OVH, consider using their Private Network features or VPNs to ensure secure and reliable communication between your Redis/Sentinel nodes if they are not in the same OVH Private Network. If using public IPs, ensure they are properly secured.
Instance Placement and Availability Zones
To achieve true disaster recovery, your Redis master, replicas, and Sentinel instances should be distributed across different OVH Availability Zones (AZs) or even different OVH Regions. This prevents a single point of failure due to an AZ or region outage.
For example, a typical setup might look like this:
- Redis Master: AZ-A
- Redis Replica 1: AZ-B
- Redis Replica 2: AZ-C
- Sentinel 1: AZ-A
- Sentinel 2: AZ-B
- Sentinel 3: AZ-C
This ensures that even if an entire AZ goes down, you still have enough Sentinels to elect a new master from the remaining replicas, and your application can reconnect.
Automated Deployment and Configuration Management
Manually configuring Redis and Sentinel across multiple instances is error-prone and not scalable. Leverage infrastructure-as-code tools like Terraform, Ansible, or Chef for automated deployment and configuration. This ensures consistency and repeatability.
An Ansible playbook snippet for configuring Redis and Sentinel might look like this:
---
- name: Configure Redis and Sentinel
hosts: redis_servers
become: yes
vars:
redis_port: 6379
sentinel_port: 26379
redis_master_ip: "{{ hostvars[groups['redis_servers'][0]]['ansible_default_ipv4']['address'] }}" # Assuming first host is master
sentinel_conf_path: /etc/redis/sentinel.conf
redis_conf_path: /etc/redis/redis.conf
tasks:
- name: Install Redis
apt:
name: redis-server
state: present
update_cache: yes
- name: Configure Redis Master/Replica
template:
src: redis.conf.j2
dest: "{{ redis_conf_path }}"
when: inventory_hostname in groups['redis_servers'][0] # Master config
notify: Restart Redis
- name: Configure Redis Replica
template:
src: redis-replica.conf.j2
dest: "{{ redis_conf_path }}"
when: inventory_hostname not in groups['redis_servers'][0] # Replica config
notify: Restart Redis
- name: Configure Sentinel
template:
src: sentinel.conf.j2
dest: "{{ sentinel_conf_path }}"
notify: Restart Sentinel
- name: Ensure Redis service is running and enabled
systemd:
name: redis-server
state: started
enabled: yes
- name: Ensure Sentinel service is running and enabled (if applicable)
systemd:
name: redis-sentinel
state: started
enabled: yes
handlers:
- name: Restart Redis
systemd:
name: redis-server
state: restarted
- name: Restart Sentinel
systemd:
name: redis-sentinel
state: restarted
You would then have Jinja2 templates (e.g., `redis.conf.j2`, `redis-replica.conf.j2`, `sentinel.conf.j2`) that dynamically generate the configuration files based on Ansible variables and inventory.
Testing and Monitoring Failover
Automated failover is only effective if it works reliably. Rigorous testing is paramount.
Simulating Failures
You can simulate various failure scenarios:
- Master Failure: Stop the Redis master process (e.g., `sudo systemctl stop redis-server`). Observe Sentinel logs and your application’s behavior.
- Network Partition: Block network traffic between a Sentinel and the master, or between Sentinels themselves.
- Instance Crash: Kill the Redis master process forcefully (e.g., `sudo kill -9 $(pgrep redis-server)`).
- Sentinel Failure: Stop one or more Sentinel processes.
After each simulation, verify that:
- Sentinel logs indicate the master is down and a failover has occurred.
- A new master has been elected.
- Your Python application can successfully connect to the new master and perform operations.
- Replicas have been reconfigured to sync with the new master.
Monitoring Sentinel and Redis
Implement comprehensive monitoring for your Redis and Sentinel instances. Key metrics include:
- Redis: Uptime, memory usage, connected clients, latency, command statistics, replication status.
- Sentinel: Uptime, number of masters being monitored, number of Sentinels in the quorum, number of Sentinels reporting master down, failover events.
Tools like Prometheus with the Redis Exporter, or commercial solutions, can be integrated with OVH’s monitoring services or your preferred observability platform. Set up alerts for critical events, such as Sentinel reporting a master down or a failover in progress.
By combining Redis Sentinel for Redis HA and careful application integration with redis-py, you can architect a resilient system on OVH that automatically handles Redis instance failures, ensuring minimal downtime and high availability for your Python applications.