Disaster Recovery 101: Architecting Auto-Failovers for Redis and C Deployments on Linode

Establishing a High-Availability Redis Cluster with Sentinel

For critical applications relying on Redis for caching or session management, a single instance is a single point of failure. Implementing Redis Sentinel provides automatic failover, ensuring minimal downtime. This section details the setup of a three-node Redis Sentinel cluster on Linode.

We’ll assume you have three Linode instances provisioned, each with a static IP address. For this example, let’s use:

Master Node: 192.168.1.10 (redis-master)
Replica Node 1: 192.168.1.11 (redis-replica-1)
Replica Node 2: 192.168.1.12 (redis-replica-2)
Sentinel Node 1: 192.168.1.20 (sentinel-1)
Sentinel Node 2: 192.168.1.21 (sentinel-2)
Sentinel Node 3: 192.168.1.22 (sentinel-3)

It’s crucial to configure Redis to run as a daemon and to enable persistence (RDB or AOF) for data recovery. Ensure Redis is installed on all three nodes intended for Redis instances.

Configuring Redis Instances

On each Redis node (redis-master, redis-replica-1, redis-replica-2), edit the Redis configuration file (typically /etc/redis/redis.conf).

Master Node Configuration (redis-master)

Ensure the following settings are present or modified:

# /etc/redis/redis.conf on redis-master
port 6379
daemonize yes
pidfile /var/run/redis_6379.pid
logfile /var/log/redis/redis-server.log
bind 0.0.0.0 # Or specific private IP if preferred
appendonly yes
# For replication, master doesn't need these specific lines, but good practice
# replica-serve-stale-data yes
# replica-read-only yes
# repl-disable-tcp-nodelay no
# repl-backlog-size 1mb
# repl-backlog-ttl 3600
# slave-priority 100

Replica Node Configuration (redis-replica-1, redis-replica-2)

On each replica node, add the following lines, pointing to the master’s IP address:

# /etc/redis/redis.conf on redis-replica-1 & redis-replica-2
port 6379
daemonize yes
pidfile /var/run/redis_6379.pid
logfile /var/log/redis/redis-server.log
bind 0.0.0.0 # Or specific private IP if preferred
appendonly yes
replica-serve-stale-data yes
replica-read-only yes
repl-disable-tcp-nodelay no
repl-backlog-size 1mb
repl-backlog-ttl 3600
slave-priority 100 # Default, can be adjusted for failover preference

# Replication settings
replicaof 192.168.1.10 6379 # Point to your master's IP and port

After configuring, restart Redis on all three nodes:

sudo systemctl restart redis-server

Setting Up Redis Sentinel

Redis Sentinel is a separate process that monitors Redis instances and performs automatic failover. Install Redis on the three Sentinel nodes (sentinel-1, sentinel-2, sentinel-3). You can use the same Redis installation package.

Create a Sentinel configuration file, e.g., /etc/redis/sentinel.conf, on each Sentinel node.

# /etc/redis/sentinel.conf on sentinel-1, sentinel-2, sentinel-3
port 26379
daemonize yes
pidfile /var/run/redis-sentinel.pid
logfile /var/log/redis/redis-sentinel.log
bind 0.0.0.0 # Or specific private IP

# Monitor the master Redis instance
# The first argument is the name of the master, the second is its IP, the third is its port,
# and the fourth is the quorum (minimum number of Sentinels that must agree a master is down).
# A quorum of 2 is sufficient for 3 Sentinels.
sentinel monitor mymaster 192.168.1.10 6379 2

# The failover timeout. If a master does not answer for this duration, it's considered down.
sentinel down-after-milliseconds mymaster 5000

# The time in milliseconds between Sentinel trying to re-configure replicas.
sentinel parallel-syncs mymaster 1

# The time in milliseconds Sentinel will wait before starting the failover process
# after a master is detected as down.
sentinel failover-timeout mymaster 10000

# Optional: If you have replicas with different priorities, you can specify them.
# sentinel can-failover-master-with-replica-priority mymaster

# Optional: If you want to use a specific replica to promote during failover.
# sentinel known-replica mymaster 192.168.1.11
# sentinel known-replica mymaster 192.168.1.12

# Optional: Authentication for Redis instances
# requirepass your_redis_password
# masterauth your_redis_password
# sentinel auth-pass mymaster your_redis_password

Start the Sentinel service on each Sentinel node:

sudo systemctl start redis-sentinel
sudo systemctl enable redis-sentinel

Verify Sentinel status:

redis-cli -p 26379 INFO Sentinel

You should see output indicating the monitored master and the other Sentinels. Once all Sentinels are up and running, they will elect a leader and begin monitoring the Redis master. To test failover, stop the Redis master process:

# On redis-master
sudo systemctl stop redis-server

Monitor the Sentinel logs (/var/log/redis/redis-sentinel.log) on the Sentinel nodes. Within a short period, one of the replicas will be promoted to master, and the Sentinels will reconfigure the remaining replicas. Your application should connect to the master’s IP address, and Sentinel will transparently redirect it to the new master.

Automating C Application Failover with Systemd and HAProxy

For stateless C applications that need high availability, we can leverage systemd for process management and automatic restarts, combined with HAProxy as a load balancer and health checker. This setup assumes your C application is designed to be stateless or can manage its state externally (e.g., via Redis, as configured above).

We’ll deploy two instances of the C application on separate Linode instances (app-1, app-2) and use a third Linode instance (lb-1) for HAProxy. The application will listen on a specific port (e.g., 8080).

Application Deployment and Systemd Service

On each application node (app-1, app-2), ensure your compiled C application binary is in a standard location (e.g., /usr/local/bin/my_c_app). Create a systemd service file to manage the application.

# /etc/systemd/system/my_c_app.service on app-1 and app-2
[Unit]
Description=My C Application Service
After=network.target

[Service]
ExecStart=/usr/local/bin/my_c_app --port 8080 --config /etc/my_c_app/config.conf
Restart=always
RestartSec=5
User=my_app_user
Group=my_app_user
WorkingDirectory=/opt/my_c_app
Environment="MY_APP_ENV=production"

[Install]
WantedBy=multi-user.target

Create the user and group, and set up the application directory:

sudo groupadd my_app_user
sudo useradd -r -g my_app_user -s /sbin/nologin my_app_user
sudo mkdir -p /opt/my_c_app
sudo chown -R my_app_user:my_app_user /opt/my_c_app
sudo mkdir -p /etc/my_c_app
sudo chown -R my_app_user:my_app_user /etc/my_c_app

Place your compiled C application binary and any necessary configuration files in the respective directories. Then, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable my_c_app
sudo systemctl start my_c_app

Check the status:

sudo systemctl status my_c_app

Configuring HAProxy for Load Balancing and Health Checks

Install HAProxy on the load balancer node (lb-1).

sudo apt update && sudo apt install haproxy -y

Edit the HAProxy configuration file (/etc/haproxy/haproxy.cfg).

# /etc/haproxy/haproxy.cfg on lb-1
global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

frontend http_frontend
    bind *:80
    mode http
    default_backend http_backend

backend http_backend
    mode http
    balance roundrobin
    option httpchk GET /healthz # Assuming your C app has a /healthz endpoint
    http-check expect status 200
    server app1 192.168.1.30:8080 check # IP of app-1
    server app2 192.168.1.31:8080 check # IP of app-2

# If your C app is not HTTP-based, use TCP mode
# frontend tcp_frontend
#    bind *:8080
#    mode tcp
#    default_backend tcp_backend

# backend tcp_backend
#    mode tcp
#    balance roundrobin
#    option tcp-check # Basic TCP connection check
#    server app1 192.168.1.30:8080 check port 8080 # IP of app-1
#    server app2 192.168.1.31:8080 check port 8080 # IP of app-2

Note: If your C application doesn’t expose an HTTP health check endpoint, you can use option tcp-check for basic TCP connectivity checks. Ensure your C application is configured to listen on the specified port (e.g., 8080).

Enable and start HAProxy:

sudo systemctl enable haproxy
sudo systemctl start haproxy

To test the failover, stop the C application service on one of the application nodes:

# On app-1
sudo systemctl stop my_c_app

HAProxy will detect that the instance is unhealthy (either via HTTP health check or TCP check) and will stop sending traffic to it. Traffic will be automatically routed to the healthy instance (app-2). When you restart the service on app-1, HAProxy will re-add it to the pool after it passes health checks.

Integrating Redis and C Application Failover Strategies

The most robust disaster recovery strategy involves combining these two approaches. Your C application instances, managed by systemd and load-balanced by HAProxy, should connect to the highly available Redis cluster managed by Sentinel.

When configuring your C application (or its connection logic), it should be aware of the Redis Sentinel endpoint. Many Redis client libraries support Sentinel discovery. If your application’s client library doesn’t directly support Sentinel, you can implement a simple discovery mechanism:

Client-Side Redis Sentinel Discovery (Conceptual Python Example)

This Python snippet illustrates how a client might discover the current Redis master via Sentinel. Your C application would need a similar logic, potentially using a C Redis client library that supports Sentinel or by implementing this logic in a proxy layer.

import redis

# List of Sentinel nodes
SENTINELS = [('192.168.1.20', 26379), ('192.168.1.21', 26379), ('192.168.1.22', 26379)]
MASTER_NAME = 'mymaster'

def get_redis_master():
    try:
        # Initialize a Sentinel client
        sentinel = redis.Sentinel(SENTINELS, socket_timeout=0.5)

        # Get the master connection object
        master = sentinel.master_for(MASTER_NAME, socket_timeout=0.5)
        
        # Test connection by pinging
        master.ping()
        
        # Return connection details
        return {
            'host': master.connection_pool.host,
            'port': master.connection_pool.port
        }
    except redis.exceptions.ConnectionError as e:
        print(f"Error connecting to Redis Sentinel or Master: {e}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

if __name__ == "__main__":
    redis_info = get_redis_master()
    if redis_info:
        print(f"Current Redis Master: {redis_info['host']}:{redis_info['port']}")
        # In a real application, you would use these details to connect
        # For example:
        # r = redis.Redis(host=redis_info['host'], port=redis_info['port'], db=0)
        # r.set('mykey', 'myvalue')
    else:
        print("Failed to get Redis master information.")

In a C application, you would typically configure the Redis connection details (host, port) in a configuration file. When a connection error occurs, your application could trigger a re-discovery of the Redis master using the Sentinel IPs. This logic should be robust, handling temporary network glitches and Sentinel leader elections.

Monitoring and Alerting

Effective disaster recovery is incomplete without comprehensive monitoring. Implement checks for:

Redis Sentinel health (number of masters down, number of sentinels available).
Redis master and replica status (connected, replication lag).
HAProxy backend health (number of available servers).
Application-level metrics (request latency, error rates).
System resource utilization (CPU, memory, disk I/O) on all nodes.

Tools like Prometheus with Alertmanager, Datadog, or Nagios can be integrated to provide real-time insights and trigger alerts for any anomalies, allowing for proactive intervention before a full-blown disaster occurs.