Server Monitoring Best Practices: Keeping Your Ruby App and Redis Clusters Alive on OVH

Proactive Redis Cluster Health Checks with `redis-cli` and Custom Scripts

Maintaining the health of your Redis clusters, especially in a distributed environment like OVH, requires more than just basic uptime checks. We need to go deeper, monitoring key performance indicators (KPIs) and ensuring cluster integrity. A common pitfall is relying solely on external HTTP checks, which tell you nothing about Redis’s internal state or its ability to serve requests efficiently. This section details how to leverage `redis-cli` and custom scripting for robust, proactive Redis monitoring.

Our primary tool for this is `redis-cli`. We’ll use it to execute commands that reveal critical cluster information. For automated checks, we’ll wrap these commands in shell scripts that can be scheduled via cron or integrated into a larger monitoring system like Prometheus with `node_exporter`’s textfile collector.

Essential `redis-cli` Commands for Cluster Health

First, ensure you have `redis-cli` installed on your monitoring server or a bastion host that has network access to your Redis cluster nodes. The commands below assume a standard Redis cluster setup with at least one master and one replica per shard.

1. Cluster State: The most fundamental check is to see if the cluster is in a `ok` state. This indicates that all nodes are communicating and slots are assigned correctly.

redis-cli -c -h  -p  cluster info | grep cluster_state

Expected Output: `cluster_state:ok`

2. Node Count and Status: Verify that the expected number of nodes are online and participating in the cluster.

redis-cli -c -h  -p  cluster nodes | grep myself | wc -l

This command counts the number of nodes that identify themselves as `myself`. If you have 6 nodes (3 masters, 3 replicas), this should output `6`. You can also parse the full `cluster nodes` output to check the `connected` status of each node.

3. Slot Distribution: Ensure all hash slots are covered and assigned to masters.

redis-cli -c -h  -p  cluster countkeyslots

Expected Output: `16384` (the total number of slots in a Redis cluster).

4. Master/Replica Synchronization: For high availability, replicas must be in sync with their masters. This is crucial to prevent data loss during failovers.

redis-cli -c -h  -p  cluster replicas  | grep -E 'flags:master|state:connected' | wc -l

This command needs refinement. A better approach is to iterate through each master and check its replicas:

#!/bin/bash

REDIS_HOST=""
REDIS_PORT=""

# Get all master node IDs
MASTER_NODES=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster nodes | grep 'master' | grep -v 'slave' | awk '{print $1}')

ALL_MASTERS_SYNCED=true

for MASTER_ID in $MASTER_NODES; do
    REPLICAS=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster nodes | grep "master $MASTER_ID" | awk '{print $1}')
    if [ -z "$REPLICAS" ]; then
        echo "ALERT: Master $MASTER_ID has no replicas!"
        ALL_MASTERS_SYNCED=false
        continue
    fi

    for REPLICA_ID in $REPLICAS; do
        # Check if replica is connected and its sync status
        REPLICA_INFO=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster replicate $REPLICA_ID info replication)
        if echo "$REPLICA_INFO" | grep -q "master_repl_offset:"; then
            MASTER_OFFSET=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster nodes $MASTER_ID | grep 'master' | awk '{print $10}') # This is a simplification, need to get master offset properly
            REPLICA_OFFSET=$(echo "$REPLICA_INFO" | grep "master_repl_offset:" | cut -d':' -f2)
            # A more robust check would involve fetching master's current offset
            # For now, we check if the replica is connected and not lagging significantly (heuristic)
            if ! echo "$REPLICA_INFO" | grep -q "master_link_status:up"; then
                echo "ALERT: Replica $REPLICA_ID for master $MASTER_ID is not connected."
                ALL_MASTERS_SYNCED=false
            fi
            # Add a check for lag if possible, e.g., by comparing master's last sync time with replica's last sync time
        else
            echo "ALERT: Could not retrieve replication info for replica $REPLICA_ID."
            ALL_MASTERS_SYNCED=false
        fi
    done
done

if $ALL_MASTERS_SYNCED; then
    echo "Redis cluster masters and replicas are in sync."
    exit 0
else
    echo "Redis cluster synchronization issues detected."
    exit 1
fi

Note: The above script is a starting point. Accurately determining the “lag” requires fetching the master’s current replication offset and comparing it. A simpler, albeit less precise, check is to ensure the replica is `connected` and its `master_link_status` is `up`.

Integrating with Prometheus via `node_exporter`

To feed these metrics into Prometheus, we can use the `node_exporter`’s textfile collector. This involves placing custom scripts in a designated directory (e.g., `/var/lib/node_exporter/textfile_collector/`) that output metrics in Prometheus text format.

Create a script, for example, `/opt/scripts/redis_cluster_exporter.sh`:

#!/bin/bash

REDIS_HOST=""
REDIS_PORT=""
TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"

# Ensure the textfile directory exists
mkdir -p $TEXTFILE_DIR

# --- Cluster State ---
CLUSTER_STATE=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster info | grep cluster_state | cut -d':' -f2)
if [ "$CLUSTER_STATE" == "ok" ]; then
    echo "redis_cluster_state{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} 1"
else
    echo "redis_cluster_state{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} 0"
fi

# --- Node Count ---
NODE_COUNT=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster nodes | grep myself | wc -l)
echo "redis_cluster_nodes_total{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} $NODE_COUNT"

# --- Slot Count ---
SLOT_COUNT=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster countkeyslots)
echo "redis_cluster_slots_total{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} $SLOT_COUNT"

# --- Replication Status (Simplified) ---
# This part is complex to get perfectly accurate lag. We'll focus on connection status.
MASTERS=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster nodes | grep 'master' | grep -v 'slave' | awk '{print $1}')
REPLICA_FAILURES=0
for MASTER_ID in $MASTERS; do
    REPLICAS=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster nodes | grep "master $MASTER_ID" | awk '{print $1}')
    if [ -z "$REPLICAS" ]; then
        echo "redis_cluster_replica_sync_status{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\",master_id=\"$MASTER_ID\"} 0" # No replicas for this master
        REPLICA_FAILURES=$((REPLICA_FAILURES + 1))
        continue
    fi
    for REPLICA_ID in $REPLICAS; do
        REPLICA_INFO=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT cluster replicate $REPLICA_ID info replication 2>/dev/null)
        if echo "$REPLICA_INFO" | grep -q "master_link_status:up"; then
            echo "redis_cluster_replica_sync_status{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\",master_id=\"$MASTER_ID\",replica_id=\"$REPLICA_ID\"} 1"
        else
            echo "redis_cluster_replica_sync_status{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\",master_id=\"$MASTER_ID\",replica_id=\"$REPLICA_ID\"} 0"
            REPLICA_FAILURES=$((REPLICA_FAILURES + 1))
        fi
    done
done

# Overall replica sync status
if [ "$REPLICA_FAILURES" -gt 0 ]; then
    echo "redis_cluster_overall_replica_sync{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} 0"
else
    echo "redis_cluster_overall_replica_sync{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} 1"
fi

# --- Basic Performance Metrics (Optional, requires Redis INFO command) ---
# Example: Memory Usage
MEMORY_USAGE=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT INFO memory | grep used_memory: | cut -d':' -f2)
echo "redis_memory_bytes{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} $MEMORY_USAGE"

# Example: Connected Clients
CONNECTED_CLIENTS=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT INFO clients | grep connected_clients: | cut -d':' -f2)
echo "redis_connected_clients{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} $CONNECTED_CLIENTS"

# Example: Commands Processed Per Second
COMMANDS_PER_SEC=$(redis-cli -c -h $REDIS_HOST -p $REDIS_PORT INFO stats | grep total_commands_processed: | cut -d':' -f2)
# This needs to be a rate, so we'd typically calculate it over time. For a single snapshot, it's just the total.
# For Prometheus, you'd use a counter and let Prometheus calculate the rate.
echo "redis_total_commands_processed{host=\"$REDIS_HOST\",port=\"$REDIS_PORT\"} $COMMANDS_PER_SEC"

Make the script executable:

chmod +x /opt/scripts/redis_cluster_exporter.sh

Configure `node_exporter` to run this script periodically. Add a cron job:

crontab -e
# Add this line to run every minute
* * * * * /opt/scripts/redis_cluster_exporter.sh > /dev/null 2>&1

Ensure your `node_exporter` configuration includes the textfile collector and points to the correct directory. In `node_exporter`’s startup command, you’d typically see something like:

/usr/local/bin/node_exporter --collector.textfile.directory="/var/lib/node_exporter/textfile_collector"

Monitoring Ruby Application Performance with `Prometheus.rb` and `Rack::Metric`

Your Ruby application is the consumer of your Redis cluster. Its performance is directly tied to Redis health and its own internal efficiency. We need to instrument the application to expose relevant metrics.

Instrumenting with `Prometheus.rb`

The `prometheus-client-ruby` gem (often referred to as `Prometheus.rb`) is the standard for instrumenting Ruby applications for Prometheus. It allows you to define various metric types (Counters, Gauges, Histograms, Summaries) and expose them via an HTTP endpoint.

Add the gem to your `Gemfile`:

gem 'prometheus-client-ruby'

Then, run `bundle install`.

Initialize the client and define your metrics. This is typically done in an initializer file (e.g., `config/initializers/prometheus.rb` for Rails):

require 'prometheus/client'

# Initialize the client
Prometheus::Client.configure do |config|
  config.logger = Rails.logger # Or your preferred logger
end

# Define metrics
# Counter for total requests
REQUEST_COUNTER = Prometheus::Client::Counter.new(
  :http_requests_total,
  docstring: 'Total HTTP requests processed',
  labels: [:method, :path, :status]
)

# Histogram for request duration
REQUEST_DURATION_HISTOGRAM = Prometheus::Client::Histogram.new(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration in seconds',
  labels: [:method, :path, :status],
  buckets: Prometheus::Client::Histogram::DEFAULT_BUCKETS # Or define custom buckets
)

# Gauge for current active connections (example)
ACTIVE_CONNECTIONS = Prometheus::Client::Gauge.new(
  :app_active_connections,
  docstring: 'Number of active application connections'
)

# Register metrics
Prometheus::Client.registry.register(REQUEST_COUNTER)
Prometheus::Client.registry.register(REQUEST_DURATION_HISTOGRAM)
Prometheus::Client.registry.register(ACTIVE_CONNECTIONS)

# Example: Incrementing active connections on connection establishment
# (This would depend on your connection pooling mechanism)
# MyConnectionPool.on_connect { ACTIVE_CONNECTIONS.increment(labels: {}) }
# MyConnectionPool.on_disconnect { ACTIVE_CONNECTIONS.decrement(labels: {}) }

# Example: Incrementing active connections in a Rack middleware
# (See Rack::Metric below)

Using `Rack::Metric` for Automatic Request Metrics

The `rack-metric` gem, built on top of `prometheus-client-ruby`, automatically instruments your Rack application (including Rails) to collect HTTP request metrics.

Add to `Gemfile`:

gem 'rack-metric'

Configure it in `config/application.rb` or `config/environments/*.rb`:

# config/application.rb or config/environments/production.rb
module YourApp
  class Application < Rails::Application
    # ... other configurations ...

    config.middleware.use Rack::Metric,
      registry: Prometheus::Client.registry,
      prefix: 'rails' # Optional prefix for metrics
  end
end

This middleware will automatically:

Increment `rails_http_requests_total` (or your configured prefix) for each request.
Measure and record `rails_http_request_duration_seconds` for each request.
It can also be configured to track other metrics like active connections if you provide custom collectors.

Exposing Metrics Endpoint

You need an endpoint for Prometheus to scrape. For Rails, you can add a route:

# config/routes.rb
Rails.application.routes.draw do
  # ... other routes ...

  # Prometheus metrics endpoint
  get '/metrics', to: Prometheus::Client::Middleware.new
end

Now, Prometheus can scrape `http://your-app-host:port/metrics` to collect application-level metrics.

OVH Specific Considerations: Network, Security, and Resource Limits

Operating in a cloud environment like OVH introduces specific challenges and best practices for monitoring. Network latency, security group configurations, and resource quotas can all impact your application’s performance and availability.

Network Latency and Bandwidth Monitoring

OVH instances are provisioned within specific regions and availability zones. High latency between your application servers and your Redis cluster, or between Redis nodes themselves, can degrade performance significantly.

Inter-Instance Latency: Use tools like `ping` and `mtr` from your application servers to your Redis nodes. Monitor these metrics over time.
Bandwidth Saturation: Monitor network traffic on your instances using `iftop`, `nload`, or cloud provider-specific tools. High bandwidth usage can indicate inefficient data transfer or potential denial-of-service attacks.
OVH Network Monitoring: Familiarize yourself with OVH’s network monitoring tools within the OVHcloud Control Panel. These can provide insights into traffic patterns and potential network issues at the infrastructure level.

Security Groups and Firewall Rules

Incorrectly configured security groups or firewalls are a common cause of connectivity issues. Ensure that:

Your application servers can reach the Redis cluster on the configured port (default 6379).
Redis nodes can reach each other for cluster communication (ports 16379 and potentially others depending on configuration).
Your monitoring server/bastion host can reach the Redis cluster for `redis-cli` checks.
Your Prometheus server can reach the application’s `/metrics` endpoint.

Regularly audit your firewall rules. A common mistake is allowing access from `0.0.0.0/0` for Redis, which is a major security risk. Restrict access to only the necessary IP ranges or security groups.

Resource Limits and Quotas

OVH, like any cloud provider, has resource limits (CPU, RAM, disk I/O, network egress). Exceeding these limits can lead to performance throttling or outright service disruption.

Monitoring Instance Resources:

CPU/Memory: Use `node_exporter` on your application and Redis instances to collect CPU, memory, and swap usage. Set up alerts in Prometheus when these metrics approach critical thresholds (e.g., > 80% for sustained periods).
Disk I/O: Monitor disk I/O wait times and throughput. High I/O wait can indicate a bottleneck, especially for Redis if it’s configured to persist data to disk frequently.
OVH Instance Metrics: Leverage OVH’s built-in instance monitoring to get a high-level view of resource utilization.

Redis Specific Resource Monitoring:

Memory Usage: As shown in the `redis_cluster_exporter.sh` script, monitor `used_memory`. Configure alerts for high memory usage, especially if approaching `maxmemory` limits.
CPU Usage: While Redis is single-threaded for command execution, high CPU can still occur due to network I/O, background saving (RDB/AOF), or replication. Monitor CPU usage on Redis instances.

Alerting Strategy

A robust alerting strategy is paramount. Combine Prometheus Alertmanager with your monitoring setup.

Key Alerts to Configure:

Redis Cluster Unreachable: Alert if `redis_cluster_state` metric drops to 0.
Redis Replica Sync Issues: Alert if `redis_cluster_overall_replica_sync` metric drops to 0, or if individual `redis_cluster_replica_sync_status` metrics are 0 for a sustained period.
High Redis Memory Usage: Alert when `redis_memory_bytes` exceeds a defined percentage of `maxmemory` or total instance RAM.
High Application CPU/Memory Usage: Alert when `node_exporter` metrics for application instances exceed thresholds.
High Request Latency: Alert on `rails_http_request_duration_seconds` histogram when p95 or p99 latencies exceed acceptable values.
High Error Rate: Alert when `rails_http_requests_total` with status codes 5xx or 4xx increases significantly.
Network Latency/Packet Loss: Integrate network monitoring tools or custom scripts that periodically ping/trace to critical endpoints and alert on increased latency or packet loss.

By implementing these detailed monitoring practices, you can ensure the stability and performance of your Ruby applications and Redis clusters on OVH, moving from reactive firefighting to proactive system management.