Server Monitoring Best Practices: Keeping Your Ruby App and Redis Clusters Alive on Linode

Proactive Redis Cluster Health Checks with `redis-cli` and Custom Scripts

Maintaining the health of your Redis cluster is paramount for any high-throughput Ruby application. Relying solely on Linode’s basic CPU/RAM metrics is insufficient. We need granular checks that understand Redis’s internal state, particularly for clustered deployments. A common pitfall is overlooking network partitions or node failures that `redis-cli` can readily detect.

The `redis-cli –cluster check` command is your first line of defense. It verifies that all nodes are reachable, that the cluster hash slots are correctly assigned, and that there are no inconsistencies. However, this command is often run manually. For continuous monitoring, we’ll integrate it into a script that can be scheduled via cron.

Automating Redis Cluster Checks with a Bash Script

Let’s craft a robust bash script that connects to a seed node in your Redis cluster and performs the necessary checks. This script will output specific error codes or messages that can be parsed by a monitoring system like Prometheus or Nagios.

First, ensure you have `redis-cli` installed on the monitoring server. On Debian/Ubuntu systems, this is typically part of the `redis-tools` package.

Here’s the script. We’ll define the seed node’s IP and port, and then execute the cluster check. The output is then filtered to identify potential issues. We’ll also add a basic check for node connectivity using `redis-cli ping`.

#!/bin/bash

# Configuration
REDIS_HOST="192.168.1.100" # Replace with an IP of one of your Redis nodes
REDIS_PORT="7000"
CLUSTER_CHECK_CMD="redis-cli -h $REDIS_HOST -p $REDIS_PORT --cluster check"
PING_CMD="redis-cli -h $REDIS_HOST -p $REDIS_PORT ping"

# --- Basic Node Connectivity Check ---
echo "--- PING Check ---"
if ! $PING_CMD >& /dev/null; then
    echo "ERROR: Redis node $REDIS_HOST:$REDIS_PORT is unreachable."
    exit 1
fi
echo "SUCCESS: Redis node $REDIS_HOST:$REDIS_PORT is reachable."

# --- Cluster Health Check ---
echo ""
echo "--- Cluster Check ---"
CLUSTER_CHECK_OUTPUT=$($CLUSTER_CHECK_CMD 2>&1)
CLUSTER_CHECK_EXIT_CODE=$?

if [ $CLUSTER_CHECK_EXIT_CODE -ne 0 ]; then
    echo "ERROR: redis-cli --cluster check failed with exit code $CLUSTER_CHECK_EXIT_CODE."
    echo "Output:"
    echo "$CLUSTER_CHECK_OUTPUT"
    exit 1
fi

# Parse output for common issues
if echo "$CLUSTER_CHECK_OUTPUT" | grep -q "ERR"; then
    echo "WARNING: redis-cli --cluster check reported errors."
    echo "$CLUSTER_CHECK_OUTPUT" | grep "ERR"
    # Depending on your monitoring, you might want to exit 1 here
    # exit 1
fi

if echo "$CLUSTER_CHECK_OUTPUT" | grep -q "fail"; then
    echo "CRITICAL: redis-cli --cluster check reported node failures."
    echo "$CLUSTER_CHECK_OUTPUT" | grep "fail"
    exit 1
fi

if echo "$CLUSTER_CHECK_OUTPUT" | grep -q "100%"; then
    echo "WARNING: Some slots are not covered by 100% of the nodes."
    echo "$CLUSTER_CHECK_OUTPUT" | grep "100%"
    # exit 1
fi

echo "SUCCESS: Redis cluster check passed."
echo "$CLUSTER_CHECK_OUTPUT"
exit 0

To make this script executable, run:

chmod +x /path/to/your/redis_cluster_check.sh

Schedule this script using cron. For example, to run it every 5 minutes:

*/5 * * * * /path/to/your/redis_cluster_check.sh >> /var/log/redis_cluster_check.log 2>&1

Monitoring Ruby Application Performance with `rack-mini-profiler` and APM Tools

Your Ruby application’s performance is directly tied to its interaction with Redis and other services. While `rack-mini-profiler` is excellent for development and staging, production monitoring requires more sophisticated Application Performance Monitoring (APM) tools. These tools provide deep insights into request latency, database queries, external service calls, and, crucially, Redis interactions.

For a typical Rails application, integrating an APM agent is straightforward. We’ll focus on common patterns and what to look for in the APM dashboard.

Key Metrics to Monitor in Your Ruby App

Request Latency: Overall time taken to serve requests. Look for spikes and long-tail latency.
Database Query Time: Time spent executing SQL queries. N+1 query problems often manifest here.
External Service Calls: Latency and error rates for calls to external APIs, including your Redis cluster.
Redis Command Latency: Specific metrics for `GET`, `SET`, `HGETALL`, etc. This is critical for identifying slow Redis operations.
Throughput: Requests per minute (RPM).
Error Rates: Percentage of requests resulting in errors (5xx, 4xx).
Memory Usage: Application heap size and garbage collection activity.

Popular APM tools for Ruby include New Relic, Datadog APM, AppSignal, and Scout APM. The setup generally involves adding a gem to your `Gemfile` and configuring an agent.

Example using Datadog APM (requires Datadog agent running on your Linode instance):

1. Add the gem:

# Gemfile
gem 'ddtrace'

2. Run `bundle install`.

3. Configure the agent. This often involves setting environment variables or a configuration file. For Rails, you might initialize it in `config/application.rb` or an initializer:

# config/initializers/datadog.rb
if Rails.env.production?
  Datadog.configure do |c|
    c.service = 'my-ruby-app'
    c.env = 'production'
    c.version = '1.2.3' # Optional: your app version
    c.tracing.enabled = true
    c.logger.level = 'info'

    # If Datadog agent is running on the same host, default agent address is used.
    # Otherwise, specify the agent host and port:
    # c.agent.host = 'datadog-agent.example.com'
    # c.agent.port = 8126
  end
end

Once integrated, the Datadog dashboard will show detailed traces for your requests, including time spent in Redis operations. You can set up monitors based on these metrics, such as alerting if the average latency for `GET` commands on your Redis cluster exceeds a certain threshold (e.g., 50ms) for more than 5 minutes.

Linode Infrastructure Monitoring and Alerting

Linode provides essential infrastructure metrics, but these need to be augmented with application-specific and service-specific monitoring. We’ll use Linode’s built-in monitoring for CPU, RAM, Disk I/O, and Network traffic, and integrate it with a more robust alerting mechanism.

Configuring Linode Alerts

Linode’s Cloud Manager allows you to set up alerts for various resource utilization thresholds. It’s crucial to set these thoughtfully to avoid alert fatigue.

CPU Utilization: Set an alert for sustained high CPU usage (e.g., > 80% for 15 minutes) on your application servers and Redis nodes. High CPU on Redis can indicate heavy load, inefficient commands, or memory fragmentation.
Memory Utilization: Alert when RAM usage approaches capacity (e.g., > 85%). For Redis, this is critical as it can lead to swapping, which severely degrades performance, or eviction policies kicking in unexpectedly.
Disk I/O: Monitor I/O wait times and disk utilization. High disk I/O on Redis nodes can be a bottleneck, especially if persistence (RDB or AOF) is enabled and heavily used.
Network Traffic: High inbound/outbound traffic can indicate a DDoS attack or simply a surge in application usage.

When configuring alerts in Linode, specify the target resource (e.g., a specific Linode instance), the metric (CPU, RAM), the condition (e.g., `greater than`), the threshold value, and the duration. You can also configure notification channels (email, Slack via webhooks).

Integrating Prometheus and Grafana for Centralized Observability

For a truly comprehensive monitoring solution, a combination of Prometheus for metrics collection and Grafana for visualization and alerting is a de facto standard. This allows you to aggregate metrics from Linode, your Ruby app (via APM or custom exporters), and your Redis cluster into a single pane of glass.

Exposing Redis Metrics to Prometheus

The `redis-exporter` is a popular choice for exposing Redis metrics in a Prometheus-compatible format. You can run this as a separate service on a dedicated Linode or alongside your Redis instances.

1. **Install `redis-exporter`:** Download the latest release from its GitHub repository (e.g., `redis_exporter-vX.Y.Z.linux-amd64.tar.gz`). Extract it and place the binary in a suitable location (e.g., `/usr/local/bin`).

2. **Run `redis-exporter`:**

./redis_exporter --redis.addr=redis://192.168.1.100:6379 --redis.password=your_redis_password --web.listen-address=":9121"

Replace `192.168.1.100:6379` with your Redis master’s address and `your_redis_password` if authentication is enabled. The exporter will listen on port 9121 by default.

3. **Configure Prometheus:** Add a scrape job to your `prometheus.yml` configuration:

scrape_configs:
  - job_name: 'redis'
    static_configs:
      - targets: ['192.168.1.100:9121', '192.168.1.101:9121', '192.168.1.102:9121'] # List all your redis-exporter instances
    # Optional: If using service discovery, this would be dynamic

Restart Prometheus for the new configuration to take effect.

Integrating Ruby App Metrics with Prometheus

While APM tools are powerful, for direct Prometheus integration, you can use the `prometheus-client-ruby` gem. This allows you to expose custom application metrics.

1. **Add the gem:**

# Gemfile
gem 'prometheus-client-ruby'

2. **Instrument your code:**

# In an initializer or a dedicated metrics file
require 'prometheus/client'

# Initialize Prometheus client
Prometheus::Client.configure do |config|
  config.logger = Rails.logger
end

# Define custom metrics
# Example: Track Redis command counts
REDIS_COMMAND_TOTAL = Prometheus::Client::Counter.new(
  name: 'redis_commands_total',
  documentation: 'Total number of Redis commands executed',
  labels: [:command, :db]
)

# Register metrics
Prometheus::Client.registry.register(REDIS_COMMAND_TOTAL)

# In your Redis client wrapper or service object:
# Example using a hypothetical Redis client wrapper
class MyRedisClient
  def get(key)
    command = 'GET'
    result = perform_redis_operation { $redis.get(key) } # Assuming $redis is your Redis client instance
    REDIS_COMMAND_TOTAL.increment(command: command, db: 'cache')
    result
  end

  def set(key, value, options = {})
    command = 'SET'
    result = perform_redis_operation { $redis.set(key, value, options) }
    REDIS_COMMAND_TOTAL.increment(command: command, db: 'cache')
    result
  end

  private

  def perform_redis_operation(&block)
    # Add latency tracking here as well if needed
    yield
  rescue Redis::BaseError => e
    # Log error, potentially increment an error counter
    raise e
  end
end

# Expose metrics endpoint (e.g., via a Rack middleware)
# config/routes.rb
# mount Prometheus::Client::Rack::Exporter => '/metrics'

3. **Configure Prometheus to scrape your app’s metrics endpoint:**

scrape_configs:
  - job_name: 'ruby_app'
    static_configs:
      - targets: ['your_app_server_ip:3000'] # Assuming metrics endpoint is exposed on port 3000
    metrics_path: '/metrics'

Grafana Dashboards and Alerting

Once Prometheus is collecting metrics, set up Grafana dashboards to visualize them. You can import pre-built dashboards for Redis or create custom ones. Key panels to include:

Redis Cluster Health (from `redis-cli –cluster check` output parsed by a custom exporter or script)
Redis Memory Usage (used_memory, used_memory_rss, fragmentation_ratio)
Redis Network Traffic (total_net_output_bytes, total_net_input_bytes)
Redis Command Operations (per command type)
Redis Latency (avg_latest_call_latency, avg_latest_call_latency_percentiles)
Ruby App Request Latency, Throughput, Error Rates (from APM or custom metrics)
Linode Instance CPU, RAM, Disk, Network (using the Prometheus Node Exporter or Linode’s API integration)

In Grafana, you can define alerting rules based on Prometheus queries. For example, an alert could fire if:

# Prometheus Alerting Rule (in a Grafana alert rule definition)
ALERT RedisHighMemoryUsage
  IF redis_memory_used_bytes{job="redis"} / redis_total_system_memory_bytes{job="redis"} * 100 > 85
  FOR 5m
  LABELS { severity = "warning" }
  ANNOTATIONS {
    summary = "Redis memory usage is high on {{ $labels.instance }}",
    description = "Redis instance {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }}% of its memory."
  }

This rule would trigger a warning if any Redis instance exceeds 85% memory usage for 5 minutes. Configure Grafana to send these alerts to Slack, PagerDuty, or email.

Advanced: Redis Persistence and AOF Rewriting Monitoring

For production Redis deployments, persistence (RDB snapshots and AOF logging) is often enabled. Monitoring the health of these mechanisms is crucial to prevent data loss and performance degradation.

AOF Rewriting Issues

Redis automatically rewrites the Append Only File (AOF) to shrink its size. This process can be resource-intensive. If the AOF file grows too large or the rewrite fails, it can impact performance and increase disk usage.

Metrics to watch:

`aof_enabled`: Should be `1` if AOF is active.
`aof_rewrite_in_progress`: Should be `0`. If this is `1` for an extended period, the rewrite is stuck.
`aof_last_bgrewrite_status`: Should be `ok`. Any other status indicates a failure.
`aof_current_size` and `aof_base_size`: Monitor these to understand AOF growth and rewrite effectiveness.

You can expose these via `redis-exporter` and alert on them in Grafana/Prometheus. For instance, an alert for `aof_rewrite_in_progress == 1` for more than 30 minutes would be a critical indicator.

RDB Snapshotting Issues

Similarly, RDB snapshots can cause latency spikes during the fork operation required to save the dataset. While less frequent than AOF rewrites, it’s important to monitor.