Server Monitoring Best Practices: Keeping Your Shopify App and Redis Clusters Alive on OVH

Establishing Robust Redis Cluster Monitoring on OVH

Maintaining the health and performance of Redis clusters, especially those powering critical Shopify applications, demands a proactive and granular monitoring strategy. On OVH infrastructure, this often involves a combination of native OVH tools and custom solutions tailored to Redis’s unique characteristics. We’ll focus on key metrics and actionable alerts that prevent downtime and performance degradation.

Key Redis Metrics for OVH Deployments

Beyond basic CPU and memory utilization, Redis-specific metrics are paramount. For a cluster, we need to monitor:

Memory Usage: used_memory, used_memory_rss, and mem_fragmentation_ratio. High fragmentation can indicate memory leaks or inefficient data structures.
Network Traffic: total_net_input_bytes and total_net_output_bytes. Spikes can signal heavy load or potential DDoS attacks.
Command Operations: total_commands_processed. A sudden drop or stagnation might indicate a blocked event loop or network issues.
Latency: Redis’s built-in latency monitoring is crucial. Track latest_fork_usec (for background save operations) and general command latency.
Replication Status: For master-slave setups, monitor master_repl_offset and slave_repl_offset to ensure replicas are in sync. Check master_link_status.
Evictions: evicted_keys. A high rate of evictions means your maxmemory policy is being hit, potentially leading to data loss for your application.
Connections: connected_clients and rejected_connections. A surge in rejected connections points to hitting the maxclients limit.

Implementing Redis Monitoring with Prometheus and Grafana on OVH

Prometheus is an excellent choice for time-series data collection, and Grafana provides powerful visualization. We’ll deploy the redis_exporter to expose Redis metrics.

Deploying redis_exporter

On each Redis node (or a dedicated monitoring host that can reach them), install and configure redis_exporter. A common approach is to run it as a systemd service.

Systemd Service File Example

Create a file like /etc/systemd/system/redis_exporter.service:

[Unit]
Description=Redis Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=redis_exporter
Group=redis_exporter
Type=simple
ExecStart=/usr/local/bin/redis_exporter --redis.addr=redis://localhost:6379 --web.listen-address=":9121"
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Ensure you have a user and group for the exporter and adjust --redis.addr if your Redis instance is not on localhost:6379. If you have a Redis cluster, you’ll need to run this exporter for each node or configure it to connect to a specific node and scrape cluster-wide metrics.

Prometheus Configuration

Add a scrape configuration to your prometheus.yml to collect metrics from the exporter:

scrape_configs:
  - job_name: 'redis_cluster'
    static_configs:
      - targets:
          - 'redis-node-1:9121'
          - 'redis-node-2:9121'
          - 'redis-node-3:9121'
          # Add all your Redis nodes here
    metrics_path: /metrics

Reload Prometheus configuration: systemctl reload prometheus.

Grafana Dashboards

Import a pre-built Redis dashboard (e.g., from Grafana.com, search for “Redis Exporter”) or create a custom one. Key panels should include:

Memory Usage (used_memory vs. maxmemory)
Key Eviction Rate
Command Throughput
Network I/O
Replication Lag
Connected Clients
Latency (if available via exporter or custom instrumentation)

OVH Specific Considerations for Shopify App Monitoring

Your Shopify app likely interacts with Redis for caching, session management, or background job queues. Monitoring the application’s perspective is equally vital.

Application-Level Metrics

Instrument your PHP (or other language) Shopify app to emit custom metrics. This can be done using libraries that integrate with Prometheus clients.

PHP Example: Tracking Cache Operations

Using the prometheus_client_php library:

<?php
require 'vendor/autoload.php';

use Prometheus\CollectorRegistry;
use Prometheus\Render\RenderText;
use Prometheus\Storage\InMemory;

// Initialize registry and storage
$registry = new CollectorRegistry(new InMemory());

// Create a counter for cache hits and misses
$cache_counter = $registry->registerCounter(
    'myapp_cache_operations_total',
    'Total cache operations (hits and misses)',
    ['type'] // 'hit' or 'miss'
);

// --- In your application logic ---

function get_from_redis($key) {
    global $redis_client, $cache_counter; // Assume $redis_client is your Redis connection

    $value = $redis_client->get($key);
    if ($value === false) {
        $cache_counter->inc(['type' => 'miss']);
        // Fetch from primary source, store in Redis, return
        return fetch_and_cache($key);
    } else {
        $cache_counter->inc(['type' => 'hit']);
        return $value;
    }
}

// --- Expose metrics endpoint ---
// In a separate script or route (e.g., /metrics.php)
$renderer = new RenderText();
header('Content-Type: ' . $renderer->getMimeType());
echo $renderer->render($registry->getMetricFamilySamples());

?>

Configure Prometheus to scrape this /metrics.php endpoint on your application servers.

OVH Network and Firewall Monitoring

OVH’s network infrastructure is a critical layer. Ensure you are monitoring:

Network Latency: Use tools like ping, mtr, or dedicated network monitoring agents to check latency between your app servers and Redis cluster, and from OVH to external services your app depends on.
Firewall Logs: If you’re using OVH’s firewall services or custom iptables rules, monitor for excessive denied connections, especially to your Redis ports (default 6379). This could indicate misconfiguration or an attack.
Bandwidth Usage: OVH provides network traffic statistics. Monitor these for unexpected spikes that might correlate with Redis traffic or application load.

Alerting Strategies for Critical Failures

Effective alerting is about catching problems *before* they impact users. Use Alertmanager (integrated with Prometheus) to define sophisticated alert rules.

Example Alerting Rules (PromQL)

Add these to your Prometheus rules file (e.g., rules.yml):

groups:
- name: redis_alerts
  rules:
  - alert: RedisHighMemoryUsage
    expr: |
      (redis_memory_used_bytes / redis_memory_max_bytes) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis memory usage is high on {{ $labels.instance }}"
      description: "Redis instance {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }}% of its allocated memory."

  - alert: RedisEvictionsOccurred
    expr: |
      rate(redis_evicted_keys_total[5m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis is evicting keys on {{ $labels.instance }}"
      description: "Redis instance {{ $labels.instance }} is actively evicting keys, indicating memory pressure."

  - alert: RedisReplicationLagging
    expr: |
      sum by (master_instance) (redis_slave_repl_offset - redis_master_repl_offset) > 1024000 # Lagging by more than 1MB
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Redis replication lag detected for master {{ $labels.master_instance }}"
      description: "Replica of {{ $labels.master_instance }} is lagging by {{ $value | printf \"%.2f\" }} bytes."

  - alert: RedisHighClientConnections
    expr: |
      redis_connected_clients > (redis_max_clients * 0.9)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis approaching max client connections on {{ $labels.instance }}"
      description: "Redis instance {{ $labels.instance }} has {{ $value | printf \"%.0f\" }} connected clients, nearing the limit."

  - alert: AppCacheMissRateHigh
    expr: |
      sum by (job) (rate(myapp_cache_operations_total{type="miss"}[5m])) / sum by (job) (rate(myapp_cache_operations_total[5m])) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High cache miss rate for Shopify app {{ $labels.job }}"
      description: "The cache miss rate for app {{ $labels.job }} has exceeded 50% over the last 10 minutes."

# Add more rules for network latency, rejected connections, etc.

Configure Alertmanager to route these alerts to your team via Slack, PagerDuty, or email. Remember to tune the `for` duration and thresholds based on your application’s tolerance for latency and data loss.

Proactive Health Checks and Diagnostics

Beyond automated monitoring, periodic manual checks and diagnostic procedures are essential.

Redis CLI Diagnostics

Connect to your Redis instances regularly and run:

redis-cli
INFO memory
INFO stats
INFO replication
INFO clients
CLIENT LIST
SLOWLOG GET 10
CONFIG GET maxmemory maxclients

Pay close attention to used_memory_rss vs. used_memory (fragmentation), evicted_keys, rejected_connections, and any commands appearing in SLOWLOG.

OVH Instance Health

Use OVH’s control panel and API to check the overall health of your underlying virtual machines or bare-metal servers. Monitor:

CPU Steal Time: High steal time indicates resource contention on the hypervisor.
Disk I/O Wait: Excessive wait times can bottleneck Redis performance.
Network Interface Errors: Dropped packets or errors on the NIC can cause intermittent connectivity issues.

By combining granular Redis metrics, application-level insights, and OVH infrastructure monitoring, you can build a resilient system that keeps your Shopify app performing optimally.