Server Monitoring Best Practices: Keeping Your Shopify App and Redis Clusters Alive on OVH
Establishing Robust Redis Cluster Monitoring on OVH
Maintaining the health and performance of Redis clusters, especially those powering critical Shopify applications, demands a proactive and granular monitoring strategy. On OVH infrastructure, this often involves a combination of native OVH tools and custom solutions tailored to Redis’s unique characteristics. We’ll focus on key metrics and actionable alerts that prevent downtime and performance degradation.
Key Redis Metrics for OVH Deployments
Beyond basic CPU and memory utilization, Redis-specific metrics are paramount. For a cluster, we need to monitor:
- Memory Usage:
used_memory,used_memory_rss, andmem_fragmentation_ratio. High fragmentation can indicate memory leaks or inefficient data structures. - Network Traffic:
total_net_input_bytesandtotal_net_output_bytes. Spikes can signal heavy load or potential DDoS attacks. - Command Operations:
total_commands_processed. A sudden drop or stagnation might indicate a blocked event loop or network issues. - Latency: Redis’s built-in latency monitoring is crucial. Track
latest_fork_usec(for background save operations) and general command latency. - Replication Status: For master-slave setups, monitor
master_repl_offsetandslave_repl_offsetto ensure replicas are in sync. Checkmaster_link_status. - Evictions:
evicted_keys. A high rate of evictions means yourmaxmemorypolicy is being hit, potentially leading to data loss for your application. - Connections:
connected_clientsandrejected_connections. A surge in rejected connections points to hitting themaxclientslimit.
Implementing Redis Monitoring with Prometheus and Grafana on OVH
Prometheus is an excellent choice for time-series data collection, and Grafana provides powerful visualization. We’ll deploy the redis_exporter to expose Redis metrics.
Deploying redis_exporter
On each Redis node (or a dedicated monitoring host that can reach them), install and configure redis_exporter. A common approach is to run it as a systemd service.
Systemd Service File Example
Create a file like /etc/systemd/system/redis_exporter.service:
[Unit] Description=Redis Exporter Wants=network-online.target After=network-online.target [Service] User=redis_exporter Group=redis_exporter Type=simple ExecStart=/usr/local/bin/redis_exporter --redis.addr=redis://localhost:6379 --web.listen-address=":9121" Restart=always RestartSec=5 [Install] WantedBy=multi-user.target
Ensure you have a user and group for the exporter and adjust --redis.addr if your Redis instance is not on localhost:6379. If you have a Redis cluster, you’ll need to run this exporter for each node or configure it to connect to a specific node and scrape cluster-wide metrics.
Prometheus Configuration
Add a scrape configuration to your prometheus.yml to collect metrics from the exporter:
scrape_configs:
- job_name: 'redis_cluster'
static_configs:
- targets:
- 'redis-node-1:9121'
- 'redis-node-2:9121'
- 'redis-node-3:9121'
# Add all your Redis nodes here
metrics_path: /metrics
Reload Prometheus configuration: systemctl reload prometheus.
Grafana Dashboards
Import a pre-built Redis dashboard (e.g., from Grafana.com, search for “Redis Exporter”) or create a custom one. Key panels should include:
- Memory Usage (
used_memoryvs.maxmemory) - Key Eviction Rate
- Command Throughput
- Network I/O
- Replication Lag
- Connected Clients
- Latency (if available via exporter or custom instrumentation)
OVH Specific Considerations for Shopify App Monitoring
Your Shopify app likely interacts with Redis for caching, session management, or background job queues. Monitoring the application’s perspective is equally vital.
Application-Level Metrics
Instrument your PHP (or other language) Shopify app to emit custom metrics. This can be done using libraries that integrate with Prometheus clients.
PHP Example: Tracking Cache Operations
Using the prometheus_client_php library:
<?php
require 'vendor/autoload.php';
use Prometheus\CollectorRegistry;
use Prometheus\Render\RenderText;
use Prometheus\Storage\InMemory;
// Initialize registry and storage
$registry = new CollectorRegistry(new InMemory());
// Create a counter for cache hits and misses
$cache_counter = $registry->registerCounter(
'myapp_cache_operations_total',
'Total cache operations (hits and misses)',
['type'] // 'hit' or 'miss'
);
// --- In your application logic ---
function get_from_redis($key) {
global $redis_client, $cache_counter; // Assume $redis_client is your Redis connection
$value = $redis_client->get($key);
if ($value === false) {
$cache_counter->inc(['type' => 'miss']);
// Fetch from primary source, store in Redis, return
return fetch_and_cache($key);
} else {
$cache_counter->inc(['type' => 'hit']);
return $value;
}
}
// --- Expose metrics endpoint ---
// In a separate script or route (e.g., /metrics.php)
$renderer = new RenderText();
header('Content-Type: ' . $renderer->getMimeType());
echo $renderer->render($registry->getMetricFamilySamples());
?>
Configure Prometheus to scrape this /metrics.php endpoint on your application servers.
OVH Network and Firewall Monitoring
OVH’s network infrastructure is a critical layer. Ensure you are monitoring:
- Network Latency: Use tools like
ping,mtr, or dedicated network monitoring agents to check latency between your app servers and Redis cluster, and from OVH to external services your app depends on. - Firewall Logs: If you’re using OVH’s firewall services or custom iptables rules, monitor for excessive denied connections, especially to your Redis ports (default 6379). This could indicate misconfiguration or an attack.
- Bandwidth Usage: OVH provides network traffic statistics. Monitor these for unexpected spikes that might correlate with Redis traffic or application load.
Alerting Strategies for Critical Failures
Effective alerting is about catching problems *before* they impact users. Use Alertmanager (integrated with Prometheus) to define sophisticated alert rules.
Example Alerting Rules (PromQL)
Add these to your Prometheus rules file (e.g., rules.yml):
groups:
- name: redis_alerts
rules:
- alert: RedisHighMemoryUsage
expr: |
(redis_memory_used_bytes / redis_memory_max_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage is high on {{ $labels.instance }}"
description: "Redis instance {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }}% of its allocated memory."
- alert: RedisEvictionsOccurred
expr: |
rate(redis_evicted_keys_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Redis is evicting keys on {{ $labels.instance }}"
description: "Redis instance {{ $labels.instance }} is actively evicting keys, indicating memory pressure."
- alert: RedisReplicationLagging
expr: |
sum by (master_instance) (redis_slave_repl_offset - redis_master_repl_offset) > 1024000 # Lagging by more than 1MB
for: 2m
labels:
severity: critical
annotations:
summary: "Redis replication lag detected for master {{ $labels.master_instance }}"
description: "Replica of {{ $labels.master_instance }} is lagging by {{ $value | printf \"%.2f\" }} bytes."
- alert: RedisHighClientConnections
expr: |
redis_connected_clients > (redis_max_clients * 0.9)
for: 5m
labels:
severity: warning
annotations:
summary: "Redis approaching max client connections on {{ $labels.instance }}"
description: "Redis instance {{ $labels.instance }} has {{ $value | printf \"%.0f\" }} connected clients, nearing the limit."
- alert: AppCacheMissRateHigh
expr: |
sum by (job) (rate(myapp_cache_operations_total{type="miss"}[5m])) / sum by (job) (rate(myapp_cache_operations_total[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High cache miss rate for Shopify app {{ $labels.job }}"
description: "The cache miss rate for app {{ $labels.job }} has exceeded 50% over the last 10 minutes."
# Add more rules for network latency, rejected connections, etc.
Configure Alertmanager to route these alerts to your team via Slack, PagerDuty, or email. Remember to tune the `for` duration and thresholds based on your application’s tolerance for latency and data loss.
Proactive Health Checks and Diagnostics
Beyond automated monitoring, periodic manual checks and diagnostic procedures are essential.
Redis CLI Diagnostics
Connect to your Redis instances regularly and run:
redis-cli INFO memory INFO stats INFO replication INFO clients CLIENT LIST SLOWLOG GET 10 CONFIG GET maxmemory maxclients
Pay close attention to used_memory_rss vs. used_memory (fragmentation), evicted_keys, rejected_connections, and any commands appearing in SLOWLOG.
OVH Instance Health
Use OVH’s control panel and API to check the overall health of your underlying virtual machines or bare-metal servers. Monitor:
- CPU Steal Time: High steal time indicates resource contention on the hypervisor.
- Disk I/O Wait: Excessive wait times can bottleneck Redis performance.
- Network Interface Errors: Dropped packets or errors on the NIC can cause intermittent connectivity issues.
By combining granular Redis metrics, application-level insights, and OVH infrastructure monitoring, you can build a resilient system that keeps your Shopify app performing optimally.