Server Monitoring Best Practices: Keeping Your WooCommerce App and Redis Clusters Alive on OVH

Proactive Redis Cluster Health Checks with `redis-cli` and Custom Scripts

Maintaining the health of Redis clusters, especially those powering critical applications like WooCommerce, requires more than just basic uptime checks. We need to monitor internal cluster state, replication lag, memory usage, and potential bottlenecks. For OVH-hosted Redis clusters, direct access via SSH to the nodes is typically available, allowing us to leverage `redis-cli` for granular insights.

A fundamental check involves verifying the cluster’s `CLUSTER INFO` output. This provides a snapshot of the cluster’s state, including the number of nodes, slots covered, and overall status. We can script this to run periodically and alert on any deviation from expected values.

Automated `CLUSTER INFO` Monitoring

We’ll create a simple Bash script that connects to one of the Redis nodes (assuming a sentinel or a known master node is accessible) and executes `CLUSTER INFO`. The output is then parsed to check for specific health indicators.

Script: `check_redis_cluster_info.sh`

#!/bin/bash

REDIS_HOST="your_redis_master_ip" # Replace with your Redis master node IP
REDIS_PORT="6379"
REDIS_PASSWORD="your_redis_password" # If password is set

# Check if redis-cli is available
if ! command -v redis-cli &> /dev/null
then
    echo "Error: redis-cli could not be found. Please install redis-tools."
    exit 1
fi

# Construct the redis-cli command
REDIS_CMD="redis-cli -h $REDIS_HOST -p $REDIS_PORT"
if [ -n "$REDIS_PASSWORD" ]; then
    REDIS_CMD="$REDIS_CMD -a $REDIS_PASSWORD"
fi

# Get cluster info
CLUSTER_INFO=$($REDIS_CMD CLUSTER INFO 2>&1)

# Check for connection errors
if echo "$CLUSTER_INFO" | grep -q "Could not connect"; then
    echo "CRITICAL: Could not connect to Redis at $REDIS_HOST:$REDIS_PORT."
    exit 2
fi

# Parse relevant metrics
CLUSTER_STATE=$(echo "$CLUSTER_INFO" | grep "cluster_state:" | awk -F: '{print $2}' | tr -d ' ')
MASTER_NODES=$(echo "$CLUSTER_INFO" | grep "cluster_known_nodes:" | awk -F: '{print $2}' | tr -d ' ')
SLOTS_OK=$(echo "$CLUSTER_INFO" | grep "cluster_slots_ok:" | awk -F: '{print $2}' | tr -d ' ')
SLOTS_REPLICATING=$(echo "$CLUSTER_INFO" | grep "cluster_slots_replicating:" | awk -F: '{print $2}' | tr -d ' ')
SLOTS_FAILED=$(echo "$CLUSTER_INFO" | grep "cluster_slots_failed:" | awk -F: '{print $2}' | tr -d ' ')

# Define expected values (adjust as per your cluster size)
EXPECTED_MASTER_NODES=3 # Example: for a 3-master cluster
EXPECTED_SLOTS_OK=16384 # Total slots in a Redis cluster

# Perform checks
ALERT_MSG=""

if [ "$CLUSTER_STATE" != "ok" ]; then
    ALERT_MSG+="CRITICAL: Redis cluster state is '$CLUSTER_STATE'. "
fi

if [ "$SLOTS_REPLICATING" -gt 0 ]; then
    ALERT_MSG+="WARNING: Redis cluster has $SLOTS_REPLICATING slots replicating. "
fi

if [ "$SLOTS_FAILED" -gt 0 ]; then
    ALERT_MSG+="CRITICAL: Redis cluster has $SLOTS_FAILED slots failed. "
fi

# This check is more complex as it depends on your specific cluster topology.
# For simplicity, we'll check if the number of known nodes is reasonable.
# A more robust check would involve querying CLUSTER NODES and counting masters.
if [ "$MASTER_NODES" -lt "$EXPECTED_MASTER_NODES" ]; then
    ALERT_MSG+="WARNING: Expected at least $EXPECTED_MASTER_NODES master nodes, found $MASTER_NODES. "
fi

if [ "$SLOTS_OK" -ne "$EXPECTED_SLOTS_OK" ]; then
    ALERT_MSG+="WARNING: Expected $EXPECTED_SLOTS_OK slots OK, found $SLOTS_OK. "
fi

if [ -n "$ALERT_MSG" ]; then
    echo "$ALERT_MSG"
    exit 1 # Indicate a non-OK status
else
    echo "OK: Redis cluster is healthy."
    exit 0
fi

To integrate this script into your monitoring system (e.g., Nagios, Zabbix, Prometheus with `node_exporter`’s textfile collector), ensure it’s executable (`chmod +x check_redis_cluster_info.sh`) and scheduled to run at regular intervals (e.g., via cron). Remember to replace placeholders like `your_redis_master_ip` and `your_redis_password` with your actual cluster details.

Monitoring WooCommerce Application Performance with APM Tools

WooCommerce, being a PHP application, can benefit immensely from Application Performance Monitoring (APM) tools. These tools go beyond simple server resource monitoring by tracing requests through your application stack, identifying slow database queries, inefficient PHP code, and external API call bottlenecks. For a production environment on OVH, consider solutions like New Relic, Datadog APM, or even self-hosted options like Jaeger or Zipkin integrated with a PHP agent.

Key WooCommerce Metrics to Track

Request Latency: Average and percentile (e.g., 95th, 99th) response times for key WooCommerce endpoints (product pages, cart, checkout, API calls).
Error Rate: Percentage of requests resulting in errors (HTTP 5xx, uncaught exceptions).
Database Query Performance: Slowest queries, query count per request, and time spent in database operations.
External Service Calls: Latency and error rates for calls to payment gateways, shipping APIs, etc.
PHP Execution Time: Time spent within the PHP interpreter for specific transactions.

Implementing an APM agent typically involves installing a PHP extension and configuring it to send data to your APM backend. For example, with New Relic, you’d install the `newrelic` PHP extension and configure `newrelic.ini`.

Example `newrelic.ini` Configuration Snippet

[newrelic]
; Required settings
license = "YOUR_NEW_RELIC_LICENSE_KEY"
app_name = "WooCommerce Production"

; Optional settings
; high_security = false
; enabled = true
; log_level = "info"
; transaction_tracer.enabled = true
; transaction_tracer.threshold = "500ms" ; Trace transactions slower than 500ms
; cross_app_tracing.enabled = true
; distributed_tracing.enabled = true
; capture_errors = true
; error_collector.enabled = true
; error_collector.ignore_errors = "Notice, Warning"

After configuring and restarting your web server (e.g., Nginx with PHP-FPM), APM data will start flowing. You can then set up dashboards and alerts within your APM provider’s UI to monitor these critical WooCommerce metrics. For instance, an alert could be triggered if the 95th percentile request latency for `/checkout/` exceeds 2 seconds, or if the error rate for API calls surpasses 1%.

OVH Server Resource Monitoring with `node_exporter` and Prometheus

For the underlying OVH virtual machines or bare-metal servers hosting your WooCommerce application and Redis, robust system-level monitoring is essential. Prometheus, coupled with `node_exporter`, is a powerful and widely adopted solution for this. `node_exporter` exposes a wide range of hardware and OS metrics that Prometheus can scrape.

Deploying `node_exporter` on OVH Instances

On each server (WooCommerce web server, Redis nodes), download and run `node_exporter`. For persistent deployment, consider running it as a systemd service.

Installation and Systemd Service (Example for Ubuntu/Debian)

# Download the latest release
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create systemd service file
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.textfile.directory=/var/lib/node_exporter/textfile-collector

[Install]
WantedBy=multi-user.target
EOF

# Create textfile collector directory
sudo mkdir -p /var/lib/node_exporter/textfile-collector

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Check status
sudo systemctl status node_exporter

Once `node_exporter` is running, configure Prometheus to scrape these targets. In your `prometheus.yml` configuration:

Prometheus Configuration (`prometheus.yml`)

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 'your_web_server_ip:9100'
          - 'your_redis_node1_ip:9100'
          - 'your_redis_node2_ip:9100'
          # Add all your server IPs
    metric_path: /metrics

Key metrics to monitor from `node_exporter` for WooCommerce and Redis include:

CPU Usage: `node_cpu_seconds_total` (breakdown by mode: user, system, idle, iowait). High iowait can indicate disk or network contention.
Memory Usage: `node_memory_MemAvailable_bytes`, `node_memory_MemFree_bytes`, `node_memory_Buffers_bytes`, `node_memory_Cached_bytes`. Monitor available memory to prevent OOM killer events.
Disk I/O: `node_disk_io_time_seconds_total` (rate of time spent on disk I/O). High values suggest disk bottlenecks.
Network Traffic: `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`. Monitor for saturation or unusual spikes.
Filesystem Usage: `node_filesystem_avail_bytes`, `node_filesystem_size_bytes`. Ensure sufficient free space.

For Redis, specific memory metrics like `redis_used_memory_bytes` (exposed via `redis-cli –intrinsic-info memory` or directly by Redis exporter) are crucial. You can expose custom metrics via `node_exporter`’s textfile collector. For example, a script in `/var/lib/node_exporter/textfile-collector/` could periodically run `redis-cli INFO memory` and output metrics in Prometheus format.

Custom Redis Memory Metric via Textfile Collector

#!/bin/bash
# Filename: redis_memory.prom

REDIS_HOST="your_redis_master_ip"
REDIS_PORT="6379"
REDIS_PASSWORD="your_redis_password"

# Check if redis-cli is available
if ! command -v redis-cli &> /dev/null
then
    exit 1 # Exit silently, node_exporter will mark it as stale
fi

# Construct the redis-cli command
REDIS_CMD="redis-cli -h $REDIS_HOST -p $REDIS_PORT"
if [ -n "$REDIS_PASSWORD" ]; then
    REDIS_CMD="$REDIS_CMD -a $REDIS_PASSWORD"
fi

# Get memory info
MEMORY_INFO=$($REDIS_CMD INFO memory 2>&1)

if echo "$MEMORY_INFO" | grep -q "Could not connect"; then
    exit 1
fi

# Extract used memory
USED_MEMORY=$(echo "$MEMORY_INFO" | grep "^used_memory:" | awk -F: '{print $2}')

if [ -n "$USED_MEMORY" ]; then
    echo "redis_used_memory_bytes{host=\"$REDIS_HOST\"} $USED_MEMORY"
fi

Place this script (e.g., `redis_memory.sh`) in `/var/lib/node_exporter/textfile-collector/` and make it executable. Ensure `node_exporter` is configured to use the textfile collector directory. Prometheus will then scrape this custom metric.

Alerting Strategy: Combining Prometheus Alertmanager and APM Alerts

A comprehensive alerting strategy combines system-level alerts from Prometheus/Alertmanager with application-level alerts from your APM tool. This ensures you’re notified of both infrastructure issues and application performance degradations.

Prometheus Alertmanager Configuration Example

Define alert rules in Prometheus (e.g., in `rules.yml`) and configure Alertmanager to route these alerts to appropriate channels (Slack, PagerDuty, email).

groups:
- name: redis_alerts
  rules:
  - alert: RedisClusterDown
    expr: redis_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis cluster node {{ $labels.instance }} is down."
      description: "Redis node {{ $labels.instance }} has been unreachable for 5 minutes."

  - alert: HighRedisMemoryUsage
    expr: redis_used_memory_bytes / (1024*1024) > 80 # Alert if memory > 80% of a hypothetical 100MB limit
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High Redis memory usage on {{ $labels.instance }}."
      description: "Redis instance {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }} MB, exceeding 80% threshold."

- name: node_alerts
  rules:
  - alert: HighCpuLoad
    expr: avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.2
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}."
      description: "CPU load on {{ $labels.instance }} is high (idle rate < 20%) for 10 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}."
      description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} has only {{ $value | printf \"%.2f\" }}% free space."

Configure Alertmanager’s `alertmanager.yml` to define receivers (e.g., Slack webhook URL, PagerDuty API key) and routing rules. For example:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

receivers:
- name: 'default-receiver'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#alerts'
    send_resolved: true

# Example for PagerDuty
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'

Complement these infrastructure alerts with alerts configured directly within your APM tool. For instance, set up an alert in New Relic if the average response time for the checkout process exceeds 3 seconds, or if the error rate for payment gateway integrations goes above 2%. This layered approach ensures that both the underlying infrastructure and the application logic are continuously monitored, providing a robust solution for keeping your WooCommerce app and Redis clusters alive and performing optimally on OVH.