Server Monitoring Best Practices: Keeping Your WordPress App and Redis Clusters Alive on DigitalOcean

Proactive Redis Cluster Health Checks with `redis-cli` and Custom Scripts

Maintaining the health of your Redis clusters, especially in a distributed setup on DigitalOcean, requires more than just basic uptime checks. We need to go deeper, monitoring key performance indicators (KPIs) and cluster state to preemptively address issues before they impact your WordPress application. This involves leveraging `redis-cli` for direct introspection and augmenting it with custom scripting for automated, actionable insights.

A fundamental check is the cluster’s overall status. For Redis Sentinel, this means ensuring Sentinels are aware of the master and replicas, and that failover mechanisms are ready. For Redis Cluster, it’s about node connectivity and slot distribution.

Sentinel Cluster Status Verification

Connect to one of your Sentinel instances and execute the following commands:

First, check the master’s status and its known replicas:

redis-cli -h  -p 26379 SENTINEL master mymaster

This command should return a list of attributes for the master, including its IP, port, and current role. Crucially, look for the `num-slaves` and `num-other-sentinels` values. If `num-slaves` is zero or significantly lower than expected, it indicates a problem with replica synchronization or connectivity. If `num-other-sentinels` is low, your Sentinel quorum might be at risk.

Next, verify the health of individual replicas:

redis-cli -h  -p 26379 SENTINEL replicas mymaster

This will list all known replicas for `mymaster`. Examine each replica’s status. Look for `master-link-down-time` which should be 0 for healthy replicas. A non-zero value indicates a broken replication link.

Redis Cluster Node and Slot Status

For Redis Cluster deployments, the focus shifts to node connectivity and the distribution of hash slots.

Connect to any node in the cluster and run:

redis-cli -h  -p 6379 CLUSTER INFO

Key metrics to monitor here are:

cluster_state: Should be ok. If it’s fail, the cluster is in a critical state.
cluster_slots_assigned: Should equal 16384 (the total number of slots). If it’s less, some slots are unassigned, meaning data is inaccessible.
cluster_slots_ok: Should equal 16384.
cluster_slots_pfail: Should be 0. A non-zero value indicates nodes that are in a “probing” state, potentially leading to failover.
cluster_slots_fail: Should be 0. This indicates nodes that are completely unreachable and have failed.
members_count: The total number of nodes in the cluster. Ensure this matches your expected count.

To get a detailed view of slot distribution and node status, use:

redis-cli -h  -p 6379 CLUSTER NODES

This output is crucial. Each line represents a node. Look for:

connected status: Ensure all nodes show as connected.
master/slave roles: Verify the correct master-replica relationships.
Slot assignments: Check that all 16384 slots are assigned and that no node is marked as fail.

Automating Health Checks with Python and `redis-py`

Manually running these commands is insufficient for production systems. We need automated checks that can trigger alerts. A Python script using the `redis-py` library is an excellent choice for this.

First, ensure you have `redis-py` installed:

pip install redis

Here’s a Python script to check Redis Cluster health. This script connects to a cluster node and inspects `CLUSTER INFO` and `CLUSTER NODES` output. It can be easily adapted for Sentinel.

import redis
import sys

# Configuration
REDIS_HOST = 'your_redis_cluster_node_ip'
REDIS_PORT = 6379
EXPECTED_NODES = 3 # Example: For a 3-node cluster
EXPECTED_SLOTS = 16384

def check_redis_cluster_health(host, port):
    try:
        r = redis.Redis(host=host, port=port, decode_responses=True)
        
        # Check CLUSTER INFO
        cluster_info = r.info('cluster')
        
        if cluster_info.get('cluster_state') != 'ok':
            print(f"CRITICAL: Cluster state is not 'ok'. Current state: {cluster_info.get('cluster_state')}", file=sys.stderr)
            return False
        
        assigned_slots = int(cluster_info.get('cluster_slots_assigned', 0))
        if assigned_slots != EXPECTED_SLOTS:
            print(f"CRITICAL: Not all slots assigned. Assigned: {assigned_slots}, Expected: {EXPECTED_SLOTS}", file=sys.stderr)
            return False
            
        pfail_slots = int(cluster_info.get('cluster_slots_pfail', 0))
        if pfail_slots > 0:
            print(f"WARNING: {pfail_slots} slots are in PFAIL state. Investigating further.", file=sys.stderr)
            # Depending on policy, this might be critical or warning
            
        fail_slots = int(cluster_info.get('cluster_slots_fail', 0))
        if fail_slots > 0:
            print(f"CRITICAL: {fail_slots} slots are in FAIL state.", file=sys.stderr)
            return False

        # Check CLUSTER NODES for connectivity and node count
        nodes_output = r.execute_command('CLUSTER NODES')
        nodes = {}
        for line in nodes_output.strip().split('\n'):
            parts = line.split()
            node_id = parts[0]
            ip_port = parts[1].split(',')
            node_ip = ip_port[0]
            node_port = int(ip_port[1])
            flags = parts[2]
            master_id = parts[3]
            ping_sent = int(parts[4])
            ping_recv = int(parts[5])
            
            connected = 'connected' in flags
            
            nodes[node_id] = {
                'ip': node_ip,
                'port': node_port,
                'flags': flags,
                'master_id': master_id,
                'connected': connected,
                'ping_sent': ping_sent,
                'ping_recv': ping_recv
            }
            
            if not connected:
                print(f"CRITICAL: Node {node_id} ({node_ip}:{node_port}) is not connected.", file=sys.stderr)
                return False

        if len(nodes) != EXPECTED_NODES:
            print(f"WARNING: Node count mismatch. Found: {len(nodes)}, Expected: {EXPECTED_NODES}", file=sys.stderr)
            # This might be a warning if nodes are temporarily down but expected to recover

        print("Redis Cluster health check passed.")
        return True

    except redis.exceptions.ConnectionError as e:
        print(f"CRITICAL: Could not connect to Redis at {host}:{port}. Error: {e}", file=sys.stderr)
        return False
    except Exception as e:
        print(f"CRITICAL: An unexpected error occurred: {e}", file=sys.stderr)
        return False

if __name__ == "__main__":
    if check_redis_cluster_health(REDIS_HOST, REDIS_PORT):
        sys.exit(0)
    else:
        sys.exit(1)

This script can be scheduled using `cron` on a monitoring server or even on one of your less critical application nodes. Configure it to run every minute. If the script exits with a non-zero status code, your monitoring system (e.g., Prometheus Alertmanager, Nagios, Zabbix) will trigger an alert.

WordPress Application Performance Monitoring (APM) with New Relic and Query Analysis

For the WordPress application itself, basic server metrics (CPU, RAM, disk I/O) are essential but insufficient. We need to understand application-level performance, identify slow database queries, and pinpoint bottlenecks within the PHP execution. New Relic is a powerful APM tool that provides deep insights into WordPress performance.

New Relic Agent Installation and Configuration

On each of your WordPress web servers (e.g., Droplets running Nginx/Apache and PHP-FPM), install the New Relic PHP agent. The exact installation steps can vary slightly based on your OS and PHP version, but generally involve downloading and running an installer script.

# Example for Ubuntu/Debian with PHP 8.1
curl -Ls https://download.newrelic.com/install/newrelic-php5/agent/newrelic-install.sh | sudo bash /dev/stdin YOUR_LICENSE_KEY YOUR_APP_NAME

Replace YOUR_LICENSE_KEY with your actual New Relic license key and YOUR_APP_NAME with a descriptive name for your WordPress application (e.g., my-wordpress-prod).

After installation, the script typically modifies your `php.ini` file (or creates a new one in the `conf.d` directory) to load the New Relic extension. You’ll need to restart your web server and PHP-FPM service for the changes to take effect:

sudo systemctl restart nginx # or apache2
sudo systemctl restart php8.1-fpm # Adjust version as needed

Analyzing Slow WordPress Database Queries

Once the New Relic agent is active, you’ll see your WordPress application appear in the New Relic dashboard. Navigate to the “Databases” section for your application. This view is invaluable for identifying slow SQL queries originating from WordPress plugins, themes, or core functionality.

Look for queries with high “Average duration” or those that appear frequently in the “Slowest queries” list. Common culprits include:

Inefficient `WP_Query` calls in custom code or plugins that perform complex joins or fetch excessive data.
Plugins that repeatedly query for the same data without proper caching.
Theme functions that execute database queries on every page load.
Lack of proper indexing on custom database tables if you’re using them.

When you identify a slow query, the next step is to investigate its origin. New Relic often provides a “Trace” or “Transaction” link associated with the slow query. Clicking this will show you the specific PHP function calls that led to that query being executed. This is where you can pinpoint the problematic plugin or theme code.

For example, you might see a query like this in New Relic:

SELECT option_value FROM wp_options WHERE option_name = 'my_plugin_setting' LIMIT 1;

If this query is slow and executed frequently, you’d trace it back to the `my_plugin` code. The solution might involve:

Implementing transient API caching for the setting.
Ensuring the setting is only retrieved when necessary, not on every page load.
Optimizing the plugin’s logic if it’s fetching data in an inefficient loop.

Leveraging New Relic Alerts for Proactive Intervention

New Relic’s alerting capabilities are crucial for proactive monitoring. Configure alerts for:

High Apdex Score Degradation: A sudden drop in your application’s Apdex score indicates a widespread performance issue.
High Error Rate: PHP errors, fatal errors, or uncaught exceptions.
Slow Transaction Traces: Alerts when specific transactions exceed a defined duration threshold.
Database Call Thresholds: Alerts when the number of database calls or their total duration exceeds a limit.
External Service Latency: If your WordPress app relies on external APIs, monitor their response times.

These alerts can be configured to notify your team via email, Slack, PagerDuty, or other channels, allowing for rapid response to performance degradations or outages.

DigitalOcean Droplet and Load Balancer Metrics with `doctl` and Prometheus

Beyond application-specific monitoring, we need to keep an eye on the underlying infrastructure: the DigitalOcean Droplets and Load Balancers. While DigitalOcean provides a basic metrics dashboard, integrating with a more robust monitoring solution like Prometheus offers greater flexibility and deeper insights.

Exporting Droplet Metrics to Prometheus

The standard way to get system-level metrics into Prometheus is by running node exporters on each Droplet. For DigitalOcean, we can also leverage the DigitalOcean API to pull metrics about Droplets and Load Balancers.

1. Node Exporter on Droplets:

Install the Prometheus Node Exporter on each of your WordPress web servers and Redis nodes. This exporter provides metrics like CPU usage, memory, disk I/O, network traffic, and more.

# Download the latest release (check Prometheus website for current version)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nobody
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Ensure that port 9100 is open in your firewall for Prometheus to scrape these metrics.

2. DigitalOcean Exporter:

To pull metrics directly from the DigitalOcean API, you can use the official `digitalocean_exporter`. This requires a DigitalOcean API token with read-only access.

# Example installation and configuration for digitalocean_exporter
# (Refer to the official digitalocean_exporter GitHub repository for the most up-to-date instructions)

# Download and install
wget https://github.com/prometheus-community/digitalocean_exporter/releases/download/v0.7.0/digitalocean_exporter-0.7.0.linux-amd64.tar.gz
tar xvfz digitalocean_exporter-0.7.0.linux-amd64.tar.gz
sudo mv digitalocean_exporter /usr/local/bin/

# Create a systemd service file
sudo tee /etc/systemd/system/digitalocean_exporter.service <<EOF
[Unit]
Description=DigitalOcean Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nobody
Type=simple
Environment="DO_API_TOKEN=YOUR_DIGITALOCEAN_API_TOKEN"
ExecStart=/usr/local/bin/digitalocean_exporter --digitalocean.client-id=YOUR_CLIENT_ID --digitalocean.api-token=$DO_API_TOKEN
# If using personal access token, client-id might not be needed or can be empty.
# Check exporter documentation.

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable digitalocean_exporter
sudo systemctl start digitalocean_exporter
sudo systemctl status digitalocean_exporter

Replace YOUR_DIGITALOCEAN_API_TOKEN and YOUR_CLIENT_ID with your actual credentials. The exporter typically runs on port 9400.

Configuring Prometheus Scrape Jobs

In your Prometheus configuration file (e.g., /etc/prometheus/prometheus.yml), add scrape jobs for both the node exporters and the DigitalOcean exporter:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: [':9100', ':9100', ':9100', ':9100'] # Add all your Droplet IPs

  - job_name: 'digitalocean'
    static_configs:
      - targets: [':9400'] # IP of the Droplet running digitalocean_exporter

After updating Prometheus configuration, reload it:

sudo systemctl reload prometheus

Monitoring DigitalOcean Load Balancers

The `digitalocean_exporter` will expose metrics related to your Load Balancers, such as:

digitalocean_loadbalancer_requests_total: Total number of requests.
digitalocean_loadbalancer_bytes_total: Total bytes transferred.
digitalocean_loadbalancer_health_status: Health status of backend Droplets.
digitalocean_loadbalancer_forwarding_rule_health_status: Health status of forwarding rules.

These metrics are invaluable for understanding traffic patterns, identifying potential overload on your load balancers, and detecting issues with backend Droplet health as seen by the load balancer.

Alerting with Prometheus Alertmanager

Configure Prometheus to send alerts to Alertmanager based on these metrics. For example, you might want alerts for:

High CPU utilization on WordPress Droplets (e.g., node_cpu_seconds_total{mode="idle"} < 0.1 for 5 minutes).
Low disk space on any Droplet (e.g., node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 10).
Redis nodes becoming unreachable (if using a Redis exporter or checking `node_exporter` metrics for Redis process).
Load balancer backend Droplets failing health checks (e.g., digitalocean_loadbalancer_health_status == 0).
High network traffic spikes that might indicate a DDoS attack or misconfiguration.

By combining these layers of monitoring—application-level (New Relic), cluster-level (Redis CLI/scripts), and infrastructure-level (Prometheus/Node Exporter)—you build a robust, proactive monitoring strategy for your WordPress application and Redis clusters on DigitalOcean.