Server Monitoring Best Practices: Keeping Your WordPress App and Elasticsearch Clusters Alive on DigitalOcean

Proactive Elasticsearch Health Checks with `curl` and `jq`

Maintaining the health of an Elasticsearch cluster, especially one supporting a high-traffic WordPress application, requires more than just basic CPU and memory monitoring. Elasticsearch has its own internal metrics that, when tracked, can predict and prevent issues before they impact your application. We’ll leverage `curl` to query the Elasticsearch API and `jq` for parsing the JSON output to extract critical health indicators.

A fundamental check is the cluster health API. This endpoint provides a snapshot of the cluster’s status (green, yellow, or red), the number of nodes, and shard allocation status. A ‘red’ status indicates that some primary shards are not allocated, meaning data is unavailable. A ‘yellow’ status means all primary shards are allocated, but some replica shards are missing, posing a risk of data loss if a node fails.

Automating Cluster Health Checks

We can script these checks to run periodically. The following Bash script uses `curl` to hit the `_cluster/health` endpoint and `jq` to extract the status. It then exits with a non-zero status code if the cluster is not in a ‘green’ state, making it suitable for integration with monitoring systems like Nagios, Zabbix, or even a simple cron job with alerting.

Ensure you replace http://localhost:9200 with the actual endpoint of your Elasticsearch cluster. If your cluster uses authentication, you’ll need to add appropriate headers to the `curl` command (e.g., -u 'elastic:changeme' or -H 'Authorization: Basic ...').

The script first fetches the cluster health. Then, it uses `jq` to select the ‘status’ field. If the status is anything other than ‘green’, it prints a message and exits with code 1. Otherwise, it prints a success message and exits with code 0.

#!/bin/bash

# Elasticsearch cluster health check script

ES_URL="http://localhost:9200" # Replace with your Elasticsearch endpoint

# Fetch cluster health and extract status
HEALTH_STATUS=$(curl -s -X GET "${ES_URL}/_cluster/health" | jq -r '.status')

if [ "$HEALTH_STATUS" != "green" ]; then
  echo "CRITICAL: Elasticsearch cluster is not green. Status: $HEALTH_STATUS"
  exit 1
else
  echo "OK: Elasticsearch cluster is green."
  exit 0
fi

Monitoring Elasticsearch Node Statistics

Beyond cluster-wide health, individual node statistics are crucial. High JVM heap usage, excessive garbage collection activity, or a large number of unassigned shards on a specific node can indicate underlying problems. The `_nodes/stats` API provides detailed metrics for each node.

We can monitor JVM heap usage, which is a common bottleneck. High heap usage can lead to frequent garbage collection pauses, impacting query latency and overall cluster responsiveness. A threshold of 80-90% is often considered a warning level.

#!/bin/bash

# Elasticsearch JVM heap usage check script

ES_URL="http://localhost:9200" # Replace with your Elasticsearch endpoint
HEAP_THRESHOLD=90 # Percentage

# Fetch node stats and iterate through each node
curl -s -X GET "${ES_URL}/_nodes/stats/jvm" | jq -c '.nodes[] | {host: .host, heap_used_percent: .jvm.mem.heap_used_percent}' | while read NODE_STATS; do
  HOST=$(echo "$NODE_STATS" | jq -r '.host')
  HEAP_USED=$(echo "$NODE_STATS" | jq -r '.heap_used_percent')

  if (( $(echo "$HEAP_USED > $HEAP_THRESHOLD" | bc -l) )); then
    echo "WARNING: Elasticsearch node $HOST has high JVM heap usage: ${HEAP_USED}%"
    # In a real monitoring setup, you might want to exit with a non-zero code here
    # or collect these warnings and report them collectively.
  else
    echo "OK: Elasticsearch node $HOST JVM heap usage: ${HEAP_USED}%"
  fi
done

This script iterates through each node, extracts its hostname and JVM heap usage percentage, and flags nodes exceeding the defined threshold. The `bc -l` command is used for floating-point comparisons, which is necessary as `jq` can output percentages with decimal points.

WordPress Application Monitoring on DigitalOcean

For the WordPress application itself, standard server metrics are essential, but we also need to consider WordPress-specific performance indicators. On DigitalOcean, this typically involves monitoring Droplet resource utilization (CPU, RAM, Disk I/O, Network) and then diving deeper into the web server (Nginx/Apache), PHP-FPM, and MySQL performance.

Nginx/Apache Performance Metrics

Web server logs are a goldmine for performance insights. Monitoring the number of active connections, request rates, and error rates (4xx, 5xx) can quickly reveal issues. For Nginx, the `stub_status` module is invaluable.

First, ensure the `stub_status` module is enabled in your Nginx configuration. Add the following to your `nginx.conf` or a site-specific configuration file:

# In your http, server, or location block
location /nginx_status {
    stub_status;
    allow 127.0.0.1; # Restrict access to localhost for security
    deny all;
}

Then, you can use `curl` to fetch these metrics:

#!/bin/bash

# Nginx stub_status check

NGINX_STATUS_URL="http://localhost/nginx_status" # Adjust if Nginx is not on localhost or path is different

# Fetch Nginx status
STATUS_OUTPUT=$(curl -s $NGINX_STATUS_URL)

# Parse the output
ACTIVE_CONNECTIONS=$(echo "$STATUS_OUTPUT" | awk '/Active Connections:/ {print $3}')
REQUESTS=$(echo "$STATUS_OUTPUT" | awk '/^requests/ {print $1}')
# Handle potential errors in parsing if output format changes
if [ -z "$ACTIVE_CONNECTIONS" ] || [ -z "$REQUESTS" ]; then
  echo "ERROR: Could not parse Nginx status output."
  exit 1
fi

echo "Nginx Active Connections: $ACTIVE_CONNECTIONS"
echo "Nginx Requests: $REQUESTS"

# Example: Alert if active connections exceed a threshold
MAX_CONNECTIONS=500 # Adjust as needed
if [ "$ACTIVE_CONNECTIONS" -gt "$MAX_CONNECTIONS" ]; then
  echo "WARNING: Nginx active connections ($ACTIVE_CONNECTIONS) exceed threshold ($MAX_CONNECTIONS)."
  # exit 1 # Uncomment to make this a critical alert
fi

PHP-FPM Monitoring

PHP-FPM’s performance is critical for WordPress. Monitoring the number of active processes, idle processes, and slow requests can help identify bottlenecks. PHP-FPM exposes its status via a socket or TCP port.

First, configure PHP-FPM to expose its status. In your `php-fpm.conf` or `pool.d/*.conf` file, add or modify the following:

; For TCP socket (e.g., on port 9000)
listen = 127.0.0.1:9000
pm.status_path = /status

; For Unix socket (e.g., /run/php/php7.4-fpm.sock)
; listen = /run/php/php7.4-fpm.sock
; pm.status_path = /status

You’ll also need to configure your web server (Nginx/Apache) to proxy requests to this status path. For Nginx:

location ~ ^/status(/.*)?$ {
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_pass unix:/run/php/php7.4-fpm.sock; # Or your TCP listener: 127.0.0.1:9000
    internal;
}

With the status endpoint configured, you can fetch the metrics. Note that PHP-FPM status is typically accessed via FastCGI, so a direct `curl` to the web server is needed.

#!/bin/bash

# PHP-FPM status check

# Assuming Nginx is configured to proxy /status to PHP-FPM
PHP_FPM_STATUS_URL="http://localhost/status?full&json" # 'full' for more details, 'json' for easier parsing

# Fetch PHP-FPM status
STATUS_OUTPUT=$(curl -s $PHP_FPM_STATUS_URL)

# Parse JSON output
if ! echo "$STATUS_OUTPUT" | jq -e . > /dev/null; then
  echo "ERROR: Could not fetch or parse PHP-FPM status JSON."
  exit 1
fi

TOTAL_PROCESSES=$(echo "$STATUS_OUTPUT" | jq '.pools[].processes.total')
ACTIVE_PROCESSES=$(echo "$STATUS_OUTPUT" | jq '.pools[].processes.active')
IDLE_PROCESSES=$(echo "$STATUS_OUTPUT" | jq '.pools[].processes.idle')
MAX_CHILDREN=$(echo "$STATUS_OUTPUT" | jq '.pools[].pm.max_children')

echo "PHP-FPM Total Processes: $TOTAL_PROCESSES"
echo "PHP-FPM Active Processes: $ACTIVE_PROCESSES"
echo "PHP-FPM Idle Processes: $IDLE_PROCESSES"
echo "PHP-FPM Max Children: $MAX_CHILDREN"

# Example: Alert if active processes are close to max_children
# Use bc for floating point comparison if needed, but here integers suffice
if [ "$ACTIVE_PROCESSES" -ge "$MAX_CHILDREN" ]; then
  echo "CRITICAL: PHP-FPM active processes ($ACTIVE_PROCESSES) reached max_children ($MAX_CHILDREN)."
  exit 1
fi

MySQL Performance Tuning and Monitoring

The MySQL database is often the bottleneck for WordPress applications. Monitoring key performance indicators (KPIs) such as query throughput, slow queries, connection counts, and buffer pool hit ratio is essential.

Slow Query Log Analysis

The slow query log records queries that take longer than a specified time to execute. Analyzing this log helps identify inefficient SQL statements that can be optimized. Ensure the slow query log is enabled in your MySQL configuration (`my.cnf` or `my.ini`):

slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 2 ; Log queries longer than 2 seconds
log_queries_not_using_indexes = 1 ; Optionally log queries not using indexes

You can then use tools like `pt-query-digest` from the Percona Toolkit to analyze the log file. For automated monitoring, you can periodically check the log file size or use `inotifywait` to trigger analysis when new entries appear.

#!/bin/bash

# MySQL slow query log monitoring script

# Ensure Percona Toolkit is installed: apt-get install percona-toolkit
SLOW_QUERY_LOG="/var/log/mysql/mysql-slow.log"
TEMP_REPORT="/tmp/mysql_slow_query_report.txt"
MAX_LOG_SIZE_MB=100 # Threshold for log file size in MB

# Check if log file exists
if [ ! -f "$SLOW_QUERY_LOG" ]; then
  echo "ERROR: Slow query log file not found at $SLOW_QUERY_LOG"
  exit 1
fi

# Check log file size
CURRENT_SIZE_KB=$(du -k "$SLOW_QUERY_LOG" | cut -f1)
MAX_SIZE_KB=$((MAX_LOG_SIZE_MB * 1024))

if [ "$CURRENT_SIZE_KB" -gt "$MAX_SIZE_KB" ]; then
  echo "WARNING: MySQL slow query log file is large ($((CURRENT_SIZE_KB / 1024)) MB). Consider rotation or analysis."
  # You might want to trigger pt-query-digest here if the log is too large
  # pt-query-digest --limit 100% "$SLOW_QUERY_LOG" > "$TEMP_REPORT"
  # echo "Generated slow query report: $TEMP_REPORT"
  # mv "$SLOW_QUERY_LOG" "$SLOW_QUERY_LOG.old" # Rotate log after analysis
fi

# More advanced: Use pt-query-digest to find top N slow queries
# This is often run less frequently (e.g., daily)
# pt-query-digest --limit 5 "$SLOW_QUERY_LOG" > "$TEMP_REPORT"
# if [ -s "$TEMP_REPORT" ]; then
#   echo "Top 5 slow queries:"
#   cat "$TEMP_REPORT"
#   # Add alerting logic here based on the report content
# fi

MySQL Status Variables

MySQL exposes numerous status variables that provide insights into its operation. Key variables to monitor include:

Threads_connected: Number of currently open connections. High values can indicate connection leaks or insufficient connection pooling.
Threads_running: Number of threads actively executing queries. High values relative to CPU cores suggest CPU contention.
Slow_queries: Counter for slow queries.
Innodb_buffer_pool_read_requests and Innodb_buffer_pool_reads: Used to calculate the InnoDB buffer pool hit ratio. A ratio below 95-99% can indicate insufficient buffer pool size.

You can query these variables using the MySQL client:

#!/bin/bash

# MySQL status variables check

DB_USER="your_db_user"
DB_PASS="your_db_password"
DB_NAME="your_db_name" # Optional, for specific database stats

# Thresholds
MAX_CONNECTIONS=200
MIN_BUFFER_POOL_HIT_RATIO=95

# Get status variables
STATUS_OUTPUT=$(mysql -u"$DB_USER" -p"$DB_PASS" -e "SHOW GLOBAL STATUS;" 2>/dev/null)

if [ $? -ne 0 ]; then
  echo "ERROR: Failed to connect to MySQL or retrieve status."
  exit 1
fi

# Extract specific variables
THREADS_CONNECTED=$(echo "$STATUS_OUTPUT" | grep Threads_connected | awk '{print $2}')
THREADS_RUNNING=$(echo "$STATUS_OUTPUT" | grep Threads_running | awk '{print $2}')
SLOW_QUERIES=$(echo "$STATUS_OUTPUT" | grep Slow_queries | awk '{print $2}')
INNODB_READ_REQUESTS=$(echo "$STATUS_OUTPUT" | grep Innodb_buffer_pool_read_requests | awk '{print $2}')
INNODB_READS=$(echo "$STATUS_OUTPUT" | grep Innodb_buffer_pool_reads | awk '{print $2}')

echo "MySQL Threads Connected: $THREADS_CONNECTED"
echo "MySQL Threads Running: $THREADS_RUNNING"
echo "MySQL Slow Queries: $SLOW_QUERIES"

# Calculate InnoDB buffer pool hit ratio
if [ "$INNODB_READ_REQUESTS" -gt 0 ] && [ "$INNODB_READS" -gt 0 ]; then
  HIT_RATIO=$(echo "scale=2; ($INNODB_READ_REQUESTS - $INNODB_READS) * 100 / $INNODB_READ_REQUESTS" | bc)
  echo "MySQL InnoDB Buffer Pool Hit Ratio: ${HIT_RATIO}%"
else
  HIT_RATIO=0
  echo "MySQL InnoDB Buffer Pool Hit Ratio: N/A (Insufficient data)"
fi

# Alerting logic
if [ "$THREADS_CONNECTED" -gt "$MAX_CONNECTIONS" ]; then
  echo "WARNING: MySQL Threads Connected ($THREADS_CONNECTED) exceeds threshold ($MAX_CONNECTIONS)."
fi

if (( $(echo "$HIT_RATIO < $MIN_BUFFER_POOL_HIT_RATIO" | bc -l) )); then
  echo "WARNING: MySQL InnoDB Buffer Pool Hit Ratio (${HIT_RATIO}%) is below threshold (${MIN_BUFFER_POOL_HIT_RATIO}%)."
fi

# You can add checks for Threads_running vs CPU cores here as well.

DigitalOcean Droplet Resource Monitoring

While the above focuses on application-level metrics, robust infrastructure monitoring is the foundation. DigitalOcean provides basic metrics through its control panel, but for deeper insights and automated alerting, consider agents like Prometheus Node Exporter, Telegraf, or Datadog Agent.

If you’re using Prometheus, Node Exporter is a standard choice. It exposes hardware and OS metrics via an HTTP endpoint. You’d then configure Prometheus to scrape these endpoints and Grafana to visualize the data.

# Example: Installing and running Prometheus Node Exporter on Ubuntu/Debian
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter

# Create a systemd service file
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Once Node Exporter is running, Prometheus can scrape its metrics from http://your_droplet_ip:9100/metrics. Key metrics to monitor include:

node_cpu_seconds_total: CPU usage by mode (idle, user, system, iowait).
node_memory_MemAvailable_bytes: Available memory.
node_disk_io_time_seconds_total: Disk I/O time.
node_network_receive_bytes_total and node_network_transmit_bytes_total: Network traffic.

Alerting Strategy and Tools

Having scripts and metrics is only half the battle. An effective alerting strategy ensures that you are notified promptly when issues arise. Consider the following:

Severity Levels: Differentiate between critical (e.g., cluster down, data loss risk), warning (e.g., high resource usage, nearing thresholds), and informational alerts.
Alert Fatigue: Avoid overwhelming your team with too many alerts. Tune thresholds carefully and use aggregation/deduplication features in your monitoring system.
Actionable Alerts: Alerts should provide enough context to understand the problem and suggest next steps.
Escalation Policies: Define who gets alerted and when, with escalation paths for unresolved issues.

Tools like Prometheus Alertmanager, Grafana Alerting, or cloud-native solutions on DigitalOcean can be configured to manage these alerts. For instance, Prometheus Alertmanager can group, route, and silence alerts based on defined rules.

By combining application-specific metrics for WordPress, Elasticsearch, and MySQL with robust infrastructure monitoring on DigitalOcean, you can build a comprehensive and proactive monitoring system that keeps your critical services running smoothly.