Server Monitoring Best Practices: Keeping Your WordPress App and Elasticsearch Clusters Alive on DigitalOcean
Proactive Elasticsearch Health Checks with `curl` and `jq`
Maintaining the health of an Elasticsearch cluster, especially one supporting a high-traffic WordPress application, requires more than just basic CPU and memory monitoring. Elasticsearch has its own internal metrics that, when tracked, can predict and prevent issues before they impact your application. We’ll leverage `curl` to query the Elasticsearch API and `jq` for parsing the JSON output to extract critical health indicators.
A fundamental check is the cluster health API. This endpoint provides a snapshot of the cluster’s status (green, yellow, or red), the number of nodes, and shard allocation status. A ‘red’ status indicates that some primary shards are not allocated, meaning data is unavailable. A ‘yellow’ status means all primary shards are allocated, but some replica shards are missing, posing a risk of data loss if a node fails.
Automating Cluster Health Checks
We can script these checks to run periodically. The following Bash script uses `curl` to hit the `_cluster/health` endpoint and `jq` to extract the status. It then exits with a non-zero status code if the cluster is not in a ‘green’ state, making it suitable for integration with monitoring systems like Nagios, Zabbix, or even a simple cron job with alerting.
Ensure you replace http://localhost:9200 with the actual endpoint of your Elasticsearch cluster. If your cluster uses authentication, you’ll need to add appropriate headers to the `curl` command (e.g., -u 'elastic:changeme' or -H 'Authorization: Basic ...').
The script first fetches the cluster health. Then, it uses `jq` to select the ‘status’ field. If the status is anything other than ‘green’, it prints a message and exits with code 1. Otherwise, it prints a success message and exits with code 0.
#!/bin/bash
# Elasticsearch cluster health check script
ES_URL="http://localhost:9200" # Replace with your Elasticsearch endpoint
# Fetch cluster health and extract status
HEALTH_STATUS=$(curl -s -X GET "${ES_URL}/_cluster/health" | jq -r '.status')
if [ "$HEALTH_STATUS" != "green" ]; then
echo "CRITICAL: Elasticsearch cluster is not green. Status: $HEALTH_STATUS"
exit 1
else
echo "OK: Elasticsearch cluster is green."
exit 0
fi
Monitoring Elasticsearch Node Statistics
Beyond cluster-wide health, individual node statistics are crucial. High JVM heap usage, excessive garbage collection activity, or a large number of unassigned shards on a specific node can indicate underlying problems. The `_nodes/stats` API provides detailed metrics for each node.
We can monitor JVM heap usage, which is a common bottleneck. High heap usage can lead to frequent garbage collection pauses, impacting query latency and overall cluster responsiveness. A threshold of 80-90% is often considered a warning level.
#!/bin/bash
# Elasticsearch JVM heap usage check script
ES_URL="http://localhost:9200" # Replace with your Elasticsearch endpoint
HEAP_THRESHOLD=90 # Percentage
# Fetch node stats and iterate through each node
curl -s -X GET "${ES_URL}/_nodes/stats/jvm" | jq -c '.nodes[] | {host: .host, heap_used_percent: .jvm.mem.heap_used_percent}' | while read NODE_STATS; do
HOST=$(echo "$NODE_STATS" | jq -r '.host')
HEAP_USED=$(echo "$NODE_STATS" | jq -r '.heap_used_percent')
if (( $(echo "$HEAP_USED > $HEAP_THRESHOLD" | bc -l) )); then
echo "WARNING: Elasticsearch node $HOST has high JVM heap usage: ${HEAP_USED}%"
# In a real monitoring setup, you might want to exit with a non-zero code here
# or collect these warnings and report them collectively.
else
echo "OK: Elasticsearch node $HOST JVM heap usage: ${HEAP_USED}%"
fi
done
This script iterates through each node, extracts its hostname and JVM heap usage percentage, and flags nodes exceeding the defined threshold. The `bc -l` command is used for floating-point comparisons, which is necessary as `jq` can output percentages with decimal points.
WordPress Application Monitoring on DigitalOcean
For the WordPress application itself, standard server metrics are essential, but we also need to consider WordPress-specific performance indicators. On DigitalOcean, this typically involves monitoring Droplet resource utilization (CPU, RAM, Disk I/O, Network) and then diving deeper into the web server (Nginx/Apache), PHP-FPM, and MySQL performance.
Nginx/Apache Performance Metrics
Web server logs are a goldmine for performance insights. Monitoring the number of active connections, request rates, and error rates (4xx, 5xx) can quickly reveal issues. For Nginx, the `stub_status` module is invaluable.
First, ensure the `stub_status` module is enabled in your Nginx configuration. Add the following to your `nginx.conf` or a site-specific configuration file:
# In your http, server, or location block
location /nginx_status {
stub_status;
allow 127.0.0.1; # Restrict access to localhost for security
deny all;
}
Then, you can use `curl` to fetch these metrics:
#!/bin/bash
# Nginx stub_status check
NGINX_STATUS_URL="http://localhost/nginx_status" # Adjust if Nginx is not on localhost or path is different
# Fetch Nginx status
STATUS_OUTPUT=$(curl -s $NGINX_STATUS_URL)
# Parse the output
ACTIVE_CONNECTIONS=$(echo "$STATUS_OUTPUT" | awk '/Active Connections:/ {print $3}')
REQUESTS=$(echo "$STATUS_OUTPUT" | awk '/^requests/ {print $1}')
# Handle potential errors in parsing if output format changes
if [ -z "$ACTIVE_CONNECTIONS" ] || [ -z "$REQUESTS" ]; then
echo "ERROR: Could not parse Nginx status output."
exit 1
fi
echo "Nginx Active Connections: $ACTIVE_CONNECTIONS"
echo "Nginx Requests: $REQUESTS"
# Example: Alert if active connections exceed a threshold
MAX_CONNECTIONS=500 # Adjust as needed
if [ "$ACTIVE_CONNECTIONS" -gt "$MAX_CONNECTIONS" ]; then
echo "WARNING: Nginx active connections ($ACTIVE_CONNECTIONS) exceed threshold ($MAX_CONNECTIONS)."
# exit 1 # Uncomment to make this a critical alert
fi
PHP-FPM Monitoring
PHP-FPM’s performance is critical for WordPress. Monitoring the number of active processes, idle processes, and slow requests can help identify bottlenecks. PHP-FPM exposes its status via a socket or TCP port.
First, configure PHP-FPM to expose its status. In your `php-fpm.conf` or `pool.d/*.conf` file, add or modify the following:
; For TCP socket (e.g., on port 9000) listen = 127.0.0.1:9000 pm.status_path = /status ; For Unix socket (e.g., /run/php/php7.4-fpm.sock) ; listen = /run/php/php7.4-fpm.sock ; pm.status_path = /status
You’ll also need to configure your web server (Nginx/Apache) to proxy requests to this status path. For Nginx:
location ~ ^/status(/.*)?$ {
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_pass unix:/run/php/php7.4-fpm.sock; # Or your TCP listener: 127.0.0.1:9000
internal;
}
With the status endpoint configured, you can fetch the metrics. Note that PHP-FPM status is typically accessed via FastCGI, so a direct `curl` to the web server is needed.
#!/bin/bash # PHP-FPM status check # Assuming Nginx is configured to proxy /status to PHP-FPM PHP_FPM_STATUS_URL="http://localhost/status?full&json" # 'full' for more details, 'json' for easier parsing # Fetch PHP-FPM status STATUS_OUTPUT=$(curl -s $PHP_FPM_STATUS_URL) # Parse JSON output if ! echo "$STATUS_OUTPUT" | jq -e . > /dev/null; then echo "ERROR: Could not fetch or parse PHP-FPM status JSON." exit 1 fi TOTAL_PROCESSES=$(echo "$STATUS_OUTPUT" | jq '.pools[].processes.total') ACTIVE_PROCESSES=$(echo "$STATUS_OUTPUT" | jq '.pools[].processes.active') IDLE_PROCESSES=$(echo "$STATUS_OUTPUT" | jq '.pools[].processes.idle') MAX_CHILDREN=$(echo "$STATUS_OUTPUT" | jq '.pools[].pm.max_children') echo "PHP-FPM Total Processes: $TOTAL_PROCESSES" echo "PHP-FPM Active Processes: $ACTIVE_PROCESSES" echo "PHP-FPM Idle Processes: $IDLE_PROCESSES" echo "PHP-FPM Max Children: $MAX_CHILDREN" # Example: Alert if active processes are close to max_children # Use bc for floating point comparison if needed, but here integers suffice if [ "$ACTIVE_PROCESSES" -ge "$MAX_CHILDREN" ]; then echo "CRITICAL: PHP-FPM active processes ($ACTIVE_PROCESSES) reached max_children ($MAX_CHILDREN)." exit 1 fi
MySQL Performance Tuning and Monitoring
The MySQL database is often the bottleneck for WordPress applications. Monitoring key performance indicators (KPIs) such as query throughput, slow queries, connection counts, and buffer pool hit ratio is essential.
Slow Query Log Analysis
The slow query log records queries that take longer than a specified time to execute. Analyzing this log helps identify inefficient SQL statements that can be optimized. Ensure the slow query log is enabled in your MySQL configuration (`my.cnf` or `my.ini`):
slow_query_log = 1 slow_query_log_file = /var/log/mysql/mysql-slow.log long_query_time = 2 ; Log queries longer than 2 seconds log_queries_not_using_indexes = 1 ; Optionally log queries not using indexes
You can then use tools like `pt-query-digest` from the Percona Toolkit to analyze the log file. For automated monitoring, you can periodically check the log file size or use `inotifywait` to trigger analysis when new entries appear.
#!/bin/bash # MySQL slow query log monitoring script # Ensure Percona Toolkit is installed: apt-get install percona-toolkit SLOW_QUERY_LOG="/var/log/mysql/mysql-slow.log" TEMP_REPORT="/tmp/mysql_slow_query_report.txt" MAX_LOG_SIZE_MB=100 # Threshold for log file size in MB # Check if log file exists if [ ! -f "$SLOW_QUERY_LOG" ]; then echo "ERROR: Slow query log file not found at $SLOW_QUERY_LOG" exit 1 fi # Check log file size CURRENT_SIZE_KB=$(du -k "$SLOW_QUERY_LOG" | cut -f1) MAX_SIZE_KB=$((MAX_LOG_SIZE_MB * 1024)) if [ "$CURRENT_SIZE_KB" -gt "$MAX_SIZE_KB" ]; then echo "WARNING: MySQL slow query log file is large ($((CURRENT_SIZE_KB / 1024)) MB). Consider rotation or analysis." # You might want to trigger pt-query-digest here if the log is too large # pt-query-digest --limit 100% "$SLOW_QUERY_LOG" > "$TEMP_REPORT" # echo "Generated slow query report: $TEMP_REPORT" # mv "$SLOW_QUERY_LOG" "$SLOW_QUERY_LOG.old" # Rotate log after analysis fi # More advanced: Use pt-query-digest to find top N slow queries # This is often run less frequently (e.g., daily) # pt-query-digest --limit 5 "$SLOW_QUERY_LOG" > "$TEMP_REPORT" # if [ -s "$TEMP_REPORT" ]; then # echo "Top 5 slow queries:" # cat "$TEMP_REPORT" # # Add alerting logic here based on the report content # fi
MySQL Status Variables
MySQL exposes numerous status variables that provide insights into its operation. Key variables to monitor include:
Threads_connected: Number of currently open connections. High values can indicate connection leaks or insufficient connection pooling.Threads_running: Number of threads actively executing queries. High values relative to CPU cores suggest CPU contention.Slow_queries: Counter for slow queries.Innodb_buffer_pool_read_requestsandInnodb_buffer_pool_reads: Used to calculate the InnoDB buffer pool hit ratio. A ratio below 95-99% can indicate insufficient buffer pool size.
You can query these variables using the MySQL client:
#!/bin/bash
# MySQL status variables check
DB_USER="your_db_user"
DB_PASS="your_db_password"
DB_NAME="your_db_name" # Optional, for specific database stats
# Thresholds
MAX_CONNECTIONS=200
MIN_BUFFER_POOL_HIT_RATIO=95
# Get status variables
STATUS_OUTPUT=$(mysql -u"$DB_USER" -p"$DB_PASS" -e "SHOW GLOBAL STATUS;" 2>/dev/null)
if [ $? -ne 0 ]; then
echo "ERROR: Failed to connect to MySQL or retrieve status."
exit 1
fi
# Extract specific variables
THREADS_CONNECTED=$(echo "$STATUS_OUTPUT" | grep Threads_connected | awk '{print $2}')
THREADS_RUNNING=$(echo "$STATUS_OUTPUT" | grep Threads_running | awk '{print $2}')
SLOW_QUERIES=$(echo "$STATUS_OUTPUT" | grep Slow_queries | awk '{print $2}')
INNODB_READ_REQUESTS=$(echo "$STATUS_OUTPUT" | grep Innodb_buffer_pool_read_requests | awk '{print $2}')
INNODB_READS=$(echo "$STATUS_OUTPUT" | grep Innodb_buffer_pool_reads | awk '{print $2}')
echo "MySQL Threads Connected: $THREADS_CONNECTED"
echo "MySQL Threads Running: $THREADS_RUNNING"
echo "MySQL Slow Queries: $SLOW_QUERIES"
# Calculate InnoDB buffer pool hit ratio
if [ "$INNODB_READ_REQUESTS" -gt 0 ] && [ "$INNODB_READS" -gt 0 ]; then
HIT_RATIO=$(echo "scale=2; ($INNODB_READ_REQUESTS - $INNODB_READS) * 100 / $INNODB_READ_REQUESTS" | bc)
echo "MySQL InnoDB Buffer Pool Hit Ratio: ${HIT_RATIO}%"
else
HIT_RATIO=0
echo "MySQL InnoDB Buffer Pool Hit Ratio: N/A (Insufficient data)"
fi
# Alerting logic
if [ "$THREADS_CONNECTED" -gt "$MAX_CONNECTIONS" ]; then
echo "WARNING: MySQL Threads Connected ($THREADS_CONNECTED) exceeds threshold ($MAX_CONNECTIONS)."
fi
if (( $(echo "$HIT_RATIO < $MIN_BUFFER_POOL_HIT_RATIO" | bc -l) )); then
echo "WARNING: MySQL InnoDB Buffer Pool Hit Ratio (${HIT_RATIO}%) is below threshold (${MIN_BUFFER_POOL_HIT_RATIO}%)."
fi
# You can add checks for Threads_running vs CPU cores here as well.
DigitalOcean Droplet Resource Monitoring
While the above focuses on application-level metrics, robust infrastructure monitoring is the foundation. DigitalOcean provides basic metrics through its control panel, but for deeper insights and automated alerting, consider agents like Prometheus Node Exporter, Telegraf, or Datadog Agent.
If you’re using Prometheus, Node Exporter is a standard choice. It exposes hardware and OS metrics via an HTTP endpoint. You’d then configure Prometheus to scrape these endpoints and Grafana to visualize the data.
# Example: Installing and running Prometheus Node Exporter on Ubuntu/Debian wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ sudo useradd -rs /bin/false node_exporter # Create a systemd service file sudo tee /etc/systemd/system/node_exporter.service <<EOF [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter
Once Node Exporter is running, Prometheus can scrape its metrics from http://your_droplet_ip:9100/metrics. Key metrics to monitor include:
node_cpu_seconds_total: CPU usage by mode (idle, user, system, iowait).node_memory_MemAvailable_bytes: Available memory.node_disk_io_time_seconds_total: Disk I/O time.node_network_receive_bytes_totalandnode_network_transmit_bytes_total: Network traffic.
Alerting Strategy and Tools
Having scripts and metrics is only half the battle. An effective alerting strategy ensures that you are notified promptly when issues arise. Consider the following:
- Severity Levels: Differentiate between critical (e.g., cluster down, data loss risk), warning (e.g., high resource usage, nearing thresholds), and informational alerts.
- Alert Fatigue: Avoid overwhelming your team with too many alerts. Tune thresholds carefully and use aggregation/deduplication features in your monitoring system.
- Actionable Alerts: Alerts should provide enough context to understand the problem and suggest next steps.
- Escalation Policies: Define who gets alerted and when, with escalation paths for unresolved issues.
Tools like Prometheus Alertmanager, Grafana Alerting, or cloud-native solutions on DigitalOcean can be configured to manage these alerts. For instance, Prometheus Alertmanager can group, route, and silence alerts based on defined rules.
By combining application-specific metrics for WordPress, Elasticsearch, and MySQL with robust infrastructure monitoring on DigitalOcean, you can build a comprehensive and proactive monitoring system that keeps your critical services running smoothly.