Server Monitoring Best Practices: Keeping Your WordPress App and Elasticsearch Clusters Alive on AWS

Proactive Health Checks for WordPress on AWS EC2

Maintaining a high-availability WordPress deployment on AWS EC2 necessitates a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to inspect application-level metrics and critical system processes. This involves leveraging both AWS CloudWatch and custom scripting for granular insights.

EC2 Instance Monitoring with CloudWatch Agent

The CloudWatch Agent is indispensable for collecting system-level metrics beyond the default EC2 metrics. It allows us to push custom metrics, log files, and detailed system performance data to CloudWatch. For a WordPress server, we’ll focus on disk I/O, network traffic, and specific process monitoring.

First, install the CloudWatch Agent on your EC2 instance. The installation process varies slightly by OS, but generally involves downloading the package and running the installer.

CloudWatch Agent Configuration

The agent’s configuration is defined in a JSON file, typically located at /opt/aws/amazon-cloudwatch-agent/bin/config.json. Here’s a sample configuration focusing on WordPress-relevant metrics:

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "WordPress/EC2",
    "metrics_collected": {
      "disk": {
        "measurement": [
          "used_percent",
          "free",
          "total",
          "inodes_free",
          "inodes_used_percent"
        ],
        "resources": [
          "/",
          "/var/www/html"
        ],
        "ignore_devices": [
          "tmpfs",
          "devtmpfs"
        ]
      },
      "mem": {
        "measurement": [
          "mem_used_percent",
          "mem_available_percent",
          "swap_used_percent"
        ]
      },
      "net": {
        "measurement": [
          "bytes_sent",
          "bytes_recv",
          "packets_sent",
          "packets_recv"
        ]
      },
      "statsd": {
        "service_address": "udp:localhost:8125"
      },
      "process": [
        {
          "process_name": "php-fpm",
          "measurement": [
            "pid",
            "cpu_usage",
            "memory_usage"
          ]
        },
        {
          "process_name": "nginx",
          "measurement": [
            "pid",
            "cpu_usage",
            "memory_usage"
          ]
        },
        {
          "process_name": "mysqld",
          "measurement": [
            "pid",
            "cpu_usage",
            "memory_usage"
          ]
        }
      ]
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "WordPress/EC2/Nginx/Access",
            "log_stream_name": "{instance_id}/nginx-access"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "WordPress/EC2/Nginx/Error",
            "log_stream_name": "{instance_id}/nginx-error"
          },
          {
            "file_path": "/var/log/php-fpm/error.log",
            "log_group_name": "WordPress/EC2/PHP-FPM/Error",
            "log_stream_name": "{instance_id}/php-fpm-error"
          },
          {
            "file_path": "/var/log/mysql/error.log",
            "log_group_name": "WordPress/EC2/MySQL/Error",
            "log_stream_name": "{instance_id}/mysql-error"
          }
        ]
      }
    }
  }
}

After saving the configuration, start and enable the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl start amazon-cloudwatch-agent

Application-Level Monitoring with StatsD and Prometheus (Optional but Recommended)

For deeper application insights, integrating StatsD and Prometheus provides a robust solution. WordPress itself can be instrumented to expose custom metrics. We’ll use the StatsD daemon to collect these metrics and then scrape them with Prometheus.

WordPress PHP-FPM Metrics via StatsD

Ensure your php-fpm configuration includes the StatsD extension or a custom logger that sends metrics. A common approach is to use a library like php-statsd-client.

<?php
// Example: In a custom plugin or mu-plugin
require_once 'vendor/autoload.php'; // If using Composer

use Domnikl\Statsd\Connection\UdpSocket;
use Domnikl\Statsd\Statsd;

$connection = new UdpSocket('127.0.0.1', 8125);
$statsd = new Statsd($connection);

// Track request duration
$startTime = microtime(true);
// ... WordPress request processing ...
$duration = microtime(true) - $startTime;
$statsd->timing('request_duration_ms', $duration * 1000);

// Track successful logins
if (is_user_logged_in() && $user_id === $current_user_id) {
    $statsd->increment('user_logins');
}

// Track errors
if (defined('WP_DEBUG') && WP_DEBUG) {
    // Log specific errors or exceptions
    $statsd->increment('php_errors');
}
?>

Configure the CloudWatch Agent’s statsd section (as shown previously) to listen on UDP port 8125. This will forward these custom metrics to CloudWatch under the WordPress/EC2 namespace.

Elasticsearch Cluster Monitoring on AWS

Monitoring an Elasticsearch cluster, especially one hosted on AWS (whether self-managed on EC2 or using Amazon Elasticsearch Service/OpenSearch Service), requires a different set of tools and metrics. We’ll focus on self-managed EC2 instances for this example.

Elasticsearch Node and Cluster Health Metrics

Elasticsearch exposes a wealth of information via its REST API. We can use tools like curl, or more effectively, a dedicated monitoring solution like Prometheus with the elasticsearch_exporter.

Using Elasticsearch Exporter with Prometheus

The elasticsearch_exporter is a Prometheus exporter that queries Elasticsearch for metrics and exposes them in Prometheus format. Install it on a dedicated monitoring node or one of the Elasticsearch nodes (though a separate node is preferred).

# Download and install the exporter (example for Linux)
wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/vX.Y.Z/elasticsearch_exporter-vX.Y.Z.linux-amd64.tar.gz
tar xvfz elasticsearch_exporter-vX.Y.Z.linux-amd64.tar.gz
sudo mv elasticsearch_exporter-vX.Y.Z.linux-amd64/elasticsearch_exporter /usr/local/bin/

# Configure the exporter (e.g., via systemd service)
# Create a systemd service file: /etc/systemd/system/elasticsearch_exporter.service
# Example content:
[Unit]
Description=Prometheus Elasticsearch Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/elasticsearch_exporter \
  --es.uri=http://localhost:9200 \
  --web.listen-address=":9114" \
  --es.all_indices \
  --es.indices_include=".*" \
  --es.cluster_name="my-es-cluster"

[Install]
WantedBy=multi-user.target

# Reload systemd, enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable elasticsearch_exporter
sudo systemctl start elasticsearch_exporter

Ensure your Prometheus configuration scrapes this exporter:

# prometheus.yml
scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['your-es-exporter-host:9114']
    metric_relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '$1'

Key Elasticsearch Metrics to Monitor

Cluster Health: elasticsearch_cluster_health_status (0=red, 1=yellow, 2=green). Set alerts for non-green statuses.
Node Status: elasticsearch_node_status (1=online, 0=offline).
JVM Heap Usage: elasticsearch_jvm_heap_used_percent. Keep this below 80-85% to avoid frequent garbage collection pauses.
Disk Usage: elasticsearch_fs_data_free_bytes and elasticsearch_fs_data_total_bytes. Monitor free disk space closely, especially for indices with high write rates.
Indexing Rate: elasticsearch_indices_indexing_rate. High rates can indicate performance bottlenecks.
Search Rate: elasticsearch_indices_search_rate. Spikes can indicate inefficient queries or heavy load.
Pending Tasks: elasticsearch_cluster_pending_tasks. A growing number indicates the cluster is struggling to keep up.
Replication Lag: Monitor shard replication status to ensure data consistency.

Alerting Strategies

Effective alerting is crucial for proactive issue resolution. We’ll use AWS CloudWatch Alarms for EC2 metrics and Prometheus Alertmanager for Elasticsearch and application-level metrics.

CloudWatch Alarms for WordPress EC2

Create CloudWatch Alarms based on the metrics collected by the CloudWatch Agent. For example, to alert when disk usage on the WordPress root volume exceeds 85%:

# Using AWS CLI
aws cloudwatch put-metric-alarm \
    --alarm-name "WordPress-HighDiskUsage" \
    --alarm-description "Alarm when WordPress root disk usage is high" \
    --metric-name "disk_used_percent" \
    --namespace "WordPress/EC2" \
    --statistic "Average" \
    --period 300 \
    --threshold 85 \
    --comparison-operator "GreaterThanOrEqualToThreshold" \
    --dimensions "Name=path,Value=/" "Name=InstanceId,Value=i-0abcdef1234567890" \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data "notBreaching" \
    --alarm-actions "arn:aws:sns:us-east-1:123456789012:MyAlertsTopic"

Similarly, set alarms for high CPU utilization, low memory, and critical log file errors (e.g., count of “Fatal error” in PHP logs).

Prometheus Alertmanager for Elasticsearch

Configure Alertmanager rules in Prometheus to trigger alerts based on Elasticsearch metrics. Example rule:

# prometheus/alert.rules.yml
groups:
- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster {{ $labels.cluster_name }} is RED"
      description: "The Elasticsearch cluster {{ $labels.cluster_name }} has entered a RED health state."

  - alert: ElasticsearchHighJVMPool
    expr: elasticsearch_jvm_heap_used_percent > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch JVM heap usage high on {{ $labels.instance }}"
      description: "Elasticsearch node {{ $labels.instance }} has JVM heap usage above 85% (current: {{ $value }}%)."

  - alert: ElasticsearchLowDiskSpace
    expr: elasticsearch_fs_data_free_bytes / elasticsearch_fs_data_total_bytes * 100 < 15
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch low disk space on {{ $labels.instance }}"
      description: "Elasticsearch node {{ $labels.instance }} has less than 15% free disk space (current: {{ printf "%.2f" (elasticsearch_fs_data_free_bytes / elasticsearch_fs_data_total_bytes * 100) }}%)."

Ensure Alertmanager is configured to route these alerts to your desired notification channels (Slack, PagerDuty, email, etc.).

Conclusion: A Layered Approach

A robust server monitoring strategy for WordPress and Elasticsearch on AWS involves combining AWS-native tools like CloudWatch with open-source solutions like Prometheus and its ecosystem. By monitoring system resources, application processes, and critical service health metrics, and by implementing intelligent alerting, you can ensure the stability, performance, and availability of your critical applications.