Server Monitoring Best Practices: Keeping Your WordPress App and MySQL Clusters Alive on AWS

Proactive Health Checks for WordPress on EC2

Maintaining a high-availability WordPress deployment on AWS hinges on robust, proactive monitoring. Beyond basic CPU and memory utilization, we need to inspect application-level metrics and critical system processes. For WordPress running on EC2 instances, this means ensuring the web server (typically Apache or Nginx) and PHP-FPM are responsive, and that WordPress itself isn’t encountering fatal errors.

Web Server Process Monitoring (Nginx Example)

A common failure point is the web server process crashing or becoming unresponsive. We can use `systemd`’s built-in monitoring capabilities or external tools like `monit` or `supervisord`. For `systemd`, we can configure a service to restart automatically if it fails.

Systemd Service Unit for Nginx

Create or edit the Nginx systemd service file. The exact path might vary slightly based on your AMI, but it’s commonly found at /etc/systemd/system/nginx.service or /usr/lib/systemd/system/nginx.service.

[Unit]
Description=The Nginx HTTP Server
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
Type=forking
PIDFile=/var/run/nginx.pid
ExecStart=/usr/sbin/nginx -c /etc/nginx/nginx.conf
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
PrivateTmp=true
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

The key directives here are Restart=on-failure and RestartSec=5s. This tells systemd to attempt a restart of the Nginx service if it exits with a non-zero status code, waiting 5 seconds between attempts. This is a fundamental layer of resilience.

PHP-FPM Health and Performance

If your WordPress site uses PHP-FPM, its health is equally critical. PHP-FPM can become overloaded, leading to slow response times or outright failures. We need to monitor its process count and error logs.

Monitoring PHP-FPM Status with `pm.status_path`

PHP-FPM exposes a status page that provides real-time metrics. Ensure this is enabled in your PHP-FPM pool configuration (e.g., /etc/php/7.4/fpm/pool.d/www.conf).

; Ensure this is uncommented and set to a path accessible by your web server
pm.status_path = /fpm_status

; Optional: Limit access to the status page for security
; For Nginx, you can add a location block in your server config to restrict access
; For example:
; location ~ ^/fpm_status$ {
;     allow 127.0.0.1;
;     deny all;
;     fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
;     include fastcgi_params;
;     fastcgi_pass unix:/var/run/php/php7.4-fpm.sock;
; }

With this enabled, you can access http://your-wordpress-domain.com/fpm_status (if configured for web access) or use `curl` from the server itself. The output provides valuable data:

pool:                 www
process manager:      dynamic
start for:            123 sec
accepted conn:        12345
listen queue:         0
max listen queue:     0
listen a:             1
total processes:      8
active processes:     2
max active processes: 4
idle processes:       6
requests:             12345
request duration:     123456
slow requests:        0
   1.33 active    FPM
   0.00 idle      FPM
   0.00 max active  FPM
   0.00 queue       FPM

Key metrics to monitor:

active processes: If this consistently approaches max_children (defined in your pool config), you’re likely hitting resource limits and need to scale up or optimize your PHP code.
listen queue: A non-zero value indicates requests are waiting to be processed by PHP-FPM.
slow requests: A direct indicator of performance bottlenecks.

You can scrape this status page using Prometheus’s `node_exporter` with the `textfile_collector` or a dedicated exporter like `php-fpm_exporter`. Alternatively, a simple `curl` and `grep` script can be run via cron for basic alerting.

WordPress Application-Level Monitoring

Beyond server processes, we need to know if WordPress itself is functioning correctly. This includes checking for fatal errors and ensuring critical pages load.

Monitoring WordPress Error Logs

WordPress logs errors to wp-content/debug.log when WP_DEBUG and WP_DEBUG_LOG are enabled in wp-config.php. In production, you should disable WP_DEBUG_DISPLAY but keep WP_DEBUG_LOG enabled. Regularly shipping these logs to a centralized logging system (like CloudWatch Logs, ELK stack, or Splunk) is crucial.

define( 'WP_DEBUG', false );
define( 'WP_DEBUG_LOG', true );
define( 'WP_DEBUG_DISPLAY', false );
@ini_set( 'display_errors', 0 );
define( 'SCRIPT_DEBUG', true ); // Useful for development, but can be false in production

Use tools like Fluentd, Fluent Bit, or the CloudWatch Agent to tail these logs and send them to your chosen aggregation service. Set up alerts for specific error patterns (e.g., “Fatal error”, “Parse error”, database connection errors).

Synthetic Monitoring of Key Pages

Implement synthetic checks that periodically request critical pages (homepage, a specific post, checkout page if applicable) and assert expected content or response times. Tools like Prometheus’s `blackbox_exporter` are excellent for this. You can configure it to perform HTTP, HTTPS, TCP, ICMP, DNS, and even basic authentication checks.

MySQL Cluster Health on AWS RDS/Aurora

For managed database services like AWS RDS or Aurora, much of the underlying infrastructure monitoring is handled by AWS. However, we still need to focus on database-specific metrics that impact application performance and availability.

Key RDS/Aurora Metrics to Monitor

AWS CloudWatch provides a rich set of metrics for RDS and Aurora instances. Ensure these are configured for alarm thresholds:

CPUUtilization: High CPU can indicate inefficient queries or insufficient instance size.
DatabaseConnections: Monitor the number of active connections. If it approaches the max_connections limit, you’ll need to optimize queries, connection pooling, or scale up.
ReadIOPS and WriteIOPS: For provisioned IOPS instances, ensure you’re not exceeding your provisioned limits. For general purpose instances, monitor these to understand I/O load.
ReadLatency and WriteLatency: High latency is a direct indicator of performance issues.
FreeableMemory: Low freeable memory can lead to increased disk I/O as the database swaps data.
DiskQueueDepth: A consistently high queue depth indicates the storage subsystem cannot keep up with the I/O requests.
Aurora specific: AuroraReplicaLag (for read replicas), ServerlessDatabaseCapacity (for Aurora Serverless).

Set up CloudWatch Alarms on these metrics with appropriate thresholds. For example, an alarm on CPUUtilization exceeding 80% for 15 minutes, or DatabaseConnections exceeding 90% of the maximum.

Query Performance and Slow Query Logs

Even with adequate instance resources, poorly optimized SQL queries can cripple your WordPress site. Monitoring slow queries is paramount.

Enabling and Analyzing Slow Query Logs

For RDS, you can enable the slow query log via the RDS console or AWS CLI. This log file can be streamed to CloudWatch Logs.

-- Example of enabling slow query log via AWS CLI (modify parameters as needed)
aws rds modify-db-instance \
    --db-instance-identifier your-db-instance-name \
    --enable-performance-insights \
    --performance-insights-kms-key-id YOUR_KMS_KEY_ID \
    --performance-insights-retention-period 7
# Note: Enabling Performance Insights often implicitly enables slow query logging or provides similar insights.
# For explicit slow query log configuration, you might need to adjust DB Parameter Groups.

# Example DB Parameter Group modification for MySQL 5.7/8.0:
# Set 'slow_query_log' to 1
# Set 'long_query_time' to a suitable threshold (e.g., 2 or 3 seconds)
# Set 'log_output' to 'FILE' or 'SYSLOG' (for streaming to CloudWatch Logs)
# Set 'slow_query_log_file' if log_output is FILE (e.g., /rdsdbdata/log/slowquery.log)

Once enabled and streamed to CloudWatch Logs, you can use CloudWatch Logs Insights or a dedicated query analysis tool to identify problematic queries. Look for queries with high execution counts and long average execution times.

AWS Performance Insights

AWS Performance Insights is a powerful tool for diagnosing database performance issues. It provides a visual dashboard to identify bottlenecks, analyze query load, and understand wait events without needing to manually parse log files.

Navigate to the RDS console, select your database instance, and go to the “Performance Insights” tab.
Analyze the “Top SQL” and “Wait Events” sections.
Filter by time range to correlate performance degradation with specific events or traffic spikes.

If Performance Insights reveals consistently slow queries, you’ll need to:

Optimize the SQL query itself (e.g., add indexes, rewrite joins).
Add appropriate database indexes. WordPress plugins can sometimes create inefficient queries or lack necessary indexes. Use tools like `EXPLAIN` to analyze query plans.
Consider caching strategies at the WordPress application level (e.g., object caching with Redis/Memcached, page caching plugins) to reduce database load.

Database Replication Lag and Failover Readiness

For high availability, especially with read replicas or Aurora clusters, monitoring replication lag is critical. Excessive lag means read replicas are out of sync, potentially serving stale data or failing to take over during a failover event.

Monitoring `AuroraReplicaLag` and `ReplicaLag`

CloudWatch provides the AuroraReplicaLag metric for Aurora clusters and ReplicaLag for standard RDS MySQL replicas. Set up alarms for these metrics to fire when lag exceeds a few seconds (the acceptable threshold depends on your application’s tolerance for stale data).

# Example CloudWatch Alarm configuration (conceptual, using AWS CLI)
aws cloudwatch put-metric-alarm \
    --alarm-name "RDS-MySQL-ReplicaLag-High" \
    --metric-name "ReplicaLag" \
    --namespace "AWS/RDS" \
    --statistic Average \
    --period 300 \
    --threshold 10 \
    --comparison-operator GreaterThanThreshold \
    --dimensions "Name=DBInstanceIdentifier,Value=your-read-replica-db-instance-name" \
    --evaluation-periods 2 \
    --alarm-description "High replication lag detected on RDS read replica." \
    --treat-missing-data notBreaching

Investigate high lag by checking:

Network connectivity between the primary and replica.
The load on the replica instance (CPU, I/O).
The load on the primary instance (especially write activity).
Long-running queries on the primary that might be blocking replication threads.

Centralized Logging and Alerting Strategy

A fragmented monitoring approach is insufficient. All relevant logs (web server, PHP-FPM, WordPress debug logs, MySQL slow query logs) and metrics should be aggregated into a central system. AWS CloudWatch Logs is a natural fit within the AWS ecosystem, offering:

Log Ingestion: Using the CloudWatch Agent to collect logs from EC2 instances and stream them.
Log Analysis: CloudWatch Logs Insights for querying and analyzing logs.
Alerting: Creating CloudWatch Alarms based on log patterns or metric thresholds.
Dashboards: Visualizing key metrics and log data.

For more advanced analysis or integration with external ticketing systems, consider shipping logs from CloudWatch Logs to services like Elasticsearch (via Kinesis Firehose) or Splunk.

Automated Recovery and Remediation

Beyond alerting, implement automated recovery actions where feasible. For instance:

Auto Scaling Groups: Configure EC2 Auto Scaling to replace unhealthy instances automatically.
Lambda Functions: Trigger Lambda functions via CloudWatch Alarms to perform specific remediation tasks (e.g., restarting a service, clearing cache, scaling up database read replicas).
RDS Automated Backups and Snapshots: Ensure these are configured and regularly tested.
Multi-AZ Deployments: For RDS, leverage Multi-AZ for automatic failover.

A well-architected monitoring strategy is not just about detecting failures, but about preventing them and ensuring rapid, automated recovery when they do occur, minimizing downtime for your WordPress application.