Server Monitoring Best Practices: Keeping Your WordPress App and Redis Clusters Alive on Google Cloud
Proactive WordPress & Redis Monitoring on Google Cloud
Maintaining the health and performance of a production WordPress application, especially when coupled with a Redis cluster for caching and session management, demands a robust and proactive monitoring strategy. On Google Cloud Platform (GCP), this translates to leveraging a combination of native GCP services and specialized tools. This guide focuses on actionable steps and configurations for ensuring high availability and rapid issue detection.
GCP Monitoring Fundamentals: Cloud Monitoring & Logging
Google Cloud’s native monitoring suite, Cloud Monitoring (formerly Stackdriver), is the foundational layer. It provides metrics, logs, and alerting for all GCP resources. For WordPress and Redis, we’ll focus on key metrics and custom log ingestion.
WordPress Application Metrics
While Cloud Monitoring captures VM-level metrics (CPU, memory, disk I/O), application-level insights are crucial. We’ll use the Ops Agent to collect custom metrics and logs from our WordPress instances.
Ops Agent Configuration for WordPress
The Ops Agent allows us to collect metrics beyond the standard GCP offerings. We’ll configure it to scrape PHP-FPM status and potentially custom application performance metrics (APM) if you’re using a library like New Relic or Datadog’s agent.
First, ensure the Ops Agent is installed on your WordPress Compute Engine instances. Then, create a configuration file, typically located at /etc/google-cloud-ops-agent/config.yaml. Here’s a sample configuration focusing on PHP-FPM and basic web server access logs:
/etc/google-cloud-ops-agent/config.yaml
logging:
receivers:
php_fpm_log:
type: files
include_paths:
- /var/log/php*/fpm/www.error.log
- /var/log/php*/fpm/www.access.log # If configured
apache_access_log:
type: files
include_paths:
- /var/log/apache2/access.log
- /var/log/httpd/access_log
apache_error_log:
type: files
include_paths:
- /var/log/apache2/error.log
- /var/log/httpd/error_log
processors:
# Example: Parse PHP-FPM access logs for request duration
parse_php_fpm_access:
type: regex_parser
regex: '^(?P<remote_ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<verb>\S+) (?P<request>\S+) (?P<protocol>\S+)" (?P<status>\d+) (?P<bytes>\d+) "(?P<referer>.*?)" "(?P<user_agent>.*?)" (?P<request_time>[\d\.]+)$'
source: php_fpm_log
# Add this processor to the receiver if you want to parse it
# receivers: [php_fpm_log] # This would be in the logging.receivers section
# For simplicity, we'll just forward logs without complex parsing here.
# More advanced parsing can be done in Cloud Logging or via log-based metrics.
forwarders:
default_logs:
destination: logging
mode: зон
metrics:
receivers:
php_fpm_status:
type: php_fpm_status
endpoint: 127.0.0.1:9000 # Adjust if your FPM socket is different
path: /status
# Optional: Add authentication if your FPM status page requires it
# username: "user"
# password: "password"
apache_status:
type: apache_status
endpoint: 127.0.0.1:80 # Or your Apache port
path: /server-status
# Optional: Add authentication if your Apache status page requires it
# username: "user"
# password: "password"
processors:
# Example: Filter out metrics we don't care about
filter_metrics:
type: filter
include:
metric.type:
- "php_fpm/.*"
- "apache.*"
forwarders:
default_metrics:
destination: monitoring
interval: "60s" # Collect metrics every minute
After updating the configuration, restart the Ops Agent:
sudo systemctl restart google-cloud-ops-agent sudo systemctl status google-cloud-ops-agent
Redis Cluster Metrics
For Redis, we’ll leverage the Ops Agent’s built-in Redis receiver and also collect Redis-specific metrics via the Redis CLI. Cloud Monitoring can ingest these metrics.
Ops Agent Configuration for Redis
Add the following to your /etc/google-cloud-ops-agent/config.yaml:
metrics:
receivers:
redis:
type: redis
endpoint: unix:/var/run/redis/redis-server.sock # Or tcp://127.0.0.1:6379
# Optional: authentication
# password: "your_redis_password"
# Ensure these are forwarded to the default_metrics forwarder
# Or create a new forwarder for Redis metrics
Key Redis metrics to monitor include:
redis/instantaneous_ops_per_sec: Commands processed per second. High values might indicate heavy load.redis/connected_clients: Number of connected clients. Spikes could indicate connection leaks or DoS.redis/used_memory: Memory usage. Crucial for avoiding OOM errors.redis/evicted_keys: Number of keys evicted due to memory limits. Indicates memory pressure.redis/rejected_connections: Number of rejected connections.redis/keyspace_hitsandredis/keyspace_misses: Cache hit/miss ratio. Low hit ratio might mean Redis isn’t effective or needs more memory.
Log-Based Metrics for Deeper Insights
Beyond raw metrics, we can create log-based metrics in Cloud Monitoring to track specific events or error patterns. For instance, counting specific WordPress error messages or slow query logs from MySQL.
Example: WordPress Fatal Error Count
Assuming your WordPress error logs (e.g., wp-content/debug.log if enabled) contain lines like:
[2023-10-27 10:30:00] /var/www/html/wp-includes/plugin.php:123 - PHP Fatal error: Uncaught Error: Call to undefined function some_function() in /var/www/html/wp-content/plugins/my-plugin/my-plugin.php:456
You can create a log-based metric in the GCP Console (Cloud Monitoring -> Log-based Metrics -> Create Metric). Use a filter like:
resource.type="gce_instance" logName="projects/YOUR_PROJECT_ID/logs/php_errors" # Adjust log name based on Ops Agent config textPayload:"PHP Fatal error:"
This will give you a time-series metric of fatal errors, which can be used for alerting.
Alerting Strategies with Cloud Monitoring
Effective alerting is about notifying the right people about the right problems at the right time, without causing alert fatigue. We’ll configure alerting policies based on the metrics and logs we’re collecting.
Key Alerting Thresholds
- CPU Utilization: Alert when average CPU > 80% for 5 minutes (for critical instances).
- Memory Utilization: Alert when memory usage > 85% for 5 minutes.
- Disk I/O Wait: Alert when I/O wait time > 10% for 5 minutes.
- Redis Memory Usage: Alert when
redis/used_memory> 80% of maxmemory. - Redis Evictions: Alert when
redis/evicted_keysincreases significantly over a short period (e.g., > 100 in 1 minute). - HTTP Error Rate: Alert when the rate of 5xx errors (from web server access logs or Cloud Load Balancing logs) exceeds a threshold (e.g., > 1% of total requests for 5 minutes).
- PHP Fatal Errors: Alert when the count of log-based metrics for fatal errors exceeds 5 in 10 minutes.
- Redis Latency: If you’re using Redis Enterprise or have custom latency monitoring, alert on p99 latency exceeding 50ms.
- Unhealthy Redis Nodes: Monitor Redis cluster health commands (e.g.,
redis-cli cluster nodes) and alert if nodes are marked as `fail` or `noaddr`.
Configuring Alerting Policies
In Cloud Monitoring, navigate to “Alerting” and create new policies. For each policy, define:
- Condition: The metric and threshold (e.g.,
compute.googleapis.com/instance/cpu/utilization> 0.8). - Trigger: How many data points must breach the threshold (e.g., “any” or “all”).
- Duration: For how long the condition must be met (e.g., “5 minutes”).
- Notification Channel: Where alerts are sent (e.g., Email, PagerDuty, Slack via Pub/Sub).
Example Alerting Policy: High Redis Memory Usage
1. Go to Cloud Monitoring -> Alerting -> Create Policy.
2. Click “Add Condition”.
3. Search for the metric: redis/used_memory.
4. Filter by the specific Redis instance or cluster group (if using resource labels).
5. Set the condition: “is above” 0.8 (for 80% of maxmemory, assuming maxmemory is configured and reported correctly, or use absolute bytes if preferred).
6. Set the duration: “for 5 minutes”.
7. Click “Next”.
8. Configure Notification Channels (e.g., select your PagerDuty service).
9. Name the policy (e.g., “High Redis Memory Usage – Production Cluster”) and provide documentation.
10. Click “Save Policy”.
Redis Cluster Specific Monitoring & Health Checks
For Redis clusters (especially if using Redis Cluster mode or Sentinel), basic metrics aren’t enough. We need to actively check cluster health.
Automated Redis Cluster Health Checks
We can use a simple script run via cron or Cloud Scheduler to perform deeper checks.
Example Bash Script for Redis Cluster Health
#!/bin/bash
REDIS_CLI="/usr/bin/redis-cli"
REDIS_HOST="127.0.0.1" # Or your primary Redis node
REDIS_PORT="6379"
REDIS_PASSWORD="your_redis_password" # If applicable
PROJECT_ID=$(gcloud config get-value project)
INSTANCE_NAME=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
ZONE=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google" | cut -d'/' -f4)
# Function to send custom metric to Cloud Monitoring
send_custom_metric() {
local metric_type="$1"
local value="$2"
local description="$3"
local resource_type="gce_instance"
local resource_labels="project_id=${PROJECT_ID},instance_id=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/id" -H "Metadata-Flavor: Google"),zone=${ZONE},instance_name=${INSTANCE_NAME}"
cat <<EOF | curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://monitoring.googleapis.com/v3/projects/${PROJECT_ID}/timeSeries" -d @-
{
"timeSeries": [
{
"metric": {
"type": "custom.googleapis.com/${metric_type}",
"labels": {
"description": "${description}"
}
},
"resource": {
"type": "${resource_type}",
"labels": {
${resource_labels//,/,\"}.
}
},
"points": [
{
"interval": {
"endTime": "$(date -uIs)"
},
"value": {
"doubleValue": ${value}
}
}
]
}
]
}
EOF
}
# Check Redis connection
if ! ${REDIS_CLI} -h ${REDIS_HOST} -p ${REDIS_PORT} -a ${REDIS_PASSWORD} PING >& /dev/null; then
echo "Error: Could not connect to Redis."
send_custom_metric "redis/health_check_failed" 1 "Redis connection failed"
exit 1
fi
# Check Redis Cluster Status (if in cluster mode)
if ${REDIS_CLI} -h ${REDIS_HOST} -p ${REDIS_PORT} -a ${REDIS_PASSWORD} CLUSTER INFO | grep 'cluster_state:ok' >& /dev/null; then
echo "Redis cluster state is OK."
send_custom_metric "redis/cluster_health_ok" 1 "Redis cluster state is OK"
# Check for failing nodes
FAILING_NODES=$(${REDIS_CLI} -h ${REDIS_HOST} -p ${REDIS_PORT} -a ${REDIS_PASSWORD} CLUSTER NODES | grep 'fail' | wc -l)
if [ "$FAILING_NODES" -gt 0 ]; then
echo "Warning: Found $FAILING_NODES failing Redis nodes."
send_custom_metric "redis/failing_nodes_count" "$FAILING_NODES" "Number of failing Redis nodes detected"
else
send_custom_metric "redis/failing_nodes_count" 0 "Number of failing Redis nodes detected"
fi
else
echo "Error: Redis cluster is not in OK state."
send_custom_metric "redis/cluster_health_failed" 1 "Redis cluster state is NOT OK"
fi
# Check for slow operations (requires CONFIG SET slowlog-log-slower-than)
# SLOW_OPS=$(${REDIS_CLI} -h ${REDIS_HOST} -p ${REDIS_PORT} -a ${REDIS_PASSWORD} SLOWLOG GET 10 | wc -l)
# if [ "$SLOW_OPS" -gt 5 ]; then # Example threshold
# echo "Warning: High number of slow Redis operations detected."
# send_custom_metric "redis/slow_operations_detected" 1 "High number of slow Redis operations"
# fi
echo "Redis health check completed successfully."
send_custom_metric "redis/health_check_passed" 1 "Redis health check script completed"
exit 0
Explanation:
- This script connects to Redis, checks the
PINGresponse, verifies theCLUSTER INFOstate, and counts nodes marked asfailinCLUSTER NODES. - It sends custom metrics (e.g.,
custom.googleapis.com/redis/health_check_failed) to Cloud Monitoring. These custom metrics can then be used to create alerts. - Ensure the script has execute permissions (
chmod +x check_redis_health.sh) and is run periodically (e.g., every 5 minutes) via cron or Cloud Scheduler. - The script uses the instance metadata service to dynamically get project ID, instance name, and zone, making it portable across instances.
- Replace placeholders like
your_redis_passwordand adjust host/port if necessary.
Monitoring Redis Sentinel
If you are using Redis Sentinel for high availability, monitor Sentinel itself:
- Sentinel Processes: Ensure Sentinel processes are running on all designated nodes.
- Master/Replica Status: Use
redis-cli -p 26379 SENTINEL master mymaster(replacemymasterwith your master name) to check the status of the master and its replicas. Look fornum-slavesandnum-other-sentinels. - Sentinel Logs: Ingest Sentinel logs into Cloud Logging and create alerts for critical events like master failovers.
WordPress Application Performance Monitoring (APM) Integration
While Cloud Monitoring provides infrastructure and basic application metrics, true APM offers deep insights into code execution, database queries, and external service calls. Consider integrating a dedicated APM solution.
Options for APM on GCP
- Google Cloud Trace & Profiler: Native GCP services that can provide distributed tracing and performance profiling. Requires instrumenting your application (e.g., using OpenTelemetry SDKs for PHP).
- Third-Party APM Tools: Datadog, New Relic, Dynatrace, AppDynamics. These typically involve installing an agent on your VMs and configuring it to send data to their platform. They often offer more comprehensive features out-of-the-box for WordPress than native GCP tools alone.
Integrating a Third-Party APM Agent
The process varies by vendor, but generally involves:
- Installing the vendor’s agent package (e.g., via
aptoryum). - Configuring the agent with your API key and specifying which applications/processes to monitor.
- Restarting the web server (Apache/Nginx) and PHP-FPM to load the APM extension.
- Ensuring network connectivity from your VMs to the APM vendor’s collection endpoints.
Once integrated, you’ll gain access to dashboards showing:
- Transaction traces (individual requests) with breakdown by time spent in PHP, database, external calls.
- Database query performance analysis.
- Error tracking and reporting.
- Server-side performance metrics specific to WordPress (e.g., plugin execution time).
Centralized Logging & Analysis
Aggregating logs from all WordPress and Redis instances into a central location is vital for troubleshooting and historical analysis. Cloud Logging is the natural choice on GCP.
Log Ingestion & Retention
The Ops Agent, as configured earlier, forwards logs to Cloud Logging. Ensure your log retention policies in Cloud Logging are adequate for your compliance and debugging needs (e.g., 30-90 days for production logs).
Log-Based Alerts & Dashboards
As shown with fatal errors, log-based metrics are powerful. Create dashboards in Cloud Monitoring to visualize key log events and error rates. Use log-based alerts for critical log patterns that don’t have corresponding metrics.
Example: Alert on WordPress Security Plugin Block
If your security plugin logs blocked IPs (e.g., “IP [1.2.3.4] blocked for attempting SQL injection”), create a log-based metric and alert:
resource.type="gce_instance" logName="projects/YOUR_PROJECT_ID/logs/wordpress_security" # Custom log name textPayload:"blocked for attempting SQL injection"
Alert if the count of this metric exceeds 10 in 15 minutes.
Conclusion: A Multi-Layered Approach
Effective server monitoring for a WordPress and Redis stack on GCP is not a single tool but a layered strategy. It involves:
- Leveraging Cloud Monitoring for infrastructure and core application metrics via the Ops Agent.
- Implementing custom metrics and log-based metrics for specific WordPress and Redis behaviors.
- Configuring intelligent alerting policies to minimize noise and maximize actionable insights.
- Utilizing specialized scripts for deep health checks on critical components like Redis clusters.
- Considering APM solutions for granular application performance visibility.
- Centralizing logs in Cloud Logging for comprehensive analysis and troubleshooting.
By combining these elements, you can build a resilient monitoring system that keeps your WordPress application and Redis clusters healthy, performant, and available.