Server Monitoring Best Practices: Keeping Your WordPress App and PostgreSQL Clusters Alive on Linode
Establishing a Baseline: Essential Metrics for WordPress and PostgreSQL
Effective server monitoring hinges on understanding what “normal” looks like for your specific application stack. For a WordPress site backed by a PostgreSQL cluster on Linode, this means tracking key performance indicators (KPIs) across both the web server layer and the database layer. Without a baseline, anomaly detection and proactive issue resolution become guesswork.
For the WordPress application itself, we’ll focus on metrics that indicate responsiveness and resource utilization. This includes:
- HTTP Request Rate: The number of requests per second hitting your web server (Nginx or Apache).
- Response Time (Average & Percentiles): How long it takes for the server to respond to requests. P95 and P99 are crucial for identifying outlier latency.
- Error Rate (HTTP 5xx): The percentage of requests resulting in server-side errors.
- CPU Utilization: Overall CPU load on the web server instances.
- Memory Usage: RAM consumption, paying close attention to swap usage.
- Disk I/O: Read/write operations per second and latency, especially important for serving static assets and WordPress uploads.
For the PostgreSQL cluster, the focus shifts to database performance and health:
- Query Throughput: Transactions per second (TPS) or queries per second (QPS).
- Query Latency (Average & Percentiles): Time taken to execute queries.
- Connection Count: Number of active and idle connections. Exceeding `max_connections` is a common failure point.
- Replication Lag: For high-availability setups, the delay between the primary and replica(s).
- CPU & Memory Usage: Similar to web servers, but critical for database operations.
- Disk I/O: Database operations are heavily I/O bound.
- Cache Hit Ratio: Effectiveness of PostgreSQL’s shared buffer cache.
- Lock Contention: Identifying queries waiting for locks.
- WAL (Write-Ahead Log) Activity: Rate of WAL generation and archiving.
Agent-Based Monitoring with Prometheus and Node Exporter
Prometheus is a de facto standard for time-series monitoring in cloud-native environments. Its pull-based model and powerful query language (PromQL) make it ideal for collecting and analyzing metrics. We’ll deploy Node Exporter on each Linode instance to expose system-level metrics.
1. Install Node Exporter on each Linode instance:
Download the latest release from the Prometheus GitHub repository. For example, on a Debian/Ubuntu system:
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ sudo useradd -rs /bin/false node_exporter
2. Create a systemd service for Node Exporter:
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Save this as /etc/systemd/system/node_exporter.service and then enable and start it:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
3. Configure Prometheus to scrape Node Exporter:
On your Prometheus server, edit the prometheus.yml configuration file. Add scrape jobs for each Linode instance running Node Exporter.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds.
evaluation_interval: 15s # Evaluate rules every 15 seconds.
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape WordPress web servers
- job_name: 'wordpress_webservers'
static_configs:
- targets:
- '192.0.2.10:9100' # Replace with your Linode IP for Web Server 1
- '192.0.2.11:9100' # Replace with your Linode IP for Web Server 2
# Add more web servers as needed
# Scrape PostgreSQL primary node
- job_name: 'postgresql_primary'
static_configs:
- targets:
- '192.0.2.20:9100' # Replace with your Linode IP for PostgreSQL Primary
# Scrape PostgreSQL replica nodes
- job_name: 'postgresql_replicas'
static_configs:
- targets:
- '192.0.2.21:9100' # Replace with your Linode IP for PostgreSQL Replica 1
- '192.0.2.22:9100' # Replace with your Linode IP for PostgreSQL Replica 2
# Add more replicas as needed
Reload the Prometheus configuration:
curl -X POST http://localhost:9090/-/reload
PostgreSQL Metrics with `postgres_exporter`
Node Exporter provides system-level metrics. To get detailed PostgreSQL metrics, we’ll use postgres_exporter. This exporter connects to your PostgreSQL instances and exposes metrics via an HTTP endpoint, which Prometheus can then scrape.
1. Install `postgres_exporter` on your PostgreSQL nodes:
Download the latest release from its GitHub repository. Similar to Node Exporter:
wget https://github.com/wrouesnel/postgres_exporter/releases/download/v0.13.0/postgres_exporter-0.13.0.linux-amd64.tar.gz tar xvfz postgres_exporter-0.13.0.linux-amd64.tar.gz sudo mv postgres_exporter-0.13.0.linux-amd64/postgres_exporter /usr/local/bin/ sudo useradd -rs /bin/false postgres_exporter
2. Configure `postgres_exporter` to connect to PostgreSQL:
Create a PostgreSQL user for the exporter with minimal privileges. This user only needs to connect and read from `pg_stat_activity`, `pg_stat_database`, and other relevant system catalogs. Avoid granting superuser privileges.
-- Connect to your PostgreSQL database as a superuser CREATE USER monitoring_user WITH PASSWORD 'your_strong_password'; GRANT CONNECT ON DATABASE your_database TO monitoring_user; GRANT USAGE ON SCHEMA pg_catalog TO monitoring_user; GRANT SELECT ON pg_stat_activity TO monitoring_user; GRANT SELECT ON pg_stat_database TO monitoring_user; GRANT SELECT ON pg_stat_replication TO monitoring_user; GRANT SELECT ON pg_locks TO monitoring_user; GRANT SELECT ON pg_stat_statements TO monitoring_user; -- If pg_stat_statements is enabled -- Add other necessary grants based on the exporter's documentation and your needs
Create a .pgpass file for the exporter user (or a dedicated system user running the exporter) to store the connection string:
# ~/.pgpass or /etc/postgres-exporter/.pgpass hostname:port:database:username:password localhost:5432:your_database:monitoring_user:your_strong_password
Ensure the permissions are strict:
chmod 0600 ~/.pgpass
3. Create a systemd service for `postgres_exporter`:
[Unit] Description=PostgreSQL Exporter Wants=network-online.target After=network-online.target [Service] User=postgres_exporter Group=postgres_exporter Type=simple # Adjust the DATA_SOURCE_NAME to match your PostgreSQL connection string # Example: "user=monitoring_user password=your_strong_password host=localhost port=5432 dbname=your_database sslmode=disable" # Or use the .pgpass file by setting PGUSER, PGHOST, etc. or using the default location. Environment="DATA_SOURCE_NAME=postgresql://monitoring_user:your_strong_password@localhost:5432/your_database?sslmode=disable" ExecStart=/usr/local/bin/postgres_exporter --web.listen-address=":9187" --extend.queries-path=/etc/postgres-exporter/queries.yaml [Install] WantedBy=multi-user.target
Save this as /etc/systemd/system/postgres_exporter.service. You’ll also need to create /etc/postgres-exporter/queries.yaml for custom queries if needed. For standard metrics, the exporter works out-of-the-box. Enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable postgres_exporter sudo systemctl start postgres_exporter sudo systemctl status postgres_exporter
4. Configure Prometheus to scrape `postgres_exporter`:
Update your prometheus.yml to include scrape jobs for the PostgreSQL exporter running on port 9187 on your database nodes.
# ... (previous scrape configs) ...
scrape_configs:
# ... (Prometheus, WordPress webservers) ...
# Scrape PostgreSQL primary node metrics
- job_name: 'postgresql_primary_metrics'
static_configs:
- targets:
- '192.0.2.20:9187' # Replace with your Linode IP for PostgreSQL Primary
# Scrape PostgreSQL replica nodes metrics
- job_name: 'postgresql_replicas_metrics'
static_configs:
- targets:
- '192.0.2.21:9187' # Replace with your Linode IP for PostgreSQL Replica 1
- '192.0.2.22:9187' # Replace with your Linode IP for PostgreSQL Replica 2
# Add more replicas as needed
Reload Prometheus configuration again.
Application-Level Metrics with `php-fpm_exporter` and Custom WordPress Checks
While Node Exporter gives us system-level insights, and `postgres_exporter` database insights, we need metrics directly from the PHP-FPM process serving WordPress and potentially custom checks for WordPress itself.
1. Exposing PHP-FPM Metrics:
PHP-FPM exposes its status via a status page. We can use `php-fpm_exporter` to scrape this and expose it in Prometheus format. First, ensure your PHP-FPM pool configuration (e.g., /etc/php/8.1/fpm/pool.d/www.conf) has the status page enabled:
pm.status_path = /status ping.path = /ping ping.response = pong
Restart PHP-FPM:
sudo systemctl restart php8.1-fpm
Install `php-fpm_exporter` (similar process to Node Exporter/postgres_exporter) and configure it to point to your PHP-FPM status page. Typically, this involves setting the --php-fpm.status-url flag.
# Example systemd service snippet for php-fpm_exporter [Service] User=php-fpm_exporter Group=php-fpm_exporter Type=simple ExecStart=/usr/local/bin/php-fpm_exporter --web.listen-address=":9124" --php-fpm.status-url="http://127.0.0.1/status" --php-fpm.ping-url="http://127.0.0.1/ping" # Ensure the web server (Nginx/Apache) is configured to proxy these URLs to the PHP-FPM socket or a local listener.
Add a scrape job for `php-fpm_exporter` (e.g., on port 9124) to your prometheus.yml.
2. Custom WordPress Health Checks:
For deeper WordPress insights, consider a simple PHP script that checks critical functionalities and exposes metrics that can be scraped by Prometheus’s `blackbox_exporter` or a custom exporter.
Create a script like /var/www/html/healthcheck.php:
<?php
header('Content-Type: application/json');
$response = ['status' => 'ok', 'checks' => []];
$startTime = microtime(true);
// Check database connection (basic)
try {
$db = new PDO('pgsql:host=localhost;dbname=your_database', 'monitoring_user', 'your_strong_password');
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$db->exec('SELECT 1'); // Simple query
$response['checks']['database_connection'] = 'ok';
} catch (PDOException $e) {
$response['status'] = 'error';
$response['checks']['database_connection'] = 'error: ' . $e->getMessage();
}
// Check WordPress core files (optional, can be resource intensive)
// You might check for the existence of wp-load.php or similar
if (file_exists(ABSPATH . 'wp-load.php')) {
$response['checks']['wp_core_files'] = 'ok';
} else {
$response['status'] = 'error';
$response['checks']['wp_core_files'] = 'error: wp-load.php not found';
}
// Add more checks: external API calls, cron job status, etc.
$endTime = microtime(true);
$response['duration_seconds'] = $endTime - $startTime;
// Expose metrics in Prometheus text format (simplified)
// For a full implementation, consider a dedicated exporter or library
echo "# HELP wordpress_healthcheck_status Overall health status (1=ok, 0=error)\n";
echo "# TYPE wordpress_healthcheck_status gauge\n";
echo "wordpress_healthcheck_status " . ($response['status'] === 'ok' ? '1' : '0') . "\n";
echo "# HELP wordpress_healthcheck_duration_seconds Health check duration in seconds\n";
echo "# TYPE wordpress_healthcheck_duration_seconds gauge\n";
echo "wordpress_healthcheck_duration_seconds " . $response['duration_seconds'] . "\n";
// Add metrics for individual checks if needed
foreach ($response['checks'] as $key => $value) {
if (strpos($value, 'ok') !== false) {
echo "# HELP wordpress_healthcheck_" . $key . "_status Status of " . $key . " (1=ok, 0=error)\n";
echo "# TYPE wordpress_healthcheck_" . $key . "_status gauge\n";
echo "wordpress_healthcheck_" . $key . "_status 1\n";
} else {
echo "# HELP wordpress_healthcheck_" . $key . "_status Status of " . $key . " (1=ok, 0=error)\n";
echo "# TYPE wordpress_healthcheck_" . $key . "_status gauge\n";
echo "wordpress_healthcheck_" . $key . "_status 0\n";
}
}
// Output JSON for debugging/other consumers if needed
// echo json_encode($response);
exit(0); // Ensure script exits cleanly
?>
Configure your web server (Nginx/Apache) to allow access to this script and potentially proxy it to PHP-FPM if not directly accessible. Then, use Prometheus’s `blackbox_exporter` configured for HTTP probes to scrape this endpoint. Alternatively, write a small Go/Python exporter that fetches this script’s output and converts it into Prometheus metrics.
Alerting with Alertmanager
Collecting metrics is only half the battle. Alerting ensures you’re notified *before* users are impacted. Prometheus integrates with Alertmanager for sophisticated alert routing, grouping, and silencing.
1. Configure Prometheus Alerting Rules:
Define alerting rules in a separate file (e.g., alerts.yml) and include it in your prometheus.yml.
groups:
- name: wordpress_alerts
rules:
- alert: HighCpuUsage
expr: avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.2
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is above 80% for the last 10 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High Memory Usage on {{ $labels.instance }}"
description: "Memory usage on {{ $labels.instance }} is above 90% for the last 5 minutes."
- alert: HighHttp5xxErrorRate
expr: sum(rate(http_requests_total{code=~"5.."}[5m])) by (instance) / sum(rate(http_requests_total[5m])) by (instance) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High HTTP 5xx error rate on {{ $labels.instance }}"
description: "More than 5% of requests to {{ $labels.instance }} are resulting in 5xx errors."
- alert: PostgreSQLHighConnectionCount
expr: pg_stat_activity_count > (pg_settings_max_connections * 0.9)
for: 5m
labels:
severity: warning
annotations:
summary: "High PostgreSQL connection count on {{ $labels.instance }}"
description: "PostgreSQL on {{ $labels.instance }} is nearing its max_connections limit ({{ pg_settings_max_connections }})."
- alert: PostgreSQLReplicationLag
expr: rate(pg_replication_lag_seconds[5m]) > 60 # Lagging by more than 60 seconds
for: 2m
labels:
severity: critical
annotations:
summary: "PostgreSQL replication lag detected on {{ $labels.instance }}"
description: "PostgreSQL replica {{ $labels.instance }} is lagging by more than 60 seconds."
- alert: WordPressHealthCheckFailed
expr: wordpress_healthcheck_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "WordPress health check failed on {{ $labels.instance }}"
description: "The WordPress application on {{ $labels.instance }} is reporting a health check failure."
Ensure your prometheus.yml includes:
rule_files: - "alerts.yml"
2. Configure Alertmanager:
Set up Alertmanager to receive alerts from Prometheus and route them to your desired notification channels (Slack, PagerDuty, email, etc.). The configuration involves defining receivers and routes.
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver' # Default receiver if no specific route matches
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
# Example for PagerDuty
# - name: 'pagerduty-critical'
# pagerduty_configs:
# - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
inhibit_rules:
- target_match:
severity: 'critical'
source_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Configure Prometheus to send alerts to Alertmanager (typically running on port 9093):
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093' # Address of your Alertmanager instance
Visualization with Grafana
Dashboards are essential for visualizing trends, understanding historical performance, and quickly diagnosing issues. Grafana is the go-to tool for this, integrating seamlessly with Prometheus.
1. Set up Grafana:
Install Grafana on a separate server or on your Prometheus server. Add Prometheus as a data source in Grafana, pointing to your Prometheus instance’s URL (e.g., http://localhost:9090).
2. Import Pre-built Dashboards:
Many excellent pre-built dashboards are available on Grafana.com for Node Exporter, PostgreSQL Exporter, and general system monitoring. Search for “Node Exporter Full” and “PostgreSQL” dashboards. Import these via the Grafana UI (Configuration -> Dashboards -> Import).
3. Create Custom Dashboards:
For WordPress-specific metrics or combined views, create custom dashboards. Key panels to include:
- Web Server Overview: Request rate, error rate (5xx, 4xx), response time percentiles (P95, P99).
- PHP-FPM Performance: Active processes, idle processes, request duration, slow requests.
- PostgreSQL Cluster Health: Replication lag, active connections, query latency, cache hit ratio, CPU/Memory usage per node.
- System Resources: CPU, Memory, Disk I/O, Network traffic per instance.
- WordPress Health Check Status: A simple gauge showing the overall health status from your custom script.
Use PromQL queries to populate these panels. For example, to show the P95 response time for your web servers:
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, instance))
And to show PostgreSQL replication lag:
avg by (instance) (rate(pg_replication_lag_seconds[5m]))
Advanced Considerations and Best Practices
1. High Availability for Monitoring: Ensure your Prometheus and Alertmanager instances are themselves highly available. Consider running multiple Prometheus instances behind a load balancer or using Thanos/Cortex for long-term storage and global view.
2. Service Discovery: For dynamic environments, use Prometheus’s service discovery mechanisms (e.g., Linode’s API, Kubernetes SD) instead of static configurations.
3. Log Aggregation: Complement metrics with centralized logging (e.g., ELK stack, Loki). Metrics tell you *what* is happening, logs tell you *why*.
4. Performance Tuning: Use the collected metrics to identify bottlenecks. For PostgreSQL, this might involve tuning shared_buffers, work_mem, or analyzing slow queries with `pg_stat_statements`. For WordPress, optimize PHP configuration, caching plugins, and consider a CDN.
5. Security: Secure your monitoring endpoints. Use firewalls to restrict access to Prometheus, Alertmanager, and exporters. Consider authentication for sensitive metrics.
6. Regular Review: Periodically review your alerts and dashboards. Are they still relevant? Are there too many false positives? Adjust thresholds and rules as your application evolves.