Server Monitoring Best Practices: Keeping Your WordPress App and Elasticsearch Clusters Alive on OVH
Core WordPress Application Monitoring on OVH
Maintaining a robust WordPress deployment on OVH requires a multi-layered monitoring strategy. Beyond basic uptime checks, we need to delve into application-level metrics, resource utilization, and potential bottlenecks. This section focuses on essential checks for the WordPress application itself.
PHP-FPM Performance Tuning and Monitoring
PHP-FPM is the workhorse for WordPress. Monitoring its performance is critical. We’ll focus on key metrics like active processes, idle processes, and request performance. A common approach is to expose these metrics via a status page and scrape them with a monitoring agent.
First, ensure your PHP-FPM configuration (typically php-fpm.conf or files within php-fpm.d/) enables the status page. Locate the pool configuration file (e.g., www.conf) and add or modify the following:
; /etc/php/8.1/fpm/pool.d/www.conf (example path) pm = dynamic pm.max_children = 100 pm.start_servers = 10 pm.min_spare_servers = 5 pm.max_spare_servers = 20 pm.process_idle_timeout = 10s ; Enable status page pm.status_path = /fpm-status ; Allow access from your monitoring server's IP or a trusted internal network ; For simplicity, we'll use a broad range here, but restrict this in production. ; access.log = /var/log/php-fpm/www.access.log ; request_slowlog_timeout = 10s ; slowlog = /var/log/php-fpm/www-slow.log
After restarting PHP-FPM (e.g., sudo systemctl restart php8.1-fpm), you can access the status page via a web browser or curl. The output will be in a format like this:
pool: www process manager: dynamic process id: 12345 start time: 01/Jan/2023:10:00:00 +0000 start since: 01/Jan/2023 accepted conn: 10000 listen queue: 0 max listen queue: 0 listen queue len: 0 idle processes: 5 active processes: 15 total processes: 20 max active processes: 18 max children reached: 2 slow requests: 5
To effectively monitor this, we can use tools like Prometheus with the php-fpm_exporter or a custom script that parses this output and sends it to your chosen monitoring backend (e.g., Datadog, Nagios, Zabbix). For Prometheus, the exporter typically scrapes the /fpm-status endpoint directly.
Nginx Performance and Error Monitoring
Nginx acts as the front-end for our WordPress application. Monitoring its access logs, error logs, and active connections is crucial. We’ll configure Nginx to expose basic performance metrics and ensure error logging is robust.
Enable the Nginx status module (ngx_http_stub_status_module) by adding the following to your Nginx configuration (e.g., within a server block or a dedicated location block):
# /etc/nginx/sites-available/your-wordpress-site.conf
server {
listen 80;
server_name your-domain.com;
# ... other configurations ...
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1; # Allow localhost
allow 192.168.1.0/24; # Allow internal network
deny all;
}
# ... other configurations ...
}
After reloading Nginx (sudo systemctl reload nginx), accessing /nginx_status will yield:
Active connections: 123 server accepts handled requests 16661 16661 123456 Reading: 1 Writing: 2 Waiting: 120
These metrics (active connections, total accepted connections, handled requests, reading, writing, waiting) are invaluable for understanding traffic load and potential Nginx worker process saturation. Tools like Prometheus with nginx-exporter can scrape this endpoint.
Furthermore, ensure your Nginx error logs are configured to capture critical events. A common setup:
# /etc/nginx/nginx.conf
http {
# ... other http configurations ...
error_log /var/log/nginx/error.log warn; # Or info, error, crit
access_log /var/log/nginx/access.log;
# ... other http configurations ...
}
Log analysis tools (e.g., ELK stack, Splunk, Graylog) should be configured to ingest and alert on specific Nginx error codes (e.g., 5xx errors, 403 Forbidden, 404 Not Found if unexpected).
Database Monitoring (MySQL/MariaDB)
WordPress relies heavily on its database. Monitoring MySQL/MariaDB is non-negotiable. Key areas include query performance, connection counts, buffer pool usage, and disk I/O.
We can leverage the mysqld_exporter for Prometheus to gather a comprehensive set of metrics. Essential MySQL variables to monitor include:
Threads_connected/max_connections: To detect connection exhaustion.Slow_queries: To identify inefficient queries.Innodb_buffer_pool_wait_free: Indicates buffer pool contention.Innodb_row_lock_waits: Highlights locking issues.Created_tmp_disk_tables/Created_tmp_tables: To understand temporary table usage.
A basic monitoring check for slow queries can be implemented directly via SQL:
-- Query to check slow queries in the last hour
SELECT
variable_value
FROM
performance_schema.global_status
WHERE
variable_name = 'Slow_queries';
For more detailed query analysis, enabling the slow query log in MySQL is essential. Configure it in my.cnf or my.ini:
[mysqld] slow_query_log = 1 slow_query_log_file = /var/log/mysql/mysql-slow.log long_query_time = 2 ; Log queries longer than 2 seconds log_queries_not_using_indexes = 1 ; Optional: Log queries that don't use indexes
Log analysis tools should parse this file for problematic queries. Tools like pt-query-digest from Percona Toolkit are invaluable for summarizing slow query logs.
Elasticsearch Cluster Monitoring
For advanced WordPress sites, especially those using Elasticsearch for search, robust cluster monitoring is paramount. We need to track node health, cluster status, indexing performance, and query latency.
Node and Cluster Health
The Elasticsearch Cluster Health API (GET _cluster/health) provides a high-level overview. Key fields to monitor:
status: Should begreen.yellowindicates unassigned shards (data loss risk), andredmeans unassigned primary shards (data loss).number_of_nodes: Ensure all expected nodes are present.number_of_data_nodes: Crucial for data availability.active_shards,relocating_shards,initializing_shards,unassigned_shards: Provide insight into cluster activity and potential issues.
{
"cluster_name" : "my-es-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue" : 0,
"active_shards_percent_as_number" : 100.0
}
The Elasticsearch Node Stats API (GET _nodes/stats) provides detailed metrics per node, including JVM heap usage, CPU utilization, disk I/O, and network traffic. Monitoring JVM heap usage is critical to prevent OutOfMemory errors. Aim to keep heap usage below 75-80%.
// Example snippet from GET _nodes/stats
{
"nodes": {
"node_id_1": {
"jvm": {
"mem": {
"heap_used_in_bytes": 1073741824,
"heap_max_in_bytes": 1572864000,
// ... other heap stats
},
// ... other JVM stats
},
"os": {
"cpu": {
"load_average": {
"1m": 2.5,
"5m": 2.1,
"15m": 1.9
}
},
// ... other OS stats
},
"fs": {
"data": [
{
"path": "/var/lib/elasticsearch",
"total_in_bytes": 500000000000,
"free_in_bytes": 100000000000,
"available_in_bytes": 90000000000
}
]
},
// ... other node stats
}
// ... other nodes
}
}
Tools like the official Elasticsearch Exporter for Prometheus or commercial APM solutions can collect and visualize these metrics. Set up alerts for:
- Cluster status becoming
yelloworred. unassigned_shardscount greater than 0.- JVM heap usage consistently above 80%.
- Low disk space on data nodes.
- High CPU load averages.
Indexing and Search Performance
For a responsive WordPress search, indexing and query performance are key. Monitor the Indexing Performance API (GET _stats/index) and Search Performance API (GET _search/metrics/search – available in newer versions).
Key metrics from GET _stats/index:
index.indexing.index_total: Total documents indexed.index.indexing.index_time_in_millis: Time spent indexing.index.indexing.throttle_time_in_millis: Time spent throttled due to resource constraints.index.segments.count: Number of segments. High segment counts can impact search performance.
Key metrics from GET _search/metrics/search (or equivalent in older versions via _nodes/stats):
query_total: Total search requests.query_time_in_millis: Total time spent on searches.query_latency_millis: Average search latency.fetch_total,fetch_time_in_millis: Time spent fetching results.
Alerting on increased query_latency_millis or high throttle_time_in_millis during indexing can help proactively identify performance degradation. Regularly review index settings, especially refresh intervals and merge policies, based on these metrics.
Log Aggregation and Alerting Strategy
A centralized logging system is indispensable for diagnosing issues across your WordPress application and Elasticsearch cluster. We’ll use a combination of Fluentd/Logstash for collection, Elasticsearch for storage, and Kibana for visualization and alerting.
Log Collection Agents
On your WordPress servers (web servers, PHP-FPM), configure Fluentd or Filebeat to tail relevant log files:
# Example Fluentd configuration for WordPress logs
# /etc/fluentd/conf.d/wordpress.conf
<source>
@type tail
path /var/log/nginx/access.log
pos_file /var/log/fluentd/nginx-access.pos
tag nginx.access
<parse>
@type nginx
</parse>
</source>
<source>
@type tail
path /var/log/nginx/error.log
pos_file /var/log/fluentd/nginx-error.pos
tag nginx.error
<parse>
@type regexp
# Basic regexp for Nginx errors, adjust as needed
expression /^(?
For Elasticsearch nodes, use Filebeat or Fluentd to collect Elasticsearch logs (gc.log, elasticsearch.log) and metrics. Ensure your Elasticsearch cluster is configured to expose metrics for scraping by your monitoring system.
Kibana for Visualization and Alerting
Once logs are aggregated in Elasticsearch, Kibana provides powerful tools for analysis and alerting. Create dashboards to visualize key metrics:
- WordPress HTTP error rates (4xx, 5xx).
- PHP-FPM error counts and types.
- Database slow query indicators.
- Elasticsearch cluster health status and node resource utilization.
- Elasticsearch indexing and search latency.
Kibana’s Alerting feature can be configured to trigger notifications (email, Slack, PagerDuty) based on thresholds or patterns in your logs and metrics. For example, an alert can be set for:
- More than 5 Nginx 5xx errors in 5 minutes.
- PHP-FPM error log entries containing “Fatal error”.
- Elasticsearch cluster status changing from
greentoyelloworred. - Average search latency exceeding 500ms for 10 minutes.
This comprehensive monitoring strategy, combining application-specific metrics, infrastructure health, and centralized logging, is crucial for maintaining a stable and performant WordPress deployment on OVH, especially when integrated with complex systems like Elasticsearch.