Server Monitoring Best Practices: Keeping Your WordPress App and Elasticsearch Clusters Alive on OVH

Core WordPress Application Monitoring on OVH

Maintaining a robust WordPress deployment on OVH requires a multi-layered monitoring strategy. Beyond basic uptime checks, we need to delve into application-level metrics, resource utilization, and potential bottlenecks. This section focuses on essential checks for the WordPress application itself.

PHP-FPM Performance Tuning and Monitoring

PHP-FPM is the workhorse for WordPress. Monitoring its performance is critical. We’ll focus on key metrics like active processes, idle processes, and request performance. A common approach is to expose these metrics via a status page and scrape them with a monitoring agent.

First, ensure your PHP-FPM configuration (typically php-fpm.conf or files within php-fpm.d/) enables the status page. Locate the pool configuration file (e.g., www.conf) and add or modify the following:

; /etc/php/8.1/fpm/pool.d/www.conf (example path)
pm = dynamic
pm.max_children = 100
pm.start_servers = 10
pm.min_spare_servers = 5
pm.max_spare_servers = 20
pm.process_idle_timeout = 10s
; Enable status page
pm.status_path = /fpm-status
; Allow access from your monitoring server's IP or a trusted internal network
; For simplicity, we'll use a broad range here, but restrict this in production.
; access.log = /var/log/php-fpm/www.access.log
; request_slowlog_timeout = 10s
; slowlog = /var/log/php-fpm/www-slow.log

After restarting PHP-FPM (e.g., sudo systemctl restart php8.1-fpm), you can access the status page via a web browser or curl. The output will be in a format like this:

pool: www
process manager: dynamic
process id: 12345
start time: 01/Jan/2023:10:00:00 +0000
start since: 01/Jan/2023
accepted conn: 10000
listen queue: 0
max listen queue: 0
listen queue len: 0
idle processes: 5
active processes: 15
total processes: 20
max active processes: 18
max children reached: 2
slow requests: 5

To effectively monitor this, we can use tools like Prometheus with the php-fpm_exporter or a custom script that parses this output and sends it to your chosen monitoring backend (e.g., Datadog, Nagios, Zabbix). For Prometheus, the exporter typically scrapes the /fpm-status endpoint directly.

Nginx Performance and Error Monitoring

Nginx acts as the front-end for our WordPress application. Monitoring its access logs, error logs, and active connections is crucial. We’ll configure Nginx to expose basic performance metrics and ensure error logging is robust.

Enable the Nginx status module (ngx_http_stub_status_module) by adding the following to your Nginx configuration (e.g., within a server block or a dedicated location block):

# /etc/nginx/sites-available/your-wordpress-site.conf
server {
    listen 80;
    server_name your-domain.com;

    # ... other configurations ...

    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1; # Allow localhost
        allow 192.168.1.0/24; # Allow internal network
        deny all;
    }

    # ... other configurations ...
}

After reloading Nginx (sudo systemctl reload nginx), accessing /nginx_status will yield:

Active connections: 123
server accepts handled requests
 16661 16661 123456
Reading: 1 Writing: 2 Waiting: 120

These metrics (active connections, total accepted connections, handled requests, reading, writing, waiting) are invaluable for understanding traffic load and potential Nginx worker process saturation. Tools like Prometheus with nginx-exporter can scrape this endpoint.

Furthermore, ensure your Nginx error logs are configured to capture critical events. A common setup:

# /etc/nginx/nginx.conf
http {
    # ... other http configurations ...

    error_log /var/log/nginx/error.log warn; # Or info, error, crit
    access_log /var/log/nginx/access.log;

    # ... other http configurations ...
}

Log analysis tools (e.g., ELK stack, Splunk, Graylog) should be configured to ingest and alert on specific Nginx error codes (e.g., 5xx errors, 403 Forbidden, 404 Not Found if unexpected).

Database Monitoring (MySQL/MariaDB)

WordPress relies heavily on its database. Monitoring MySQL/MariaDB is non-negotiable. Key areas include query performance, connection counts, buffer pool usage, and disk I/O.

We can leverage the mysqld_exporter for Prometheus to gather a comprehensive set of metrics. Essential MySQL variables to monitor include:

Threads_connected / max_connections: To detect connection exhaustion.
Slow_queries: To identify inefficient queries.
Innodb_buffer_pool_wait_free: Indicates buffer pool contention.
Innodb_row_lock_waits: Highlights locking issues.
Created_tmp_disk_tables / Created_tmp_tables: To understand temporary table usage.

A basic monitoring check for slow queries can be implemented directly via SQL:

-- Query to check slow queries in the last hour
SELECT
    variable_value
FROM
    performance_schema.global_status
WHERE
    variable_name = 'Slow_queries';

For more detailed query analysis, enabling the slow query log in MySQL is essential. Configure it in my.cnf or my.ini:

[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 2  ; Log queries longer than 2 seconds
log_queries_not_using_indexes = 1 ; Optional: Log queries that don't use indexes

Log analysis tools should parse this file for problematic queries. Tools like pt-query-digest from Percona Toolkit are invaluable for summarizing slow query logs.

Elasticsearch Cluster Monitoring

For advanced WordPress sites, especially those using Elasticsearch for search, robust cluster monitoring is paramount. We need to track node health, cluster status, indexing performance, and query latency.

Node and Cluster Health

The Elasticsearch Cluster Health API (GET _cluster/health) provides a high-level overview. Key fields to monitor:

status: Should be green. yellow indicates unassigned shards (data loss risk), and red means unassigned primary shards (data loss).
number_of_nodes: Ensure all expected nodes are present.
number_of_data_nodes: Crucial for data availability.
active_shards, relocating_shards, initializing_shards, unassigned_shards: Provide insight into cluster activity and potential issues.

{
  "cluster_name" : "my-es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 10,
  "active_shards" : 20,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue" : 0,
  "active_shards_percent_as_number" : 100.0
}

The Elasticsearch Node Stats API (GET _nodes/stats) provides detailed metrics per node, including JVM heap usage, CPU utilization, disk I/O, and network traffic. Monitoring JVM heap usage is critical to prevent OutOfMemory errors. Aim to keep heap usage below 75-80%.

// Example snippet from GET _nodes/stats
{
  "nodes": {
    "node_id_1": {
      "jvm": {
        "mem": {
          "heap_used_in_bytes": 1073741824,
          "heap_max_in_bytes": 1572864000,
          // ... other heap stats
        },
        // ... other JVM stats
      },
      "os": {
        "cpu": {
          "load_average": {
            "1m": 2.5,
            "5m": 2.1,
            "15m": 1.9
          }
        },
        // ... other OS stats
      },
      "fs": {
        "data": [
          {
            "path": "/var/lib/elasticsearch",
            "total_in_bytes": 500000000000,
            "free_in_bytes": 100000000000,
            "available_in_bytes": 90000000000
          }
        ]
      },
      // ... other node stats
    }
    // ... other nodes
  }
}

Tools like the official Elasticsearch Exporter for Prometheus or commercial APM solutions can collect and visualize these metrics. Set up alerts for:

Cluster status becoming yellow or red.
unassigned_shards count greater than 0.
JVM heap usage consistently above 80%.
Low disk space on data nodes.
High CPU load averages.

Indexing and Search Performance

For a responsive WordPress search, indexing and query performance are key. Monitor the Indexing Performance API (GET _stats/index) and Search Performance API (GET _search/metrics/search – available in newer versions).

Key metrics from GET _stats/index:

index.indexing.index_total: Total documents indexed.
index.indexing.index_time_in_millis: Time spent indexing.
index.indexing.throttle_time_in_millis: Time spent throttled due to resource constraints.
index.segments.count: Number of segments. High segment counts can impact search performance.

Key metrics from GET _search/metrics/search (or equivalent in older versions via _nodes/stats):

query_total: Total search requests.
query_time_in_millis: Total time spent on searches.
query_latency_millis: Average search latency.
fetch_total, fetch_time_in_millis: Time spent fetching results.

Alerting on increased query_latency_millis or high throttle_time_in_millis during indexing can help proactively identify performance degradation. Regularly review index settings, especially refresh intervals and merge policies, based on these metrics.

Log Aggregation and Alerting Strategy

A centralized logging system is indispensable for diagnosing issues across your WordPress application and Elasticsearch cluster. We’ll use a combination of Fluentd/Logstash for collection, Elasticsearch for storage, and Kibana for visualization and alerting.

Log Collection Agents

On your WordPress servers (web servers, PHP-FPM), configure Fluentd or Filebeat to tail relevant log files:

# Example Fluentd configuration for WordPress logs
# /etc/fluentd/conf.d/wordpress.conf
<source>
  @type tail
  path /var/log/nginx/access.log
  pos_file /var/log/fluentd/nginx-access.pos
  tag nginx.access
  <parse>
    @type nginx
  </parse>
</source>

<source>
  @type tail
  path /var/log/nginx/error.log
  pos_file /var/log/fluentd/nginx-error.pos
  tag nginx.error
  <parse>
    @type regexp
    # Basic regexp for Nginx errors, adjust as needed
    expression /^(?\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2})\s+\[(?[^\]]+)\]\s+(?.*)$/
  </parse>
</source>

<source>
  @type tail
  path /var/log/php-fpm/www-error.log # Adjust path as per your PHP-FPM config
  pos_file /var/log/fluentd/php-fpm-error.pos
  tag php-fpm.error
  <parse>
    @type regexp
    # Basic regexp for PHP-FPM errors, adjust as needed
    expression /^(?\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2})\s+(?\d+)\s+\[error\].*(?.*)$/
  </parse>
</source>

<match **>
  @type forward
  
    10
  
  
    1s
  
  
    @type file
    path /var/log/fluentd/buffer/wordpress
    flush_interval 5s
  
  
    host your-elasticsearch-ingest-node-ip
    port 24224 # Default Fluentd forward port
  
</match>

For Elasticsearch nodes, use Filebeat or Fluentd to collect Elasticsearch logs (gc.log, elasticsearch.log) and metrics. Ensure your Elasticsearch cluster is configured to expose metrics for scraping by your monitoring system.

Kibana for Visualization and Alerting

Once logs are aggregated in Elasticsearch, Kibana provides powerful tools for analysis and alerting. Create dashboards to visualize key metrics:

WordPress HTTP error rates (4xx, 5xx).
PHP-FPM error counts and types.
Database slow query indicators.
Elasticsearch cluster health status and node resource utilization.
Elasticsearch indexing and search latency.

Kibana’s Alerting feature can be configured to trigger notifications (email, Slack, PagerDuty) based on thresholds or patterns in your logs and metrics. For example, an alert can be set for:

More than 5 Nginx 5xx errors in 5 minutes.
PHP-FPM error log entries containing “Fatal error”.
Elasticsearch cluster status changing from green to yellow or red.
Average search latency exceeding 500ms for 10 minutes.

This comprehensive monitoring strategy, combining application-specific metrics, infrastructure health, and centralized logging, is crucial for maintaining a stable and performant WordPress deployment on OVH, especially when integrated with complex systems like Elasticsearch.