Server Monitoring Best Practices: Keeping Your WordPress App and MongoDB Clusters Alive on Linode

Proactive MongoDB Cluster Health Checks

Maintaining the health of a MongoDB replica set or sharded cluster is paramount for any production WordPress application relying on it for data persistence. Beyond basic resource utilization, we need to monitor the internal state of MongoDB itself. This involves checking replication lag, oplog status, and the health of individual nodes.

Monitoring Replication Lag

Replication lag is a critical indicator of potential data inconsistency or performance bottlenecks. We can query the replica set status to identify the delay between the primary and secondaries. A common approach is to use a script that periodically checks the optimeDate of the secondaries against the primary.

Here’s a Python script that can be scheduled via cron to check replication lag:

import pymongo
import datetime
import sys

# Configuration
MONGO_URI = "mongodb://user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin"
LAG_THRESHOLD_SECONDS = 60  # Alert if lag exceeds 60 seconds

def check_replication_lag(mongo_uri, lag_threshold):
    try:
        client = pymongo.MongoClient(mongo_uri)
        db = client.admin
        
        # Get primary's current time
        primary_status = db.command('replSetGetStatus')
        primary_optime = primary_status['members'][0]['optimeDate'] # Assuming member 0 is primary for simplicity, a more robust check might be needed

        # Check each secondary
        for member in primary_status['members']:
            if member['stateStr'] != 'PRIMARY':
                secondary_optime = member['optimeDate']
                lag = (primary_optime - secondary_optime).total_seconds()
                
                print(f"Secondary {member['name']} lag: {lag:.2f} seconds")
                
                if lag > lag_threshold:
                    print(f"ALERT: Replication lag on {member['name']} exceeds threshold ({lag_threshold}s). Current lag: {lag:.2f}s", file=sys.stderr)
                    # In a real-world scenario, you'd trigger an alert here (e.g., PagerDuty, Slack)
                    sys.exit(1) # Indicate an error

    except pymongo.errors.ConnectionFailure as e:
        print(f"ERROR: Could not connect to MongoDB: {e}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"ERROR: An unexpected error occurred: {e}", file=sys.stderr)
        sys.exit(1)
    finally:
        if 'client' in locals() and client:
            client.close()

if __name__ == "__main__":
    check_replication_lag(MONGO_URI, LAG_THRESHOLD_SECONDS)
    print("Replication lag is within acceptable limits.")
    sys.exit(0)

This script connects to the replica set, retrieves the replication status, and iterates through each member. It calculates the time difference between the primary’s last operation and the secondary’s last operation. If this difference exceeds the defined LAG_THRESHOLD_SECONDS, an alert is printed to stderr, and the script exits with a non-zero status, which can be caught by cron or a monitoring agent.

Oplog Monitoring

The oplog (operation log) is the heart of MongoDB replication. Monitoring its size and the time it takes to clear is crucial. A consistently growing oplog can indicate that secondaries are unable to keep up, or that the oplog window is too small for the write load.

You can inspect the oplog’s status directly on the primary node:

mongo --quiet --eval "db.getReplicationInfo().logSizeMB"
mongo --quiet --eval "db.getReplicationInfo().timeDiff"

The logSizeMB gives you the current oplog size in MB. The timeDiff indicates how much time the current oplog can hold operations. A common practice is to set the oplog size such that it can hold at least 24 hours of operations, or more, depending on your write patterns and expected maintenance windows.

To automate this, you can extend the Python script or create a separate one to query these values and alert if the oplog is nearing its capacity or if timeDiff is too low.

WordPress Application Performance Monitoring (APM) on Linode

For WordPress, performance is often tied to database query speed, external API calls, and PHP execution time. Generic server metrics (CPU, RAM, Disk I/O) are necessary but insufficient. We need to dive into the application layer.

Leveraging New Relic or Datadog Agents

Commercial APM solutions like New Relic or Datadog provide comprehensive insights into WordPress performance. Their agents typically hook into PHP’s execution flow, instrumenting functions, database queries, and external HTTP requests.

Installation on a Linode server usually involves:

Installing the respective agent package (e.g., apt-get install newrelic or downloading a Datadog agent installer).
Configuring the agent with your license key and specifying the application name.
Configuring the PHP agent (e.g., editing /etc/php/X.Y/mods-available/newrelic.ini or /etc/datadog-agent/conf.d/php.d/conf.yaml). This often involves enabling the extension and setting specific parameters.
Restarting your web server (e.g., Nginx or Apache) and PHP-FPM.

Once configured, these tools provide dashboards showing:

Slowest PHP transactions
Database query performance (including slow queries to MongoDB)
External service call latency
Error rates and traces
Server resource utilization correlated with application performance

Custom WordPress Health Checks

Beyond APM, implementing custom health checks within WordPress itself can provide application-specific insights. This can be achieved by creating a custom plugin or by leveraging existing health check plugins that allow for custom checks.

A simple custom check could verify if WordPress can successfully connect to its MongoDB database and perform a basic read/write operation. This can be exposed via a dedicated endpoint.

<?php
/*
Plugin Name: Custom Health Check
Description: Provides a custom health check endpoint for WordPress and MongoDB.
Version: 1.0
Author: Your Name
*/

add_action('rest_api_init', function () {
    register_rest_route('custom-health/v1', '/check', array(
        'methods' => 'GET',
        'callback' => 'custom_health_check_endpoint',
        'permission_callback' => '__return_true' // For simplicity, allow public access. In production, restrict this.
    ));
});

function custom_health_check_endpoint() {
    $results = array(
        'wordpress_status' => 'ok',
        'mongodb_status' => 'unknown',
        'message' => ''
    );

    // Basic WordPress check (e.g., can we access options?)
    if (get_option('siteurl') === false) {
        $results['wordpress_status'] = 'error';
        $results['message'] .= 'WordPress core check failed. ';
    }

    // MongoDB Check (requires a MongoDB PHP driver and connection details)
    // Assuming you have a way to get MongoDB connection details, e.g., from wp-config.php or constants
    // For demonstration, we'll use a placeholder connection string.
    // In a real scenario, you'd use a robust MongoDB client library.
    $mongo_uri = defined('MONGO_URI') ? MONGO_URI : 'mongodb://localhost:27017/wordpress_db'; // Example

    try {
        // This is a simplified check. A real implementation would use a proper MongoDB client.
        // Example using MongoClient (requires pecl install mongodb)
        $client = new MongoDB\Client($mongo_uri);
        $db = $client->selectDatabase('admin'); // Select a database to check connection
        $db->command(['ping' => 1]); // Ping command to verify connection

        // Attempt a simple read/write operation
        $collection = $client->selectCollection('health_check', 'status');
        $timestamp = new MongoDB\BSON\UTCDateTime();
        $result = $collection->updateOne(
            ['_id' => 'health'],
            ['$set' => ['last_checked' => $timestamp]],
            ['upsert' => true]
        );

        if ($result->getModifiedCount() === 1 || $result->getUpsertedCount() === 1) {
            $results['mongodb_status'] = 'ok';
            $results['message'] .= 'MongoDB connection and basic operation successful. ';
        } else {
            $results['mongodb_status'] = 'error';
            $results['message'] .= 'MongoDB connection successful, but write operation failed. ';
        }

    } catch (MongoDB\Driver\Exception\Exception $e) {
        $results['mongodb_status'] = 'error';
        $results['message'] .= 'MongoDB connection error: ' . $e->getMessage() . ' ';
    } catch (Exception $e) {
        $results['mongodb_status'] = 'error';
        $results['message'] .= 'An unexpected error occurred during MongoDB check: ' . $e->getMessage() . ' ';
    }

    // Determine overall status
    if ($results['wordpress_status'] === 'ok' && $results['mongodb_status'] === 'ok') {
        $status_code = 200;
        $results['overall_status'] = 'ok';
    } else {
        $status_code = 503; // Service Unavailable
        $results['overall_status'] = 'error';
    }

    return new WP_REST_Response($results, $status_code);
}
?>

This plugin registers a REST API endpoint (e.g., /wp-json/custom-health/v1/check). When accessed, it performs a basic WordPress check and attempts to connect to MongoDB, ping it, and perform a simple upsert operation. The response includes the status of each component and an overall health status. This endpoint can be polled by external monitoring systems like Nagios, Zabbix, or even a simple cron job with curl.

Server-Level Monitoring on Linode

While application and database monitoring are critical, foundational server metrics on Linode are non-negotiable. This includes CPU, memory, disk I/O, network traffic, and process status.

Configuring Prometheus Node Exporter

Prometheus is a popular open-source monitoring and alerting system. The node_exporter is a powerful tool that exposes a wide range of hardware and OS metrics from *nix kernels. It’s an excellent choice for monitoring your Linode instances.

Installation and configuration:

Download the latest release of node_exporter from the Prometheus download page.

Extract the archive and move the binary to a suitable location (e.g., /usr/local/bin/).

Create a systemd service file to manage the exporter.

# Download and extract (example for amd64 Linux)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service file
sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.textfile.directory=/var/lib/node_exporter/textfile-collector

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

By default, node_exporter runs on port 9100. You’ll need to configure your Prometheus server to scrape metrics from this endpoint. For custom metrics (e.g., specific application health checks not covered by default collectors), you can use the textfile_collector feature. Create files in /var/lib/node_exporter/textfile-collector/ with metrics in Prometheus text format.

Monitoring MongoDB with Prometheus

Prometheus also offers a dedicated exporter for MongoDB: mongodb_exporter. This exporter queries MongoDB’s internal metrics and exposes them in a Prometheus-compatible format.

Installation typically involves downloading the binary and running it, often with a configuration file specifying MongoDB connection details.

# Download and extract (example for amd64 Linux)
wget https://github.com/dblock/mongodb_exporter/releases/download/v0.35.0/mongodb_exporter-0.35.0.linux-amd64.tar.gz
tar xvfz mongodb_exporter-0.35.0.linux-amd64.tar.gz
sudo mv mongodb_exporter-0.35.0.linux-amd64/mongodb_exporter /usr/local/bin/

# Create a user for MongoDB exporter (optional but recommended)
sudo useradd -rs 1000 mongodb_exporter

# Create a configuration file (e.g., /etc/mongodb_exporter.yml)
sudo nano /etc/mongodb_exporter.yml

mongodb:
  uri: "mongodb://exporter_user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin"
  # You can specify which collectors to enable/disable
  # collectors:
  #   - "replSet"
  #   - "db"
  #   - "collection"
  #   - "oplog"
  #   - "stats"
  #   - "index"
  #   - "profile"
  #   - "log"
  #   - "session"
  #   - "auth"
  #   - "command"
  #   - "storage"
  #   - "wiredTiger"
  #   - "network"
  #   - "locks"
  #   - "query_plan"
  #   - "query_sample"
  #   - "query_operation"
  #   - "query_aggregation"
  #   - "query_command"
  #   - "query_insert"
  #   - "query_update"
  #   - "query_delete"
  #   - "query_getmore"
  #   - "query_killcursors"
  #   - "query_other"
  #   - "query_total"
  #   - "query_failed"
  #   - "query_slow"
  #   - "query_read_lock_wait"
  #   - "query_write_lock_wait"
  #   - "query_total_time"
  #   - "query_lock_time"
  #   - "query_total_docs"
  #   - "query_docs_returned"
  #   - "query_docs_inserted"
  #   - "query_docs_updated"
  #   - "query_docs_deleted"
  #   - "query_docs_returned_per_sec"
  #   - "query_docs_inserted_per_sec"
  #   - "query_docs_updated_per_sec"
  #   - "query_docs_deleted_per_sec"
  #   - "query_total_time_per_sec"
  #   - "query_lock_time_per_sec"
  #   - "query_read_lock_wait_per_sec"
  #   - "query_write_lock_wait_per_sec"
  #   - "query_read_lock_wait_total"
  #   - "query_write_lock_wait_total"
  #   - "query_read_lock_wait_avg"
  #   - "query_write_lock_wait_avg"
  #   - "query_read_lock_wait_max"
  #   - "query_write_lock_wait_max"
  #   - "query_read_lock_wait_min"
  #   - "query_write_lock_wait_min"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p50"
  #   - "query_write_lock_wait_p50"
  #   - "query_read_lock_wait_p75"
  #   - "query_write_lock_wait_p75"
  #   - "query_read_lock_wait_p90"
  #   - "query_write_lock_wait_p90"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #   - "query_write_lock_wait_p10"
  #   - "query_read_lock_wait_p20"
  #   - "query_write_lock_wait_p20"
  #   - "query_read_lock_wait_p30"
  #   - "query_write_lock_wait_p30"
  #   - "query_read_lock_wait_p40"
  #   - "query_write_lock_wait_p40"
  #   - "query_read_lock_wait_p60"
  #   - "query_write_lock_wait_p60"
  #   - "query_read_lock_wait_p80"
  #   - "query_write_lock_wait_p80"
  #   - "query_read_lock_wait_p95"
  #   - "query_write_lock_wait_p95"
  #   - "query_read_lock_wait_p99"
  #   - "query_write_lock_wait_p99"
  #   - "query_read_lock_wait_p999"
  #   - "query_write_lock_wait_p999"
  #   - "query_read_lock_wait_p0"
  #   - "query_write_lock_wait_p0"
  #   - "query_read_lock_wait_p100"
  #   - "query_write_lock_wait_p100"
  #   - "query_read_lock_wait_p10"
  #