Server Monitoring Best Practices: Keeping Your WooCommerce App and MongoDB Clusters Alive on OVH

Proactive MongoDB Cluster Health Checks with `mongostat` and `mongotop`

Maintaining the health of your MongoDB clusters, especially those backing a high-traffic WooCommerce application, requires more than just reactive alerts. Proactive, granular monitoring of key performance indicators (KPIs) is crucial. We’ll focus on leveraging the built-in MongoDB tools, `mongostat` and `mongotop`, for real-time insights into cluster performance and resource utilization on your OVH infrastructure.

Real-time Connection and Operation Monitoring with `mongostat`

`mongostat` provides a live, tabular view of MongoDB server statistics. It’s invaluable for quickly assessing the current load, identifying bottlenecks, and understanding the nature of incoming requests. Running this directly on your MongoDB nodes or from a dedicated monitoring host is the first step.

To get a comprehensive view of your MongoDB replica set, you’ll want to run `mongostat` against each member. A common setup involves monitoring the primary and secondary nodes to understand replication lag and read distribution.

Essential `mongostat` Metrics and Interpretation

insert, query, update, delete: These columns show the number of operations per second for each type. A sudden spike in one category can indicate an issue with a specific WooCommerce feature (e.g., a surge in `insert` operations during a flash sale).
get_more: Indicates cursor operations. High `get_more` counts can suggest inefficient queries or large result sets being fetched incrementally.
dirty: Percentage of the working set that is dirty in memory. High values might indicate insufficient RAM or heavy write loads.
used: Percentage of RAM currently used by the MongoDB process.
res: Resident memory usage (RAM).
qrw, arw: Queue length for read and write operations. Non-zero values indicate operations are waiting. Sustained high values are a strong indicator of a bottleneck.
netIn, netOut: Network traffic in and out. Useful for identifying network saturation.
conn: Number of active client connections. Exceeding connection limits can lead to application errors.
idx miss: Index miss ratio. A high miss ratio means MongoDB is scanning entire collections instead of using indexes, severely impacting query performance.

Here’s a typical command to run `mongostat` with a 5-second interval, focusing on key metrics:

Example `mongostat` Command

mongostat --host mongodb1.yourdomain.com:27017 --username admin --password 'your_secure_password' --authenticationDatabase admin --oplog --interval 5 --rowcount 1 --noheaders --columns insert,query,update,delete,get_more,dirty,used,res,qrw,arw,netIn,netOut,conn,idx miss

Note: Replace mongodb1.yourdomain.com:27017, admin, and your_secure_password with your actual cluster details. For production, use environment variables or a secure credential management system instead of hardcoding passwords.

Deep Dive into Disk I/O and Query Performance with `mongotop`

While `mongostat` gives a broad overview, `mongotop` provides a more granular look at which collections are consuming the most I/O and CPU resources. This is indispensable for pinpointing slow queries or inefficient data access patterns within your WooCommerce application.

Interpreting `mongotop` Output

ns: The namespace (database.collection) being accessed.
total: Total time spent in seconds on operations for this collection.
read: Time spent on read operations.
write: Time spent on write operations.
%total: Percentage of total server time spent on this collection.

A high %total for a specific collection, especially if dominated by read time, strongly suggests that queries against this collection are inefficient. This could be due to missing indexes, poorly written queries from the WooCommerce backend, or large data scans.

Example `mongotop` Command

mongotop --host mongodb1.yourdomain.com:27017 --username admin --password 'your_secure_password' --authenticationDatabase admin --interval 10

This command will display statistics every 10 seconds. Observe the output for collections that consistently show high read or write times, or a significant %total. These are your prime candidates for optimization.

Automating MongoDB Health Checks with Python and `pymongo`

Manual execution of `mongostat` and `mongotop` is useful for ad-hoc analysis, but for continuous monitoring, automation is key. We can use Python with the `pymongo` library to programmatically collect these metrics and send them to our monitoring system (e.g., Prometheus, InfluxDB, Datadog).

Python Script for `mongostat`-like Metrics

This script connects to a MongoDB instance and retrieves server status information, which can be mapped to `mongostat`’s output. We’ll focus on connection counts, operation rates, and memory usage.

import pymongo
import time
from pymongo import MongoClient

# --- Configuration ---
MONGO_URI = "mongodb://admin:[email protected]:27017/?authSource=admin"
MONITOR_INTERVAL_SECONDS = 15
# ---------------------

def get_server_stats(client):
    try:
        db = client.admin
        stats = db.command('serverStatus')
        return stats
    except Exception as e:
        print(f"Error fetching server status: {e}")
        return None

def calculate_rates(current_stats, previous_stats, interval):
    if not previous_stats:
        return {key: 0 for key in current_stats.keys()}

    rates = {}
    for key, value in current_stats.items():
        if key in previous_stats and isinstance(value, (int, float)):
            delta = value - previous_stats[key]
            rates[key] = delta / interval
        else:
            rates[key] = value # For non-numeric or new keys, just pass through
    return rates

def main():
    client = None
    previous_stats = None

    try:
        client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
        client.admin.command('ping') # Check connection
        print("Connected to MongoDB.")

        while True:
            current_stats = get_server_stats(client)
            if current_stats:
                if previous_stats:
                    # Extract relevant metrics for rate calculation
                    current_metrics = {
                        'insert': current_stats['opcounters']['insert'],
                        'query': current_stats['opcounters']['query'],
                        'update': current_stats['opcounters']['update'],
                        'delete': current_stats['opcounters']['delete'],
                        'get_more': current_stats['opcounters']['get_more'],
                        'connections_current': current_stats['connections']['current'],
                        'network_bytesIn': current_stats['network']['bytesIn'],
                        'network_bytesOut': current_stats['network']['bytesOut'],
                        'mem_resident': current_stats['mem']['resident'],
                        'globalLock_currentQueue_total': current_stats['globalLock']['currentQueue']['total'],
                        'globalLock_currentQueue_readers': current_stats['globalLock']['currentQueue']['readers'],
                        'globalLock_currentQueue_writers': current_stats['globalLock']['currentQueue']['writers'],
                        # 'metrics.queryExecutor.scannedObjects': current_stats['metrics']['queryExecutor']['scannedObjects'], # Requires specific metrics enabled
                        # 'metrics.operationMetrics.writeConflicts': current_stats['metrics']['operationMetrics']['writeConflicts'] # Requires specific metrics enabled
                    }
                    previous_metrics = {
                        'insert': previous_stats['opcounters']['insert'],
                        'query': previous_stats['opcounters']['query'],
                        'update': previous_stats['opcounters']['update'],
                        'delete': previous_stats['opcounters']['delete'],
                        'get_more': previous_stats['opcounters']['get_more'],
                        'connections_current': previous_stats['connections']['current'],
                        'network_bytesIn': previous_stats['network']['bytesIn'],
                        'network_bytesOut': previous_stats['network']['bytesOut'],
                        'mem_resident': previous_stats['mem']['resident'],
                        'globalLock_currentQueue_total': previous_stats['globalLock']['currentQueue']['total'],
                        'globalLock_currentQueue_readers': previous_stats['globalLock']['currentQueue']['readers'],
                        'globalLock_currentQueue_writers': previous_stats['globalLock']['currentQueue']['writers'],
                    }

                    rates = calculate_rates(current_metrics, previous_metrics, MONITOR_INTERVAL_SECONDS)

                    print(f"--- Stats ({time.strftime('%Y-%m-%d %H:%M:%S')}) ---")
                    print(f"Connections: {rates.get('connections_current', 'N/A')}")
                    print(f"Ops/sec: Insert={rates.get('insert', 0):.2f}, Query={rates.get('query', 0):.2f}, Update={rates.get('update', 0):.2f}, Delete={rates.get('delete', 0):.2f}, GetMore={rates.get('get_more', 0):.2f}")
                    print(f"Network: In={rates.get('network_bytesIn', 0)/1024:.2f} KB/s, Out={rates.get('network_bytesOut', 0)/1024:.2f} KB/s")
                    print(f"Memory: Resident={rates.get('mem_resident', 0)/1024:.2f} MB")
                    print(f"Queues: Total={rates.get('globalLock_currentQueue_total', 0)}, Readers={rates.get('globalLock_currentQueue_readers', 0)}, Writers={rates.get('globalLock_currentQueue_writers', 0)}")
                    # Add more metrics as needed and send to your monitoring system

                previous_stats = current_stats # Store for next iteration

            time.sleep(MONITOR_INTERVAL_SECONDS)

    except pymongo.errors.ConnectionFailure as e:
        print(f"Could not connect to MongoDB: {e}")
    except KeyboardInterrupt:
        print("\nMonitoring stopped.")
    finally:
        if client:
            client.close()

if __name__ == "__main__":
    main()

This script provides a foundation. You would typically integrate the `print` statements with calls to your chosen metrics collection agent (e.g., `prometheus_client` for Prometheus, `datadog_agent` for Datadog) to push these metrics to your central monitoring dashboard.

Python Script for `mongotop`-like Metrics

To replicate `mongotop`’s functionality, we need to query the `db.serverStatus()` and `db.stats()` for collection-level metrics. Note that detailed collection statistics might require specific MongoDB configurations or versions.

import pymongo
import time
from pymongo import MongoClient

# --- Configuration ---
MONGO_URI = "mongodb://admin:[email protected]:27017/?authSource=admin"
MONITOR_INTERVAL_SECONDS = 30
# ---------------------

def get_collection_stats(client):
    collection_metrics = {}
    try:
        # Get list of databases
        databases = client.list_database_names()
        for db_name in databases:
            if db_name in ['local', 'config', 'admin']: # Skip system databases
                continue

            db = client[db_name]
            # Get list of collections in the database
            collection_names = db.list_collection_names()
            for col_name in collection_names:
                try:
                    # Fetch collection stats. This can be resource intensive.
                    # For detailed operation times, we might need to rely on profiling or serverStatus metrics if available.
                    # The 'stats' command provides basic info. For operation times, we'd typically aggregate from serverStatus opcountersDelta or similar.
                    # A more accurate way to get read/write time per collection is often through MongoDB's profiler or specific metrics.
                    # For this example, we'll simulate by looking at total operations and potentially query stats if enabled.

                    # A more direct way to get operation counts per collection is via serverStatus's 'metrics.operationMetrics' if enabled.
                    # However, this requires specific configuration.
                    # For a simpler approach, we can look at 'db.stats()' and aggregate opcountersDelta if available.
                    # Let's try to get basic stats and infer.

                    col_stats = db.get_collection(col_name).stats()
                    ns = f"{db_name}.{col_name}"

                    # We'll approximate read/write time by looking at operation counts and total time.
                    # This is a simplification. Real-time read/write *time* per collection is harder to get directly without profiling.
                    # A common approach is to use serverStatus's opcountersDelta and aggregate over time.
                    # For this example, we'll focus on operation counts as a proxy.

                    collection_metrics[ns] = {
                        'count': col_stats.get('count', 0),
                        'size': col_stats.get('size', 0),
                        'avgObjSize': col_stats.get('avgObjSize', 0),
                        'storageSize': col_stats.get('storageSize', 0),
                        # Placeholder for read/write time - requires more advanced metrics or profiling
                        'read_time_ms': 0,
                        'write_time_ms': 0,
                        'total_ops': col_stats.get('totalIndexes', 0) # Placeholder, not actual ops
                    }

                except Exception as e:
                    print(f"Error getting stats for {db_name}.{col_name}: {e}")
        return collection_metrics
    except Exception as e:
        print(f"Error fetching database/collection stats: {e}")
        return None

def main():
    client = None
    previous_collection_data = {} # To store previous operation counts for rate calculation

    try:
        client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
        client.admin.command('ping') # Check connection
        print("Connected to MongoDB for collection stats.")

        while True:
            current_collection_data = get_collection_stats(client)
            if current_collection_data:
                # Calculate rates for operations if we had access to opcountersDelta per collection
                # For this simplified example, we'll just display current stats.
                # To truly replicate mongotop's time-based metrics, you'd need to:
                # 1. Enable detailed metrics in MongoDB (e.g., via serverStatus or specific profiling).
                # 2. Track 'opcountersDelta' or similar metrics over time for each collection.
                # 3. Calculate the time spent based on these deltas.

                print(f"--- Collection Stats ({time.strftime('%Y-%m-%d %H:%M:%S')}) ---")
                # Sort by count for easier viewing of active collections
                sorted_collections = sorted(current_collection_data.items(), key=lambda item: item[1]['count'], reverse=True)

                for ns, metrics in sorted_collections:
                    # In a real scenario, you'd calculate read/write time here.
                    # For now, we display basic stats.
                    print(f"  Namespace: {ns}")
                    print(f"    Count: {metrics['count']}")
                    print(f"    Size: {metrics['size'] / (1024*1024):.2f} MB")
                    print(f"    Avg Obj Size: {metrics['avgObjSize']:.2f} bytes")
                    print(f"    Storage Size: {metrics['storageSize'] / (1024*1024):.2f} MB")
                    # print(f"    Approx Read Time: {metrics['read_time_ms']:.2f} ms") # Placeholder
                    # print(f"    Approx Write Time: {metrics['write_time_ms']:.2f} ms") # Placeholder
                    # Send these metrics to your monitoring system

                # Update previous data for next iteration if calculating rates
                previous_collection_data = current_collection_data

            time.sleep(MONITOR_INTERVAL_SECONDS)

    except pymongo.errors.ConnectionFailure as e:
        print(f"Could not connect to MongoDB: {e}")
    except KeyboardInterrupt:
        print("\nMonitoring stopped.")
    finally:
        if client:
            client.close()

if __name__ == "__main__":
    main()

Important Consideration for `mongotop` Replication: Directly measuring the *time* spent on reads/writes per collection is complex. MongoDB’s `serverStatus` command provides `opcounters` (total operations since startup) and `opcountersDelta` (operations in the last interval, if enabled). To accurately replicate `mongotop`’s time-based metrics, you’d need to:

Enable `opcountersDelta` in your MongoDB configuration.
Periodically fetch `serverStatus`, extract `opcountersDelta` for each collection (if available at collection level, otherwise aggregate from global `opcountersDelta`), and calculate the rate of operations per second.
Correlate these operation rates with CPU usage metrics to estimate time spent.

The provided Python script for `mongotop` focuses on basic collection statistics (size, count) as a starting point. For true performance analysis, integrating with MongoDB’s profiler or using more advanced monitoring tools that can capture these granular operation timings is recommended.

WooCommerce Application-Level Monitoring

Beyond the database, monitoring the WooCommerce application itself is paramount. This involves tracking request latency, error rates, resource utilization (CPU, memory, network) of your web servers, and application-specific metrics.

Key WooCommerce Application Metrics

HTTP Request Latency: Average and percentile (p95, p99) response times for key WooCommerce endpoints (e.g., product pages, cart, checkout, API calls).
Error Rates: Percentage of HTTP 5xx and 4xx errors. Monitor specific WooCommerce error codes.
PHP-FPM/Web Server Metrics: Active processes, request queue length, slow requests.
Application Cache Hit/Miss Ratio: For Redis or Memcached used by WooCommerce.
Background Job Status: Monitoring the success/failure rate of WooCommerce background tasks (e.g., order processing, email sending).
Resource Utilization: CPU, memory, disk I/O, and network traffic on your web servers.

Nginx Configuration for Request Logging and Metrics

Nginx can be configured to log detailed request information, which can then be parsed by tools like `goaccess` or ingested by your monitoring system. We’ll focus on enabling a custom log format that captures relevant metrics.

# In your nginx.conf or site-specific conf file
http {
    # ... other http configurations ...

    # Define a custom log format for detailed request logging
    log_format wc_detailed '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for" '
                          '$request_time $upstream_response_time $upstream_addr';

    server {
        listen 80;
        server_name your-woocommerce-domain.com;
        root /var/www/html/your-woocommerce-app;
        index index.php index.html index.htm;

        access_log /var/log/nginx/woocommerce_access.log wc_detailed;
        error_log /var/log/nginx/woocommerce_error.log;

        location / {
            try_files $uri $uri/ /index.php?$args;
        }

        location ~ \.php$ {
            include snippets/fastcgi_params.conf;
            # Adjust fastcgi_pass to your PHP-FPM socket or address
            fastcgi_pass unix:/var/run/php/php8.1-fpm.sock;
            fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
            fastcgi_index index.php;
            include fastcgi_params;
        }

        # ... other server configurations ...
    }
}

With this configuration, your /var/log/nginx/woocommerce_access.log will contain entries like:

192.168.1.100 - - [10/Oct/2023:10:30:00 +0000] "GET /shop/product/awesome-widget HTTP/1.1" 200 1543 "-" "Mozilla/5.0 ..." "-" 0.123 0.110 127.0.0.1:9000

The $request_time and $upstream_response_time fields are critical for latency monitoring. You can use tools like Filebeat or Fluentd to ship these logs to a centralized logging system (e.g., Elasticsearch, Splunk) for analysis and alerting.

PHP-FPM Monitoring

PHP-FPM is the workhorse for PHP applications. Monitoring its status is vital. You can enable its status page for real-time insights.

# In your php-fpm pool configuration (e.g., /etc/php/8.1/fpm/pool.d/www.conf)
; Ensure the status page is accessible
pm.status_path = /status

; For security, you might want to restrict access to the status page
; Example using Nginx to proxy and restrict access:
# In your Nginx site configuration:
# location ~ ^/php-fpm_status {
#     include fastcgi_params;
#     fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
#     fastcgi_pass unix:/var/run/php/php8.1-fpm.sock;
#
#     # Restrict access to specific IPs or internal network
#     allow 192.168.0.0/16;
#     deny all;
# }

Accessing http://your-woocommerce-domain.com/status (or the proxied URL) will show metrics like:

pool: www
process manager: dynamic
start for 13000 requests
max children reached 500
process currently running: 10
total processes: 10
active processes: 10
idle processes: 0
requests: 13000
accepted conn: 13000
listen queue: 0
max listen queue: 0
listen queue len: 0
idle_timeout: 30
request duration: 13000000
slow requests: 0
disabled for 0 requests
slowlog: /var/log/php-fpm/slow.log

Key indicators here are max children reached (indicates you might need to increase pm.max_children), listen queue (requests waiting for a free process), and slow requests.

OVH-Specific Considerations and Alerts

OVH provides its own set of monitoring tools and metrics for its infrastructure. Integrating these with your application-level monitoring gives a holistic view.

OVH Public Cloud Monitoring Metrics

Instance Metrics: CPU utilization, network I/O, disk I/O, memory usage for your VMs.
Load Balancer Metrics: Request rates, backend health, latency.
Database Service Metrics: For managed PostgreSQL, MySQL, or MongoDB services (if used).

You can access these metrics via the OVHcloud Control Panel or programmatically using the OVHcloud API. For automated alerting, consider setting up:

Essential Alerts to Configure

MongoDB Cluster Health:
- Primary/Secondary status changes (failover events).
- Replication lag exceeding a defined threshold (e.g., 60 seconds).
- High CPU/Memory utilization on MongoDB nodes (e.g., > 85% for sustained periods).
- Disk space running low on MongoDB data/log volumes.
- Connection errors or high connection counts.
WooCommerce Application Health:
- High HTTP 5xx error rates (e.g., > 1% of total requests).
- High average/p99 request latency (e.g., > 2 seconds for checkout).
- PHP-FPM listen queue consistently above zero.
- Nginx worker process issues or high error logs.
- Application cache (Redis/Memcached) high memory usage or low hit ratio.
- Critical background job failures.
OVH Infrastructure Health:
- VM CPU/Memory utilization exceeding critical thresholds (e.g., > 90%).
- Network ingress/egress saturation.
- Disk I/O wait times or saturation.
- Load Balancer backend health checks failing.

By combining granular MongoDB metrics, detailed application logs, and OVH infrastructure-level data, you can build a robust monitoring strategy that keeps your WooCommerce application stable and performant.