Server Monitoring Best Practices: Keeping Your PHP App and MongoDB Clusters Alive on Linode

Proactive Health Checks for PHP Applications

Maintaining the health of a PHP application goes beyond simply checking if the web server is responding. We need to ensure the application itself is functioning correctly, processing requests efficiently, and not leaking resources. This involves a multi-layered approach, starting with basic HTTP checks and extending to application-specific metrics.

HTTP Endpoint Health Checks

A fundamental check is to ensure your web server (e.g., Nginx or Apache) is serving your PHP application. This can be done by periodically fetching a dedicated health check endpoint. This endpoint should perform minimal, quick checks like database connectivity and essential configuration validation.

For example, create a healthcheck.php file in your application’s public directory:

<?php
header('Content-Type: application/json');

$response = ['status' => 'ok', 'timestamp' => time()];

// Basic check: Database connectivity (example using PDO)
try {
    // Replace with your actual database connection details
    $db = new PDO('mysql:host=localhost;dbname=your_db', 'user', 'password', [
        PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
        PDO::ATTR_TIMEOUT => 2 // 2-second timeout
    ]);
    $db->query('SELECT 1'); // Simple query to test connection
} catch (PDOException $e) {
    http_response_code(503); // Service Unavailable
    $response = ['status' => 'error', 'message' => 'Database connection failed: ' . $e->getMessage()];
    echo json_encode($response);
    exit;
}

// Add other critical checks here (e.g., cache connectivity, essential service availability)

echo json_encode($response);
?>

You can then use a tool like curl or a dedicated monitoring agent to poll this endpoint. For instance, using curl in a cron job:

#!/bin/bash

HEALTHCHECK_URL="http://your-app-domain.com/healthcheck.php"
EXPECTED_STATUS_CODE=200
EXPECTED_JSON_KEY="status"
EXPECTED_JSON_VALUE="ok"
LOG_FILE="/var/log/php_healthcheck.log"

echo "$(date '+%Y-%m-%d %H:%M:%S') - Checking health of $HEALTHCHECK_URL" >> $LOG_FILE

HTTP_RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTHCHECK_URL")

if [ "$HTTP_RESPONSE" != "$EXPECTED_STATUS_CODE" ]; then
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Unexpected HTTP status code: $HTTP_RESPONSE" >> $LOG_FILE
    # Trigger alert here (e.g., send email, Slack notification)
    exit 1
fi

# Further check JSON response
JSON_RESPONSE=$(curl -s "$HEALTHCHECK_URL")
if ! echo "$JSON_RESPONSE" | grep -q "\"$EXPECTED_JSON_KEY\":\"$EXPECTED_JSON_VALUE\""; then
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Unexpected JSON response: $JSON_RESPONSE" >> $LOG_FILE
    # Trigger alert here
    exit 1
fi

echo "$(date '+%Y-%m-%d %H:%M:%S') - SUCCESS: Application health check passed." >> $LOG_FILE
exit 0

This script should be scheduled via cron to run at regular intervals (e.g., every minute).

Application Performance Monitoring (APM) Integration

For deeper insights, integrate an Application Performance Monitoring (APM) tool. Tools like New Relic, Datadog, or even open-source solutions like Prometheus with the PHP-FPM exporter can provide invaluable metrics on request latency, error rates, transaction traces, and resource utilization (CPU, memory) specific to your PHP processes.

Key PHP Metrics to Monitor:

Request Latency: Average and percentile (p95, p99) response times for key transactions.
Error Rate: Percentage of requests resulting in PHP errors (e.g., E_ERROR, E_WARNING, E_NOTICE) or HTTP 5xx errors.
Throughput: Requests per minute (RPM).
PHP-FPM Pool Metrics: Active processes, idle processes, queue length, slow requests.
Memory Usage: Peak memory usage per request and overall PHP process memory.
CPU Usage: CPU time consumed by PHP processes.

Configure your APM agent to collect these metrics and set up alerts for anomalies. For example, an alert for a sustained increase in average request latency or a spike in error rates is crucial.

MongoDB Cluster Health and Performance Monitoring

MongoDB clusters, especially replica sets and sharded clusters, require diligent monitoring to ensure data availability, consistency, and optimal performance. Linode’s managed MongoDB offerings simplify some aspects, but understanding the underlying metrics is vital for advanced troubleshooting and capacity planning.

Replica Set Health

The health of a replica set is paramount. Key indicators include the state of each member, replication lag, and oplog window.

Use the rs.status() command in the MongoDB shell to get a comprehensive overview:

mongo --host mongodb0.example.net --port 27017 --username admin --password 'your_password' --authenticationDatabase admin

rs.status()

Key fields to scrutinize in rs.status() output:

members[].stateStr: Should ideally be ‘PRIMARY’, ‘SECONDARY’, or ‘ARBITER’. Any other state (e.g., ‘STARTUP’, ‘RECOVERING’, ‘DOWN’, ‘ROLLBACK’) indicates a problem.
members[].health: Should be 1 (healthy). 0 indicates an unhealthy member.
members[].replicationLagSecs: The replication lag in seconds for secondary members. A consistently high or increasing lag is a major concern.
oplog.logLengthSecs: The total length of the oplog in seconds.
oplog.time: The timestamp of the last entry in the oplog.
oplog.tாய்ச்ச: The oplog window, calculated as oplog.logLengthSecs – (current time – oplog.time). A small oplog window can lead to secondaries falling behind if writes are heavy.

Automate the polling of these metrics. You can script this using the MongoDB shell or a driver (e.g., Python’s pymongo) and send the data to your monitoring system (e.g., Prometheus, InfluxDB).

Example Python script snippet using pymongo:

from pymongo import MongoClient
from datetime import datetime, timezone
import time

# Replace with your MongoDB connection string
client = MongoClient('mongodb://admin:[email protected]:27017/?authSource=admin')
db = client.admin

try:
    rs_status = db.command('replSetGetStatus')
    current_time_ms = int(time.time() * 1000)

    for member in rs_status['members']:
        member_id = member['_id']
        state_str = member['stateStr']
        health = member['health']
        optime_ms = member['optimeDate'].timegm() * 1000 # Convert datetime to milliseconds since epoch

        # Calculate lag for secondaries
        replication_lag_sec = 0
        if state_str == 'SECONDARY':
            replication_lag_sec = (current_time_ms - optime_ms) / 1000.0

        print(f"Member: {member_id}, State: {state_str}, Health: {health}, Lag: {replication_lag_sec:.2f}s")

        # Send metrics to your monitoring system (e.g., Prometheus Pushgateway, InfluxDB)
        # Example: push_metric('mongodb_member_health', {'member': member_id, 'state': state_str}, health)
        # Example: push_metric('mongodb_member_replication_lag_seconds', {'member': member_id}, replication_lag_sec)

    # Oplog window calculation
    oplog_stats = rs_status['oplog']
    oplog_length_sec = oplog_stats['logLengthSecs']
    oplog_last_time_ms = oplog_stats['time'].timegm() * 1000
    oplog_window_sec = oplog_length_sec - ((current_time_ms - oplog_last_time_ms) / 1000.0)

    print(f"Oplog Length: {oplog_length_sec}s, Oplog Window: {oplog_window_sec:.2f}s")
    # Example: push_metric('mongodb_oplog_window_seconds', {}, oplog_window_sec)

except Exception as e:
    print(f"Error fetching replica set status: {e}")

client.close()

Performance Metrics

Beyond replica set status, monitor core performance indicators:

Query Performance: Slow queries (use the slowms setting and monitor the slow query log), query execution times, and index usage.
Read/Write Operations: Operations per second (reads, writes), insert/update/delete latency.
Network Traffic: Inbound and outbound network bandwidth.
Disk I/O: Read/write operations per second, latency, and queue depth.
Memory Usage: Resident set size (RSS), virtual memory size (VMS), cache hit rates (e.g., WiredTiger cache).
CPU Usage: System and user CPU time consumed by MongoDB processes.
Connections: Number of active connections, connection pool usage.

Tools like mongostat and mongotop provide real-time insights, but for historical trending and alerting, integrate with a time-series database and dashboarding tool. Linode’s Cloud Manager may offer some basic metrics, but for production environments, a dedicated solution is recommended.

# Real-time statistics
mongostat --host mongodb0.example.net --port 27017 --username admin --password 'your_password' --authenticationDatabase admin --discover --interval 5

# Real-time top-like view of collections
mongotop --host mongodb0.example.net --port 27017 --username admin --password 'your_password' --authenticationDatabase admin --discover --interval 10

For automated collection, consider using the MongoDB Atlas monitoring tools (if applicable) or setting up the MongoDB Exporter for Prometheus. This exporter can scrape metrics from your MongoDB instances and expose them in a Prometheus-compatible format.

System-Level Monitoring on Linode

Don’t forget the foundational layer: the Linode instances themselves. Comprehensive system monitoring ensures that the underlying infrastructure is healthy and not a bottleneck.

Key Linode Instance Metrics

CPU Utilization: Overall CPU load and per-core usage. High sustained CPU can indicate inefficient application code or insufficient resources.
Memory Usage: RAM usage, swap usage. Excessive swapping is a strong indicator of memory pressure.
Disk I/O: Read/write operations, disk latency, and disk space utilization. Running out of disk space is a critical failure point.
Network Traffic: Inbound/outbound bandwidth, packet loss, network errors.
Process Monitoring: Ensure critical processes (PHP-FPM, Nginx/Apache, MongoDB daemons) are running and not consuming excessive resources.

Linode provides basic monitoring through its Cloud Manager. For more advanced, granular, and centralized monitoring, deploy agents like:

Node Exporter (Prometheus): Collects a wide range of hardware and OS metrics.
Telegraf (InfluxDB): A plugin-driven server agent that can collect metrics and send them to various outputs, including InfluxDB.
Datadog Agent / New Relic Infrastructure Agent: Commercial agents that integrate deeply with their respective platforms.

Configure these agents to collect the metrics listed above and forward them to your central monitoring system. Set up alerts for thresholds like CPU > 90% for 5 minutes, Memory Usage > 85%, Disk Space < 10% free, or critical processes being down.

Log Aggregation and Analysis

Centralized logging is indispensable for debugging and identifying root causes of issues. Collect logs from your web server (Nginx/Apache), PHP-FPM, MongoDB, and application logs into a single, searchable location.

Tools like:

ELK Stack (Elasticsearch, Logstash, Kibana): A powerful open-source solution.
Loki (Grafana): A log aggregation system inspired by Prometheus, designed for cost efficiency and ease of operation.
Datadog Logs / Splunk: Commercial log management platforms.

can be used. Configure your servers to forward logs using agents like Filebeat (for ELK), Promtail (for Loki), or the respective agents for commercial solutions. Ensure your application logs are structured (e.g., JSON) to facilitate easier parsing and searching.

Example Log Rotation and Forwarding (using Filebeat for ELK):

# /etc/filebeat/filebeat.yml

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/*.log
    - /var/log/php-fpm/*.log
    - /var/log/mongodb/mongod.log
    - /var/log/your_app/*.log # Assuming your app logs here

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~

output.elasticsearch:
  hosts: ["your-elasticsearch-host:9200"]
  # username: "elastic"
  # password: "changeme"

# Or for Logstash:
# output.logstash:
#   hosts: ["your-logstash-host:5044"]

Regularly review logs for recurring errors, unusual patterns, or security-related events. Set up alerts based on specific log messages (e.g., critical errors, repeated failed login attempts).