Server Monitoring Best Practices: Keeping Your PHP App and MongoDB Clusters Alive on AWS

Proactive MongoDB Cluster Health Checks with CloudWatch Metrics

Maintaining the health of a MongoDB cluster on AWS, especially when serving a high-traffic PHP application, requires more than just reactive alerts. We need to establish baseline metrics and set up proactive anomaly detection. Amazon CloudWatch is our primary tool for this. Beyond the standard EC2 instance metrics, we need to focus on MongoDB-specific operational data. This involves configuring agents to push custom metrics or leveraging CloudWatch Logs to parse MongoDB’s diagnostic output.

A critical set of metrics to monitor includes:

Network In/Out (Bytes): Essential for understanding data transfer volume to/from the cluster.
Disk Read/Write Operations: High I/O can indicate performance bottlenecks.
Disk Read/Write Bytes: Correlate with operations to gauge data throughput.
CPU Utilization (%): Standard but crucial for identifying overloaded nodes.
Memory Utilization (%): Especially important for understanding cache hit rates and potential swapping.
Network Packets In/Out: Can reveal issues with network saturation or packet loss.
MongoDB WiredTiger Cache Usage (%): A key indicator of how effectively MongoDB is using RAM for data caching. Low usage might mean insufficient RAM or inefficient queries.
MongoDB WiredTiger Cache Dirty Pages (%): High percentages suggest data is being written to disk frequently, impacting performance.
MongoDB WiredTiger Cache Read/Write Operations: Direct insight into disk I/O driven by the storage engine.
MongoDB Operations (Read/Write/Command): Tracks the overall request load on the database.
MongoDB Network (In/Out): Specific to MongoDB’s network traffic.
MongoDB Locks (Global/Database/Collection): High lock contention is a common cause of performance degradation.
MongoDB Connections (Current/Available): Prevents connection exhaustion.
MongoDB Replication Lag: Critical for ensuring data consistency across replica sets.

To collect these, we’ll deploy the CloudWatch Agent on our EC2 instances hosting MongoDB. For custom metrics, we can use the agent’s StatsD or collectd input plugins, or parse logs. Let’s focus on log parsing for WiredTiger metrics and replication lag, as these are often logged at a granular level.

Configuring CloudWatch Agent for MongoDB Metrics

The CloudWatch Agent configuration file (typically /opt/aws/amazon-cloudwatch-agent/bin/config.json) needs to be updated to collect system-level metrics and parse MongoDB logs. We’ll enable the `collectd` input for WiredTiger metrics and configure a log file parser for replication status.

System and WiredTiger Metrics via collectd

First, ensure the CloudWatch Agent is installed and running. Then, create or modify the agent configuration. We’ll enable the `collectd` input and specify the plugins for system and MongoDB metrics. You might need to install `collectd` and configure its MongoDB plugin separately if the CloudWatch Agent’s `collectd` input doesn’t directly expose them. A more direct approach is often to use the agent’s `statsd` input if your MongoDB monitoring tools can expose metrics in that format, or to parse logs.

For this example, let’s assume we’re using a method that exposes metrics via StatsD or we’re parsing logs. If using collectd, you’d typically configure /etc/collectd/plugins/mongodb.conf (or similar) and then point the CloudWatch Agent’s collectd input to it. However, a more common and often simpler approach for custom metrics is to use the agent’s log parsing capabilities or a dedicated metrics exporter.

Let’s illustrate log parsing for replication lag. MongoDB logs replication events, and we can parse these. A more robust method is to use `mongostat` or `mongotop` output and pipe it to the agent, or use a dedicated exporter like Prometheus `mongodb_exporter` which can then be scraped by the CloudWatch Agent’s Prometheus receiver.

Log Parsing for Replication Lag

We’ll configure the CloudWatch Agent to tail MongoDB’s log files and extract replication lag information. This requires defining a log group and a log pattern to capture the relevant data. Assuming your MongoDB logs are in /var/log/mongodb/mongod.log and contain lines like:

2023-10-27T10:30:00.123+0000 I REPL [ReplicationCoordinator] replSetReplicationCoordinator: member has lag of ms

Here’s a snippet of the CloudWatch Agent configuration (config.json) to achieve this:

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "cwagent"
    },
    "metrics": {
        "namespace": "MongoDB/Cluster",
        "metrics_collected": {
            "ec2": {
                "measurement": [
                    "disk_read_ops",
                    "disk_write_ops",
                    "disk_read_bytes",
                    "disk_write_bytes",
                    "network_rx_bytes",
                    "network_tx_bytes",
                    "cpu_usage_idle",
                    "cpu_usage_user",
                    "cpu_usage_system",
                    "mem_used_percent"
                ],
                "metrics_aggregation_interval": 60
            },
            "statsd": {
                "service_address": "127.0.0.1:8125",
                "metrics_collection_interval": 60
            }
        }
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    {
                        "file_path": "/var/log/mongodb/mongod.log",
                        "log_group_name": "MongoDB/Cluster/Logs",
                        "log_stream_name": "{instance_id}",
                        "timezone": "UTC",
                        "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T",
                        "regex": "(?P<message>.*)",
                        "log_processor": {
                            "type": "json",
                            "json_keys": {
                                "timestamp": "timestamp",
                                "level": "level",
                                "message": "message"
                            }
                        }
                    }
                ]
            },
            "log_stream_name": "{instance_id}"
        }
    }
}

This configuration collects standard EC2 metrics and enables StatsD. The crucial part is the logs.logs_collected.files section. It points to the MongoDB log file, defines a log group, and uses a regex to capture log lines. For more advanced parsing of replication lag, we’d need a more sophisticated regex or a custom log processor. A common approach is to use a tool that can parse the output of rs.status() periodically and push those metrics.

Advanced MongoDB Replication Monitoring

Replication lag is a critical indicator of data consistency. Relying solely on log parsing can be brittle. A more robust method involves periodically executing rs.status() and processing its output. We can script this using Python and the pymongo library.

Python Script for Replication Status Metrics

This Python script connects to a MongoDB replica set, retrieves the status, and calculates the lag for each secondary member. It then pushes these metrics to CloudWatch using the boto3 SDK.

import boto3
import pymongo
import time
import logging
from datetime import datetime, timezone

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# AWS CloudWatch configuration
CLOUDWATCH_NAMESPACE = "MongoDB/Cluster"
CLOUDWATCH_REGION = "us-east-1" # Replace with your AWS region

# MongoDB connection details
# Ensure your EC2 instance has IAM role with CloudWatch PutMetricData permissions
MONGO_URI = "mongodb://localhost:27017/" # Or your replica set connection string
REPLICA_SET_NAME = "myReplicaSet" # Optional, but good for clarity

def get_replication_lag(primary_client, secondary_host):
    """Calculates replication lag for a secondary member."""
    try:
        primary_status = primary_client.admin.command('replSetGetStatus')
        for member in primary_status.get('members', []):
            if member.get('name') == secondary_host:
                if member.get('stateStr') == 'SECONDARY':
                    primary_optime_ts = primary_status.get('members', [])[0].get('optimeDate') # Assuming first member is primary
                    secondary_optime_ts = member.get('optimeDate')
                    if primary_optime_ts and secondary_optime_ts:
                        lag_seconds = (primary_optime_ts - secondary_optime_ts).total_seconds()
                        return max(0, lag_seconds) # Lag cannot be negative
                else:
                    return None # Not a secondary or not found
        return None
    except Exception as e:
        logging.error(f"Error getting replication lag for {secondary_host}: {e}")
        return None

def push_metric_to_cloudwatch(metric_name, value, dimensions=None):
    """Pushes a single metric to CloudWatch."""
    cloudwatch = boto3.client('cloudwatch', region_name=CLOUDWATCH_REGION)
    try:
        cloudwatch.put_metric_data(
            Namespace=CLOUDWATCH_NAMESPACE,
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': 'Seconds', # Or 'Count', 'Percent', etc.
                    'Dimensions': dimensions if dimensions else []
                },
            ]
        )
        logging.info(f"Pushed metric: {metric_name}={value} to CloudWatch.")
    except Exception as e:
        logging.error(f"Failed to push metric {metric_name} to CloudWatch: {e}")

def monitor_replication():
    """Connects to MongoDB, checks replication status, and pushes metrics."""
    try:
        client = pymongo.MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
        # The ismaster command is cheap and does not require auth.
        client.admin.command('ismaster')
        logging.info("Successfully connected to MongoDB.")

        repl_status = client.admin.command('replSetGetStatus')
        primary_member_name = None
        secondary_members = []

        for member in repl_status.get('members', []):
            if member.get('stateStr') == 'PRIMARY':
                primary_member_name = member.get('name')
            elif member.get('stateStr') == 'SECONDARY':
                secondary_members.append(member)

        if not primary_member_name:
            logging.warning("No primary member found in replica set.")
            return

        # Get primary client to fetch optimeDate accurately
        primary_client = pymongo.MongoClient(f"mongodb://{primary_member_name}/", serverSelectionTimeoutMS=5000)
        primary_client.admin.command('ismaster') # Ensure connection

        # Push primary status
        push_metric_to_cloudwatch(
            metric_name="ReplicaSetPrimary",
            value=1,
            dimensions=[{'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME}, {'Name': 'MemberName', 'Value': primary_member_name}]
        )

        for member in secondary_members:
            member_name = member.get('name')
            member_state = member.get('stateStr')
            member_optime_date = member.get('optimeDate')

            # Push member state
            push_metric_to_cloudwatch(
                metric_name="ReplicaSetMemberState",
                value=1, # Value is arbitrary, state is in dimensions
                dimensions=[
                    {'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME},
                    {'Name': 'MemberName', 'Value': member_name},
                    {'Name': 'MemberState', 'Value': member_state}
                ]
            )

            if member_state == 'SECONDARY' and member_optime_date:
                # Calculate lag against the primary's optimeDate
                primary_optime_ts = repl_status.get('members', [])[0].get('optimeDate') # Assuming first member in status is primary
                if primary_optime_ts:
                    lag_seconds = (primary_optime_ts - member_optime_date).total_seconds()
                    lag_seconds = max(0, lag_seconds) # Ensure non-negative

                    push_metric_to_cloudwatch(
                        metric_name="ReplicationLag",
                        value=lag_seconds,
                        dimensions=[
                            {'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME},
                            {'Name': 'MemberName', 'Value': member_name}
                        ]
                    )
                else:
                    logging.warning(f"Could not get primary optimeDate for lag calculation for member {member_name}.")
            elif member_state == 'ARBITER':
                 push_metric_to_cloudwatch(
                    metric_name="ReplicaSetMemberState",
                    value=1,
                    dimensions=[
                        {'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME},
                        {'Name': 'MemberName', 'Value': member_name},
                        {'Name': 'MemberState', 'Value': member_state}
                    ]
                )


    except pymongo.errors.ConnectionFailure as e:
        logging.error(f"Could not connect to MongoDB: {e}")
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
    finally:
        if 'client' in locals() and client:
            client.close()
        if 'primary_client' in locals() and primary_client:
            primary_client.close()

if __name__ == "__main__":
    # This script should be run periodically, e.g., via cron or systemd timer
    # For demonstration, we run it once. In production, loop or schedule.
    monitor_replication()

    # Example of running periodically (e.g., every 60 seconds)
    # while True:
    #     monitor_replication()
    #     time.sleep(60)

To deploy this script:

Install necessary libraries: pip install pymongo boto3
Ensure the EC2 instance has an IAM role attached with permissions for cloudwatch:PutMetricData.
Configure the script with your MongoDB URI and desired AWS region.
Schedule the script to run at regular intervals (e.g., every 1-5 minutes) using cron or a systemd timer.

This provides granular, actionable metrics for replication lag, allowing for precise alerting when lag exceeds acceptable thresholds.

PHP Application Performance Monitoring with CloudWatch Logs and Alarms

Your PHP application’s performance is directly tied to the database’s responsiveness. Monitoring the application’s error rates, response times, and resource utilization is crucial. We can leverage CloudWatch Logs to collect application logs and then set up alarms based on log patterns.

Structured Logging in PHP

To effectively monitor your PHP application, implement structured logging. Using a library like Monolog with a JSON formatter is highly recommended. This ensures log entries are machine-readable and easily parsable by CloudWatch Logs.

<?php
require 'vendor/autoload.php'; // Assuming Monolog is installed via Composer

use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Formatter\JsonFormatter;

// Create a log channel
$log = new Logger('app');

// Create a stream handler for stdout (which CloudWatch Agent will tail)
$handler = new StreamHandler('php://stdout', Logger::DEBUG);

// Set the formatter to JSON
$handler->setFormatter(new JsonFormatter());

$log->pushHandler($handler);

// Example log entries
$log->info('Application started', ['version' => '1.2.0']);

try {
    // Simulate a database query
    // $db = new PDO(...);
    // $stmt = $db->query('SELECT * FROM users WHERE id = 1');
    // $user = $stmt->fetch();

    // Simulate a successful operation
    $log->info('User data fetched successfully', ['user_id' => 1, 'query_time_ms' => 150]);

    // Simulate an error
    if (rand(0, 10) < 2) { // 20% chance of error
        throw new Exception('Database connection failed');
    }

} catch (Exception $e) {
    $log->error('An error occurred during user data fetch', [
        'user_id' => 1,
        'error_message' => $e->getMessage(),
        'error_code' => $e->getCode(),
        'trace' => $e->getTraceAsString() // Be cautious with sensitive trace info in production logs
    ]);
}

$log->info('Application finished processing request');
?>

Ensure your CloudWatch Agent configuration is set up to collect logs from where your PHP application writes them (e.g., php://stdout if using Docker, or a specific log file). Add a section to your agent’s config.json:

{
    // ... other agent configurations ...
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    // ... other log files ...
                    {
                        "file_path": "/var/log/php-app/app.log", // Or wherever your logs are written
                        "log_group_name": "PHPApp/ApplicationLogs",
                        "log_stream_name": "{instance_id}",
                        "timezone": "UTC",
                        "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T", // Adjust if your JSON timestamp differs
                        "log_processor": {
                            "type": "json" // Tell CloudWatch Agent to parse as JSON
                        }
                    }
                ]
            }
        }
    }
}

CloudWatch Alarms for Application Errors

Once logs are flowing into CloudWatch Logs, we can create Metric Filters to extract metrics from these logs and then set up Alarms based on these metrics. For example, to count error occurrences.

Step 1: Create a Metric Filter

Navigate to your CloudWatch Log Group (e.g., PHPApp/ApplicationLogs) in the AWS Console. Under “Logs metrics,” click “Create metric filter.”

Filter Pattern:

{ $.level = "error" }

This pattern filters for log entries where the JSON field level is equal to "error".

Metric Details:

Metric Namespace: PHPApp/Metrics
Metric Name: ErrorCount
Metric Value: 1 (Each matching log entry increments the count by 1)

Step 2: Create a CloudWatch Alarm

After creating the metric filter, go to “Alarms” in CloudWatch. Click “Create alarm.”

Select Metric: Choose the metric you just created (e.g., PHPApp/Metrics, ErrorCount).

Define conditions:

Statistic: Sum
Period: 5 minutes (or your desired interval)
Threshold type: Static
Whenever ErrorCount is: Greater than
than: 0 (or a specific threshold, e.g., 10 errors in 5 minutes)

Configure actions: Set up notifications to an SNS topic for alerts (e.g., email, Slack integration).

EC2 Instance Health and Performance Tuning

While MongoDB and the PHP application are critical, the underlying EC2 instances must also be healthy. Standard EC2 metrics are a good starting point, but we need to correlate them with application and database performance.

Key EC2 Metrics and Thresholds

Use CloudWatch’s default EC2 metrics, but set custom alarms with appropriate thresholds:

CPU Utilization: Alarm if consistently above 80-90% for extended periods (e.g., 15 minutes). This often indicates an application or database bottleneck.
Memory Utilization: While EC2 doesn’t expose memory usage directly by default (requires CloudWatch Agent), if you are collecting it, alarm if consistently above 85-90%. High memory usage can lead to swapping and severe performance degradation.
Disk I/O (Read/Write Ops/Bytes): Monitor for unusually high rates that correlate with slow application responses. High I/O can indicate inefficient queries or insufficient IOPS on EBS volumes.
Network In/Out: Alarm on sustained high network traffic that approaches instance or EBS network limits.
Disk Queue Length: A sustained queue length greater than 2-3 per disk can indicate I/O saturation.

EBS Volume Performance

For MongoDB, EBS volume performance is paramount. Monitor these EBS-specific metrics:

Volume Read/Write Ops: Track actual IOPS against provisioned IOPS (for io1/gp3) or burst credits (for gp2).
Volume Read/Write Bytes: Track throughput against provisioned throughput (for gp3) or instance limits.
Volume Queue Length: As mentioned, a sustained queue length indicates I/O bottlenecks.
Volume Idle Time: Low idle time suggests the volume is constantly busy.

If you observe persistent I/O bottlenecks, consider:

Upgrading EBS volume type (e.g., from gp2 to gp3 or io1/io2).
Increasing provisioned IOPS or throughput for gp3/io1 volumes.
Optimizing MongoDB queries to reduce I/O load.
Ensuring your EC2 instance type has sufficient network bandwidth for EBS traffic.

Automated Recovery and Health Checks

Proactive monitoring is essential, but automated recovery mechanisms are vital for maintaining high availability.

Auto Scaling Groups and Health Checks

For your PHP application servers, leverage AWS Auto Scaling Groups. Configure:

EC2 Health Checks: Auto Scaling Groups can perform EC2 status checks. If an instance fails, it’s terminated and replaced.
ELB Health Checks: If using an Elastic Load Balancer (ELB), configure health checks that your application exposes (e.g., a /health endpoint). The ELB will stop sending traffic to unhealthy instances, and the Auto Scaling Group will replace them if they remain unhealthy.

For MongoDB, direct Auto Scaling is more complex due to statefulness. However, you can use Auto Scaling Groups for your application tier and implement automated failover for your MongoDB replica set. AWS DocumentDB offers managed auto-scaling and failover capabilities if you’re considering a managed service.

Automated MongoDB Failover

MongoDB’s replica set mechanism handles failover automatically. When a primary becomes unreachable, the remaining secondaries elect a new primary. Ensure your replica set configuration is robust:

Sufficient Members: A minimum of 3 voting members is recommended for automatic failover (e.g., Primary, Secondary, Secondary).
Arbiter: Consider using an arbiter if you cannot have an odd number of data-bearing nodes, but be aware of its limitations.
Priority: Configure member priorities to influence failover elections.
Network Latency: Ensure low latency between replica set members, especially across Availability Zones.

Your PHP application should be configured to connect to the replica set using a connection string that allows it to discover the current primary and automatically reconnect. For example:

// Example using MongoDB PHP Driver
$mongoClient = new MongoDB\Client(
    "mongodb://mongo1.example.com:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&readPreference=primary"
);

// The driver will automatically find the primary and reconnect if needed.
// For read operations, you might use readPreference=secondaryPreferred
// to distribute read load.

By combining comprehensive monitoring with automated recovery strategies, you can build a resilient and highly available PHP application powered by MongoDB on AWS.