Server Monitoring Best Practices: Keeping Your PHP App and MongoDB Clusters Alive on AWS
Proactive MongoDB Cluster Health Checks with CloudWatch Metrics
Maintaining the health of a MongoDB cluster on AWS, especially when serving a high-traffic PHP application, requires more than just reactive alerts. We need to establish baseline metrics and set up proactive anomaly detection. Amazon CloudWatch is our primary tool for this. Beyond the standard EC2 instance metrics, we need to focus on MongoDB-specific operational data. This involves configuring agents to push custom metrics or leveraging CloudWatch Logs to parse MongoDB’s diagnostic output.
A critical set of metrics to monitor includes:
- Network In/Out (Bytes): Essential for understanding data transfer volume to/from the cluster.
- Disk Read/Write Operations: High I/O can indicate performance bottlenecks.
- Disk Read/Write Bytes: Correlate with operations to gauge data throughput.
- CPU Utilization (%): Standard but crucial for identifying overloaded nodes.
- Memory Utilization (%): Especially important for understanding cache hit rates and potential swapping.
- Network Packets In/Out: Can reveal issues with network saturation or packet loss.
- MongoDB WiredTiger Cache Usage (%): A key indicator of how effectively MongoDB is using RAM for data caching. Low usage might mean insufficient RAM or inefficient queries.
- MongoDB WiredTiger Cache Dirty Pages (%): High percentages suggest data is being written to disk frequently, impacting performance.
- MongoDB WiredTiger Cache Read/Write Operations: Direct insight into disk I/O driven by the storage engine.
- MongoDB Operations (Read/Write/Command): Tracks the overall request load on the database.
- MongoDB Network (In/Out): Specific to MongoDB’s network traffic.
- MongoDB Locks (Global/Database/Collection): High lock contention is a common cause of performance degradation.
- MongoDB Connections (Current/Available): Prevents connection exhaustion.
- MongoDB Replication Lag: Critical for ensuring data consistency across replica sets.
To collect these, we’ll deploy the CloudWatch Agent on our EC2 instances hosting MongoDB. For custom metrics, we can use the agent’s StatsD or collectd input plugins, or parse logs. Let’s focus on log parsing for WiredTiger metrics and replication lag, as these are often logged at a granular level.
Configuring CloudWatch Agent for MongoDB Metrics
The CloudWatch Agent configuration file (typically /opt/aws/amazon-cloudwatch-agent/bin/config.json) needs to be updated to collect system-level metrics and parse MongoDB logs. We’ll enable the `collectd` input for WiredTiger metrics and configure a log file parser for replication status.
System and WiredTiger Metrics via collectd
First, ensure the CloudWatch Agent is installed and running. Then, create or modify the agent configuration. We’ll enable the `collectd` input and specify the plugins for system and MongoDB metrics. You might need to install `collectd` and configure its MongoDB plugin separately if the CloudWatch Agent’s `collectd` input doesn’t directly expose them. A more direct approach is often to use the agent’s `statsd` input if your MongoDB monitoring tools can expose metrics in that format, or to parse logs.
For this example, let’s assume we’re using a method that exposes metrics via StatsD or we’re parsing logs. If using collectd, you’d typically configure /etc/collectd/plugins/mongodb.conf (or similar) and then point the CloudWatch Agent’s collectd input to it. However, a more common and often simpler approach for custom metrics is to use the agent’s log parsing capabilities or a dedicated metrics exporter.
Let’s illustrate log parsing for replication lag. MongoDB logs replication events, and we can parse these. A more robust method is to use `mongostat` or `mongotop` output and pipe it to the agent, or use a dedicated exporter like Prometheus `mongodb_exporter` which can then be scraped by the CloudWatch Agent’s Prometheus receiver.
Log Parsing for Replication Lag
We’ll configure the CloudWatch Agent to tail MongoDB’s log files and extract replication lag information. This requires defining a log group and a log pattern to capture the relevant data. Assuming your MongoDB logs are in /var/log/mongodb/mongod.log and contain lines like:
2023-10-27T10:30:00.123+0000 I REPL [ReplicationCoordinator] replSetReplicationCoordinator: member
Here’s a snippet of the CloudWatch Agent configuration (config.json) to achieve this:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "MongoDB/Cluster",
"metrics_collected": {
"ec2": {
"measurement": [
"disk_read_ops",
"disk_write_ops",
"disk_read_bytes",
"disk_write_bytes",
"network_rx_bytes",
"network_tx_bytes",
"cpu_usage_idle",
"cpu_usage_user",
"cpu_usage_system",
"mem_used_percent"
],
"metrics_aggregation_interval": 60
},
"statsd": {
"service_address": "127.0.0.1:8125",
"metrics_collection_interval": 60
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/mongodb/mongod.log",
"log_group_name": "MongoDB/Cluster/Logs",
"log_stream_name": "{instance_id}",
"timezone": "UTC",
"multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T",
"regex": "(?P<message>.*)",
"log_processor": {
"type": "json",
"json_keys": {
"timestamp": "timestamp",
"level": "level",
"message": "message"
}
}
}
]
},
"log_stream_name": "{instance_id}"
}
}
}
This configuration collects standard EC2 metrics and enables StatsD. The crucial part is the logs.logs_collected.files section. It points to the MongoDB log file, defines a log group, and uses a regex to capture log lines. For more advanced parsing of replication lag, we’d need a more sophisticated regex or a custom log processor. A common approach is to use a tool that can parse the output of rs.status() periodically and push those metrics.
Advanced MongoDB Replication Monitoring
Replication lag is a critical indicator of data consistency. Relying solely on log parsing can be brittle. A more robust method involves periodically executing rs.status() and processing its output. We can script this using Python and the pymongo library.
Python Script for Replication Status Metrics
This Python script connects to a MongoDB replica set, retrieves the status, and calculates the lag for each secondary member. It then pushes these metrics to CloudWatch using the boto3 SDK.
import boto3
import pymongo
import time
import logging
from datetime import datetime, timezone
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# AWS CloudWatch configuration
CLOUDWATCH_NAMESPACE = "MongoDB/Cluster"
CLOUDWATCH_REGION = "us-east-1" # Replace with your AWS region
# MongoDB connection details
# Ensure your EC2 instance has IAM role with CloudWatch PutMetricData permissions
MONGO_URI = "mongodb://localhost:27017/" # Or your replica set connection string
REPLICA_SET_NAME = "myReplicaSet" # Optional, but good for clarity
def get_replication_lag(primary_client, secondary_host):
"""Calculates replication lag for a secondary member."""
try:
primary_status = primary_client.admin.command('replSetGetStatus')
for member in primary_status.get('members', []):
if member.get('name') == secondary_host:
if member.get('stateStr') == 'SECONDARY':
primary_optime_ts = primary_status.get('members', [])[0].get('optimeDate') # Assuming first member is primary
secondary_optime_ts = member.get('optimeDate')
if primary_optime_ts and secondary_optime_ts:
lag_seconds = (primary_optime_ts - secondary_optime_ts).total_seconds()
return max(0, lag_seconds) # Lag cannot be negative
else:
return None # Not a secondary or not found
return None
except Exception as e:
logging.error(f"Error getting replication lag for {secondary_host}: {e}")
return None
def push_metric_to_cloudwatch(metric_name, value, dimensions=None):
"""Pushes a single metric to CloudWatch."""
cloudwatch = boto3.client('cloudwatch', region_name=CLOUDWATCH_REGION)
try:
cloudwatch.put_metric_data(
Namespace=CLOUDWATCH_NAMESPACE,
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': 'Seconds', # Or 'Count', 'Percent', etc.
'Dimensions': dimensions if dimensions else []
},
]
)
logging.info(f"Pushed metric: {metric_name}={value} to CloudWatch.")
except Exception as e:
logging.error(f"Failed to push metric {metric_name} to CloudWatch: {e}")
def monitor_replication():
"""Connects to MongoDB, checks replication status, and pushes metrics."""
try:
client = pymongo.MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
# The ismaster command is cheap and does not require auth.
client.admin.command('ismaster')
logging.info("Successfully connected to MongoDB.")
repl_status = client.admin.command('replSetGetStatus')
primary_member_name = None
secondary_members = []
for member in repl_status.get('members', []):
if member.get('stateStr') == 'PRIMARY':
primary_member_name = member.get('name')
elif member.get('stateStr') == 'SECONDARY':
secondary_members.append(member)
if not primary_member_name:
logging.warning("No primary member found in replica set.")
return
# Get primary client to fetch optimeDate accurately
primary_client = pymongo.MongoClient(f"mongodb://{primary_member_name}/", serverSelectionTimeoutMS=5000)
primary_client.admin.command('ismaster') # Ensure connection
# Push primary status
push_metric_to_cloudwatch(
metric_name="ReplicaSetPrimary",
value=1,
dimensions=[{'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME}, {'Name': 'MemberName', 'Value': primary_member_name}]
)
for member in secondary_members:
member_name = member.get('name')
member_state = member.get('stateStr')
member_optime_date = member.get('optimeDate')
# Push member state
push_metric_to_cloudwatch(
metric_name="ReplicaSetMemberState",
value=1, # Value is arbitrary, state is in dimensions
dimensions=[
{'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME},
{'Name': 'MemberName', 'Value': member_name},
{'Name': 'MemberState', 'Value': member_state}
]
)
if member_state == 'SECONDARY' and member_optime_date:
# Calculate lag against the primary's optimeDate
primary_optime_ts = repl_status.get('members', [])[0].get('optimeDate') # Assuming first member in status is primary
if primary_optime_ts:
lag_seconds = (primary_optime_ts - member_optime_date).total_seconds()
lag_seconds = max(0, lag_seconds) # Ensure non-negative
push_metric_to_cloudwatch(
metric_name="ReplicationLag",
value=lag_seconds,
dimensions=[
{'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME},
{'Name': 'MemberName', 'Value': member_name}
]
)
else:
logging.warning(f"Could not get primary optimeDate for lag calculation for member {member_name}.")
elif member_state == 'ARBITER':
push_metric_to_cloudwatch(
metric_name="ReplicaSetMemberState",
value=1,
dimensions=[
{'Name': 'ReplicaSetName', 'Value': REPLICA_SET_NAME},
{'Name': 'MemberName', 'Value': member_name},
{'Name': 'MemberState', 'Value': member_state}
]
)
except pymongo.errors.ConnectionFailure as e:
logging.error(f"Could not connect to MongoDB: {e}")
except Exception as e:
logging.error(f"An unexpected error occurred: {e}")
finally:
if 'client' in locals() and client:
client.close()
if 'primary_client' in locals() and primary_client:
primary_client.close()
if __name__ == "__main__":
# This script should be run periodically, e.g., via cron or systemd timer
# For demonstration, we run it once. In production, loop or schedule.
monitor_replication()
# Example of running periodically (e.g., every 60 seconds)
# while True:
# monitor_replication()
# time.sleep(60)
To deploy this script:
- Install necessary libraries:
pip install pymongo boto3 - Ensure the EC2 instance has an IAM role attached with permissions for
cloudwatch:PutMetricData. - Configure the script with your MongoDB URI and desired AWS region.
- Schedule the script to run at regular intervals (e.g., every 1-5 minutes) using cron or a systemd timer.
This provides granular, actionable metrics for replication lag, allowing for precise alerting when lag exceeds acceptable thresholds.
PHP Application Performance Monitoring with CloudWatch Logs and Alarms
Your PHP application’s performance is directly tied to the database’s responsiveness. Monitoring the application’s error rates, response times, and resource utilization is crucial. We can leverage CloudWatch Logs to collect application logs and then set up alarms based on log patterns.
Structured Logging in PHP
To effectively monitor your PHP application, implement structured logging. Using a library like Monolog with a JSON formatter is highly recommended. This ensures log entries are machine-readable and easily parsable by CloudWatch Logs.
<?php
require 'vendor/autoload.php'; // Assuming Monolog is installed via Composer
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Formatter\JsonFormatter;
// Create a log channel
$log = new Logger('app');
// Create a stream handler for stdout (which CloudWatch Agent will tail)
$handler = new StreamHandler('php://stdout', Logger::DEBUG);
// Set the formatter to JSON
$handler->setFormatter(new JsonFormatter());
$log->pushHandler($handler);
// Example log entries
$log->info('Application started', ['version' => '1.2.0']);
try {
// Simulate a database query
// $db = new PDO(...);
// $stmt = $db->query('SELECT * FROM users WHERE id = 1');
// $user = $stmt->fetch();
// Simulate a successful operation
$log->info('User data fetched successfully', ['user_id' => 1, 'query_time_ms' => 150]);
// Simulate an error
if (rand(0, 10) < 2) { // 20% chance of error
throw new Exception('Database connection failed');
}
} catch (Exception $e) {
$log->error('An error occurred during user data fetch', [
'user_id' => 1,
'error_message' => $e->getMessage(),
'error_code' => $e->getCode(),
'trace' => $e->getTraceAsString() // Be cautious with sensitive trace info in production logs
]);
}
$log->info('Application finished processing request');
?>
Ensure your CloudWatch Agent configuration is set up to collect logs from where your PHP application writes them (e.g., php://stdout if using Docker, or a specific log file). Add a section to your agent’s config.json:
{
// ... other agent configurations ...
"logs": {
"logs_collected": {
"files": {
"collect_list": [
// ... other log files ...
{
"file_path": "/var/log/php-app/app.log", // Or wherever your logs are written
"log_group_name": "PHPApp/ApplicationLogs",
"log_stream_name": "{instance_id}",
"timezone": "UTC",
"multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T", // Adjust if your JSON timestamp differs
"log_processor": {
"type": "json" // Tell CloudWatch Agent to parse as JSON
}
}
]
}
}
}
}
CloudWatch Alarms for Application Errors
Once logs are flowing into CloudWatch Logs, we can create Metric Filters to extract metrics from these logs and then set up Alarms based on these metrics. For example, to count error occurrences.
Step 1: Create a Metric Filter
Navigate to your CloudWatch Log Group (e.g., PHPApp/ApplicationLogs) in the AWS Console. Under “Logs metrics,” click “Create metric filter.”
Filter Pattern:
{ $.level = "error" }
This pattern filters for log entries where the JSON field level is equal to "error".
Metric Details:
- Metric Namespace:
PHPApp/Metrics - Metric Name:
ErrorCount - Metric Value:
1(Each matching log entry increments the count by 1)
Step 2: Create a CloudWatch Alarm
After creating the metric filter, go to “Alarms” in CloudWatch. Click “Create alarm.”
Select Metric: Choose the metric you just created (e.g., PHPApp/Metrics, ErrorCount).
Define conditions:
- Statistic:
Sum - Period:
5 minutes(or your desired interval) - Threshold type:
Static - Whenever ErrorCount is:
Greater than - than:
0(or a specific threshold, e.g.,10errors in 5 minutes)
Configure actions: Set up notifications to an SNS topic for alerts (e.g., email, Slack integration).
EC2 Instance Health and Performance Tuning
While MongoDB and the PHP application are critical, the underlying EC2 instances must also be healthy. Standard EC2 metrics are a good starting point, but we need to correlate them with application and database performance.
Key EC2 Metrics and Thresholds
Use CloudWatch’s default EC2 metrics, but set custom alarms with appropriate thresholds:
- CPU Utilization: Alarm if consistently above 80-90% for extended periods (e.g., 15 minutes). This often indicates an application or database bottleneck.
- Memory Utilization: While EC2 doesn’t expose memory usage directly by default (requires CloudWatch Agent), if you are collecting it, alarm if consistently above 85-90%. High memory usage can lead to swapping and severe performance degradation.
- Disk I/O (Read/Write Ops/Bytes): Monitor for unusually high rates that correlate with slow application responses. High I/O can indicate inefficient queries or insufficient IOPS on EBS volumes.
- Network In/Out: Alarm on sustained high network traffic that approaches instance or EBS network limits.
- Disk Queue Length: A sustained queue length greater than 2-3 per disk can indicate I/O saturation.
EBS Volume Performance
For MongoDB, EBS volume performance is paramount. Monitor these EBS-specific metrics:
- Volume Read/Write Ops: Track actual IOPS against provisioned IOPS (for
io1/gp3) or burst credits (forgp2). - Volume Read/Write Bytes: Track throughput against provisioned throughput (for
gp3) or instance limits. - Volume Queue Length: As mentioned, a sustained queue length indicates I/O bottlenecks.
- Volume Idle Time: Low idle time suggests the volume is constantly busy.
If you observe persistent I/O bottlenecks, consider:
- Upgrading EBS volume type (e.g., from
gp2togp3orio1/io2). - Increasing provisioned IOPS or throughput for
gp3/io1volumes. - Optimizing MongoDB queries to reduce I/O load.
- Ensuring your EC2 instance type has sufficient network bandwidth for EBS traffic.
Automated Recovery and Health Checks
Proactive monitoring is essential, but automated recovery mechanisms are vital for maintaining high availability.
Auto Scaling Groups and Health Checks
For your PHP application servers, leverage AWS Auto Scaling Groups. Configure:
- EC2 Health Checks: Auto Scaling Groups can perform EC2 status checks. If an instance fails, it’s terminated and replaced.
- ELB Health Checks: If using an Elastic Load Balancer (ELB), configure health checks that your application exposes (e.g., a
/healthendpoint). The ELB will stop sending traffic to unhealthy instances, and the Auto Scaling Group will replace them if they remain unhealthy.
For MongoDB, direct Auto Scaling is more complex due to statefulness. However, you can use Auto Scaling Groups for your application tier and implement automated failover for your MongoDB replica set. AWS DocumentDB offers managed auto-scaling and failover capabilities if you’re considering a managed service.
Automated MongoDB Failover
MongoDB’s replica set mechanism handles failover automatically. When a primary becomes unreachable, the remaining secondaries elect a new primary. Ensure your replica set configuration is robust:
- Sufficient Members: A minimum of 3 voting members is recommended for automatic failover (e.g., Primary, Secondary, Secondary).
- Arbiter: Consider using an arbiter if you cannot have an odd number of data-bearing nodes, but be aware of its limitations.
- Priority: Configure member priorities to influence failover elections.
- Network Latency: Ensure low latency between replica set members, especially across Availability Zones.
Your PHP application should be configured to connect to the replica set using a connection string that allows it to discover the current primary and automatically reconnect. For example:
// Example using MongoDB PHP Driver
$mongoClient = new MongoDB\Client(
"mongodb://mongo1.example.com:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&readPreference=primary"
);
// The driver will automatically find the primary and reconnect if needed.
// For read operations, you might use readPreference=secondaryPreferred
// to distribute read load.
By combining comprehensive monitoring with automated recovery strategies, you can build a resilient and highly available PHP application powered by MongoDB on AWS.