Server Monitoring Best Practices: Keeping Your Python App and MongoDB Clusters Alive on AWS
Establishing a Robust Monitoring Foundation with AWS CloudWatch
Effective server monitoring for a Python application and its associated MongoDB clusters on AWS hinges on a multi-layered approach. We’ll leverage AWS CloudWatch as our primary telemetry ingestion and alerting platform, supplemented by application-specific metrics and external checks. This ensures comprehensive visibility into system health, performance, and potential issues before they impact end-users.
Monitoring the Python Application Layer
For our Python application, we’ll focus on key performance indicators (KPIs) that directly reflect its operational status and user experience. This includes request latency, error rates, and resource utilization (CPU, memory). We’ll use the CloudWatch Agent to push custom metrics from our EC2 instances.
Configuring the CloudWatch Agent for Custom Metrics
First, ensure the CloudWatch Agent is installed and configured on your EC2 instances running the Python application. The agent can collect system-level metrics (CPU, disk, network) and custom application metrics. We’ll define a configuration file (e.g., /opt/aws/amazon-cloudwatch-agent/bin/config.json) to specify what to collect.
Here’s an example configuration snippet focusing on custom metrics derived from application logs or internal instrumentation:
{
"agent": {
"metrics_collection_interval": 60
},
"metrics": {
"namespace": "MyPythonApp",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"aggregation_dimensions": [
[ "InstanceId" ]
],
"metrics_collected": {
"statsd": {
"service_address": "127.0.0.1:8125",
"metrics_collection_interval": 60
},
"collectd": {
"data_source": "python",
"typesdb": "/usr/share/collectd/types.db",
"plugins": [
{
"name": "python",
"config": {
"module_path": "/opt/my_app/cloudwatch_plugins",
"log_level": "info",
"plugins": [
{
"name": "custom_metrics",
"config": {
"interval": 60
}
}
]
}
}
]
}
}
}
}
This configuration assumes you have a StatsD endpoint or a collectd Python plugin running within your application that emits custom metrics. For instance, using the statsd library in Python:
import statsd
import time
client = statsd.StatsClient('127.0.0.1', 8125)
def process_request(request_data):
start_time = time.time()
try:
# ... process request ...
client.incr('requests_processed')
duration = time.time() - start_time
client.timing('request_latency', duration * 1000) # in milliseconds
return "Success"
except Exception as e:
client.incr('requests_failed')
client.incr('errors_total')
return "Error"
# Example usage
# process_request({"data": "..."})
You would then start the CloudWatch agent with this configuration:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
Setting Up CloudWatch Alarms
Once metrics are flowing into CloudWatch under the MyPythonApp namespace, we can define alarms. Critical alarms should trigger notifications via SNS to a dedicated DevOps channel or email distribution list.
aws cloudwatch put-metric-alarm \
--alarm-name "HighRequestLatency-MyPythonApp" \
--alarm-description "Alarm when average request latency exceeds 500ms for 5 minutes" \
--metric-name "request_latency" \
--namespace "MyPythonApp" \
--statistic Average \
--period 300 \
--threshold 500 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:my-devops-alerts-topic
Similarly, configure alarms for:
requests_failed(count > 10 in 5 minutes)errors_total(count > 5 in 5 minutes)- System metrics like
cpuutilization(Average > 80% for 10 minutes) if not using default EC2 metrics.
Monitoring the MongoDB Cluster on AWS (e.g., using EC2 or DocumentDB)
Monitoring MongoDB requires a different set of metrics, focusing on database operations, connection pooling, replication lag, and disk I/O. If running MongoDB on EC2, we’ll again use the CloudWatch Agent. For AWS DocumentDB, CloudWatch metrics are automatically integrated.
Monitoring MongoDB on EC2 Instances
We’ll use the mongostat and mongotop tools, and potentially a collectd plugin for MongoDB, to gather detailed database metrics. The CloudWatch Agent can be configured to collect these.
{
"agent": {
"metrics_collection_interval": 60
},
"metrics": {
"namespace": "MyMongoDBCluster",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"statsd": {
"service_address": "127.0.0.1:8125",
"metrics_collection_interval": 60
},
"collectd": {
"plugins": [
{
"name": "mongodb",
"config": {
"Host": "127.0.0.1",
"Port": 27017,
"Interval": 60,
"Metrics": [
"connections",
"network",
"opcounters",
"background_flushing",
"locks",
"dur",
"extra_info",
"remote",
"global_lock",
"buffer_pool_stats",
"metrics"
]
}
}
]
}
}
}
}
This collectd plugin (ensure it’s installed and configured) will expose metrics like:
mongodb_connections_currentmongodb_opcounters_insert_total,mongodb_opcounters_query_total, etc.mongodb_replication_lag_seconds(if applicable)mongodb_buffer_pool_stats_pages_dirty
These metrics will be sent to CloudWatch under the MyMongoDBCluster namespace. Alarms should be set for:
- High replication lag (e.g.,
mongodb_replication_lag_seconds> 60 for 5 minutes) - Low available connections (e.g.,
mongodb_connections_available< 10 for 5 minutes) - High disk usage (via EC2 system metrics)
- High CPU/Memory utilization on database instances.
Monitoring AWS DocumentDB
DocumentDB automatically publishes a rich set of metrics to CloudWatch under the AWS/DocDB namespace. Key metrics include:
ConnectionsDatabaseConnectionsDiskQueueDepthCPUUtilizationReadIOPS,WriteIOPSReplicationLag
Setting up alarms for DocumentDB is similar to other AWS services:
aws cloudwatch put-metric-alarm \
--alarm-name "HighDocumentDBReplicationLag" \
--alarm-description "Alarm when DocumentDB replication lag exceeds 30 seconds" \
--metric-name "ReplicationLag" \
--namespace "AWS/DocDB" \
--statistic Maximum \
--period 60 \
--threshold 30 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=DBInstanceIdentifier,Value=my-documentdb-instance \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:my-devops-alerts-topic
Log Aggregation and Analysis with CloudWatch Logs
Beyond metrics, centralized log aggregation is crucial for debugging and root cause analysis. We’ll configure the CloudWatch Agent to stream application and database logs to CloudWatch Logs.
Configuring Log Streaming
Add the following to your CloudWatch Agent configuration file (/opt/aws/amazon-cloudwatch-agent/bin/config.json):
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/my_python_app.log",
"log_group_name": "MyPythonApp/Logs",
"log_stream_name": "{instance_id}/app"
},
{
"file_path": "/var/log/mongodb/mongod.log",
"log_group_name": "MyMongoDBCluster/Logs",
"log_stream_name": "{instance_id}/mongod"
}
]
}
}
}
}
After updating the agent configuration, restart it:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
Leveraging CloudWatch Logs Insights
CloudWatch Logs Insights provides a powerful query language to analyze your logs. For example, to find all Python application errors:
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20
And to analyze MongoDB slow queries:
fields @timestamp, @message | filter @message like /query: / and @message like /planSummary: / | stats count() by bin(5m), @message | sort @timestamp desc
External Health Checks and Synthetic Monitoring
While internal metrics and logs are vital, external checks simulate user behavior and verify end-to-end availability. AWS Route 53 Health Checks and CloudWatch Synthetics (Canaries) are excellent for this.
Route 53 Health Checks
Configure Route 53 health checks to ping a specific health endpoint on your Python application (e.g., /healthz). If the endpoint returns a non-2xx status code, Route 53 can mark the resource as unhealthy and stop sending traffic to it (if integrated with ELB/ALB).
CloudWatch Synthetics Canaries
Canaries are scripts (written in Node.js or Python) that run on a schedule to simulate user interactions. You can create canaries to:
- Perform a GET request to your application’s homepage and assert the response code.
- Simulate a user login flow.
- Check critical API endpoints.
- Verify MongoDB connectivity from an external perspective (though this is less common and usually handled by internal checks).
If a canary fails, it triggers a CloudWatch alarm, providing an early warning of external-facing issues.
Proactive Maintenance and Incident Response
A robust monitoring strategy isn’t just about detecting failures; it’s about preventing them and responding effectively. Regularly review your dashboards, analyze trends in metrics and logs, and refine your alerting thresholds. Implement runbooks for common alert scenarios to ensure swift and consistent incident response.