Server Monitoring Best Practices: Keeping Your Python App and DynamoDB Clusters Alive on AWS
Proactive Health Checks for Python Applications on EC2
Maintaining the health of Python applications deployed on EC2 instances requires a multi-layered monitoring approach. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. A common pattern involves using a combination of system-level tools and custom application probes.
For system-level checks, CloudWatch Agent is indispensable. It allows us to collect custom metrics from EC2 instances, including application logs and custom metrics emitted by our Python application. First, ensure the CloudWatch Agent is installed and configured on your EC2 instances. The configuration typically resides in /opt/aws/amazon-cloudwatch-agent/bin/config.json.
Configuring CloudWatch Agent for Application Metrics
A robust configuration will include collecting standard system metrics, application logs, and custom metrics. For Python applications, we often want to monitor request latency, error rates, and the number of active worker processes. Let’s assume your Python application logs errors to /var/log/my_python_app/error.log and emits custom metrics via a StatsD endpoint on port 8125.
Here’s a sample config.json for the CloudWatch Agent:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"logs": {
"metrics_collected": {
"logs": [
{
"log_group_name": "/aws/my-python-app/errors",
"log_stream_name": "{instance_id}/error.log",
"file_path": "/var/log/my_python_app/error.log",
"log_group_class": "STANDARD"
}
]
}
},
"metrics": {
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait",
"cpu_usage_user",
"cpu_usage_system"
],
"totalcpu": true
},
"mem": {
"measurement": [
"mem_used_percent",
"mem_available_percent"
]
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"resources": [
"/"
]
},
"statsd": {
"service_address": "udp:127.0.0.1:8125",
"metrics_collection_interval": 60
}
}
}
}
After updating the configuration file, restart the CloudWatch Agent:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
Application-Level Health Checks with Python
To complement system metrics, implement application-level health checks. This typically involves a dedicated endpoint (e.g., /health) that your Python application exposes. This endpoint should not only check if the web server is responsive but also verify connectivity to critical dependencies like databases or external services.
For Flask applications, a simple health check endpoint might look like this:
from flask import Flask, jsonify
import redis
app = Flask(__name__)
# Assume Redis is running on localhost:6379
redis_client = redis.StrictRedis(host='localhost', port=6379, db=0)
@app.route('/health', methods=['GET'])
def health_check():
status = {
"status": "healthy",
"dependencies": {}
}
# Check Redis connection
try:
redis_client.ping()
status["dependencies"]["redis"] = "connected"
except redis.exceptions.ConnectionError as e:
status["status"] = "unhealthy"
status["dependencies"]["redis"] = f"connection_error: {e}"
# Add checks for other dependencies (e.g., database, external APIs) here
if status["status"] == "healthy":
return jsonify(status), 200
else:
return jsonify(status), 503 # Service Unavailable
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
You can then use an external monitoring service (like AWS CodeDeploy health checks, Elastic Load Balancer health checks, or a third-party tool) to poll this /health endpoint. For EC2 instances not behind a load balancer, you can use a simple script run by cron or a systemd timer to periodically check this endpoint and trigger alerts if it returns a non-2xx status code.
Monitoring DynamoDB Clusters on AWS
DynamoDB, being a managed service, abstracts away much of the underlying infrastructure. However, effective monitoring is still crucial for performance tuning, cost optimization, and ensuring application availability. AWS provides a rich set of CloudWatch metrics for DynamoDB.
Key DynamoDB CloudWatch Metrics to Watch
Focus on metrics that indicate potential bottlenecks or over-provisioning. These can be viewed in the AWS Management Console under CloudWatch -> Metrics -> DynamoDB.
- ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits: These show the actual capacity consumed by your operations.
- ProvisionedReadCapacityUnits and ProvisionedWriteCapacityUnits: These represent the capacity you have configured.
- ReadThrottleEvents and WriteThrottleEvents: Crucial for identifying when requests are being throttled due to exceeding provisioned capacity.
- SuccessfulRequestLatency: The average latency for successful requests. High latency can indicate underlying issues.
- SystemErrors: Indicates errors originating from the DynamoDB service itself.
- ReturnedItemCount: The number of items returned by a scan or query operation. High values can indicate inefficient queries.
- ConditionalCheckFailedRequests: For applications using conditional writes, this indicates failed conditions.
Setting Up CloudWatch Alarms for DynamoDB
Proactive alerting is key. Configure CloudWatch Alarms to notify you when specific thresholds are breached. Here are some recommended alarm configurations:
- High Read/Write Throttling: Alarm when
ReadThrottleEventsorWriteThrottleEventsare greater than 0 for a sustained period (e.g., 5 minutes). This is a direct indicator of performance degradation. - Sustained High Capacity Utilization: Alarm when
ConsumedReadCapacityUnits/ProvisionedReadCapacityUnits(or write equivalents) is consistently above 80% for an extended period (e.g., 15 minutes). This suggests a need to scale up provisioned capacity or optimize queries. - High Latency: Alarm when
SuccessfulRequestLatency(average or p95/p99) exceeds a defined threshold (e.g., 100ms) for a sustained period. - System Errors: Alarm on any occurrence of
SystemErrors.
You can create these alarms via the AWS CLI:
# Example: Alarm for Read Throttling
aws cloudwatch put-metric-alarm \
--alarm-name "DynamoDB-High-Read-Throttle-Events" \
--alarm-description "High number of read throttle events for table X" \
--metric-name ReadThrottleEvents \
--namespace "AWS/DynamoDB" \
--statistic Sum \
--period 300 \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=TableName,Value=YourTableName \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic
# Example: Alarm for High Read Capacity Utilization
aws cloudwatch put-metric-alarm \
--alarm-name "DynamoDB-High-Read-Capacity-Utilization" \
--alarm-description "Sustained high read capacity utilization for table X" \
--metric-name ConsumedReadCapacityUnits \
--namespace "AWS/DynamoDB" \
--statistic Average \
--period 600 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=TableName,Value=YourTableName \
--extended-statistic PercentChange \
--evaluation-periods 3 \
--datapoints-to-alarm 3 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic
Remember to replace YourTableName and the SNS topic ARN with your specific values. For Auto Scaling, you’ll use these metrics to define scaling policies.
Leveraging DynamoDB Streams and Lambda for Real-time Monitoring
For more advanced, real-time monitoring and automated responses, DynamoDB Streams combined with AWS Lambda can be powerful. You can capture item-level modifications and trigger Lambda functions to perform custom analysis or take immediate action.
For instance, you could:
- Monitor the rate of specific item updates to detect anomalies.
- Trigger alerts based on the content of updated items (e.g., a status field changing to “critical”).
- Perform real-time data validation or transformation.
To set this up:
- Enable DynamoDB Streams on your table (e.g.,
NEW_AND_OLD_IMAGES). - Create a Lambda function that is triggered by the DynamoDB Stream.
- Within the Lambda function, process the stream records to extract relevant information and implement your monitoring logic.
- Send custom metrics to CloudWatch from your Lambda function using the AWS SDK.
Example Python Lambda function snippet:
import boto3
import json
import base64
cloudwatch = boto3.client('cloudwatch')
def lambda_handler(event, context):
for record in event['Records']:
if record['eventName'] == 'MODIFY':
new_image = record['dynamodb']['NewImage']
old_image = record['dynamodb']['OldImage']
# Example: Monitor changes to a 'status' field
if 'status' in new_image and 'status' in old_image:
new_status = new_image['status']['S']
old_status = old_image['status']['S']
if new_status != old_status:
print(f"Status changed from {old_status} to {new_status}")
# Publish custom metric to CloudWatch
cloudwatch.put_metric_data(
Namespace='MyDynamoDBApp/StatusChanges',
MetricData=[
{
'MetricName': 'StatusChangeCount',
'Dimensions': [
{'Name': 'TableName', 'Value': record['eventSourceARN'].split('/')[1]},
{'Name': 'NewStatus', 'Value': new_status},
{'Name': 'OldStatus', 'Value': old_status}
],
'Value': 1,
'Unit': 'Count'
},
]
)
return {
'statusCode': 200,
'body': json.dumps('Processed DynamoDB stream records.')
}
This approach allows for granular, real-time insights into your DynamoDB data and operations, enabling faster detection and resolution of issues.