Server Monitoring Best Practices: Keeping Your Python App and Redis Clusters Alive on AWS
Proactive Health Checks for Python Applications on EC2
Maintaining the health of Python applications deployed on AWS EC2 instances requires a multi-layered approach to monitoring. Beyond basic CPU and memory utilization, we need to ensure the application itself is responsive and its critical dependencies are functioning correctly. This involves implementing both external and internal health checks.
External Health Checks with ELB/ALB Target Groups
AWS Elastic Load Balancing (ELB) and Application Load Balancing (ALB) provide built-in health check mechanisms for registered targets. For a Python web application (e.g., Flask, Django, FastAPI), this typically means configuring the load balancer to periodically ping a specific health check endpoint exposed by the application.
A robust health check endpoint should not only return a 200 OK status but also perform a quick validation of core application functionality. For instance, it might check database connectivity or the availability of a critical external service.
Example Flask Health Check Endpoint
Here’s a simple Flask example. Ensure this endpoint is accessible by the load balancer but ideally not exposed directly to the public internet without authentication.
from flask import Flask, jsonify
import redis # Assuming Redis is a dependency
app = Flask(__name__)
# Replace with your actual Redis connection details
REDIS_HOST = 'your-redis-host'
REDIS_PORT = 6379
def check_redis_connection():
try:
r = redis.StrictRedis(host=REDIS_HOST, port=REDIS_PORT, socket_connect_timeout=1, socket_timeout=1)
r.ping()
return True
except redis.exceptions.ConnectionError:
return False
@app.route('/health')
def health_check():
redis_ok = check_redis_connection()
if redis_ok:
return jsonify({"status": "ok", "dependencies": {"redis": "healthy"}}), 200
else:
return jsonify({"status": "degraded", "dependencies": {"redis": "unhealthy"}}), 503 # Service Unavailable
if __name__ == '__main__':
# In production, use a WSGI server like Gunicorn or uWSGI
# Example: gunicorn -w 4 -b 0.0.0.0:5000 app:app
app.run(debug=True, host='0.0.0.0', port=5000)
When configuring your ALB/ELB Target Group, you would set the Health Check Path to /health, the protocol to HTTP, and adjust the interval, timeout, and healthy/unhealthy thresholds according to your application’s tolerance for latency and downtime.
Internal Application Monitoring with Prometheus and Node Exporter
For deeper insights into the EC2 instance and the Python application’s resource consumption, integrating Prometheus with the Node Exporter is a standard practice. Node Exporter provides a wide range of system-level metrics, while custom exporters or application instrumentation can expose application-specific metrics.
Setting up Node Exporter on EC2
Download and run Node Exporter on each EC2 instance. It exposes metrics on port 9100 by default.
# Download the latest version (check for the latest release on GitHub) wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz cd node_exporter-1.7.0.linux-amd64 # Run Node Exporter ./node_exporter
For production, you’ll want to run this as a systemd service. Create a file like /etc/systemd/system/node_exporter.service:
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus ExecStart=/path/to/node_exporter/node_exporter --collector.textfile.directory=/path/to/node_exporter/textfile_collector [Install] WantedBy=multi-user.target
Then enable and start the service:
sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter
Instrumenting Python Applications for Prometheus
Use the prometheus_client Python library to expose application-specific metrics. These can include request counts, latency histograms, error rates, and custom business metrics.
from flask import Flask, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CollectorRegistry
import time
import random
app = Flask(__name__)
registry = CollectorRegistry()
# Example Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint', 'status_code'], registry=registry)
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP Request Latency', ['endpoint'], registry=registry)
ACTIVE_USERS = Gauge('app_active_users', 'Number of active users', registry=registry)
@app.route('/')
def index():
start_time = time.time()
try:
# Simulate some work
time.sleep(random.uniform(0.1, 0.5))
status_code = 200
return "Hello, World!"
except Exception as e:
status_code = 500
return str(e), 500
finally:
duration = time.time() - start_time
REQUEST_COUNT.labels(method='GET', endpoint='/', status_code=status_code).inc()
REQUEST_LATENCY.labels(endpoint='/').observe(duration)
@app.route('/metrics')
def metrics():
# Update Gauge metric
ACTIVE_USERS.set(random.randint(10, 100))
return Response(generate_latest(registry), mimetype='text/plain')
if __name__ == '__main__':
# In production, use a WSGI server like Gunicorn or uWSGI
app.run(host='0.0.0.0', port=5001) # Expose metrics on a different port or path
Configure Prometheus to scrape both the Node Exporter (port 9100) and your application’s metrics endpoint (e.g., port 5001/metrics) on each EC2 instance. This typically involves adding jobs to your prometheus.yml configuration.
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['ec2-instance-1:9100', 'ec2-instance-2:9100'] # Replace with actual IPs/hostnames
- job_name: 'python_app_metrics'
static_configs:
- targets: ['ec2-instance-1:5001', 'ec2-instance-2:5001'] # Replace with actual IPs/hostnames
Redis Cluster Monitoring on AWS ElastiCache
For Redis clusters managed by AWS ElastiCache, monitoring shifts from instance-level metrics to ElastiCache-specific metrics and Redis commands. AWS provides CloudWatch metrics for ElastiCache, which are crucial for understanding cluster health and performance.
Key ElastiCache Metrics to Monitor
- CPUUtilization: High CPU can indicate heavy load or inefficient queries.
- MemoryUsagePercentage: Crucial for avoiding Evictions and OutOfMemory errors.
- CacheHits and CacheMisses: Monitor cache efficiency. A high miss rate might indicate insufficient cache size or poor data access patterns.
- Evictions: Indicates the cache is full and data is being removed. Frequent evictions degrade performance.
- CurrConnections: Number of active client connections. Spikes can indicate application issues or DDoS attacks.
- NewConnections: Rate of new connections.
- ReplicationLag: For Redis (cluster mode disabled), this shows how far behind replicas are.
- EngineCPUUtilization: Specific to Redis, this shows CPU usage by the Redis engine itself.
Set up CloudWatch Alarms on these metrics. For example, an alarm for MemoryUsagePercentage exceeding 85% or Evictions greater than 0 over a 5-minute period can alert you to potential issues before they impact your application.
Advanced Redis Monitoring with Redis CLI and CloudWatch Agent
While CloudWatch provides essential metrics, direct interaction with Redis via the CLI can offer deeper, real-time insights. For metrics not exposed by default by ElastiCache, you can use the CloudWatch Agent to collect custom metrics.
Using Redis CLI for Diagnostics
Connect to your ElastiCache Redis endpoint using redis-cli. You’ll need to retrieve the primary endpoint from the AWS console.
# Install redis-tools if not already present # sudo apt-get install redis-tools (Debian/Ubuntu) # sudo yum install redis (CentOS/RHEL) redis-cli -h your-elasticache-redis-primary-endpoint.cache.amazonaws.com -p 6379
Once connected, use commands like:
INFO memory INFO persistence INFO stats INFO clients MONITOR # Use with extreme caution in production, very verbose! SLOWLOG GET 10 # View slow commands CONFIG GET maxmemory CONFIG GET maxmemory-policy
These commands provide granular details about memory allocation, command execution statistics, client connections, and configuration parameters. Analyzing SLOWLOG is particularly useful for identifying performance bottlenecks caused by specific Redis commands.
Custom Metrics with CloudWatch Agent
If you need to monitor specific Redis metrics not available through ElastiCache’s default CloudWatch integration (e.g., detailed command latency per command type), you can deploy the CloudWatch Agent on an EC2 instance within the same VPC and configure it to execute custom scripts that gather and push metrics.
Create a custom script (e.g., /opt/cloudwatch-agent/scripts/redis_custom_metrics.sh) that uses redis-cli to fetch data and format it for the agent.
#!/bin/bash
REDIS_HOST="your-elasticache-redis-primary-endpoint.cache.amazonaws.com"
REDIS_PORT="6379"
# Example: Get number of keys
NUM_KEYS=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT KEYS "*" | wc -l)
# Example: Get memory usage
MEMORY_USAGE_MB=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT INFO memory | grep 'used_memory:' | awk '{print $2 / 1024 / 1024}')
# Output in a format the CloudWatch agent can parse (e.g., JSON)
cat <<EOF
{
"Metrics": [
{
"Namespace": "MyRedisMetrics",
"Dimensions": [
["RedisInstance"]
],
"Metrics": [
{
"Name": "NumberOfKeys",
"Unit": "Count"
},
{
"Name": "MemoryUsageMB",
"Unit": "Megabytes"
}
]
}
]
}
EOF
Configure the CloudWatch Agent’s amazon-cloudwatch-agent.json file to collect these metrics. Ensure the agent is running with appropriate IAM permissions to write to CloudWatch.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "MyRedisMetrics",
"metrics_collected": {
"exec": [
{
"command_file": "/opt/cloudwatch-agent/scripts/redis_custom_metrics.sh",
"timeout": 10
}
]
}
}
}
Alerting and Incident Response Strategy
Effective monitoring is incomplete without a robust alerting and incident response strategy. Define clear Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for your application and Redis clusters. Alerts should be actionable and routed to the appropriate teams.
Alerting with Prometheus Alertmanager
Prometheus Alertmanager handles deduplication, grouping, and routing of alerts generated by Prometheus. Configure alert rules in Prometheus based on the metrics collected.
# prometheus.rules.yml
groups:
- name: python_app_alerts
rules:
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{endpoint="/"}[5m])) by (le, endpoint)) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency on / endpoint"
description: "95th percentile latency for / endpoint is {{ $value }}s, exceeding SLO."
- alert: AppUnhealthy
expr: up{job="python_app_metrics"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Python application is down"
description: "The python_app_metrics job is not reachable."
- name: redis_alerts
rules:
- alert: RedisHighMemoryUsage
expr: avg(MemoryUsageMB{job="elasticache_redis"}) by (RedisInstance) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Redis memory usage high"
description: "Redis instance {{ $labels.RedisInstance }} is using {{ $value }}MB, exceeding 85% of maxmemory."
- alert: RedisEvictionsOccurred
expr: sum(rate(aws_elasticache_evictions[5m])) by (CacheClusterId) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis evictions detected"
description: "Redis cluster {{ $labels.CacheClusterId }} is experiencing evictions, indicating memory pressure."
Configure Alertmanager to route these alerts to Slack, PagerDuty, or email. Ensure alert severity levels (e.g., warning, critical) are appropriately mapped to escalation policies.
Incident Response Playbooks
Develop clear, documented playbooks for common alert scenarios. For example:
- High Redis Memory Usage: Playbook might involve checking
SLOWLOG, identifying memory-intensive keys, scaling up the ElastiCache node size, or adjusting themaxmemory-policy. - Application Unreachable: Playbook could include checking ELB health checks, restarting the application process on EC2, examining application logs for errors, and verifying underlying EC2 instance health.
- High Request Latency: Playbook might involve analyzing Prometheus metrics for resource contention (CPU, memory), checking Redis performance, and profiling the Python application code.
Regularly review and update these playbooks based on post-incident analyses. Automating parts of the response, where safe and feasible, can significantly reduce Mean Time To Recovery (MTTR).