Server Monitoring Best Practices: Keeping Your Python App and DynamoDB Clusters Alive on OVH
Proactive Health Checks for Python Applications
Maintaining the health of your Python applications, especially those interacting with distributed systems like DynamoDB, requires a multi-layered monitoring strategy. Beyond basic process uptime, we need to inspect application-level metrics, request latency, error rates, and resource utilization. For OVH-hosted environments, this often involves a combination of OS-level tools and application-specific instrumentation.
Application-Level Metrics with Prometheus and Exporters
Prometheus is a de facto standard for metrics collection. For Python applications, the prometheus_client library is indispensable. We’ll instrument our application to expose key metrics, such as request counts, latency histograms, and custom business logic counters. This data is then scraped by a Prometheus server.
Instrumenting a Flask Application
Consider a simple Flask application that interacts with DynamoDB. We’ll add instrumentation to track API endpoint performance and DynamoDB call durations.
from flask import Flask, request, jsonify
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import boto3
import os
# Initialize Flask app
app = Flask(__name__)
# Initialize Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status_code'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request duration in seconds', ['method', 'endpoint'])
DYNAMODB_LATENCY = Histogram('dynamodb_operation_duration_seconds', 'DynamoDB operation duration in seconds', ['operation'])
DYNAMODB_ERRORS = Counter('dynamodb_errors_total', 'Total DynamoDB errors', ['operation'])
ACTIVE_CONNECTIONS = Gauge('active_connections', 'Number of active connections to external services')
# Initialize DynamoDB client
# Ensure AWS credentials and region are configured (e.g., via environment variables or IAM roles on OVH instances)
region = os.environ.get("AWS_REGION", "eu-west-1")
dynamodb = boto3.resource('dynamodb', region_name=region)
table_name = os.environ.get("DYNAMODB_TABLE", "my-app-data")
table = dynamodb.Table(table_name)
# Simulate active connections
ACTIVE_CONNECTIONS.set(1) # Example: Set to 1 for a single DynamoDB connection pool
@REQUEST_LATENCY.labels(method='GET', endpoint='/items').time()
@REQUEST_COUNT.labels(method='GET', endpoint='/items', status_code=200).track_function()
def get_items():
try:
start_time = time.time()
response = table.scan()
duration = time.time() - start_time
DYNAMODB_LATENCY.labels(operation='scan').observe(duration)
REQUEST_COUNT.labels(method='GET', endpoint='/items', status_code=200).inc()
return jsonify(response.get('Items', [])), 200
except Exception as e:
DYNAMODB_ERRORS.labels(operation='scan').inc()
REQUEST_COUNT.labels(method='GET', endpoint='/items', status_code=500).inc()
return jsonify({"error": str(e)}), 500
@app.route('/items', methods=['GET'])
def items_handler():
return get_items()
# Add more routes and instrumentation as needed...
if __name__ == '__main__':
# Start Prometheus metrics server on a separate port (e.g., 9091)
# This is crucial: don't expose metrics on the same port as your application if possible,
# or ensure proper routing/security.
start_http_server(9091)
print("Prometheus metrics server started on port 9091")
# Start Flask app on default port 5000
app.run(host='0.0.0.0', port=5000, debug=False)
In this example, we’re tracking:
- Total HTTP requests per method, endpoint, and status code.
- Latency distribution for HTTP requests.
- Latency distribution for DynamoDB operations (specifically ‘scan’ in this case).
- Count of errors encountered during DynamoDB operations.
- A gauge for the number of active connections (a simplified example).
Configuring Prometheus Scrape Targets
On your OVH instance(s) running the Python application, you’ll need a Prometheus agent or the main Prometheus server configured to scrape these metrics. Assuming Prometheus is running elsewhere (e.g., a dedicated monitoring VM or a managed service), you’d add a scrape configuration like this to your prometheus.yml:
scrape_configs:
- job_name: 'python_app_metrics'
static_configs:
- targets: ['your_app_instance_ip:9091'] # Replace with the actual IP and port
labels:
environment: 'production'
application: 'my-python-app'
instance: 'app-server-01' # Or use service discovery
If you’re using dynamic service discovery (e.g., Consul, Kubernetes), you’d adapt the `static_configs` to use the appropriate discovery mechanism.
Monitoring DynamoDB Performance and Health
DynamoDB, being a managed service, abstracts away much of the underlying infrastructure. However, its performance and cost are directly tied to your provisioned throughput, consumed capacity, and query patterns. Monitoring these aspects is critical.
Key DynamoDB Metrics to Watch
AWS CloudWatch provides a wealth of metrics for DynamoDB. We need to focus on metrics that indicate throttling, high latency, and inefficient resource utilization.
- ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits: Essential for understanding your actual usage against provisioned capacity. Spikes here can indicate performance issues or cost overruns.
- ProvisionedReadCapacityUnits and ProvisionedWriteCapacityUnits: Your configured limits.
- ReadThrottleEvents and WriteThrottleEvents: Direct indicators of throttling. These should ideally be zero.
- SuccessfulRequestLatency: The average latency for successful requests. High latency suggests contention or inefficient queries.
- SystemErrors: Indicates issues on the AWS side, though rare.
- ItemCount and TableSizeBytes: Useful for understanding data volume and growth.
Integrating CloudWatch Metrics with Prometheus/Grafana
While you can monitor CloudWatch directly, integrating these metrics into your primary monitoring dashboard (e.g., Grafana) provides a unified view. The cloudwatch_exporter is a popular tool for this. You’ll deploy it on an instance that has AWS credentials and configure it to scrape specific DynamoDB metrics.
Deploying and Configuring cloudwatch_exporter
First, install the exporter. This can often be done via pip:
pip install cloudwatch_exporter
Next, create a configuration file (e.g., cloudwatch_exporter.yml) to specify which metrics to collect. Ensure the instance running this exporter has appropriate IAM permissions to access CloudWatch metrics for your DynamoDB tables.
# cloudwatch_exporter.yml
region: 'eu-west-1' # Your AWS region
metrics:
- name: 'ConsumedReadCapacityUnits'
namespace: 'AWS/DynamoDB'
dimensions:
- name: 'TableName'
value: 'my-app-data' # Replace with your table name
statistics: ['Sum']
period: 300 # 5 minutes
- name: 'ConsumedWriteCapacityUnits'
namespace: 'AWS/DynamoDB'
dimensions:
- name: 'TableName'
value: 'my-app-data'
statistics: ['Sum']
period: 300
- name: 'ReadThrottleEvents'
namespace: 'AWS/DynamoDB'
dimensions:
- name: 'TableName'
value: 'my-app-data'
statistics: ['Sum']
period: 300
- name: 'WriteThrottleEvents'
namespace: 'AWS/DynamoDB'
dimensions:
- name: 'TableName'
value: 'my-app-data'
statistics: ['Sum']
period: 300
- name: 'SuccessfulRequestLatency'
namespace: 'AWS/DynamoDB'
dimensions:
- name: 'TableName'
value: 'my-app-data'
statistics: ['Average']
period: 300
Run the exporter, typically as a systemd service:
# Example systemd service file (e.g., /etc/systemd/system/cloudwatch_exporter.service) [Unit] Description=CloudWatch Exporter After=network.target [Service] User=your_user Group=your_group ExecStart=/usr/local/bin/cloudwatch_exporter --config cloudwatch_exporter.yml Restart=always RestartSec=10 [Install] WantedBy=multi-user.target
sudo systemctl daemon-reload sudo systemctl enable cloudwatch_exporter sudo systemctl start cloudwatch_exporter sudo systemctl status cloudwatch_exporter
Then, configure Prometheus to scrape the cloudwatch_exporter‘s endpoint (default is 9118):
scrape_configs:
- job_name: 'dynamodb_metrics'
static_configs:
- targets: ['your_exporter_instance_ip:9118'] # IP of the instance running cloudwatch_exporter
labels:
environment: 'production'
service: 'dynamodb'
table: 'my-app-data'
OS-Level Monitoring on OVH Instances
Even with robust application and database monitoring, the underlying operating system needs attention. OVH instances, like any server, can suffer from resource exhaustion, network issues, or unexpected process failures.
Essential OS Metrics
- CPU Utilization: High CPU can indicate inefficient code, runaway processes, or insufficient resources.
- Memory Usage: Swapping to disk is a performance killer. Monitor free memory and swap usage.
- Disk I/O: High disk I/O wait times can bottleneck applications, especially those performing frequent disk operations.
- Network Traffic: Monitor bandwidth usage and network errors (e.g., dropped packets).
- Process Uptime and Resource Usage: Ensure your Python application process is running and not consuming excessive resources.
Node Exporter for Prometheus
The node_exporter is the standard Prometheus exporter for hardware and OS metrics. It’s straightforward to deploy on your OVH instances.
Installation and Configuration
Download the latest release from the Prometheus GitHub repository. For Debian/Ubuntu:
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ sudo rm -rf node_exporter-1.7.0.linux-amd64*
Create a systemd service file (e.g., /etc/systemd/system/node_exporter.service):
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=nobody ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
Configure Prometheus to scrape the node_exporter (default port 9100):
scrape_configs:
- job_name: 'node_metrics'
static_configs:
- targets: ['your_app_instance_ip:9100'] # IP of the OVH instance
labels:
environment: 'production'
instance: 'app-server-01'
Alerting Strategies
Collecting metrics is only half the battle. Proactive alerting ensures you’re notified *before* users are impacted. Prometheus Alertmanager is the standard companion for Prometheus.
Defining Alerting Rules
Alerting rules are defined in Prometheus itself, typically in separate rule files referenced in prometheus.yml. Here are some critical alerts for our setup:
groups:
- name: python_app_alerts
rules:
- alert: HighHttpRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="python_app_metrics"}[5m])) by (le, endpoint, method)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.endpoint }} ({{ $labels.method }})"
description: "95th percentile latency for {{ $labels.endpoint }} ({{ $labels.method }}) is above 2s for 5 minutes."
- alert: HighDynamoDBScanLatency
expr: histogram_quantile(0.95, sum(rate(dynamodb_operation_duration_seconds_bucket{job="python_app_metrics", operation="scan"}[5m])) by (le)) > 1.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for DynamoDB scan operations"
description: "95th percentile latency for DynamoDB scan operations is above 1.5s for 5 minutes."
- alert: DynamoDBThrottling
expr: sum(rate(dynamodb_errors_total{job="python_app_metrics", operation="scan"}[5m])) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "DynamoDB throttling detected for scan operations"
description: "Throttling events detected for DynamoDB scan operations. Check consumed vs provisioned capacity."
- alert: HighCPUUtilization
expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU utilization on {{ $labels.instance }}"
description: "CPU utilization on {{ $labels.instance }} is above 90% for 5 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage on {{ $labels.instance }} is above 85% for 5 minutes."
- alert: PythonAppNotRunning
expr: up{job="python_app_metrics"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Python application is down on {{ $labels.instance }}"
description: "The Python application scrape target is down."
These rules should be added to your Prometheus configuration and reloaded. Prometheus will then evaluate these expressions and send firing alerts to Alertmanager.
Configuring Alertmanager
Alertmanager handles deduplication, grouping, and routing of alerts to various receivers (email, Slack, PagerDuty, etc.). A basic alertmanager.yml might look like this:
global: # The smpt server used to send email alerts smtp_smarthost: 'smtp.example.com:587' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: 'your_smtp_password' route: group_by: ['alertname', 'job'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' # Default receiver if no specific route matches routes: - receiver: 'critical-alerts' matchers: severity: 'critical' continue: true # Allow further routing if needed receivers: - name: 'default-receiver' slack_configs: - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX' channel: '#alerts-general' - name: 'critical-alerts' slack_configs: - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX' channel: '#alerts-critical' pagerduty_configs: - service_key: 'your_pagerduty_integration_key'
Ensure your Prometheus configuration points to your Alertmanager instance.
Log Aggregation and Analysis
While metrics tell you *what* is happening, logs tell you *why*. Centralized log aggregation is crucial for debugging issues that metrics alone can’t explain.
Log Shipping with Fluentd/Filebeat
On your OVH instances, you can use agents like Fluentd or Filebeat to tail application logs and ship them to a central logging backend (e.g., Elasticsearch, Loki, Splunk).
Filebeat Configuration Example
Assuming your Python application logs to /var/log/my-python-app/app.log, configure Filebeat’s filebeat.yml:
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/my-python-app/app.log
json. মনোবিজ্ঞান: true # If your app logs in JSON format
# If not JSON, you might need to parse logs with processors or use multiline settings
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
output.elasticsearch: # Or output.logstash, output.redis, etc.
hosts: ["your_elasticsearch_host:9200"]
# username: "elastic"
# password: "changeme"
# If using Logstash for parsing/filtering:
# output.logstash:
# hosts: ["your_logstash_host:5044"]
Start and enable the Filebeat service:
sudo systemctl enable filebeat sudo systemctl start filebeat sudo systemctl status filebeat
Analyzing Logs
Once logs are aggregated, use your chosen backend’s query language (e.g., KQL for Elasticsearch, LogQL for Loki) to search for errors, trace requests across services, and understand application behavior. For example, searching for `level:error` or `trace_id:xyz` in your aggregated logs.
Conclusion: A Holistic Approach
Effective server monitoring for a Python application interacting with DynamoDB on OVH is not a single tool or metric. It’s a layered strategy encompassing application-level instrumentation, database-specific insights, OS health, and centralized logging. By combining Prometheus for metrics, CloudWatch for managed service insights, Node Exporter for OS health, and a log aggregation system, you build a resilient monitoring framework that allows for proactive issue detection and rapid resolution.