Server Monitoring Best Practices: Keeping Your Python App and PostgreSQL Clusters Alive on DigitalOcean
Establishing a Baseline: Essential Metrics for Python Apps and PostgreSQL
Effective server monitoring begins with understanding what “normal” looks like for your specific stack. For a Python application, this means tracking request latency, error rates, and resource utilization (CPU, RAM, disk I/O). For a PostgreSQL cluster, the focus shifts to query performance, connection counts, replication lag, and disk space. DigitalOcean’s built-in monitoring provides a good starting point, but for production environments, we need deeper insights and proactive alerting.
Python Application Monitoring with Prometheus and Grafana
Prometheus is the de facto standard for time-series monitoring in cloud-native environments. We’ll instrument our Python application to expose metrics that Prometheus can scrape.
First, install the necessary Prometheus client library for Python:
pip install prometheus_client
Next, modify your Python application (e.g., using Flask or Django) to expose a `/metrics` endpoint. Here’s a simple Flask example:
from flask import Flask, Response
from prometheus_client import generate_latest, REGISTRY, CollectorRegistry
from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily
app = Flask(__name__)
# Custom metrics
request_latency = CounterMetricFamily('http_requests_total', 'Total HTTP requests', labels=['method', 'endpoint', 'status'])
error_count = CounterMetricFamily('http_errors_total', 'Total HTTP errors', labels=['method', 'endpoint'])
# Example of a gauge for active connections (if your app manages them)
# active_connections = GaugeMetricFamily('app_active_connections', 'Number of active application connections')
# In a real app, you'd update these counters/gauges based on request handling
# For demonstration, we'll just register them.
# You'd typically have middleware or decorators to update these dynamically.
# Dummy function to simulate metric updates
def update_metrics():
# In a real scenario, these would be updated by your application logic
# e.g., after processing a request, incrementing request_latency
# and error_count if an exception occurs.
pass
@app.route('/metrics')
def metrics():
update_metrics() # Ensure metrics are up-to-date
return Response(generate_latest(), mimetype='text/plain')
@app.route('/')
def hello_world():
# Simulate a successful request
# request_latency.labels(method='GET', endpoint='/', status='200').inc()
return 'Hello, World!'
@app.route('/error')
def simulate_error():
# Simulate an error
# request_latency.labels(method='GET', endpoint='/error', status='500').inc()
# error_count.labels(method='GET', endpoint='/error').inc()
return 'Internal Server Error', 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Now, configure Prometheus to scrape this endpoint. Assuming you have Prometheus installed on a separate server or within a Docker container, create or update your `prometheus.yml` configuration:
scrape_configs:
- job_name: 'python_app'
static_configs:
- targets: ['YOUR_APP_SERVER_IP:5000'] # Replace with your app server's IP and port
metrics_path: '/metrics'
After restarting Prometheus, you should see your Python application targets appearing in the Prometheus UI under “Targets”.
PostgreSQL Cluster Monitoring with `pg_exporter` and Prometheus
For PostgreSQL, the `postgres_exporter` is an excellent choice. It queries PostgreSQL’s system catalogs and exposes metrics in a Prometheus-compatible format.
First, deploy `postgres_exporter`. The easiest way is often via Docker:
docker run -d \ --name postgres_exporter \ -p 9187:9187 \ -e DATA_SOURCE_NAME="postgresql://user:password@your_postgres_host:5432/your_database?sslmode=disable" \ prom/postgres-exporter:latest
Replace `user`, `password`, `your_postgres_host`, and `your_database` with your actual PostgreSQL credentials and connection details. Ensure the user has sufficient privileges to query `pg_stat_activity`, `pg_stat_database`, `pg_locks`, etc. For a cluster, you’ll run this exporter on each PostgreSQL node or configure it to connect to each node’s primary/replica.
Update your `prometheus.yml` to include the `postgres_exporter` targets:
scrape_configs:
- job_name: 'python_app'
static_configs:
- targets: ['YOUR_APP_SERVER_IP:5000']
metrics_path: '/metrics'
- job_name: 'postgres_cluster'
static_configs:
- targets:
- 'POSTGRES_NODE1_IP:9187' # Replace with your PostgreSQL node IPs
- 'POSTGRES_NODE2_IP:9187'
- 'POSTGRES_NODE3_IP:9187'
metrics_path: '/metrics'
Restart Prometheus. You should now see both your Python apps and PostgreSQL nodes being scraped.
Visualizing with Grafana: Dashboards for Insight
Grafana is the perfect companion to Prometheus for visualizing metrics and creating dashboards. Install Grafana and add Prometheus as a data source.
For Python applications, common metrics to visualize include:
- HTTP Request Rate (per endpoint, per method)
- HTTP Request Latency (average, p95, p99)
- HTTP Error Rate (4xx, 5xx)
- CPU/Memory Usage of the Python process
- Disk I/O of the application server
For PostgreSQL, essential visualizations include:
- Active Connections (total, by state)
- Query Throughput (statements per second)
- Slow Queries (count, duration)
- Replication Lag (if applicable)
- Disk Usage (total, per database)
- Cache Hit Ratios
- Lock Contention
You can find many pre-built Grafana dashboards for Prometheus and PostgreSQL on Grafana.com. Import them and adapt them to your specific needs. For example, a good PostgreSQL dashboard will often show `pg_stat_activity` to identify long-running queries or idle connections.
Alerting: Proactive Problem Detection with Alertmanager
Monitoring is only half the battle; alerting is crucial for proactive incident response. Prometheus integrates with Alertmanager to handle alerts.
Define alerting rules in Prometheus. Create a `rules.yml` file:
groups:
- name: python_app_alerts
rules:
- alert: HighHttpRequestErrorRate
expr: |
sum(rate(http_errors_total{job="python_app"}[5m])) by (endpoint, method)
/
sum(rate(http_requests_total{job="python_app"}[5m])) by (endpoint, method)
> 0.05 # More than 5% error rate over 5 minutes
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.endpoint }} ({{ $labels.method }})"
description: "The endpoint {{ $labels.endpoint }} is experiencing an error rate above 5%."
- alert: HighHttpRequestLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="python_app"}[5m])) by (le, endpoint, method))
> 2 # 95th percentile latency exceeds 2 seconds
for: 10m
labels:
severity: warning
annotations:
summary: "High request latency on {{ $labels.endpoint }} ({{ $labels.method }})"
description: "95th percentile latency for {{ $labels.endpoint }} is {{ $value }}s."
- name: postgres_alerts
rules:
- alert: HighPostgresConnections
expr: sum(pg_stat_activity_count{job="postgres_cluster"}) by (datname) > 100 # More than 100 connections to a database
for: 15m
labels:
severity: warning
annotations:
summary: "High number of PostgreSQL connections to {{ $labels.datname }}"
description: "Database {{ $labels.datname }} has {{ $value }} active connections."
- alert: LowReplicationLag
expr: pg_replication_lag_seconds{job="postgres_cluster", role="replica"} > 60 # Replication lag exceeds 60 seconds
for: 5m
labels:
severity: critical
annotations:
summary: "High replication lag on replica"
description: "Replication lag for {{ $labels.instance }} is {{ $value }} seconds."
- alert: LowDiskSpace
expr: node_filesystem_avail_bytes{job="node_exporter", mountpoint="/"} / node_filesystem_size_bytes{job="node_exporter", mountpoint="/"} * 100 < 10 # Less than 10% free space on root partition
for: 30m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Filesystem on {{ $labels.instance }} has only {{ printf "%.2f" $value }}% free space."
Make sure to include this `rules.yml` in your `prometheus.yml`:
rule_files: - "rules.yml" # Path to your rules file
Configure Alertmanager to receive alerts from Prometheus and route them to your preferred notification channels (Slack, PagerDuty, email, etc.).
DigitalOcean Specifics: Droplet Health and Networking
Beyond application and database metrics, monitor the underlying Droplet health. DigitalOcean’s control panel provides basic CPU, RAM, and network traffic graphs. For more granular OS-level metrics, deploy the node_exporter on each Droplet.
# Example using Docker for node_exporter docker run -d \ --name node_exporter \ --net="host" \ prom/node-exporter:latest
Add `node_exporter` to your Prometheus configuration:
scrape_configs:
# ... other jobs ...
- job_name: 'node_exporter'
static_configs:
- targets:
- 'DROPLET1_IP:9100' # Replace with your Droplet IPs
- 'DROPLET2_IP:9100'
- 'DROPLET3_IP:9100'
Key metrics from `node_exporter` to watch include:
- CPU Usage (overall and per core)
- Memory Usage (free, used, buffers, cache)
- Disk I/O (reads/writes per second, latency)
- Network Traffic (bytes sent/received per second)
- Filesystem Usage (available space)
For PostgreSQL clusters, especially those using synchronous replication, monitor network latency between nodes. High inter-node latency can cause application slowdowns or replication failures. DigitalOcean’s private networking is generally reliable, but it’s good practice to have metrics for it.
Log Aggregation and Analysis
Metrics tell you *what* is happening, but logs tell you *why*. Implement a centralized log aggregation system. Tools like Loki, Elasticsearch/Logstash/Kibana (ELK), or Splunk are common choices. For a simpler setup, consider Fluentd or Filebeat to ship logs from your Python app and PostgreSQL instances to a central store.
Configure your Python application to log to `stdout`/`stderr` (if running in containers) or to a dedicated log file. For PostgreSQL, configure `postgresql.conf` to log errors and slow queries:
# In postgresql.conf log_destination = 'stderr' # Or 'csvlog' for structured logs logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_min_duration_statement = '250ms' # Log statements longer than 250ms log_checkpoints = on log_connections = on log_disconnections = on log_error_verbosity = default log_lock_waits = on log_temp_files = 0 # Log temporary files larger than 0 bytes log_autovacuum_min_duration = 0 # Log autovacuum actions
Use tools like pg_tail or configure your log shipper to monitor these log files and send them to your aggregation system. This allows you to correlate metric spikes with specific error messages or slow query patterns.
High Availability and Disaster Recovery Considerations
Monitoring is a key component of HA/DR. Ensure your monitoring system itself is highly available. If your Prometheus or Grafana instances go down, you lose visibility. Consider running multiple Prometheus instances or using a managed Prometheus service.
For PostgreSQL clusters, monitor replication status rigorously. Alerts for replication lag or broken replication are critical. Ensure you have automated failover mechanisms in place, and that your monitoring system can detect and alert on failover events.
Regularly test your monitoring and alerting. Simulate failures (e.g., stop a PostgreSQL node, introduce errors in your app) to ensure alerts fire correctly and notifications reach the right people. This validation is as important as the initial setup.