Server Monitoring Best Practices: Keeping Your Python App and MongoDB Clusters Alive on Linode
Proactive Health Checks for Python Applications
Maintaining the health of your Python web applications, especially those interacting with MongoDB, requires a multi-layered monitoring approach. Beyond basic uptime checks, we need to inspect application-level metrics, resource utilization, and potential bottlenecks. This section details setting up robust health checks that go beyond simple HTTP 200 responses.
Implementing a Custom Health Endpoint
A dedicated health check endpoint within your Python application provides granular insights. This endpoint should not only confirm the application is running but also verify its critical dependencies, such as database connectivity. For a Flask application, this might look like:
from flask import Flask, jsonify
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure
app = Flask(__name__)
# Configuration for MongoDB connection
MONGO_URI = "mongodb://your_mongo_host:27017/"
MONGO_DB_NAME = "your_database"
def check_mongo_connection(uri, db_name):
try:
client = MongoClient(uri, serverSelectionTimeoutMS=5000) # 5-second timeout
client.admin.command('ping') # A lightweight command to check connection
db = client[db_name]
# Optionally, check if a specific collection exists or perform a small query
# if db.my_collection.count_documents({}) > 0:
# return True, "MongoDB connected and collection accessible."
return True, "MongoDB connected successfully."
except ConnectionFailure as e:
return False, f"MongoDB connection failed: {e}"
except Exception as e:
return False, f"An unexpected error occurred with MongoDB: {e}"
finally:
if 'client' in locals() and client:
client.close()
@app.route('/health')
def health_check():
is_mongo_ok, mongo_message = check_mongo_connection(MONGO_URI, MONGO_DB_NAME)
if is_mongo_ok:
return jsonify({
"status": "ok",
"message": "Application and MongoDB are healthy.",
"dependencies": {
"mongodb": {
"status": "ok",
"details": mongo_message
}
}
}), 200
else:
return jsonify({
"status": "error",
"message": "Application has critical dependency issues.",
"dependencies": {
"mongodb": {
"status": "error",
"details": mongo_message
}
}
}), 503 # Service Unavailable
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
This endpoint provides a JSON response indicating the overall health and the status of the MongoDB connection. A 503 Service Unavailable status code is crucial for load balancers and external monitoring systems to correctly interpret application unavailability.
Resource Monitoring with Prometheus and Node Exporter
To gain visibility into the underlying infrastructure, we’ll deploy Prometheus for metrics collection and Node Exporter to expose system-level metrics from your Linode instances. This is essential for identifying resource exhaustion before it impacts your Python application or MongoDB cluster.
Installing and Configuring Node Exporter
On each Linode server hosting your Python app or MongoDB nodes, download and run Node Exporter. A common approach is to run it as a systemd service.
# Download the latest release (adjust version as needed) wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz cd node_exporter-1.7.0.linux-amd64 # Create a systemd service file sudo nano /etc/systemd/system/node_exporter.service
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile-collector [Install] WantedBy=multi-user.target
Create the user and directory, then enable and start the service:
sudo useradd -rs /bin/false prometheus sudo mkdir -p /var/lib/node_exporter/textfile-collector sudo mv node_exporter /usr/local/bin/ sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter sudo systemctl status node_exporter
Node Exporter will now expose metrics on port 9100. Ensure this port is accessible from your Prometheus server.
Configuring Prometheus Server
On your Prometheus server, edit the prometheus.yml configuration file to scrape metrics from your Node Exporter instances and your Python application’s health endpoint.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape Node Exporter instances
- job_name: 'node_exporter'
static_configs:
- targets:
- 'your_app_server_ip:9100'
- 'your_mongo_node1_ip:9100'
- 'your_mongo_node2_ip:9100'
# Add all your Linode instances here
# Scrape Python application health endpoints
- job_name: 'python_app_health'
metrics_path: /health # Prometheus will append this to the target URL
scheme: http
static_configs:
- targets:
- 'your_app_server_ip:5000' # Assuming your Flask app runs on port 5000
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.*'
replacement: '$1'
- target_label: __param_target
source_labels: [__address__]
- target_label: __address__
replacement: 'your_app_server_ip:5000' # The actual address Prometheus scrapes for metrics
- target_label: __metrics_path__
replacement: /health # Ensure this matches your health endpoint path
# Scrape MongoDB exporter (if deployed)
# - job_name: 'mongodb_exporter'
# static_configs:
# - targets: ['your_mongo_exporter_ip:9274'] # Default port for mongodb_exporter
The relabel_configs for the Python app health job are a bit nuanced. We’re using the application’s IP and port as the target for scraping, but we’re also instructing Prometheus to use the /health path. The relabeling ensures that the instance label correctly reflects the application server’s IP, and we explicitly set the __metrics_path__ to /health. This setup allows Prometheus to scrape the health endpoint as a metric source.
Monitoring MongoDB Clusters
Monitoring MongoDB requires specific metrics related to performance, replication, and resource usage. We’ll cover using the official MongoDB exporter and integrating its metrics into Prometheus.
Deploying MongoDB Exporter
The MongoDB exporter (mongodb_exporter) is a Prometheus exporter for MongoDB. It can be run as a standalone binary or as a Docker container. For a production setup, running it as a systemd service on a dedicated monitoring node or one of your MongoDB nodes is recommended.
# Download the latest release (adjust version as needed) wget https://github.com/mongodb-developer/mongodb_exporter/releases/download/v0.35.0/mongodb_exporter-v0.35.0.linux-amd64.tar.gz tar xvfz mongodb_exporter-v0.35.0.linux-amd64.tar.gz cd mongodb_exporter-v0.35.0.linux-amd64 # Create a systemd service file sudo nano /etc/systemd/system/mongodb_exporter.service
[Unit]
Description=MongoDB Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=mongodb_exporter
Group=mongodb_exporter
Type=simple
# Ensure MONGO_URI is set correctly for your MongoDB deployment
# Example for a replica set: mongodb://user:password@host1:27017,host2:27017/admin?replicaSet=rs0
# Example for standalone: mongodb://user:password@host:27017/admin
Environment="MONGO_URI=mongodb://your_mongo_user:your_mongo_password@your_mongo_host:27017/admin?replicaSet=your_replica_set_name"
ExecStart=/usr/local/bin/mongodb_exporter --mongodb.uri="${MONGO_URI}" --web.listen-address=":9274"
[Install]
WantedBy=multi-user.target
Create the user, directory, and start the service:
sudo useradd -rs /bin/false mongodb_exporter sudo mv mongodb_exporter /usr/local/bin/ sudo systemctl daemon-reload sudo systemctl start mongodb_exporter sudo systemctl enable mongodb_exporter sudo systemctl status mongodb_exporter
The exporter will now be available on port 9274. Update your prometheus.yml to include this job, as shown in the commented-out section of the previous YAML block.
Alerting with Alertmanager
Collecting metrics is only half the battle; you need to be notified when things go wrong. Alertmanager integrates with Prometheus to handle alerts, deduplicate them, group them, and route them to the correct receivers (e.g., Slack, PagerDuty, email).
Defining Alerting Rules in Prometheus
Create a separate file for your alerting rules, e.g., alerts.yml, and include it in your prometheus.yml:
# In prometheus.yml
rule_files:
- 'alerts.yml'
# In alerts.yml
groups:
- name: general.rules
rules:
# Alert if a Node Exporter instance is down for more than 5 minutes
- alert: NodeExporterDown
expr: up{job="node_exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node Exporter {{ $labels.instance }} is down"
description: "The Node Exporter on {{ $labels.instance }} has been down for more than 5 minutes."
# Alert if CPU usage is consistently high
- alert: HighCpuUsage
expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is above 85% for the last 10 minutes."
# Alert if disk space is running low
- alert: LowDiskSpace
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
for: 15m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Filesystem on {{ $labels.instance }} has less than 10% free space."
- name: python_app.rules
rules:
# Alert if the Python app health check returns an error (5xx status)
- alert: PythonAppUnhealthy
expr: probe_success{job="python_app_health"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Python application {{ $labels.instance }} is unhealthy"
description: "The health check endpoint for Python application on {{ $labels.instance }} returned an error."
# Alert if application response time is too high (requires application instrumentation or specific exporters)
# This is a placeholder; actual implementation depends on how you expose app latency.
# Example using a hypothetical 'http_request_duration_seconds' metric:
# - alert: HighAppResponseTime
# expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, instance)) > 2
# for: 5m
# labels:
# severity: warning
# annotations:
# summary: "High response time for Python app {{ $labels.instance }}"
# description: "95th percentile response time for {{ $labels.instance }} is over 2 seconds."
- name: mongodb.rules
rules:
# Alert if MongoDB replica set is not healthy
- alert: MongoReplicaSetNotHealthy
expr: mongodb_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "MongoDB exporter is down for {{ $labels.instance }}"
description: "The MongoDB exporter on {{ $labels.instance }} is not reporting metrics."
# Alert if replication lag is too high
- alert: MongoReplicationLag
expr: mongodb_replset_member_state == 1 and mongodb_replset_member_optime_lag > 60 # Lag in seconds
for: 5m
labels:
severity: critical
annotations:
summary: "MongoDB replication lag on {{ $labels.instance }}"
description: "MongoDB replica set member {{ $labels.instance }} has a replication lag of more than 60 seconds."
# Alert if MongoDB connections are too high
- alert: HighMongoConnections
expr: mongodb_connections_current > 800 # Adjust threshold based on your setup
for: 10m
labels:
severity: warning
annotations:
summary: "High number of MongoDB connections on {{ $labels.instance }}"
description: "MongoDB instance {{ $labels.instance }} has more than 800 active connections."
Configuring Alertmanager
Ensure your Prometheus server is configured to send alerts to Alertmanager. In prometheus.yml:
# In prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['your_alertmanager_ip:9093'] # Address of your Alertmanager instance
And configure Alertmanager itself (alertmanager.yml) to define receivers (e.g., Slack):
global:
slack_api_url: '<YOUR_SLACK_WEBHOOK_URL>'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications' # Default receiver
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#your-alerts-channel'
send_resolved: true
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
inhibit_rules:
- target_match:
severity: 'critical'
source_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
With this setup, you have a comprehensive monitoring stack for your Python applications and MongoDB clusters on Linode, ensuring proactive detection and alerting for potential issues.