Server Monitoring Best Practices: Keeping Your Python App and MongoDB Clusters Alive on Linode

Proactive Health Checks for Python Applications

Maintaining the health of your Python web applications, especially those interacting with MongoDB, requires a multi-layered monitoring approach. Beyond basic uptime checks, we need to inspect application-level metrics, resource utilization, and potential bottlenecks. This section details setting up robust health checks that go beyond simple HTTP 200 responses.

Implementing a Custom Health Endpoint

A dedicated health check endpoint within your Python application provides granular insights. This endpoint should not only confirm the application is running but also verify its critical dependencies, such as database connectivity. For a Flask application, this might look like:

from flask import Flask, jsonify
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure

app = Flask(__name__)

# Configuration for MongoDB connection
MONGO_URI = "mongodb://your_mongo_host:27017/"
MONGO_DB_NAME = "your_database"

def check_mongo_connection(uri, db_name):
    try:
        client = MongoClient(uri, serverSelectionTimeoutMS=5000) # 5-second timeout
        client.admin.command('ping') # A lightweight command to check connection
        db = client[db_name]
        # Optionally, check if a specific collection exists or perform a small query
        # if db.my_collection.count_documents({}) > 0:
        #     return True, "MongoDB connected and collection accessible."
        return True, "MongoDB connected successfully."
    except ConnectionFailure as e:
        return False, f"MongoDB connection failed: {e}"
    except Exception as e:
        return False, f"An unexpected error occurred with MongoDB: {e}"
    finally:
        if 'client' in locals() and client:
            client.close()

@app.route('/health')
def health_check():
    is_mongo_ok, mongo_message = check_mongo_connection(MONGO_URI, MONGO_DB_NAME)

    if is_mongo_ok:
        return jsonify({
            "status": "ok",
            "message": "Application and MongoDB are healthy.",
            "dependencies": {
                "mongodb": {
                    "status": "ok",
                    "details": mongo_message
                }
            }
        }), 200
    else:
        return jsonify({
            "status": "error",
            "message": "Application has critical dependency issues.",
            "dependencies": {
                "mongodb": {
                    "status": "error",
                    "details": mongo_message
                }
            }
        }), 503 # Service Unavailable

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

This endpoint provides a JSON response indicating the overall health and the status of the MongoDB connection. A 503 Service Unavailable status code is crucial for load balancers and external monitoring systems to correctly interpret application unavailability.

Resource Monitoring with Prometheus and Node Exporter

To gain visibility into the underlying infrastructure, we’ll deploy Prometheus for metrics collection and Node Exporter to expose system-level metrics from your Linode instances. This is essential for identifying resource exhaustion before it impacts your Python application or MongoDB cluster.

Installing and Configuring Node Exporter

On each Linode server hosting your Python app or MongoDB nodes, download and run Node Exporter. A common approach is to run it as a systemd service.

# Download the latest release (adjust version as needed)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Create a systemd service file
sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile-collector

[Install]
WantedBy=multi-user.target

Create the user and directory, then enable and start the service:

sudo useradd -rs /bin/false prometheus
sudo mkdir -p /var/lib/node_exporter/textfile-collector
sudo mv node_exporter /usr/local/bin/
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
sudo systemctl status node_exporter

Node Exporter will now expose metrics on port 9100. Ensure this port is accessible from your Prometheus server.

Configuring Prometheus Server

On your Prometheus server, edit the prometheus.yml configuration file to scrape metrics from your Node Exporter instances and your Python application’s health endpoint.

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape Node Exporter instances
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 'your_app_server_ip:9100'
          - 'your_mongo_node1_ip:9100'
          - 'your_mongo_node2_ip:9100'
          # Add all your Linode instances here

  # Scrape Python application health endpoints
  - job_name: 'python_app_health'
    metrics_path: /health # Prometheus will append this to the target URL
    scheme: http
    static_configs:
      - targets:
          - 'your_app_server_ip:5000' # Assuming your Flask app runs on port 5000
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '$1'
      - target_label: __param_target
        source_labels: [__address__]
      - target_label: __address__
        replacement: 'your_app_server_ip:5000' # The actual address Prometheus scrapes for metrics
      - target_label: __metrics_path__
        replacement: /health # Ensure this matches your health endpoint path

  # Scrape MongoDB exporter (if deployed)
  # - job_name: 'mongodb_exporter'
  #   static_configs:
  #     - targets: ['your_mongo_exporter_ip:9274'] # Default port for mongodb_exporter

The relabel_configs for the Python app health job are a bit nuanced. We’re using the application’s IP and port as the target for scraping, but we’re also instructing Prometheus to use the /health path. The relabeling ensures that the instance label correctly reflects the application server’s IP, and we explicitly set the __metrics_path__ to /health. This setup allows Prometheus to scrape the health endpoint as a metric source.

Monitoring MongoDB Clusters

Monitoring MongoDB requires specific metrics related to performance, replication, and resource usage. We’ll cover using the official MongoDB exporter and integrating its metrics into Prometheus.

Deploying MongoDB Exporter

The MongoDB exporter (mongodb_exporter) is a Prometheus exporter for MongoDB. It can be run as a standalone binary or as a Docker container. For a production setup, running it as a systemd service on a dedicated monitoring node or one of your MongoDB nodes is recommended.

# Download the latest release (adjust version as needed)
wget https://github.com/mongodb-developer/mongodb_exporter/releases/download/v0.35.0/mongodb_exporter-v0.35.0.linux-amd64.tar.gz
tar xvfz mongodb_exporter-v0.35.0.linux-amd64.tar.gz
cd mongodb_exporter-v0.35.0.linux-amd64

# Create a systemd service file
sudo nano /etc/systemd/system/mongodb_exporter.service

[Unit]
Description=MongoDB Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=mongodb_exporter
Group=mongodb_exporter
Type=simple
# Ensure MONGO_URI is set correctly for your MongoDB deployment
# Example for a replica set: mongodb://user:password@host1:27017,host2:27017/admin?replicaSet=rs0
# Example for standalone: mongodb://user:password@host:27017/admin
Environment="MONGO_URI=mongodb://your_mongo_user:your_mongo_password@your_mongo_host:27017/admin?replicaSet=your_replica_set_name"
ExecStart=/usr/local/bin/mongodb_exporter --mongodb.uri="${MONGO_URI}" --web.listen-address=":9274"

[Install]
WantedBy=multi-user.target

Create the user, directory, and start the service:

sudo useradd -rs /bin/false mongodb_exporter
sudo mv mongodb_exporter /usr/local/bin/
sudo systemctl daemon-reload
sudo systemctl start mongodb_exporter
sudo systemctl enable mongodb_exporter
sudo systemctl status mongodb_exporter

The exporter will now be available on port 9274. Update your prometheus.yml to include this job, as shown in the commented-out section of the previous YAML block.

Alerting with Alertmanager

Collecting metrics is only half the battle; you need to be notified when things go wrong. Alertmanager integrates with Prometheus to handle alerts, deduplicate them, group them, and route them to the correct receivers (e.g., Slack, PagerDuty, email).

Defining Alerting Rules in Prometheus

Create a separate file for your alerting rules, e.g., alerts.yml, and include it in your prometheus.yml:

# In prometheus.yml
rule_files:
  - 'alerts.yml'

# In alerts.yml
groups:
  - name: general.rules
    rules:
      # Alert if a Node Exporter instance is down for more than 5 minutes
      - alert: NodeExporterDown
        expr: up{job="node_exporter"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node Exporter {{ $labels.instance }} is down"
          description: "The Node Exporter on {{ $labels.instance }} has been down for more than 5 minutes."

      # Alert if CPU usage is consistently high
      - alert: HighCpuUsage
        expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage on {{ $labels.instance }} is above 85% for the last 10 minutes."

      # Alert if disk space is running low
      - alert: LowDiskSpace
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

  - name: python_app.rules
    rules:
      # Alert if the Python app health check returns an error (5xx status)
      - alert: PythonAppUnhealthy
        expr: probe_success{job="python_app_health"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Python application {{ $labels.instance }} is unhealthy"
          description: "The health check endpoint for Python application on {{ $labels.instance }} returned an error."

      # Alert if application response time is too high (requires application instrumentation or specific exporters)
      # This is a placeholder; actual implementation depends on how you expose app latency.
      # Example using a hypothetical 'http_request_duration_seconds' metric:
      # - alert: HighAppResponseTime
      #   expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, instance)) > 2
      #   for: 5m
      #   labels:
      #     severity: warning
      #   annotations:
      #     summary: "High response time for Python app {{ $labels.instance }}"
      #     description: "95th percentile response time for {{ $labels.instance }} is over 2 seconds."

  - name: mongodb.rules
    rules:
      # Alert if MongoDB replica set is not healthy
      - alert: MongoReplicaSetNotHealthy
        expr: mongodb_up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "MongoDB exporter is down for {{ $labels.instance }}"
          description: "The MongoDB exporter on {{ $labels.instance }} is not reporting metrics."

      # Alert if replication lag is too high
      - alert: MongoReplicationLag
        expr: mongodb_replset_member_state == 1 and mongodb_replset_member_optime_lag > 60 # Lag in seconds
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "MongoDB replication lag on {{ $labels.instance }}"
          description: "MongoDB replica set member {{ $labels.instance }} has a replication lag of more than 60 seconds."

      # Alert if MongoDB connections are too high
      - alert: HighMongoConnections
        expr: mongodb_connections_current > 800 # Adjust threshold based on your setup
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High number of MongoDB connections on {{ $labels.instance }}"
          description: "MongoDB instance {{ $labels.instance }} has more than 800 active connections."

Configuring Alertmanager

Ensure your Prometheus server is configured to send alerts to Alertmanager. In prometheus.yml:

# In prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['your_alertmanager_ip:9093'] # Address of your Alertmanager instance

And configure Alertmanager itself (alertmanager.yml) to define receivers (e.g., Slack):

global:
  slack_api_url: '<YOUR_SLACK_WEBHOOK_URL>'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications' # Default receiver

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#your-alerts-channel'
        send_resolved: true
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

inhibit_rules:
  - target_match:
      severity: 'critical'
    source_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

With this setup, you have a comprehensive monitoring stack for your Python applications and MongoDB clusters on Linode, ensuring proactive detection and alerting for potential issues.