Server Monitoring Best Practices: Keeping Your Python App and MongoDB Clusters Alive on DigitalOcean

Proactive Health Checks for Python Applications

Maintaining the health of a Python application, especially one serving critical traffic, requires more than just basic uptime checks. We need to delve into application-level metrics and implement intelligent alerting. For a typical Flask or Django application, this involves exposing internal metrics and setting up external probes.

A common pattern is to expose a `/health` or `/metrics` endpoint within the application itself. This endpoint can report on database connectivity, cache status, and internal worker queues. Let’s consider a simple Flask example:

Flask Health Endpoint Example

from flask import Flask, jsonify
import redis
import pymongo

app = Flask(__name__)

# Configuration (ideally from environment variables)
MONGO_URI = "mongodb://mongo1:27017,mongo2:27017/?replicaSet=rs0"
REDIS_HOST = "redis_cache"
REDIS_PORT = 6379

def check_mongo_connection(uri):
    try:
        client = pymongo.MongoClient(uri, serverSelectionTimeoutMS=5000)
        # The ismaster command is cheap and does not require auth.
        client.admin.command('ismaster')
        return True, "MongoDB connection successful"
    except pymongo.errors.ConnectionFailure as e:
        return False, f"MongoDB connection failed: {e}"
    except Exception as e:
        return False, f"An unexpected error occurred with MongoDB: {e}"

def check_redis_connection(host, port):
    try:
        r = redis.StrictRedis(host=host, port=port, socket_connect_timeout=2, socket_timeout=2)
        r.ping()
        return True, "Redis connection successful"
    except redis.exceptions.ConnectionError as e:
        return False, f"Redis connection failed: {e}"
    except Exception as e:
        return False, f"An unexpected error occurred with Redis: {e}"

@app.route('/health')
def health_check():
    mongo_ok, mongo_msg = check_mongo_connection(MONGO_URI)
    redis_ok, redis_msg = check_redis_connection(REDIS_HOST, REDIS_PORT)

    status = {
        "status": "unhealthy",
        "dependencies": {
            "mongodb": {"ok": mongo_ok, "message": mongo_msg},
            "redis": {"ok": redis_ok, "message": redis_msg}
        }
    }

    if mongo_ok and redis_ok:
        status["status"] = "healthy"
        return jsonify(status), 200
    else:
        return jsonify(status), 503 # Service Unavailable

if __name__ == '__main__':
    # In production, use a proper WSGI server like Gunicorn
    app.run(host='0.0.0.0', port=5000)

This endpoint returns a 200 OK for healthy status and a 503 Service Unavailable for unhealthy. This is crucial for load balancers and external monitoring tools. For more detailed metrics (request latency, error rates, memory usage), consider integrating libraries like prometheus_client and exposing a `/metrics` endpoint.

Monitoring MongoDB Clusters on DigitalOcean

DigitalOcean’s Managed MongoDB service simplifies cluster management, but robust monitoring is still essential. We need to track not just basic availability but also performance indicators like query latency, replication lag, and disk usage.

The primary tool for this is the MongoDB diagnostic commands, accessible via the `mongosh` shell or programmatically. For automated monitoring, we’ll use a dedicated monitoring agent or script that periodically queries these metrics.

Key MongoDB Metrics to Monitor

Replication Lag: Critical for ensuring data consistency across nodes. Use rs.status().
Query Performance: Track slow queries and overall query execution times. Use db.serverStatus() and db.currentOp().
Disk Usage: Prevent outages due to full disks. Use db.stats() or system-level tools.
Connections: Monitor active and available connections to avoid connection exhaustion. Use db.serverStatus().
Memory Usage: Keep an eye on RAM consumption, especially for WiredTiger cache. Use db.serverStatus().

Automated MongoDB Health Checks Script (Python)

This Python script connects to a MongoDB replica set and checks for replication lag and basic server status. It’s designed to be run periodically by a scheduler like cron or a systemd timer.

import pymongo
import time
import sys
import os

# Configuration from environment variables
MONGO_URI = os.environ.get("MONGO_URI", "mongodb://user:password@mongo1:27017,mongo2:27017/?replicaSet=rs0")
REPLICATION_LAG_THRESHOLD_SECONDS = int(os.environ.get("REPLICATION_LAG_THRESHOLD_SECONDS", 60))
DISK_USAGE_THRESHOLD_PERCENT = int(os.environ.get("DISK_USAGE_THRESHOLD_PERCENT", 85))

def check_replication_lag(client):
    try:
        rs_status = client.admin.command('replSetGetStatus')
        primary_member = None
        max_lag = 0

        for member in rs_status['members']:
            if member['stateStr'] == 'PRIMARY':
                primary_member = member
            # Calculate lag for secondary members
            if member['stateStr'] != 'PRIMARY':
                optime_date = member['optimeDate']
                primary_optime_date = next(m['optimeDate'] for m in rs_status['members'] if m['stateStr'] == 'PRIMARY')
                lag = (primary_optime_date - optime_date).total_seconds()
                if lag > max_lag:
                    max_lag = lag

        if max_lag > REPLICATION_LAG_THRESHOLD_SECONDS:
            print(f"CRITICAL: Replication lag detected. Max lag: {max_lag:.2f}s (Threshold: {REPLICATION_LAG_THRESHOLD_SECONDS}s)", file=sys.stderr)
            return False
        else:
            print(f"OK: Replication lag is within acceptable limits (Max lag: {max_lag:.2f}s)")
            return True
    except pymongo.errors.OperationFailure as e:
        print(f"ERROR: Failed to get replication status: {e}", file=sys.stderr)
        return False
    except Exception as e:
        print(f"ERROR: Unexpected error during replication check: {e}", file=sys.stderr)
        return False

def check_disk_usage(client):
    try:
        # Use db.command('storageStats') for more detailed disk usage per collection
        # For a quick check, db.stats() provides overall data/index size
        db_stats = client.admin.command('dbStats')
        total_size_gb = db_stats['dataSize'] / (1024**3)
        storage_size_gb = db_stats['storageSize'] / (1024**3) # WiredTiger uncompressed size

        # DigitalOcean provides disk size, we need to know the total provisioned size.
        # This is a simplification; in a real scenario, you'd query DO API or have it configured.
        # Assuming a common DO droplet disk size for demonstration.
        # For managed databases, DO handles disk provisioning, so this check might be less direct.
        # A better approach for DO Managed DBs is to monitor DO's own metrics.
        # However, if you have self-hosted MongoDB on DO droplets, this is relevant.

        # Placeholder for actual disk size retrieval
        # For DO Managed Databases, rely on DO's metrics.
        # For self-hosted on DO droplets:
        # total_provisioned_gb = get_droplet_disk_size() # Function to call DO API

        # For this example, let's assume we know the total disk size.
        # If running on a DO droplet, you'd use `df -h /` or similar.
        # For managed DBs, this check is less applicable directly.
        print("INFO: Disk usage check is simplified for managed databases. Rely on DigitalOcean's provided metrics.")
        return True # Assume OK if not self-hosted with direct disk access

    except pymongo.errors.OperationFailure as e:
        print(f"ERROR: Failed to get database stats: {e}", file=sys.stderr)
        return False
    except Exception as e:
        print(f"ERROR: Unexpected error during disk usage check: {e}", file=sys.stderr)
        return False

def check_server_status(client):
    try:
        server_status = client.admin.command('serverStatus')
        connections = server_status['connections']
        network = server_status['network']
        metrics = server_status['metrics']

        print(f"INFO: Connections - Current: {connections['current']}, Available: {connections['available']}")
        print(f"INFO: Network - Bytes In: {network['bytesIn']}, Bytes Out: {network['bytesOut']}")
        print(f"INFO: WiredTiger Cache - Bytes Used: {metrics['cdot']['wiredTiger']['cache']['bytesCurrentlyUsed']:,}, Pages Read into Cache: {metrics['cdot']['wiredTiger']['cache']['pagesReadIntoCache']:,}")

        # Add specific thresholds for connections, cache usage etc. if needed
        if connections['current'] > connections['available'] * 0.9:
            print(f"WARNING: High connection usage: {connections['current']}/{connections['available']}", file=sys.stderr)

        return True
    except pymongo.errors.OperationFailure as e:
        print(f"ERROR: Failed to get server status: {e}", file=sys.stderr)
        return False
    except Exception as e:
        print(f"ERROR: Unexpected error during server status check: {e}", file=sys.stderr)
        return False


if __name__ == "__main__":
    try:
        client = pymongo.MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
        client.admin.command('ping') # Verify connection
        print("INFO: Successfully connected to MongoDB.")

        all_checks_ok = True

        if not check_replication_lag(client):
            all_checks_ok = False
        if not check_disk_usage(client): # Note: Simplified for DO Managed DBs
            all_checks_ok = False
        if not check_server_status(client):
            all_checks_ok = False

        if all_checks_ok:
            print("INFO: All MongoDB health checks passed.")
            sys.exit(0)
        else:
            print("ERROR: One or more MongoDB health checks failed.", file=sys.stderr)
            sys.exit(1)

    except pymongo.errors.ConnectionFailure as e:
        print(f"FATAL: Could not connect to MongoDB at {MONGO_URI}: {e}", file=sys.stderr)
        sys.exit(2)
    except Exception as e:
        print(f"FATAL: An unexpected error occurred: {e}", file=sys.stderr)
        sys.exit(3)
    finally:
        if 'client' in locals() and client:
            client.close()

To use this script effectively:

Set the MONGO_URI environment variable with your DigitalOcean Managed MongoDB connection string.
Configure REPLICATION_LAG_THRESHOLD_SECONDS and DISK_USAGE_THRESHOLD_PERCENT as needed.
Schedule this script using cron or a systemd timer to run every 1-5 minutes.
Pipe the output to a log file and configure your alerting system (e.g., Prometheus Alertmanager, PagerDuty) to trigger on non-zero exit codes or specific error messages.

Integrating with DigitalOcean Monitoring & Alerting

DigitalOcean’s built-in monitoring provides a good baseline. For your Droplets running Python apps and potentially self-hosted MongoDB (though Managed MongoDB is recommended), ensure the DigitalOcean agent is installed and configured.

Key metrics to monitor via the DO dashboard:

CPU Utilization: High CPU can indicate inefficient code or heavy load.
Memory Usage: Crucial for Python apps and MongoDB’s cache.
Disk I/O: Bottlenecks here severely impact database performance.
Network Traffic: Monitor for unusual spikes or drops.

Setting Up Alerts in DigitalOcean

DigitalOcean allows you to set up alerts directly on Droplet and Managed Database metrics. This is your first line of defense.

Example alert configuration:

Resource: Droplet CPU Usage
Condition: Greater than 90% for 15 minutes
Alerts To: Your email, Slack integration (via webhooks)

Resource: Managed MongoDB Disk Usage
Condition: Greater than 85% for 30 minutes
Alerts To: Your email, PagerDuty

For more sophisticated alerting based on your custom application and database scripts (like the Python examples above), consider integrating with tools like Prometheus and Alertmanager. You can scrape metrics from your application’s `/metrics` endpoint and use the output of your custom scripts as Prometheus exporters or directly feed them into Alertmanager.

Advanced: Centralized Logging and Tracing

Beyond metrics, logs and traces are invaluable for diagnosing issues. Centralizing logs from all your application instances and database nodes allows for easier searching and correlation.

Consider using a stack like:

Log Collection: Fluentd, Logstash, or Vector
Log Storage/Search: Elasticsearch, Loki
Visualization: Kibana, Grafana

For distributed tracing, integrate libraries like OpenTelemetry into your Python application. This allows you to visualize the path of a request through your system, identifying latency bottlenecks across services and databases.

Example: Python Logging Configuration

import logging
import logging.handlers
import sys
import os

LOG_LEVEL = os.environ.get('LOG_LEVEL', 'INFO').upper()
LOG_FILE = os.environ.get('LOG_FILE', '/var/log/myapp/app.log')
LOG_MAX_BYTES = int(os.environ.get('LOG_MAX_BYTES', 10 * 1024 * 1024)) # 10MB
LOG_BACKUP_COUNT = int(os.environ.get('LOG_BACKUP_COUNT', 5))

# Ensure log directory exists
log_dir = os.path.dirname(LOG_FILE)
if not os.path.exists(log_dir):
    try:
        os.makedirs(log_dir)
    except OSError as e:
        print(f"Error creating log directory {log_dir}: {e}", file=sys.stderr)
        # Fallback to stderr if directory creation fails
        logging.basicConfig(level=LOG_LEVEL, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        logging.error(f"Failed to create log directory {log_dir}. Logging to stderr.")
        # Exit or handle appropriately if logging is critical
        # sys.exit(1)
else:
    # Setup rotating file handler
    file_handler = logging.handlers.RotatingFileHandler(
        LOG_FILE,
        maxBytes=LOG_MAX_BYTES,
        backupCount=LOG_BACKUP_COUNT
    )
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    file_handler.setFormatter(formatter)

    # Setup console handler (for Docker/Kubernetes environments)
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setFormatter(formatter)

    # Get the root logger
    logger = logging.getLogger()
    logger.setLevel(LOG_LEVEL)

    # Add handlers
    logger.addHandler(file_handler)
    logger.addHandler(console_handler)

    logging.info(f"Logging configured. Level: {LOG_LEVEL}, File: {LOG_FILE}")

# Example usage
def my_function():
    logging.info("Executing my_function.")
    try:
        # Simulate an error
        result = 1 / 0
    except ZeroDivisionError:
        logging.error("Encountered a ZeroDivisionError!", exc_info=True) # exc_info=True logs traceback
    logging.info("Finished my_function.")

if __name__ == "__main__":
    my_function()

Configure your log shipping agent (e.g., Fluentd) on each Droplet to collect logs from /var/log/myapp/app.log and forward them to your centralized logging system. For MongoDB, ensure its log configuration is also set up to output to a file that your agent can read.

Conclusion: A Multi-Layered Approach

Effective server monitoring is not a single tool or configuration but a layered strategy. It starts with basic infrastructure metrics provided by DigitalOcean, extends to application-level health checks and custom database monitoring scripts, and is further enhanced by centralized logging and distributed tracing. By implementing these practices, you build resilience and gain deep visibility into your Python application and MongoDB cluster’s health, ensuring stability and performance.