Server Monitoring Best Practices: Keeping Your Python App and MongoDB Clusters Alive on DigitalOcean
Proactive Health Checks for Python Applications
Maintaining the health of a Python application, especially one serving critical traffic, requires more than just basic uptime checks. We need to delve into application-level metrics and implement intelligent alerting. For a typical Flask or Django application, this involves exposing internal metrics and setting up external probes.
A common pattern is to expose a `/health` or `/metrics` endpoint within the application itself. This endpoint can report on database connectivity, cache status, and internal worker queues. Let’s consider a simple Flask example:
Flask Health Endpoint Example
from flask import Flask, jsonify
import redis
import pymongo
app = Flask(__name__)
# Configuration (ideally from environment variables)
MONGO_URI = "mongodb://mongo1:27017,mongo2:27017/?replicaSet=rs0"
REDIS_HOST = "redis_cache"
REDIS_PORT = 6379
def check_mongo_connection(uri):
try:
client = pymongo.MongoClient(uri, serverSelectionTimeoutMS=5000)
# The ismaster command is cheap and does not require auth.
client.admin.command('ismaster')
return True, "MongoDB connection successful"
except pymongo.errors.ConnectionFailure as e:
return False, f"MongoDB connection failed: {e}"
except Exception as e:
return False, f"An unexpected error occurred with MongoDB: {e}"
def check_redis_connection(host, port):
try:
r = redis.StrictRedis(host=host, port=port, socket_connect_timeout=2, socket_timeout=2)
r.ping()
return True, "Redis connection successful"
except redis.exceptions.ConnectionError as e:
return False, f"Redis connection failed: {e}"
except Exception as e:
return False, f"An unexpected error occurred with Redis: {e}"
@app.route('/health')
def health_check():
mongo_ok, mongo_msg = check_mongo_connection(MONGO_URI)
redis_ok, redis_msg = check_redis_connection(REDIS_HOST, REDIS_PORT)
status = {
"status": "unhealthy",
"dependencies": {
"mongodb": {"ok": mongo_ok, "message": mongo_msg},
"redis": {"ok": redis_ok, "message": redis_msg}
}
}
if mongo_ok and redis_ok:
status["status"] = "healthy"
return jsonify(status), 200
else:
return jsonify(status), 503 # Service Unavailable
if __name__ == '__main__':
# In production, use a proper WSGI server like Gunicorn
app.run(host='0.0.0.0', port=5000)
This endpoint returns a 200 OK for healthy status and a 503 Service Unavailable for unhealthy. This is crucial for load balancers and external monitoring tools. For more detailed metrics (request latency, error rates, memory usage), consider integrating libraries like prometheus_client and exposing a `/metrics` endpoint.
Monitoring MongoDB Clusters on DigitalOcean
DigitalOcean’s Managed MongoDB service simplifies cluster management, but robust monitoring is still essential. We need to track not just basic availability but also performance indicators like query latency, replication lag, and disk usage.
The primary tool for this is the MongoDB diagnostic commands, accessible via the `mongosh` shell or programmatically. For automated monitoring, we’ll use a dedicated monitoring agent or script that periodically queries these metrics.
Key MongoDB Metrics to Monitor
- Replication Lag: Critical for ensuring data consistency across nodes. Use
rs.status(). - Query Performance: Track slow queries and overall query execution times. Use
db.serverStatus()anddb.currentOp(). - Disk Usage: Prevent outages due to full disks. Use
db.stats()or system-level tools. - Connections: Monitor active and available connections to avoid connection exhaustion. Use
db.serverStatus(). - Memory Usage: Keep an eye on RAM consumption, especially for WiredTiger cache. Use
db.serverStatus().
Automated MongoDB Health Checks Script (Python)
This Python script connects to a MongoDB replica set and checks for replication lag and basic server status. It’s designed to be run periodically by a scheduler like cron or a systemd timer.
import pymongo
import time
import sys
import os
# Configuration from environment variables
MONGO_URI = os.environ.get("MONGO_URI", "mongodb://user:password@mongo1:27017,mongo2:27017/?replicaSet=rs0")
REPLICATION_LAG_THRESHOLD_SECONDS = int(os.environ.get("REPLICATION_LAG_THRESHOLD_SECONDS", 60))
DISK_USAGE_THRESHOLD_PERCENT = int(os.environ.get("DISK_USAGE_THRESHOLD_PERCENT", 85))
def check_replication_lag(client):
try:
rs_status = client.admin.command('replSetGetStatus')
primary_member = None
max_lag = 0
for member in rs_status['members']:
if member['stateStr'] == 'PRIMARY':
primary_member = member
# Calculate lag for secondary members
if member['stateStr'] != 'PRIMARY':
optime_date = member['optimeDate']
primary_optime_date = next(m['optimeDate'] for m in rs_status['members'] if m['stateStr'] == 'PRIMARY')
lag = (primary_optime_date - optime_date).total_seconds()
if lag > max_lag:
max_lag = lag
if max_lag > REPLICATION_LAG_THRESHOLD_SECONDS:
print(f"CRITICAL: Replication lag detected. Max lag: {max_lag:.2f}s (Threshold: {REPLICATION_LAG_THRESHOLD_SECONDS}s)", file=sys.stderr)
return False
else:
print(f"OK: Replication lag is within acceptable limits (Max lag: {max_lag:.2f}s)")
return True
except pymongo.errors.OperationFailure as e:
print(f"ERROR: Failed to get replication status: {e}", file=sys.stderr)
return False
except Exception as e:
print(f"ERROR: Unexpected error during replication check: {e}", file=sys.stderr)
return False
def check_disk_usage(client):
try:
# Use db.command('storageStats') for more detailed disk usage per collection
# For a quick check, db.stats() provides overall data/index size
db_stats = client.admin.command('dbStats')
total_size_gb = db_stats['dataSize'] / (1024**3)
storage_size_gb = db_stats['storageSize'] / (1024**3) # WiredTiger uncompressed size
# DigitalOcean provides disk size, we need to know the total provisioned size.
# This is a simplification; in a real scenario, you'd query DO API or have it configured.
# Assuming a common DO droplet disk size for demonstration.
# For managed databases, DO handles disk provisioning, so this check might be less direct.
# A better approach for DO Managed DBs is to monitor DO's own metrics.
# However, if you have self-hosted MongoDB on DO droplets, this is relevant.
# Placeholder for actual disk size retrieval
# For DO Managed Databases, rely on DO's metrics.
# For self-hosted on DO droplets:
# total_provisioned_gb = get_droplet_disk_size() # Function to call DO API
# For this example, let's assume we know the total disk size.
# If running on a DO droplet, you'd use `df -h /` or similar.
# For managed DBs, this check is less applicable directly.
print("INFO: Disk usage check is simplified for managed databases. Rely on DigitalOcean's provided metrics.")
return True # Assume OK if not self-hosted with direct disk access
except pymongo.errors.OperationFailure as e:
print(f"ERROR: Failed to get database stats: {e}", file=sys.stderr)
return False
except Exception as e:
print(f"ERROR: Unexpected error during disk usage check: {e}", file=sys.stderr)
return False
def check_server_status(client):
try:
server_status = client.admin.command('serverStatus')
connections = server_status['connections']
network = server_status['network']
metrics = server_status['metrics']
print(f"INFO: Connections - Current: {connections['current']}, Available: {connections['available']}")
print(f"INFO: Network - Bytes In: {network['bytesIn']}, Bytes Out: {network['bytesOut']}")
print(f"INFO: WiredTiger Cache - Bytes Used: {metrics['cdot']['wiredTiger']['cache']['bytesCurrentlyUsed']:,}, Pages Read into Cache: {metrics['cdot']['wiredTiger']['cache']['pagesReadIntoCache']:,}")
# Add specific thresholds for connections, cache usage etc. if needed
if connections['current'] > connections['available'] * 0.9:
print(f"WARNING: High connection usage: {connections['current']}/{connections['available']}", file=sys.stderr)
return True
except pymongo.errors.OperationFailure as e:
print(f"ERROR: Failed to get server status: {e}", file=sys.stderr)
return False
except Exception as e:
print(f"ERROR: Unexpected error during server status check: {e}", file=sys.stderr)
return False
if __name__ == "__main__":
try:
client = pymongo.MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
client.admin.command('ping') # Verify connection
print("INFO: Successfully connected to MongoDB.")
all_checks_ok = True
if not check_replication_lag(client):
all_checks_ok = False
if not check_disk_usage(client): # Note: Simplified for DO Managed DBs
all_checks_ok = False
if not check_server_status(client):
all_checks_ok = False
if all_checks_ok:
print("INFO: All MongoDB health checks passed.")
sys.exit(0)
else:
print("ERROR: One or more MongoDB health checks failed.", file=sys.stderr)
sys.exit(1)
except pymongo.errors.ConnectionFailure as e:
print(f"FATAL: Could not connect to MongoDB at {MONGO_URI}: {e}", file=sys.stderr)
sys.exit(2)
except Exception as e:
print(f"FATAL: An unexpected error occurred: {e}", file=sys.stderr)
sys.exit(3)
finally:
if 'client' in locals() and client:
client.close()
To use this script effectively:
- Set the
MONGO_URIenvironment variable with your DigitalOcean Managed MongoDB connection string. - Configure
REPLICATION_LAG_THRESHOLD_SECONDSandDISK_USAGE_THRESHOLD_PERCENTas needed. - Schedule this script using
cronor a systemd timer to run every 1-5 minutes. - Pipe the output to a log file and configure your alerting system (e.g., Prometheus Alertmanager, PagerDuty) to trigger on non-zero exit codes or specific error messages.
Integrating with DigitalOcean Monitoring & Alerting
DigitalOcean’s built-in monitoring provides a good baseline. For your Droplets running Python apps and potentially self-hosted MongoDB (though Managed MongoDB is recommended), ensure the DigitalOcean agent is installed and configured.
Key metrics to monitor via the DO dashboard:
- CPU Utilization: High CPU can indicate inefficient code or heavy load.
- Memory Usage: Crucial for Python apps and MongoDB’s cache.
- Disk I/O: Bottlenecks here severely impact database performance.
- Network Traffic: Monitor for unusual spikes or drops.
Setting Up Alerts in DigitalOcean
DigitalOcean allows you to set up alerts directly on Droplet and Managed Database metrics. This is your first line of defense.
Example alert configuration:
- Resource: Droplet CPU Usage
- Condition: Greater than 90% for 15 minutes
- Alerts To: Your email, Slack integration (via webhooks)
- Resource: Managed MongoDB Disk Usage
- Condition: Greater than 85% for 30 minutes
- Alerts To: Your email, PagerDuty
For more sophisticated alerting based on your custom application and database scripts (like the Python examples above), consider integrating with tools like Prometheus and Alertmanager. You can scrape metrics from your application’s `/metrics` endpoint and use the output of your custom scripts as Prometheus exporters or directly feed them into Alertmanager.
Advanced: Centralized Logging and Tracing
Beyond metrics, logs and traces are invaluable for diagnosing issues. Centralizing logs from all your application instances and database nodes allows for easier searching and correlation.
Consider using a stack like:
- Log Collection: Fluentd, Logstash, or Vector
- Log Storage/Search: Elasticsearch, Loki
- Visualization: Kibana, Grafana
For distributed tracing, integrate libraries like OpenTelemetry into your Python application. This allows you to visualize the path of a request through your system, identifying latency bottlenecks across services and databases.
Example: Python Logging Configuration
import logging
import logging.handlers
import sys
import os
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'INFO').upper()
LOG_FILE = os.environ.get('LOG_FILE', '/var/log/myapp/app.log')
LOG_MAX_BYTES = int(os.environ.get('LOG_MAX_BYTES', 10 * 1024 * 1024)) # 10MB
LOG_BACKUP_COUNT = int(os.environ.get('LOG_BACKUP_COUNT', 5))
# Ensure log directory exists
log_dir = os.path.dirname(LOG_FILE)
if not os.path.exists(log_dir):
try:
os.makedirs(log_dir)
except OSError as e:
print(f"Error creating log directory {log_dir}: {e}", file=sys.stderr)
# Fallback to stderr if directory creation fails
logging.basicConfig(level=LOG_LEVEL, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logging.error(f"Failed to create log directory {log_dir}. Logging to stderr.")
# Exit or handle appropriately if logging is critical
# sys.exit(1)
else:
# Setup rotating file handler
file_handler = logging.handlers.RotatingFileHandler(
LOG_FILE,
maxBytes=LOG_MAX_BYTES,
backupCount=LOG_BACKUP_COUNT
)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)
# Setup console handler (for Docker/Kubernetes environments)
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setFormatter(formatter)
# Get the root logger
logger = logging.getLogger()
logger.setLevel(LOG_LEVEL)
# Add handlers
logger.addHandler(file_handler)
logger.addHandler(console_handler)
logging.info(f"Logging configured. Level: {LOG_LEVEL}, File: {LOG_FILE}")
# Example usage
def my_function():
logging.info("Executing my_function.")
try:
# Simulate an error
result = 1 / 0
except ZeroDivisionError:
logging.error("Encountered a ZeroDivisionError!", exc_info=True) # exc_info=True logs traceback
logging.info("Finished my_function.")
if __name__ == "__main__":
my_function()
Configure your log shipping agent (e.g., Fluentd) on each Droplet to collect logs from /var/log/myapp/app.log and forward them to your centralized logging system. For MongoDB, ensure its log configuration is also set up to output to a file that your agent can read.
Conclusion: A Multi-Layered Approach
Effective server monitoring is not a single tool or configuration but a layered strategy. It starts with basic infrastructure metrics provided by DigitalOcean, extends to application-level health checks and custom database monitoring scripts, and is further enhanced by centralized logging and distributed tracing. By implementing these practices, you build resilience and gain deep visibility into your Python application and MongoDB cluster’s health, ensuring stability and performance.