Server Monitoring Best Practices: Keeping Your Python App and MongoDB Clusters Alive on Google Cloud
Proactive Health Checks for Python Applications on GKE
Maintaining the health of Python applications deployed on Google Kubernetes Engine (GKE) requires a multi-layered approach to monitoring. Beyond basic liveness and readiness probes, we need to instrument our applications to expose internal metrics and implement robust external checks that simulate user behavior.
Application-Level Metrics with Prometheus
Prometheus is the de facto standard for metrics collection in Kubernetes. For Python applications, the prometheus_client library is indispensable. We’ll instrument our Flask application to expose key performance indicators (KPIs) like request latency, error rates, and active user counts.
Instrumenting a Flask Application
First, install the necessary library:
pip install prometheus_client Flask
Next, integrate it into your Flask app. We’ll create a `/metrics` endpoint that Prometheus can scrape.
Consider a simple Flask application:
app.py
from flask import Flask, Response
from prometheus_client import generate_latest, Counter, Histogram, Gauge
import time
import random
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint', 'status_code'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of Active Users')
@app.route('/')
def index():
start_time = time.time()
try:
# Simulate some work
time.sleep(random.uniform(0.1, 0.5))
status_code = 200
return "Hello, World!"
except Exception as e:
status_code = 500
return str(e), 500
finally:
duration = time.time() - start_time
REQUEST_LATENCY.labels(method='GET', endpoint='/').observe(duration)
REQUEST_COUNT.labels(method='GET', endpoint='/', status_code=status_code).inc()
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype='text/plain')
if __name__ == '__main__':
# Simulate user activity
def update_users():
while True:
ACTIVE_USERS.set(random.randint(10, 100))
time.sleep(5)
import threading
user_thread = threading.Thread(target=update_users)
user_thread.daemon = True
user_thread.start()
app.run(host='0.0.0.0', port=5000)
Deploying Prometheus to GKE
We’ll use the Prometheus Operator for a streamlined deployment. This involves applying a set of Kubernetes manifests that define the Prometheus server, Alertmanager, and custom resource definitions (CRDs) for configuring scraping and alerting.
First, install the Prometheus Operator:
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_rules.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/bundle.yaml
Next, create a ServiceMonitor resource to tell Prometheus how to scrape your application’s metrics endpoint. This assumes your Flask application is running in a pod with the label app: my-flask-app and the service is named my-flask-app-service.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-flask-app-monitor
labels:
release: prometheus # This label should match your Prometheus instance's label
spec:
selector:
matchLabels:
app: my-flask-app # Label of the Pods to scrape
namespaceSelector:
matchNames:
- default # Namespace where your application is deployed
endpoints:
- port: http # Name of the port in your Service
interval: 30s
path: /metrics
Apply this manifest:
kubectl apply -f service-monitor.yaml
Finally, configure your Prometheus instance to pick up this ServiceMonitor. This is typically done by ensuring your Prometheus Custom Resource (CR) has a label selector that matches the labels on the ServiceMonitor (e.g., release: prometheus).
MongoDB Cluster Health on Google Cloud
Monitoring MongoDB clusters, especially in a distributed environment on GKE, requires attention to both instance-level metrics and cluster-wide operational status. Google Cloud’s operations suite (formerly Stackdriver) offers powerful tools, but we’ll also leverage MongoDB’s native monitoring capabilities.
Leveraging Google Cloud Operations Suite
The Ops Agent for Google Cloud can collect system and application logs and metrics from your MongoDB instances. Ensure the agent is deployed to your GKE nodes and configured to collect relevant MongoDB metrics.
A typical configuration for the Ops Agent might involve enabling the metrics and logs collection for your MongoDB pods. You can achieve this by creating a ConfigMap and mounting it into the agent’s deployment.
Example ConfigMap for Ops Agent (simplified):
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-agent-config
namespace: google-cloud-ops-agent
data:
config-otel-collector.yaml: |
receivers:
hostmetrics:
collection_interval: 10s
scrapers:
cpu:
memory:
disk:
network:
# Add specific MongoDB receiver if available or use logs/exec
exporters:
googlecloud:
project: YOUR_GCP_PROJECT_ID
service:
pipelines:
metrics:
receivers: [hostmetrics]
exporters: [googlecloud]
You would then reference this ConfigMap in your Ops Agent DaemonSet or Deployment. The specific metrics to collect include CPU utilization, memory usage, disk I/O, network traffic, and importantly, MongoDB-specific metrics like connection counts, query performance, replication lag, and oplog size.
MongoDB Native Monitoring Tools
For deeper insights, mongostat, mongotop, and the MongoDB diagnostic commands (accessible via the mongo shell) are invaluable. We can integrate these into our monitoring strategy by running them periodically and exporting their output.
Replication Lag Monitoring
Replication lag is a critical indicator of cluster health. We can monitor this using the rs.status() command.
mongo --host mongodb-0.mongodb-headless.default.svc.cluster.local --port 27017 --username replUser --password replPassword --authenticationDatabase admin --quiet --eval "rs.status()"
This command outputs a JSON document. We can parse this to extract the lag value for each secondary member. A simple Python script can automate this:
from pymongo import MongoClient
import time
import os
# Connect to the primary node (or any node in the replica set)
# Ensure you have appropriate credentials and connection string
MONGO_HOST = os.environ.get("MONGO_HOST", "mongodb-0.mongodb-headless.default.svc.cluster.local")
MONGO_PORT = int(os.environ.get("MONGO_PORT", 27017))
MONGO_USER = os.environ.get("MONGO_USER", "replUser")
MONGO_PASS = os.environ.get("MONGO_PASS", "replPassword")
AUTH_DB = os.environ.get("AUTH_DB", "admin")
client = MongoClient(
host=MONGO_HOST,
port=MONGO_PORT,
username=MONGO_USER,
password=MONGO_PASS,
authSource=AUTH_DB
)
try:
rs_status = client.admin.command('replSetGetStatus')
primary_member = None
secondaries = []
for member in rs_status['members']:
if member['stateStr'] == 'PRIMARY':
primary_member = member
elif member['stateStr'] == 'SECONDARY':
secondaries.append(member)
if primary_member:
for secondary in secondaries:
# Calculate lag: time difference between primary's last write and secondary's apply
# This is a simplified calculation. For precise lag, consider oplog timestamps.
primary_optime_ts = primary_member['optimeTs']
secondary_optime_ts = secondary['optimeTs']
# Convert seconds since epoch to datetime objects for comparison
primary_optime_dt = time.mktime(time.gmtime(primary_optime_ts.time))
secondary_optime_dt = time.mktime(time.gmtime(secondary_optime_ts.time))
lag_seconds = primary_optime_dt - secondary_optime_dt
print(f"Secondary {secondary['name']} lag: {lag_seconds:.2f} seconds")
# Here you would typically push this metric to Prometheus or another monitoring system
# e.g., replication_lag_seconds.labels(member=secondary['name']).set(lag_seconds)
else:
print("No primary found in replica set status.")
except Exception as e:
print(f"Error fetching replica set status: {e}")
finally:
client.close()
This script can be run as a Kubernetes CronJob, with its output scraped by Prometheus (if configured to scrape logs or custom metrics) or directly pushed to a time-series database.
Oplog Monitoring
The size and utilization of the oplog are crucial for replication performance. A full oplog can halt replication. We can monitor this using mongostat or by querying the oplog.rs collection.
# Using mongostat (requires mongostat executable in the pod)
mongostat --host mongodb-0.mongodb-headless.default.svc.cluster.local --port 27017 -u replUser -p replPassword --authenticationDatabase admin --oplog --rowcount 1
# Using mongo shell to query oplog.rs
mongo --host mongodb-0.mongodb-headless.default.svc.cluster.local --port 27017 --username replUser --password replPassword --authenticationDatabase admin --quiet --eval "db.getSiblingDB('local').oplog.stats()"
The output of oplog.stats() provides information on the oplog’s size, entries, and usage. Again, a script can parse this and expose it as metrics.
Alerting Strategies
Effective alerting is paramount. We should define alerts based on thresholds that indicate potential issues before they impact users. This involves configuring Prometheus Alertmanager.
Alerting on Application Health
Examples of Prometheus alerting rules for the Python application:
groups:
- name: python_app_alerts
rules:
- alert: HighHttpRequestErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."} [5m])) by (endpoint)
/
sum(rate(http_requests_total[5m])) by (endpoint)
> 0.05 # More than 5% error rate
for: 5m
labels:
severity: critical
annotations:
summary: "High HTTP request error rate for {{ $labels.endpoint }}"
description: "The endpoint {{ $labels.endpoint }} has an error rate exceeding 5% over the last 5 minutes."
- alert: HighHttpRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)) > 2 # 95th percentile latency > 2 seconds
for: 10m
labels:
severity: warning
annotations:
summary: "High HTTP request latency for {{ $labels.endpoint }}"
description: "The 95th percentile latency for endpoint {{ $labels.endpoint }} is above 2 seconds for the last 10 minutes."
- alert: LowActiveUsers
expr: app_active_users < 10 # Less than 10 active users
for: 15m
labels:
severity: info
annotations:
summary: "Low active user count"
description: "The number of active users has dropped below 10 for the last 15 minutes."
Alerting on MongoDB Cluster Health
Alerting rules for MongoDB can be defined similarly, assuming you’ve exported MongoDB metrics to Prometheus (e.g., via a custom exporter or by scraping logs). If using Google Cloud Operations, you’d configure alerts directly within that platform.
groups:
- name: mongodb_alerts
rules:
- alert: MongoDBReplicationLag
# This assumes a metric 'mongodb_replication_lag_seconds' is exposed
expr: mongodb_replication_lag_seconds > 60 # Replication lag exceeds 60 seconds
for: 5m
labels:
severity: critical
annotations:
summary: "MongoDB replication lag detected on {{ $labels.member }}"
description: "Replica set member {{ $labels.member }} has a replication lag of {{ $value }} seconds."
- alert: HighMongoDBQueryThroughput
# This assumes a metric like 'mongodb_oplog_entries_total' or similar
expr: rate(mongodb_oplog_entries_total[5m]) > 1000 # High oplog write rate
for: 10m
labels:
severity: warning
annotations:
summary: "High MongoDB oplog write rate"
description: "MongoDB oplog write rate is exceeding 1000 entries/sec."
- alert: MongoDBDiskSpaceLow
# This assumes a metric like 'node_disk_free_bytes' or similar from node_exporter
expr: node_disk_free_bytes{mountpoint="/data/db"} < 1024*1024*1024*10 # Less than 10GB free disk space
for: 15m
labels:
severity: critical
annotations:
summary: "Low disk space on MongoDB node"
description: "MongoDB node {{ $labels.instance }} has less than 10GB of free disk space on /data/db."
These rules should be applied to your Prometheus instance via a PrometheusRule custom resource in Kubernetes.
Conclusion
A comprehensive monitoring strategy for Python applications and MongoDB clusters on GKE involves a blend of application-level instrumentation, Kubernetes-native tooling, cloud provider services, and database-specific diagnostics. By proactively monitoring key metrics and setting up intelligent alerts, you can ensure the stability, performance, and availability of your critical services.