Server Monitoring Best Practices: Keeping Your Python App and MongoDB Clusters Alive on Google Cloud

Proactive Health Checks for Python Applications on GKE

Maintaining the health of Python applications deployed on Google Kubernetes Engine (GKE) requires a multi-layered approach to monitoring. Beyond basic liveness and readiness probes, we need to instrument our applications to expose internal metrics and implement robust external checks that simulate user behavior.

Application-Level Metrics with Prometheus

Prometheus is the de facto standard for metrics collection in Kubernetes. For Python applications, the prometheus_client library is indispensable. We’ll instrument our Flask application to expose key performance indicators (KPIs) like request latency, error rates, and active user counts.

Instrumenting a Flask Application

First, install the necessary library:

pip install prometheus_client Flask

Next, integrate it into your Flask app. We’ll create a `/metrics` endpoint that Prometheus can scrape.

Consider a simple Flask application:

app.py

from flask import Flask, Response
from prometheus_client import generate_latest, Counter, Histogram, Gauge
import time
import random

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint', 'status_code'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of Active Users')

@app.route('/')
def index():
    start_time = time.time()
    try:
        # Simulate some work
        time.sleep(random.uniform(0.1, 0.5))
        status_code = 200
        return "Hello, World!"
    except Exception as e:
        status_code = 500
        return str(e), 500
    finally:
        duration = time.time() - start_time
        REQUEST_LATENCY.labels(method='GET', endpoint='/').observe(duration)
        REQUEST_COUNT.labels(method='GET', endpoint='/', status_code=status_code).inc()

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype='text/plain')

if __name__ == '__main__':
    # Simulate user activity
    def update_users():
        while True:
            ACTIVE_USERS.set(random.randint(10, 100))
            time.sleep(5)
    
    import threading
    user_thread = threading.Thread(target=update_users)
    user_thread.daemon = True
    user_thread.start()

    app.run(host='0.0.0.0', port=5000)

Deploying Prometheus to GKE

We’ll use the Prometheus Operator for a streamlined deployment. This involves applying a set of Kubernetes manifests that define the Prometheus server, Alertmanager, and custom resource definitions (CRDs) for configuring scraping and alerting.

First, install the Prometheus Operator:

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_rules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/bundle.yaml

Next, create a ServiceMonitor resource to tell Prometheus how to scrape your application’s metrics endpoint. This assumes your Flask application is running in a pod with the label app: my-flask-app and the service is named my-flask-app-service.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-flask-app-monitor
  labels:
    release: prometheus # This label should match your Prometheus instance's label
spec:
  selector:
    matchLabels:
      app: my-flask-app # Label of the Pods to scrape
  namespaceSelector:
    matchNames:
      - default # Namespace where your application is deployed
  endpoints:
  - port: http # Name of the port in your Service
    interval: 30s
    path: /metrics

Apply this manifest:

kubectl apply -f service-monitor.yaml

Finally, configure your Prometheus instance to pick up this ServiceMonitor. This is typically done by ensuring your Prometheus Custom Resource (CR) has a label selector that matches the labels on the ServiceMonitor (e.g., release: prometheus).

MongoDB Cluster Health on Google Cloud

Monitoring MongoDB clusters, especially in a distributed environment on GKE, requires attention to both instance-level metrics and cluster-wide operational status. Google Cloud’s operations suite (formerly Stackdriver) offers powerful tools, but we’ll also leverage MongoDB’s native monitoring capabilities.

Leveraging Google Cloud Operations Suite

The Ops Agent for Google Cloud can collect system and application logs and metrics from your MongoDB instances. Ensure the agent is deployed to your GKE nodes and configured to collect relevant MongoDB metrics.

A typical configuration for the Ops Agent might involve enabling the metrics and logs collection for your MongoDB pods. You can achieve this by creating a ConfigMap and mounting it into the agent’s deployment.

Example ConfigMap for Ops Agent (simplified):

apiVersion: v1
kind: ConfigMap
metadata:
  name: ops-agent-config
  namespace: google-cloud-ops-agent
data:
  config-otel-collector.yaml: |
    receivers:
      hostmetrics:
        collection_interval: 10s
        scrapers:
          cpu:
          memory:
          disk:
          network:
      # Add specific MongoDB receiver if available or use logs/exec
    exporters:
      googlecloud:
        project: YOUR_GCP_PROJECT_ID
    service:
      pipelines:
        metrics:
          receivers: [hostmetrics]
          exporters: [googlecloud]

You would then reference this ConfigMap in your Ops Agent DaemonSet or Deployment. The specific metrics to collect include CPU utilization, memory usage, disk I/O, network traffic, and importantly, MongoDB-specific metrics like connection counts, query performance, replication lag, and oplog size.

MongoDB Native Monitoring Tools

For deeper insights, mongostat, mongotop, and the MongoDB diagnostic commands (accessible via the mongo shell) are invaluable. We can integrate these into our monitoring strategy by running them periodically and exporting their output.

Replication Lag Monitoring

Replication lag is a critical indicator of cluster health. We can monitor this using the rs.status() command.

mongo --host mongodb-0.mongodb-headless.default.svc.cluster.local --port 27017 --username replUser --password replPassword --authenticationDatabase admin --quiet --eval "rs.status()"

This command outputs a JSON document. We can parse this to extract the lag value for each secondary member. A simple Python script can automate this:

from pymongo import MongoClient
import time
import os

# Connect to the primary node (or any node in the replica set)
# Ensure you have appropriate credentials and connection string
MONGO_HOST = os.environ.get("MONGO_HOST", "mongodb-0.mongodb-headless.default.svc.cluster.local")
MONGO_PORT = int(os.environ.get("MONGO_PORT", 27017))
MONGO_USER = os.environ.get("MONGO_USER", "replUser")
MONGO_PASS = os.environ.get("MONGO_PASS", "replPassword")
AUTH_DB = os.environ.get("AUTH_DB", "admin")

client = MongoClient(
    host=MONGO_HOST,
    port=MONGO_PORT,
    username=MONGO_USER,
    password=MONGO_PASS,
    authSource=AUTH_DB
)

try:
    rs_status = client.admin.command('replSetGetStatus')
    primary_member = None
    secondaries = []

    for member in rs_status['members']:
        if member['stateStr'] == 'PRIMARY':
            primary_member = member
        elif member['stateStr'] == 'SECONDARY':
            secondaries.append(member)

    if primary_member:
        for secondary in secondaries:
            # Calculate lag: time difference between primary's last write and secondary's apply
            # This is a simplified calculation. For precise lag, consider oplog timestamps.
            primary_optime_ts = primary_member['optimeTs']
            secondary_optime_ts = secondary['optimeTs']
            
            # Convert seconds since epoch to datetime objects for comparison
            primary_optime_dt = time.mktime(time.gmtime(primary_optime_ts.time))
            secondary_optime_dt = time.mktime(time.gmtime(secondary_optime_ts.time))
            
            lag_seconds = primary_optime_dt - secondary_optime_dt
            
            print(f"Secondary {secondary['name']} lag: {lag_seconds:.2f} seconds")
            
            # Here you would typically push this metric to Prometheus or another monitoring system
            # e.g., replication_lag_seconds.labels(member=secondary['name']).set(lag_seconds)

    else:
        print("No primary found in replica set status.")

except Exception as e:
    print(f"Error fetching replica set status: {e}")
finally:
    client.close()

This script can be run as a Kubernetes CronJob, with its output scraped by Prometheus (if configured to scrape logs or custom metrics) or directly pushed to a time-series database.

Oplog Monitoring

The size and utilization of the oplog are crucial for replication performance. A full oplog can halt replication. We can monitor this using mongostat or by querying the oplog.rs collection.

# Using mongostat (requires mongostat executable in the pod)
mongostat --host mongodb-0.mongodb-headless.default.svc.cluster.local --port 27017 -u replUser -p replPassword --authenticationDatabase admin --oplog --rowcount 1

# Using mongo shell to query oplog.rs
mongo --host mongodb-0.mongodb-headless.default.svc.cluster.local --port 27017 --username replUser --password replPassword --authenticationDatabase admin --quiet --eval "db.getSiblingDB('local').oplog.stats()"

The output of oplog.stats() provides information on the oplog’s size, entries, and usage. Again, a script can parse this and expose it as metrics.

Alerting Strategies

Effective alerting is paramount. We should define alerts based on thresholds that indicate potential issues before they impact users. This involves configuring Prometheus Alertmanager.

Alerting on Application Health

Examples of Prometheus alerting rules for the Python application:

groups:
- name: python_app_alerts
  rules:
  - alert: HighHttpRequestErrorRate
    expr: |
      sum(rate(http_requests_total{status_code=~"5.."} [5m])) by (endpoint)
      /
      sum(rate(http_requests_total[5m])) by (endpoint)
      > 0.05 # More than 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High HTTP request error rate for {{ $labels.endpoint }}"
      description: "The endpoint {{ $labels.endpoint }} has an error rate exceeding 5% over the last 5 minutes."

  - alert: HighHttpRequestLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)) > 2 # 95th percentile latency > 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High HTTP request latency for {{ $labels.endpoint }}"
      description: "The 95th percentile latency for endpoint {{ $labels.endpoint }} is above 2 seconds for the last 10 minutes."

  - alert: LowActiveUsers
    expr: app_active_users < 10 # Less than 10 active users
    for: 15m
    labels:
      severity: info
    annotations:
      summary: "Low active user count"
      description: "The number of active users has dropped below 10 for the last 15 minutes."

Alerting on MongoDB Cluster Health

Alerting rules for MongoDB can be defined similarly, assuming you’ve exported MongoDB metrics to Prometheus (e.g., via a custom exporter or by scraping logs). If using Google Cloud Operations, you’d configure alerts directly within that platform.

groups:
- name: mongodb_alerts
  rules:
  - alert: MongoDBReplicationLag
    # This assumes a metric 'mongodb_replication_lag_seconds' is exposed
    expr: mongodb_replication_lag_seconds > 60 # Replication lag exceeds 60 seconds
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "MongoDB replication lag detected on {{ $labels.member }}"
      description: "Replica set member {{ $labels.member }} has a replication lag of {{ $value }} seconds."

  - alert: HighMongoDBQueryThroughput
    # This assumes a metric like 'mongodb_oplog_entries_total' or similar
    expr: rate(mongodb_oplog_entries_total[5m]) > 1000 # High oplog write rate
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High MongoDB oplog write rate"
      description: "MongoDB oplog write rate is exceeding 1000 entries/sec."

  - alert: MongoDBDiskSpaceLow
    # This assumes a metric like 'node_disk_free_bytes' or similar from node_exporter
    expr: node_disk_free_bytes{mountpoint="/data/db"} < 1024*1024*1024*10 # Less than 10GB free disk space
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on MongoDB node"
      description: "MongoDB node {{ $labels.instance }} has less than 10GB of free disk space on /data/db."

These rules should be applied to your Prometheus instance via a PrometheusRule custom resource in Kubernetes.

Conclusion

A comprehensive monitoring strategy for Python applications and MongoDB clusters on GKE involves a blend of application-level instrumentation, Kubernetes-native tooling, cloud provider services, and database-specific diagnostics. By proactively monitoring key metrics and setting up intelligent alerts, you can ensure the stability, performance, and availability of your critical services.