Server Monitoring Best Practices: Keeping Your Python App and MySQL Clusters Alive on OVH
Proactive Health Checks for Python Applications
Maintaining the health of Python applications, especially those deployed on cloud infrastructure like OVH, requires a multi-layered monitoring approach. Beyond basic uptime checks, we need to delve into application-specific metrics and error detection. A common pattern is to expose a dedicated health check endpoint within the application itself.
For a Flask application, this might look like:
from flask import Flask, jsonify
import logging
app = Flask(__name__)
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def check_database_connection():
# Placeholder for actual DB connection check
# In a real scenario, this would attempt a quick query or connection
try:
# Example: Using SQLAlchemy (replace with your actual ORM/DB library)
# from your_app.database import db_session
# db_session.execute("SELECT 1")
return True, "Database connection is healthy."
except Exception as e:
logging.error(f"Database connection check failed: {e}")
return False, f"Database connection error: {str(e)}"
@app.route('/healthz')
def health_check():
db_healthy, db_message = check_database_connection()
# Add other checks as needed: cache, external services, etc.
# For example, checking a Redis connection:
# redis_healthy, redis_message = check_redis_connection()
if db_healthy: # and redis_healthy:
return jsonify({
"status": "ok",
"message": "Application is healthy.",
"database": {"status": "ok", "message": db_message}
# "redis": {"status": "ok", "message": redis_message}
}), 200
else:
return jsonify({
"status": "error",
"message": "Application is unhealthy.",
"database": {"status": "error", "message": db_message}
# "redis": {"status": "error", "message": redis_message}
}), 503 # Service Unavailable
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
This endpoint should be polled by an external monitoring system (e.g., Prometheus, Nagios, or even a simple cron job with `curl`). The response code (200 for healthy, non-2xx for unhealthy) and the JSON payload provide granular insights.
Leveraging Prometheus for Application and Infrastructure Metrics
Prometheus is a de facto standard for metrics-based monitoring. For Python applications, the prometheus_client library is invaluable. We can instrument our application to expose custom metrics.
First, install the library:
pip install prometheus_client
Then, integrate it into your Flask app:
from flask import Flask, Response
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
import time
import random
app = Flask(__name__)
# Custom metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Summary('http_request_duration_seconds', 'HTTP Request Latency', ['endpoint'])
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
@app.route('/')
@REQUEST_LATENCY.labels(endpoint='/').time()
def index():
REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
time.sleep(random.uniform(0.1, 0.5)) # Simulate work
return "Hello, World!"
@app.route('/api/data')
@REQUEST_LATENCY.labels(endpoint='/api/data').time()
def get_data():
REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
time.sleep(random.uniform(0.2, 1.0)) # Simulate work
return {"data": "some_value"}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
This exposes a /metrics endpoint that Prometheus can scrape. We’re tracking request counts and latency. For more advanced metrics, consider using:
Gaugefor values that can go up or down (e.g., current number of active connections).Histogramfor distributions of values (e.g., response times, useful for calculating percentiles).Summaryfor calculating configurable quantiles of observations over a sliding window.
Monitoring MySQL Clusters with Percona Monitoring and Management (PMM)
For MySQL clusters, especially in a production environment, a robust solution like Percona Monitoring and Management (PMM) is highly recommended. PMM provides deep insights into MySQL performance, availability, and query analysis.
Deployment typically involves:
- Deploying the PMM Server (a Docker container or VM).
- Installing the PMM Client on each MySQL node.
The PMM Client uses the mysqld_exporter to collect metrics from MySQL instances. Configuration involves ensuring the exporter can connect to your MySQL instances with appropriate credentials.
On your MySQL server (e.g., running on OVH’s dedicated servers or VPS), you’ll need to create a dedicated monitoring user:
CREATE USER 'pmm_monitor'@'localhost' IDENTIFIED BY 'your_strong_password'; GRANT PROCESS, REPLICATION CLIENT, SHOW DATABASES, SELECT ON *.* TO 'pmm_monitor'@'localhost'; FLUSH PRIVILEGES;
Then, configure the mysqld_exporter. This is often done via environment variables when running it as a service or within Docker.
# Example using Docker Compose for PMM Client
version: '3'
services:
pmm-client:
image: perconalab/pmm-client:latest
container_name: pmm-client
environment:
- PMM_SERVER_ADDRESS=your_pmm_server_ip_or_hostname
- PMM_SERVER_USERNAME=admin
- PMM_SERVER_PASSWORD=your_pmm_server_password
volumes:
- /usr/local/percona/pmm2/collectors:/usr/local/percona/pmm2/collectors
- /usr/local/percona/pmm2/exporters:/usr/local/percona/pmm2/exporters
- /usr/local/percona/pmm2/config:/usr/local/percona/pmm2/config
- /var/lib/mysql:/var/lib/mysql # For socket file access if needed
network_mode: host # Or configure specific network
restart: always
# Add your MySQL service here if running in the same compose file
# For existing MySQL instances, you'll run the pmm-client container separately
# and point it to your MySQL host/port.
Once the PMM client is running and connected, PMM will automatically start collecting metrics. You can then access the PMM web UI to visualize these metrics, set up alerts, and analyze query performance.
OVH Specific Considerations: Network and Firewall
When deploying monitoring agents or exposing metrics endpoints on OVH infrastructure, firewall rules are critical. Ensure that:
- Your Prometheus server can reach the
/metricsendpoints of your Python applications (typically port 5000 or 8000). - Your PMM server can reach the PMM client agents (usually port 443 for the client registration, and the MySQL port, e.g., 3306, for data collection).
- Your PMM client can reach your MySQL cluster nodes.
- Your application health check endpoints (e.g.,
/healthz) are accessible by your monitoring probes.
OVH’s control panel provides tools to manage security groups and firewall rules. For instance, to allow Prometheus (running on a specific IP) to scrape your application on port 5000:
# Example using OVH API or control panel to add a rule: # Allow TCP traffic on port 5000 from Prometheus server IP (e.g., 1.2.3.4) # to your application server IP.
Similarly, for PMM, ensure the necessary ports are open between the PMM server, PMM clients, and your MySQL instances.
Alerting Strategies: From Prometheus Alertmanager to PMM Alerts
Effective monitoring is incomplete without a robust alerting strategy. Prometheus integrates with Alertmanager for sophisticated alert routing and deduplication.
Define alerting rules in Prometheus configuration files (e.g., rules.yml):
groups:
- name: python_app_alerts
rules:
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{endpoint="/api/data"}[5m])) by (le, endpoint)) > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected for /api/data endpoint"
description: "95th percentile latency for /api/data is {{ $value }}s, exceeding threshold."
- alert: AppUnhealthy
expr: probe_success{job="my_python_app_healthcheck"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Python application health check failed"
description: "The /healthz endpoint for my_python_app is returning an error."
- name: mysql_alerts
rules:
- alert: MySQLHighCPU
expr: pmm_mysql_cpu_usage_seconds_total{instance=~"mysql-node-.*"} > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on MySQL instance"
description: "MySQL instance {{ $labels.instance }} is experiencing high CPU usage ({{ $value }})."
- alert: MySQLReplicationLag
expr: pmm_mysql_replication_lag_seconds{instance="mysql-master-1"} > 60
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL replication lag detected"
description: "MySQL master-1 has a replication lag of {{ $value }} seconds."
Configure Alertmanager to route these alerts to Slack, PagerDuty, or email. PMM also has its own alerting capabilities, often integrated with Alertmanager or providing direct notification channels.
For MySQL, PMM’s query analytics is crucial for identifying slow queries that might be impacting performance. Setting up alerts for specific slow query patterns or high `Threads_running` can prevent cascading failures.
Log Aggregation and Analysis
Metrics tell you *what* is happening, but logs tell you *why*. Centralized log aggregation is essential for debugging issues across distributed systems.
Tools like Elasticsearch, Logstash, and Kibana (ELK stack), or Grafana Loki, are commonly used. On OVH, you might deploy these yourself or use managed services.
For Python applications, ensure your logging is configured to output in a structured format (e.g., JSON). This makes parsing and searching much easier.
import logging
import json
# Configure JSON logging
class JsonFormatter(logging.Formatter):
def format(self, record):
log_entry = {
"timestamp": self.formatTime(record, self.datefmt),
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name,
"pathname": record.pathname,
"lineno": record.lineno,
}
# Add exception info if present
if record.exc_info:
log_entry['exc_info'] = self.formatException(record.exc_info)
return json.dumps(log_entry)
logger = logging.getLogger('my_app')
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)
logger.info("Application started successfully.")
try:
result = 1 / 0
except ZeroDivisionError:
logger.exception("An error occurred during calculation.")
Configure Logstash or Fluentd agents on your OVH servers to collect these logs and forward them to your central logging cluster. For MySQL, ensure the general query log and error log are being collected.
By combining application-level health checks, Prometheus metrics, PMM for database insights, robust alerting, and centralized logging, you build a resilient monitoring strategy capable of keeping your Python applications and MySQL clusters healthy and performant on OVH infrastructure.