Server Monitoring Best Practices: Keeping Your Python App and MySQL Clusters Alive on DigitalOcean

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

This comprehensive monitoring stack—combining DigitalOcean’s infrastructure insights, `pt-heartbeat` for MySQL replication, Prometheus with Node Exporter and custom application metrics for deep dives, and Alertmanager for actionable notifications—provides a robust system for keeping your Python applications and MySQL clusters alive and performing optimally on DigitalOcean.

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

DigitalOcean’s built-in monitoring provides dashboards for CPU, memory, disk, and network I/O. While useful for a quick glance, it lacks the granularity and customizability of Prometheus. However, it’s a good first line of defense and can alert you to critical issues before Prometheus might be configured to do so.

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

Collecting metrics is only half the battle; acting on them is crucial. Prometheus integrates with Alertmanager to handle alerts generated by Prometheus rules. Alertmanager deduplicates, groups, and routes alerts to the correct receiver (e.g., email, Slack, PagerDuty).

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Beyond application-specific metrics, comprehensive server monitoring requires understanding the underlying system resources. DigitalOcean’s built-in monitoring provides a good overview, but for deeper insights and integration with Prometheus, we’ll deploy the Node Exporter.

Node Exporter is a Prometheus exporter for hardware and OS metrics exposed by *NIX systems, including CPU usage, memory, network statistics, disk I/O, and more. It’s essential for understanding resource contention that might affect your Python app or MySQL clusters.

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

This command starts `gunicorn` with 4 worker processes, binding to port 8000. The `gunicorn_config.py` will attempt to start a Prometheus metrics server for each worker. Note that managing ports for multiple workers can be complex. A more robust approach is to use a reverse proxy (like Nginx) to route all `/metrics` requests to a single endpoint or to a specific worker.

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

To run this with `gunicorn` and expose metrics, you’ll need to configure `gunicorn` to load the Prometheus exporter. The `prometheus_client` library can be loaded as a `gunicorn` worker class or via a pre-hook. A common approach is to use the `post_worker_init` hook to ensure each worker has its metrics endpoint.

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

pip install prometheus_client gunicorn

Next, instrument your Python application to expose metrics. Here’s a simple example using a Flask application:

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

chmod +x /opt/scripts/check_replication_lag.sh
# On each read replica, edit your crontab
crontab -e

# Add the following line:
* * * * * /opt/scripts/check_replication_lag.sh >> /var/log/replication_lag.log 2>&1

This setup provides a foundational layer for monitoring MySQL replication. For more advanced scenarios, consider integrating with a centralized monitoring system like Prometheus or Datadog, which can scrape metrics from `pt-heartbeat` or directly from MySQL’s `SHOW REPLICA STATUS` output.

Python Application Performance Monitoring with Prometheus and `gunicorn`

Monitoring Python applications is crucial for identifying performance bottlenecks, errors, and resource utilization. For applications deployed on DigitalOcean using `gunicorn` as the WSGI server, integrating Prometheus for metrics collection offers a powerful, scalable solution.

We’ll use the prometheus_client Python library to expose application-specific metrics and `gunicorn`’s built-in support for Prometheus exporters.

First, install the necessary libraries:

pip install prometheus_client gunicorn

Next, instrument your Python application to expose metrics. Here’s a simple example using a Flask application:

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

# On each read replica, create a monitoring script, e.g., /opt/scripts/check_replication_lag.sh
#!/bin/bash

# MySQL connection details for the replica
REPLICA_USER="your_mysql_user"
REPLICA_PASSWORD="your_mysql_password"
REPLICA_HOST="127.0.0.1" # Or the replica's IP
REPLICA_DB="heartbeat"

# Primary MySQL connection details (for pt-heartbeat to query)
PRIMARY_USER="your_mysql_user"
PRIMARY_PASSWORD="your_mysql_password"
PRIMARY_HOST="your_primary_mysql_ip" # IP of your primary MySQL server
PRIMARY_DB="heartbeat"

# Threshold for replication lag in seconds
LAG_THRESHOLD=60

# Run pt-heartbeat to get the lag
LAG=$(pt-heartbeat --host=$REPLICA_HOST --user=$REPLICA_USER --password=$REPLICA_PASSWORD --database=$REPLICA_DB --table=ping --master-server-id=1 --check-replication-lag --master-host=$PRIMARY_HOST --master-user=$PRIMARY_USER --master-password=$PRIMARY_PASSWORD --master-database=$PRIMARY_DB 2>&1)

# Check if pt-heartbeat ran successfully and returned a lag value
if [[ $LAG =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
    # Convert lag to integer for comparison
    LAG_INT=$(echo "$LAG" | cut -d. -f1)
    if [ "$LAG_INT" -gt "$LAG_THRESHOLD" ]; then
        echo "ALERT: Replication lag on replica $REPLICA_HOST is ${LAG}s, exceeding threshold of ${LAG_THRESHOLD}s."
        # Add your alerting mechanism here (e.g., send email, trigger PagerDuty)
        # Example: echo "Replication lag alert for $REPLICA_HOST" | mail -s "MySQL Replication Alert" [email protected]
    else
        echo "OK: Replication lag on replica $REPLICA_HOST is ${LAG}s."
    fi
else
    echo "ERROR: Failed to get replication lag on replica $REPLICA_HOST. Output: $LAG"
    # Add alerting for script failure
fi

Make the script executable and add it to cron on each replica to run, for instance, every minute:

chmod +x /opt/scripts/check_replication_lag.sh
# On each read replica, edit your crontab
crontab -e

# Add the following line:
* * * * * /opt/scripts/check_replication_lag.sh >> /var/log/replication_lag.log 2>&1

Python Application Performance Monitoring with Prometheus and `gunicorn`

We’ll use the prometheus_client Python library to expose application-specific metrics and `gunicorn`’s built-in support for Prometheus exporters.

First, install the necessary libraries:

pip install prometheus_client gunicorn

Next, instrument your Python application to expose metrics. Here’s a simple example using a Flask application:

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

# On your primary MySQL server, edit your crontab
crontab -e

# Add the following line to run every 10 seconds:
*/10 * * * * pt-heartbeat --host=127.0.0.1 --user=your_mysql_user --password=your_mysql_password --database=heartbeat --table=ping --update-check

Replace your_mysql_user and your_mysql_password with appropriate credentials. It’s highly recommended to use a dedicated MySQL user with minimal privileges for monitoring. For enhanced security, consider using MySQL proxy or socket authentication if possible.

On each of your read replicas, configure `pt-heartbeat` to monitor the lag. This script will run periodically and check the difference between the timestamp on the replica and the primary. We’ll then use a simple shell script to check the output and trigger alerts.

# On each read replica, create a monitoring script, e.g., /opt/scripts/check_replication_lag.sh
#!/bin/bash

# MySQL connection details for the replica
REPLICA_USER="your_mysql_user"
REPLICA_PASSWORD="your_mysql_password"
REPLICA_HOST="127.0.0.1" # Or the replica's IP
REPLICA_DB="heartbeat"

# Primary MySQL connection details (for pt-heartbeat to query)
PRIMARY_USER="your_mysql_user"
PRIMARY_PASSWORD="your_mysql_password"
PRIMARY_HOST="your_primary_mysql_ip" # IP of your primary MySQL server
PRIMARY_DB="heartbeat"

# Threshold for replication lag in seconds
LAG_THRESHOLD=60

# Run pt-heartbeat to get the lag
LAG=$(pt-heartbeat --host=$REPLICA_HOST --user=$REPLICA_USER --password=$REPLICA_PASSWORD --database=$REPLICA_DB --table=ping --master-server-id=1 --check-replication-lag --master-host=$PRIMARY_HOST --master-user=$PRIMARY_USER --master-password=$PRIMARY_PASSWORD --master-database=$PRIMARY_DB 2>&1)

# Check if pt-heartbeat ran successfully and returned a lag value
if [[ $LAG =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
    # Convert lag to integer for comparison
    LAG_INT=$(echo "$LAG" | cut -d. -f1)
    if [ "$LAG_INT" -gt "$LAG_THRESHOLD" ]; then
        echo "ALERT: Replication lag on replica $REPLICA_HOST is ${LAG}s, exceeding threshold of ${LAG_THRESHOLD}s."
        # Add your alerting mechanism here (e.g., send email, trigger PagerDuty)
        # Example: echo "Replication lag alert for $REPLICA_HOST" | mail -s "MySQL Replication Alert" [email protected]
    else
        echo "OK: Replication lag on replica $REPLICA_HOST is ${LAG}s."
    fi
else
    echo "ERROR: Failed to get replication lag on replica $REPLICA_HOST. Output: $LAG"
    # Add alerting for script failure
fi

Make the script executable and add it to cron on each replica to run, for instance, every minute:

chmod +x /opt/scripts/check_replication_lag.sh
# On each read replica, edit your crontab
crontab -e

# Add the following line:
* * * * * /opt/scripts/check_replication_lag.sh >> /var/log/replication_lag.log 2>&1

Python Application Performance Monitoring with Prometheus and `gunicorn`

We’ll use the prometheus_client Python library to expose application-specific metrics and `gunicorn`’s built-in support for Prometheus exporters.

First, install the necessary libraries:

pip install prometheus_client gunicorn

Next, instrument your Python application to expose metrics. Here’s a simple example using a Flask application:

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

-- On your primary MySQL server
CREATE DATABASE IF NOT EXISTS heartbeat;
USE heartbeat;
CREATE TABLE IF NOT EXISTS ping (
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Now, configure `pt-heartbeat` to run as a scheduled job on your primary MySQL server. This job will periodically insert a new timestamp into the `heartbeat.ping` table. We’ll use `cron` for this.

# On your primary MySQL server, edit your crontab
crontab -e

# Add the following line to run every 10 seconds:
*/10 * * * * pt-heartbeat --host=127.0.0.1 --user=your_mysql_user --password=your_mysql_password --database=heartbeat --table=ping --update-check

# On each read replica, create a monitoring script, e.g., /opt/scripts/check_replication_lag.sh
#!/bin/bash

# MySQL connection details for the replica
REPLICA_USER="your_mysql_user"
REPLICA_PASSWORD="your_mysql_password"
REPLICA_HOST="127.0.0.1" # Or the replica's IP
REPLICA_DB="heartbeat"

# Primary MySQL connection details (for pt-heartbeat to query)
PRIMARY_USER="your_mysql_user"
PRIMARY_PASSWORD="your_mysql_password"
PRIMARY_HOST="your_primary_mysql_ip" # IP of your primary MySQL server
PRIMARY_DB="heartbeat"

# Threshold for replication lag in seconds
LAG_THRESHOLD=60

# Run pt-heartbeat to get the lag
LAG=$(pt-heartbeat --host=$REPLICA_HOST --user=$REPLICA_USER --password=$REPLICA_PASSWORD --database=$REPLICA_DB --table=ping --master-server-id=1 --check-replication-lag --master-host=$PRIMARY_HOST --master-user=$PRIMARY_USER --master-password=$PRIMARY_PASSWORD --master-database=$PRIMARY_DB 2>&1)

# Check if pt-heartbeat ran successfully and returned a lag value
if [[ $LAG =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
    # Convert lag to integer for comparison
    LAG_INT=$(echo "$LAG" | cut -d. -f1)
    if [ "$LAG_INT" -gt "$LAG_THRESHOLD" ]; then
        echo "ALERT: Replication lag on replica $REPLICA_HOST is ${LAG}s, exceeding threshold of ${LAG_THRESHOLD}s."
        # Add your alerting mechanism here (e.g., send email, trigger PagerDuty)
        # Example: echo "Replication lag alert for $REPLICA_HOST" | mail -s "MySQL Replication Alert" [email protected]
    else
        echo "OK: Replication lag on replica $REPLICA_HOST is ${LAG}s."
    fi
else
    echo "ERROR: Failed to get replication lag on replica $REPLICA_HOST. Output: $LAG"
    # Add alerting for script failure
fi

Make the script executable and add it to cron on each replica to run, for instance, every minute:

chmod +x /opt/scripts/check_replication_lag.sh
# On each read replica, edit your crontab
crontab -e

# Add the following line:
* * * * * /opt/scripts/check_replication_lag.sh >> /var/log/replication_lag.log 2>&1

Python Application Performance Monitoring with Prometheus and `gunicorn`

We’ll use the prometheus_client Python library to expose application-specific metrics and `gunicorn`’s built-in support for Prometheus exporters.

First, install the necessary libraries:

pip install prometheus_client gunicorn

Next, instrument your Python application to expose metrics. Here’s a simple example using a Flask application:

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

sudo apt-get update
sudo apt-get install percona-toolkit

Next, create a dedicated database and table on your primary MySQL server for `pt-heartbeat` to use. This table should be replicated to all your read replicas.

-- On your primary MySQL server
CREATE DATABASE IF NOT EXISTS heartbeat;
USE heartbeat;
CREATE TABLE IF NOT EXISTS ping (
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Now, configure `pt-heartbeat` to run as a scheduled job on your primary MySQL server. This job will periodically insert a new timestamp into the `heartbeat.ping` table. We’ll use `cron` for this.

# On your primary MySQL server, edit your crontab
crontab -e

# Add the following line to run every 10 seconds:
*/10 * * * * pt-heartbeat --host=127.0.0.1 --user=your_mysql_user --password=your_mysql_password --database=heartbeat --table=ping --update-check

# On each read replica, create a monitoring script, e.g., /opt/scripts/check_replication_lag.sh
#!/bin/bash

# MySQL connection details for the replica
REPLICA_USER="your_mysql_user"
REPLICA_PASSWORD="your_mysql_password"
REPLICA_HOST="127.0.0.1" # Or the replica's IP
REPLICA_DB="heartbeat"

# Primary MySQL connection details (for pt-heartbeat to query)
PRIMARY_USER="your_mysql_user"
PRIMARY_PASSWORD="your_mysql_password"
PRIMARY_HOST="your_primary_mysql_ip" # IP of your primary MySQL server
PRIMARY_DB="heartbeat"

# Threshold for replication lag in seconds
LAG_THRESHOLD=60

# Run pt-heartbeat to get the lag
LAG=$(pt-heartbeat --host=$REPLICA_HOST --user=$REPLICA_USER --password=$REPLICA_PASSWORD --database=$REPLICA_DB --table=ping --master-server-id=1 --check-replication-lag --master-host=$PRIMARY_HOST --master-user=$PRIMARY_USER --master-password=$PRIMARY_PASSWORD --master-database=$PRIMARY_DB 2>&1)

# Check if pt-heartbeat ran successfully and returned a lag value
if [[ $LAG =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
    # Convert lag to integer for comparison
    LAG_INT=$(echo "$LAG" | cut -d. -f1)
    if [ "$LAG_INT" -gt "$LAG_THRESHOLD" ]; then
        echo "ALERT: Replication lag on replica $REPLICA_HOST is ${LAG}s, exceeding threshold of ${LAG_THRESHOLD}s."
        # Add your alerting mechanism here (e.g., send email, trigger PagerDuty)
        # Example: echo "Replication lag alert for $REPLICA_HOST" | mail -s "MySQL Replication Alert" [email protected]
    else
        echo "OK: Replication lag on replica $REPLICA_HOST is ${LAG}s."
    fi
else
    echo "ERROR: Failed to get replication lag on replica $REPLICA_HOST. Output: $LAG"
    # Add alerting for script failure
fi

Make the script executable and add it to cron on each replica to run, for instance, every minute:

chmod +x /opt/scripts/check_replication_lag.sh
# On each read replica, edit your crontab
crontab -e

# Add the following line:
* * * * * /opt/scripts/check_replication_lag.sh >> /var/log/replication_lag.log 2>&1

Python Application Performance Monitoring with Prometheus and `gunicorn`

We’ll use the prometheus_client Python library to expose application-specific metrics and `gunicorn`’s built-in support for Prometheus exporters.

First, install the necessary libraries:

pip install prometheus_client gunicorn

Next, instrument your Python application to expose metrics. Here’s a simple example using a Flask application:

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

Proactive MySQL Cluster Health Checks with `pt-heartbeat`

Maintaining the health and synchronization of MySQL replication is paramount for high availability and disaster recovery. For clusters deployed on DigitalOcean, especially those with multiple read replicas, a robust monitoring solution is essential. We’ll leverage Percona Toolkit’s `pt-heartbeat` to monitor replication lag. This tool writes a timestamp to a dedicated table on the primary and reads it from the replicas, calculating the replication delay.

First, ensure Percona Toolkit is installed on all your MySQL nodes. On Ubuntu/Debian-based systems:

sudo apt-get update
sudo apt-get install percona-toolkit

Next, create a dedicated database and table on your primary MySQL server for `pt-heartbeat` to use. This table should be replicated to all your read replicas.

-- On your primary MySQL server
CREATE DATABASE IF NOT EXISTS heartbeat;
USE heartbeat;
CREATE TABLE IF NOT EXISTS ping (
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Now, configure `pt-heartbeat` to run as a scheduled job on your primary MySQL server. This job will periodically insert a new timestamp into the `heartbeat.ping` table. We’ll use `cron` for this.

# On your primary MySQL server, edit your crontab
crontab -e

# Add the following line to run every 10 seconds:
*/10 * * * * pt-heartbeat --host=127.0.0.1 --user=your_mysql_user --password=your_mysql_password --database=heartbeat --table=ping --update-check

# On each read replica, create a monitoring script, e.g., /opt/scripts/check_replication_lag.sh
#!/bin/bash

# MySQL connection details for the replica
REPLICA_USER="your_mysql_user"
REPLICA_PASSWORD="your_mysql_password"
REPLICA_HOST="127.0.0.1" # Or the replica's IP
REPLICA_DB="heartbeat"

# Primary MySQL connection details (for pt-heartbeat to query)
PRIMARY_USER="your_mysql_user"
PRIMARY_PASSWORD="your_mysql_password"
PRIMARY_HOST="your_primary_mysql_ip" # IP of your primary MySQL server
PRIMARY_DB="heartbeat"

# Threshold for replication lag in seconds
LAG_THRESHOLD=60

# Run pt-heartbeat to get the lag
LAG=$(pt-heartbeat --host=$REPLICA_HOST --user=$REPLICA_USER --password=$REPLICA_PASSWORD --database=$REPLICA_DB --table=ping --master-server-id=1 --check-replication-lag --master-host=$PRIMARY_HOST --master-user=$PRIMARY_USER --master-password=$PRIMARY_PASSWORD --master-database=$PRIMARY_DB 2>&1)

# Check if pt-heartbeat ran successfully and returned a lag value
if [[ $LAG =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
    # Convert lag to integer for comparison
    LAG_INT=$(echo "$LAG" | cut -d. -f1)
    if [ "$LAG_INT" -gt "$LAG_THRESHOLD" ]; then
        echo "ALERT: Replication lag on replica $REPLICA_HOST is ${LAG}s, exceeding threshold of ${LAG_THRESHOLD}s."
        # Add your alerting mechanism here (e.g., send email, trigger PagerDuty)
        # Example: echo "Replication lag alert for $REPLICA_HOST" | mail -s "MySQL Replication Alert" [email protected]
    else
        echo "OK: Replication lag on replica $REPLICA_HOST is ${LAG}s."
    fi
else
    echo "ERROR: Failed to get replication lag on replica $REPLICA_HOST. Output: $LAG"
    # Add alerting for script failure
fi

Make the script executable and add it to cron on each replica to run, for instance, every minute:

chmod +x /opt/scripts/check_replication_lag.sh
# On each read replica, edit your crontab
crontab -e

# Add the following line:
* * * * * /opt/scripts/check_replication_lag.sh >> /var/log/replication_lag.log 2>&1

Python Application Performance Monitoring with Prometheus and `gunicorn`

We’ll use the prometheus_client Python library to expose application-specific metrics and `gunicorn`’s built-in support for Prometheus exporters.

First, install the necessary libraries:

pip install prometheus_client gunicorn

Next, instrument your Python application to expose metrics. Here’s a simple example using a Flask application:

from flask import Flask, Response
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define some Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Gauge('http_request_latency_seconds', 'HTTP Request Latency', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of currently active users')

# Simulate active users
def update_active_users():
    ACTIVE_USERS.set(random.randint(10, 100))

@app.route('/')
def index():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    update_active_users()
    time.sleep(random.uniform(0.1, 0.5)) # Simulate work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/').set(latency)
    return "Hello, World!"

@app.route('/api/data')
def api_data():
    start_time = time.time()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    update_active_users()
    time.sleep(random.uniform(0.5, 1.5)) # Simulate heavier work
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').set(latency)
    return {"data": "some_data", "value": random.randint(1, 1000)}

@app.route('/metrics')
def metrics():
    update_active_users() # Ensure metrics are up-to-date
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    # For development, run with Flask's built-in server
    # In production, use gunicorn
    app.run(debug=True, host='0.0.0.0', port=5000)

Create a `gunicorn` configuration file (e.g., gunicorn_config.py):

import prometheus_client
import os

# Define the port for Prometheus metrics, typically different from the app port
# Ensure this port is accessible by your Prometheus server.
# It's common to use a dedicated port like 9100 or a higher one.
# For simplicity, we'll use a port derived from the worker ID, or a fixed one.
# A more robust approach is to use a reverse proxy (like Nginx) to route /metrics.

# If using multiple workers, each worker might expose metrics on a different port,
# or a single port can be used if you configure Prometheus to scrape each worker.
# A common pattern is to have a single metrics endpoint served by one worker or a proxy.

# For simplicity, let's assume we're using a single worker or a proxy.
# If using multiple workers and want each to expose metrics, you'd need to manage ports.
# A more practical approach for multiple workers is to use a reverse proxy.

# Let's configure gunicorn to expose metrics on a specific port for simplicity.
# This requires careful consideration in a multi-worker setup.
# A better approach is to use a reverse proxy.

# For this example, we'll assume a single worker or a proxy setup.
# If you have multiple workers, Prometheus needs to be configured to scrape each.
# Or, use a reverse proxy to consolidate.

# Let's use a simple approach: expose metrics on a fixed port.
# This might require adjustments based on your DigitalOcean droplet setup and firewall rules.
METRICS_PORT = 9100 # Or another available port

def post_worker_init(worker):
    # This hook runs after a worker process has been initialized.
    # We start a new thread to serve the Prometheus metrics.
    # Ensure the port is not already in use by another worker or process.
    # In a multi-worker setup, this can be tricky. A reverse proxy is often better.
    # For demonstration, we'll start a server.
    # A more robust solution involves a dedicated metrics endpoint or a proxy.

    # Start the Prometheus metrics server in a new thread.
    # This is a basic example; production might need more sophisticated handling.
    from prometheus_client import start_http_server
    try:
        start_http_server(METRICS_PORT + worker.pid % 1000) # Use a port based on PID to avoid conflicts if needed
        print(f"Prometheus metrics server started on port {METRICS_PORT + worker.pid % 1000} for worker {worker.pid}")
    except OSError as e:
        print(f"Could not start Prometheus metrics server on port {METRICS_PORT + worker.pid % 1000}: {e}")
        print("Consider using a reverse proxy or ensuring port availability.")

# If you are using a reverse proxy like Nginx, you might not need to start the server here.
# Instead, you'd configure Nginx to proxy requests to the /metrics endpoint of your app.
# However, for direct scraping, starting the server is necessary.

# Example of how to run gunicorn with this config:
# gunicorn -c gunicorn_config.py your_app_module:app
# e.g., gunicorn -c gunicorn_config.py main:app

Now, run your application using `gunicorn` with this configuration:

# Assuming your Flask app is in a file named app.py and the Flask instance is named 'app'
gunicorn -c gunicorn_config.py app:app --workers 4 --bind 0.0.0.0:8000

Configure Prometheus to scrape these metrics. In your prometheus.yml:

scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['your_app_droplet_ip:9100', 'your_app_droplet_ip:9101', 'your_app_droplet_ip:9102', 'your_app_droplet_ip:9103'] # Adjust ports based on worker PIDs or your strategy
    # If using a reverse proxy:
    # - targets: ['your_app_droplet_ip:8080'] # Assuming Nginx on 8080 proxies /metrics

Remember to adjust the `targets` based on how you’ve configured your Prometheus metrics server. If you’re using a reverse proxy, you’ll point Prometheus to the proxy’s address and port.

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Download and install Node Exporter on each of your DigitalOcean droplets (app servers, database servers).

# On each droplet, download the latest release (check Prometheus website for latest URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to /usr/local/bin
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Node Exporter typically runs on port 9100. Ensure this port is open in your DigitalOcean firewall and accessible by your Prometheus server.

Add Node Exporter targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['app_droplet_1_ip:9100', 'app_droplet_2_ip:9100', 'mysql_droplet_1_ip:9100', 'mysql_droplet_2_ip:9100']
    # Add more targets for all your droplets

To leverage DigitalOcean’s monitoring effectively:

Regularly review the “Graphs” section for each Droplet in your DigitalOcean control panel.
Set up alerts within DigitalOcean for key metrics like CPU utilization exceeding 90% for a sustained period, or disk I/O hitting critical levels. This provides a safety net.
Correlate DigitalOcean alerts with Prometheus data. If DigitalOcean flags high CPU, use Prometheus to drill down into which process (your Python app, MySQL, or system processes) is consuming the CPU.

Alerting Strategy with Alertmanager

First, set up Alertmanager. You can run it as a Docker container or a standalone binary. For this example, we’ll assume a basic configuration.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

# Example for Slack (uncomment and configure)
# - name: 'slack-receiver'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
#     channel: '#alerts'
#     send_resolved: true

# Example for PagerDuty (uncomment and configure)
# - name: 'pagerduty-receiver'
#   pagerduty_configs:
#   - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
#     send_resolved: true

# Define specific routes for different alert severities or types
# routes:
# - receiver: 'critical-alerts'
#   matchers:
#     severity: 'critical'
#   continue: true # If you want alerts to also go to the default receiver
# - receiver: 'warning-alerts'
#   matchers:
#     severity: 'warning'
#   continue: true

Run Alertmanager:

# Example using Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/your/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml

Now, configure Prometheus to send alerts to Alertmanager. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager_droplet_ip:9093'] # IP of your Alertmanager server

Define alerting rules in Prometheus. Create a separate file (e.g., alerts.yml) and include it in your prometheus.yml:

# alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLHighReplicationLag
    expr: avg(mysql_slave_status_seconds_behind_master{job="mysql_exporter"}) by (instance) > 60 # Assuming you have mysql_exporter running
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} is lagging by more than 60 seconds for 5 minutes."

- name: python_app_alerts
  rules:
  - alert: PythonAppHighLatency
    expr: avg_over_time(http_request_latency_seconds_sum{job="my_python_app"}[5m]) / avg_over_time(http_requests_total{job="my_python_app"}[5m]) > 2.0 # Average latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency for Python app {{ $labels.endpoint }}"
      description: "Average request latency for {{ $labels.endpoint }} on {{ $labels.instance }} has been above 2s for 10 minutes."

  - alert: PythonAppHighErrorRate
    expr: sum(rate(http_requests_total{job="my_python_app", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="my_python_app"}[5m])) > 0.05 # 5% error rate
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for Python app"
      description: "The error rate for {{ $labels.instance }} is above 5% for 5 minutes."

- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: node_load1 > 0.8 * count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # Load average exceeding CPU cores
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} has been high for 15 minutes."

  - alert: LowDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 # Less than 10% disk space remaining
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 10% free space."

Reload Prometheus configuration for the new rules and Alertmanager target to take effect.

Server Monitoring Best Practices: Keeping Your Python App and MySQL Clusters Alive on DigitalOcean

Alerting Strategy with Alertmanager

Alerting Strategy with Alertmanager

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

Python Application Performance Monitoring with Prometheus and `gunicorn`

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

Python Application Performance Monitoring with Prometheus and `gunicorn`

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

Python Application Performance Monitoring with Prometheus and `gunicorn`

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

Python Application Performance Monitoring with Prometheus and `gunicorn`

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

Python Application Performance Monitoring with Prometheus and `gunicorn`

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

Proactive MySQL Cluster Health Checks with `pt-heartbeat`

Python Application Performance Monitoring with Prometheus and `gunicorn`

System-Level Monitoring with Node Exporter and DigitalOcean Monitoring

Alerting Strategy with Alertmanager

Recent Posts

Top Categories

Our Products

Our Services