Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on DigitalOcean

Proactive PostgreSQL Cluster Health Checks with `pg_stat_activity` and `pg_stat_replication`

Maintaining the health of a PostgreSQL cluster, especially in a distributed environment on DigitalOcean, requires more than just basic CPU and memory monitoring. We need to dive deep into the database’s internal state. Key views like pg_stat_activity and pg_stat_replication provide invaluable real-time insights into query performance, connection pooling, and replication lag.

For a C++ application interacting with PostgreSQL, understanding long-running queries or idle-in-transaction connections is critical. These can exhaust connection pools and lead to application unresponsiveness. We can script checks to identify and alert on such conditions.

Automated PostgreSQL Health Checks Script (Bash)

This Bash script connects to your PostgreSQL primary and replica nodes, queries relevant statistics, and outputs actionable information. It’s designed to be run periodically by a scheduler like cron.

First, ensure you have psql installed and configured with appropriate connection parameters (e.g., via ~/.pgpass or environment variables like PGHOST, PGUSER, PGDATABASE, PGPASSWORD).

Primary Node Check:

#!/bin/bash

# --- Configuration ---
PRIMARY_HOST="your_primary_db_host"
REPLICA_HOST="your_replica_db_host"
PG_USER="your_db_user"
PG_DB="your_db_name"
ALERT_EMAIL="[email protected]"
LONG_QUERY_THRESHOLD_SEC=60
IDLE_IN_TX_THRESHOLD_SEC=300
REPLICATION_LAG_THRESHOLD_SEC=60

# --- Functions ---
send_alert() {
    local subject="$1"
    local message="$2"
    echo "$message" | mail -s "$subject" "$ALERT_EMAIL"
    echo "$(date): ALERT - $subject"
    echo "$message"
}

check_primary_health() {
    echo "--- Checking Primary: $PRIMARY_HOST ---"

    # Check for long-running queries
    psql -h "$PRIMARY_HOST" -U "$PG_USER" -d "$PG_DB" -t -c "
        SELECT
            pid,
            age(clock_timestamp(), query_start),
            usename,
            query
        FROM pg_stat_activity
        WHERE state = 'active' AND query_start IS NOT NULL AND clock_timestamp() - query_start > interval '${LONG_QUERY_THRESHOLD_SEC}s'
        ORDER BY query_start;" | while IFS='|' read -r pid duration user query; do
        send_alert "PostgreSQL Long Running Query on $PRIMARY_HOST" "PID: $pid\nDuration: $duration\nUser: $user\nQuery: $query"
    done

    # Check for idle-in-transaction connections
    psql -h "$PRIMARY_HOST" -U "$PG_USER" -d "$PG_DB" -t -c "
        SELECT
            pid,
            age(clock_timestamp(), state_change),
            usename,
            client_addr
        FROM pg_stat_activity
        WHERE state = 'idle in transaction' AND state_change IS NOT NULL AND clock_timestamp() - state_change > interval '${IDLE_IN_TX_THRESHOLD_SEC}s'
        ORDER BY state_change;" | while IFS='|' read -r pid duration user client_addr; do
        send_alert "PostgreSQL Idle-in-Transaction on $PRIMARY_HOST" "PID: $pid\nDuration: $duration\nUser: $user\nClient: $client_addr"
    done

    # Check for active connections (optional, for connection pool monitoring)
    active_connections=$(psql -h "$PRIMARY_HOST" -U "$PG_USER" -d "$PG_DB" -t -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';")
    echo "Active connections: $active_connections"
    # Add logic here to alert if active_connections exceeds a certain threshold
}

check_replica_health() {
    echo "--- Checking Replica: $REPLICA_HOST ---"

    # Check replication lag
    lag_info=$(psql -h "$REPLICA_HOST" -U "$PG_USER" -d "$PG_DB" -t -c "
        SELECT
            pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn(),
            pg_is_in_recovery(),
            pg_last_xact_replay_timestamp(),
            pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS received_lag_bytes,
            pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS replay_lag_bytes
        FROM pg_control_checkpoint();")

    if echo "$lag_info" | grep -q "f"; then
        send_alert "PostgreSQL Replica Not Ready on $REPLICA_HOST" "Replica is not in recovery or WAL receive/replay is not caught up."
        return
    fi

    # Extract lag in seconds (approximate)
    # This requires a bit more parsing or a more complex query depending on PG version
    # For simplicity, we'll use a basic check for now. A more robust check would involve pg_stat_replication.

    # Using pg_stat_replication for more accurate lag
    replication_lag_sec=$(psql -h "$REPLICA_HOST" -U "$PG_USER" -d "$PG_DB" -t -c "
        SELECT
            EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds
        FROM pg_stat_replication
        WHERE application_name IS NOT NULL AND pg_is_in_recovery();")

    if [ -n "$replication_lag_sec" ] && [ "$(echo "$replication_lag_sec > $REPLICATION_LAG_THRESHOLD_SEC" | bc)" -eq 1 ]; then
        send_alert "PostgreSQL Replication Lag on $REPLICA_HOST" "Replication lag is ${replication_lag_sec}s, exceeding threshold of ${REPLICATION_LAG_THRESHOLD_SEC}s."
    elif [ -z "$replication_lag_sec" ]; then
        echo "No active replication detected on replica."
    else
        echo "Replication lag: ${replication_lag_sec}s (OK)"
    fi
}

# --- Main Execution ---
check_primary_health
check_replica_health

Cron Job Setup:

To run this script every 5 minutes, add the following line to your crontab (crontab -e):

*/5 * * * * /path/to/your/pg_health_check.sh >> /var/log/pg_health_check.log 2>&1

C++ Application Performance Monitoring with Prometheus and Node Exporter

For the C++ application itself, we need to monitor system-level metrics and application-specific performance indicators. Prometheus is an excellent choice for time-series monitoring, and node_exporter provides a wealth of system metrics. For application-specific metrics, we’ll instrument our C++ code.

Instrumenting C++ for Prometheus Metrics

The prometheus-cpp client library allows you to expose custom metrics from your C++ application. This is crucial for understanding application-specific bottlenecks, such as request processing times, cache hit rates, or error counts.

1. Add the library to your project:

If using CMake, add it to your CMakeLists.txt:

# In your CMakeLists.txt
find_package(prometheus-cpp REQUIRED)
target_link_libraries(your_app_target PRIVATE prometheus-cpp::core)

Example C++ Metrics Exposure

Here’s a simplified example of how to expose a request counter and a histogram for request duration:

#include <prometheus/registry.h>
#include <prometheus/counter.h>
#include <prometheus/histogram.h>
#include <prometheus/exposer.h>
#include <chrono>
#include <thread>
#include <iostream>

// Global registry and exposer (for simplicity in this example)
std::shared_ptr<prometheus::Registry> registry;
prometheus::Exposer exposer;

// Metrics
prometheus::Family<prometheus::Counter>& request_counter_family;
prometheus::Family<prometheus::Histogram>& request_duration_family;

void initialize_metrics() {
    registry = std::make_shared<prometheus::Registry>();
    exposer.RegisterCollectable(registry);
    exposer.StartWebserver(8080); // Expose metrics on port 8080

    auto& metrics_subsystem = registry->AddLabels({{"app", "my_cpp_app"}});

    request_counter_family = metrics_subsystem.AddCounter(
        prometheus::Counter::Family::WithName("http_requests_total")
            .Add({{"method", "GET"}, {"path", "/api/v1/data"}}));

    request_duration_family = metrics_subsystem.AddHistogram(
        prometheus::Histogram::Family::WithName("http_request_duration_seconds")
            .Add({{"method", "GET"}, {"path", "/api/v1/data"}}));
}

void process_request() {
    auto start_time = std::chrono::high_resolution_clock::now();

    // Simulate request processing
    std::this_thread::sleep_for(std::chrono::milliseconds(rand() % 500 + 100));

    auto end_time = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = end_time - start_time;

    // Increment counter
    request_counter_family.Get({{"method", "GET"}, {"path", "/api/v1/data"}}).Increment();

    // Observe duration
    request_duration_family.Get({{"method", "GET"}, {"path", "/api/v1/data"}}).Observe(elapsed.count());

    std::cout << "Processed request in " << elapsed.count() << "s" << std::endl;
}

int main() {
    initialize_metrics();

    while (true) {
        process_request();
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }

    return 0;
}

Compile this code and run it. You can then access the metrics at http://your_app_ip:8080/metrics.

DigitalOcean Droplet Monitoring with Node Exporter and Prometheus

For each DigitalOcean Droplet running your C++ application or PostgreSQL instances, we’ll deploy node_exporter. This provides essential system-level metrics.

Installing Node Exporter

Download the latest release from the Prometheus download page. For a typical Ubuntu droplet:

# Download the latest binary
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# Extract
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz

# Move to a suitable location
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Clean up
rm -rf node_exporter-1.7.0.linux-amd64*

Running Node Exporter as a Systemd Service

Create a systemd service file to manage node_exporter:

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Then, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

node_exporter will now be listening on port 9100. You can verify by curling it:

curl http://localhost:9100/metrics

Prometheus Server Configuration

Your central Prometheus server needs to be configured to scrape metrics from your C++ applications and the node_exporter instances. Edit your prometheus.yml file.

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. Default is every 1 minute.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape Node Exporter instances on all Droplets
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 'droplet_app_1_ip:9100'
          - 'droplet_app_2_ip:9100'
          - 'droplet_db_primary_ip:9100' # If DB nodes also run node_exporter
          - 'droplet_db_replica_ip:9100'

  # Scrape C++ application metrics
  - job_name: 'cpp_application'
    static_configs:
      - targets:
          - 'droplet_app_1_ip:8080' # Port where your C++ app exposes metrics
          - 'droplet_app_2_ip:8080'

  # Scrape PostgreSQL exporter (if you set one up, e.g., pg_exporter)
  # - job_name: 'postgres_exporter'
  #   static_configs:
  #     - targets:
  #         - 'droplet_db_primary_ip:9187' # Default port for pg_exporter
  #         - 'droplet_db_replica_ip:9187'

After updating prometheus.yml, reload Prometheus configuration:

# If running Prometheus directly
kill -HUP $(pidof prometheus)

# If running via systemd
sudo systemctl reload prometheus

Alerting with Alertmanager

Prometheus alone can detect issues, but Alertmanager handles deduplication, grouping, and routing of alerts to your team via email, Slack, PagerDuty, etc. Configure Prometheus to send alerts to Alertmanager.

Prometheus prometheus.yml (alerting section):

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager_ip:9093'] # Your Alertmanager instance address

Alerting Rules (e.g., in a file like rules.yml):

groups:
- name: postgres_alerts
  rules:
  - alert: PostgreSQLHighReplicationLag
    expr: |
      avg by (instance) (
        rate(pg_replication_lag_seconds{job="postgres_exporter"}[5m])
      ) > 60 # Alert if lag is over 60 seconds
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High PostgreSQL replication lag on {{ $labels.instance }}"
      description: "PostgreSQL replication lag on {{ $labels.instance }} is {{ $value }} seconds, exceeding the threshold."

- name: cpp_app_alerts
  rules:
  - alert: CppApplicationHighRequestLatency
    expr: |
      histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="cpp_application"}[5m])) > 2 # 95th percentile latency over 2 seconds
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High request latency for C++ app {{ $labels.instance }}"
      description: "95th percentile request latency for {{ $labels.instance }} is {{ $value }} seconds."

  - alert: CppApplicationHighErrorRate
    expr: |
      sum(rate(http_requests_total{job="cpp_application", status_code=~"5..|4.."}[5m])) by (instance)
      /
      sum(rate(http_requests_total{job="cpp_application"}[5m])) by (instance)
      > 0.05 # More than 5% of requests are errors
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for C++ app {{ $labels.instance }}"
      description: "Error rate for {{ $labels.instance }} is {{ $value | humanizePercentage }}."

- name: node_exporter_alerts
  rules:
  - alert: HighCpuLoad
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85 # CPU usage > 85%
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}"
      description: "CPU load on {{ $labels.instance }} is {{ $value }}%."

  - alert: LowDiskSpace
    expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10 # Less than 10% disk space free
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Disk space on {{ $labels.instance }} is running low ({{ $value }}% free)."

Ensure your Prometheus server is configured to load these rules (add - 'rules.yml' under the rule_files directive in prometheus.yml) and that Alertmanager is running and configured to receive alerts from Prometheus.

Log Aggregation and Analysis

While metrics provide a quantitative view, logs offer qualitative insights into application behavior and errors. Centralizing logs from your C++ applications and PostgreSQL instances is essential for debugging and auditing.

Consider using a stack like:

Fluentd/Fluent Bit: Lightweight log forwarders to collect logs from your droplets.
Elasticsearch: For indexing and storing logs.
Kibana: For visualizing and searching logs.

Configure your C++ application to log to stdout or a designated log file, and use Fluentd/Fluent Bit on each droplet to tail these logs and forward them to your Elasticsearch cluster. For PostgreSQL, configure log_destination in postgresql.conf (e.g., stderr) and use a similar log forwarding setup.

Conclusion

A robust server monitoring strategy for a C++ application and PostgreSQL cluster on DigitalOcean involves a multi-layered approach. By combining deep PostgreSQL introspection, application-level instrumentation with Prometheus, system-level metrics via Node Exporter, centralized alerting with Alertmanager, and comprehensive log aggregation, you can ensure high availability, performance, and rapid issue resolution.