Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on DigitalOcean
Proactive PostgreSQL Cluster Health Checks with `pg_stat_activity` and `pg_stat_replication`
Maintaining the health of a PostgreSQL cluster, especially in a distributed environment on DigitalOcean, requires more than just basic CPU and memory monitoring. We need to dive deep into the database’s internal state. Key views like pg_stat_activity and pg_stat_replication provide invaluable real-time insights into query performance, connection pooling, and replication lag.
For a C++ application interacting with PostgreSQL, understanding long-running queries or idle-in-transaction connections is critical. These can exhaust connection pools and lead to application unresponsiveness. We can script checks to identify and alert on such conditions.
Automated PostgreSQL Health Checks Script (Bash)
This Bash script connects to your PostgreSQL primary and replica nodes, queries relevant statistics, and outputs actionable information. It’s designed to be run periodically by a scheduler like cron.
First, ensure you have psql installed and configured with appropriate connection parameters (e.g., via ~/.pgpass or environment variables like PGHOST, PGUSER, PGDATABASE, PGPASSWORD).
Primary Node Check:
#!/bin/bash # --- Configuration --- PRIMARY_HOST="your_primary_db_host" REPLICA_HOST="your_replica_db_host" PG_USER="your_db_user" PG_DB="your_db_name" ALERT_EMAIL="[email protected]" LONG_QUERY_THRESHOLD_SEC=60 IDLE_IN_TX_THRESHOLD_SEC=300 REPLICATION_LAG_THRESHOLD_SEC=60 # --- Functions --- send_alert() { local subject="$1" local message="$2" echo "$message" | mail -s "$subject" "$ALERT_EMAIL" echo "$(date): ALERT - $subject" echo "$message" } check_primary_health() { echo "--- Checking Primary: $PRIMARY_HOST ---" # Check for long-running queries psql -h "$PRIMARY_HOST" -U "$PG_USER" -d "$PG_DB" -t -c " SELECT pid, age(clock_timestamp(), query_start), usename, query FROM pg_stat_activity WHERE state = 'active' AND query_start IS NOT NULL AND clock_timestamp() - query_start > interval '${LONG_QUERY_THRESHOLD_SEC}s' ORDER BY query_start;" | while IFS='|' read -r pid duration user query; do send_alert "PostgreSQL Long Running Query on $PRIMARY_HOST" "PID: $pid\nDuration: $duration\nUser: $user\nQuery: $query" done # Check for idle-in-transaction connections psql -h "$PRIMARY_HOST" -U "$PG_USER" -d "$PG_DB" -t -c " SELECT pid, age(clock_timestamp(), state_change), usename, client_addr FROM pg_stat_activity WHERE state = 'idle in transaction' AND state_change IS NOT NULL AND clock_timestamp() - state_change > interval '${IDLE_IN_TX_THRESHOLD_SEC}s' ORDER BY state_change;" | while IFS='|' read -r pid duration user client_addr; do send_alert "PostgreSQL Idle-in-Transaction on $PRIMARY_HOST" "PID: $pid\nDuration: $duration\nUser: $user\nClient: $client_addr" done # Check for active connections (optional, for connection pool monitoring) active_connections=$(psql -h "$PRIMARY_HOST" -U "$PG_USER" -d "$PG_DB" -t -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';") echo "Active connections: $active_connections" # Add logic here to alert if active_connections exceeds a certain threshold } check_replica_health() { echo "--- Checking Replica: $REPLICA_HOST ---" # Check replication lag lag_info=$(psql -h "$REPLICA_HOST" -U "$PG_USER" -d "$PG_DB" -t -c " SELECT pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn(), pg_is_in_recovery(), pg_last_xact_replay_timestamp(), pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS received_lag_bytes, pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS replay_lag_bytes FROM pg_control_checkpoint();") if echo "$lag_info" | grep -q "f"; then send_alert "PostgreSQL Replica Not Ready on $REPLICA_HOST" "Replica is not in recovery or WAL receive/replay is not caught up." return fi # Extract lag in seconds (approximate) # This requires a bit more parsing or a more complex query depending on PG version # For simplicity, we'll use a basic check for now. A more robust check would involve pg_stat_replication. # Using pg_stat_replication for more accurate lag replication_lag_sec=$(psql -h "$REPLICA_HOST" -U "$PG_USER" -d "$PG_DB" -t -c " SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds FROM pg_stat_replication WHERE application_name IS NOT NULL AND pg_is_in_recovery();") if [ -n "$replication_lag_sec" ] && [ "$(echo "$replication_lag_sec > $REPLICATION_LAG_THRESHOLD_SEC" | bc)" -eq 1 ]; then send_alert "PostgreSQL Replication Lag on $REPLICA_HOST" "Replication lag is ${replication_lag_sec}s, exceeding threshold of ${REPLICATION_LAG_THRESHOLD_SEC}s." elif [ -z "$replication_lag_sec" ]; then echo "No active replication detected on replica." else echo "Replication lag: ${replication_lag_sec}s (OK)" fi } # --- Main Execution --- check_primary_health check_replica_health
Cron Job Setup:
To run this script every 5 minutes, add the following line to your crontab (crontab -e):
*/5 * * * * /path/to/your/pg_health_check.sh >> /var/log/pg_health_check.log 2>&1
C++ Application Performance Monitoring with Prometheus and Node Exporter
For the C++ application itself, we need to monitor system-level metrics and application-specific performance indicators. Prometheus is an excellent choice for time-series monitoring, and node_exporter provides a wealth of system metrics. For application-specific metrics, we’ll instrument our C++ code.
Instrumenting C++ for Prometheus Metrics
The prometheus-cpp client library allows you to expose custom metrics from your C++ application. This is crucial for understanding application-specific bottlenecks, such as request processing times, cache hit rates, or error counts.
1. Add the library to your project:
If using CMake, add it to your CMakeLists.txt:
# In your CMakeLists.txt find_package(prometheus-cpp REQUIRED) target_link_libraries(your_app_target PRIVATE prometheus-cpp::core)
Example C++ Metrics Exposure
Here’s a simplified example of how to expose a request counter and a histogram for request duration:
#include <prometheus/registry.h>
#include <prometheus/counter.h>
#include <prometheus/histogram.h>
#include <prometheus/exposer.h>
#include <chrono>
#include <thread>
#include <iostream>
// Global registry and exposer (for simplicity in this example)
std::shared_ptr<prometheus::Registry> registry;
prometheus::Exposer exposer;
// Metrics
prometheus::Family<prometheus::Counter>& request_counter_family;
prometheus::Family<prometheus::Histogram>& request_duration_family;
void initialize_metrics() {
registry = std::make_shared<prometheus::Registry>();
exposer.RegisterCollectable(registry);
exposer.StartWebserver(8080); // Expose metrics on port 8080
auto& metrics_subsystem = registry->AddLabels({{"app", "my_cpp_app"}});
request_counter_family = metrics_subsystem.AddCounter(
prometheus::Counter::Family::WithName("http_requests_total")
.Add({{"method", "GET"}, {"path", "/api/v1/data"}}));
request_duration_family = metrics_subsystem.AddHistogram(
prometheus::Histogram::Family::WithName("http_request_duration_seconds")
.Add({{"method", "GET"}, {"path", "/api/v1/data"}}));
}
void process_request() {
auto start_time = std::chrono::high_resolution_clock::now();
// Simulate request processing
std::this_thread::sleep_for(std::chrono::milliseconds(rand() % 500 + 100));
auto end_time = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end_time - start_time;
// Increment counter
request_counter_family.Get({{"method", "GET"}, {"path", "/api/v1/data"}}).Increment();
// Observe duration
request_duration_family.Get({{"method", "GET"}, {"path", "/api/v1/data"}}).Observe(elapsed.count());
std::cout << "Processed request in " << elapsed.count() << "s" << std::endl;
}
int main() {
initialize_metrics();
while (true) {
process_request();
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return 0;
}
Compile this code and run it. You can then access the metrics at http://your_app_ip:8080/metrics.
DigitalOcean Droplet Monitoring with Node Exporter and Prometheus
For each DigitalOcean Droplet running your C++ application or PostgreSQL instances, we’ll deploy node_exporter. This provides essential system-level metrics.
Installing Node Exporter
Download the latest release from the Prometheus download page. For a typical Ubuntu droplet:
# Download the latest binary wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz # Extract tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz # Move to a suitable location sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ # Clean up rm -rf node_exporter-1.7.0.linux-amd64*
Running Node Exporter as a Systemd Service
Create a systemd service file to manage node_exporter:
# /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=nobody Group=nogroup Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Then, enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
node_exporter will now be listening on port 9100. You can verify by curling it:
curl http://localhost:9100/metrics
Prometheus Server Configuration
Your central Prometheus server needs to be configured to scrape metrics from your C++ applications and the node_exporter instances. Edit your prometheus.yml file.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. Default is every 1 minute.
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape Node Exporter instances on all Droplets
- job_name: 'node_exporter'
static_configs:
- targets:
- 'droplet_app_1_ip:9100'
- 'droplet_app_2_ip:9100'
- 'droplet_db_primary_ip:9100' # If DB nodes also run node_exporter
- 'droplet_db_replica_ip:9100'
# Scrape C++ application metrics
- job_name: 'cpp_application'
static_configs:
- targets:
- 'droplet_app_1_ip:8080' # Port where your C++ app exposes metrics
- 'droplet_app_2_ip:8080'
# Scrape PostgreSQL exporter (if you set one up, e.g., pg_exporter)
# - job_name: 'postgres_exporter'
# static_configs:
# - targets:
# - 'droplet_db_primary_ip:9187' # Default port for pg_exporter
# - 'droplet_db_replica_ip:9187'
After updating prometheus.yml, reload Prometheus configuration:
# If running Prometheus directly kill -HUP $(pidof prometheus) # If running via systemd sudo systemctl reload prometheus
Alerting with Alertmanager
Prometheus alone can detect issues, but Alertmanager handles deduplication, grouping, and routing of alerts to your team via email, Slack, PagerDuty, etc. Configure Prometheus to send alerts to Alertmanager.
Prometheus prometheus.yml (alerting section):
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager_ip:9093'] # Your Alertmanager instance address
Alerting Rules (e.g., in a file like rules.yml):
groups:
- name: postgres_alerts
rules:
- alert: PostgreSQLHighReplicationLag
expr: |
avg by (instance) (
rate(pg_replication_lag_seconds{job="postgres_exporter"}[5m])
) > 60 # Alert if lag is over 60 seconds
for: 5m
labels:
severity: critical
annotations:
summary: "High PostgreSQL replication lag on {{ $labels.instance }}"
description: "PostgreSQL replication lag on {{ $labels.instance }} is {{ $value }} seconds, exceeding the threshold."
- name: cpp_app_alerts
rules:
- alert: CppApplicationHighRequestLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="cpp_application"}[5m])) > 2 # 95th percentile latency over 2 seconds
for: 10m
labels:
severity: warning
annotations:
summary: "High request latency for C++ app {{ $labels.instance }}"
description: "95th percentile request latency for {{ $labels.instance }} is {{ $value }} seconds."
- alert: CppApplicationHighErrorRate
expr: |
sum(rate(http_requests_total{job="cpp_application", status_code=~"5..|4.."}[5m])) by (instance)
/
sum(rate(http_requests_total{job="cpp_application"}[5m])) by (instance)
> 0.05 # More than 5% of requests are errors
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for C++ app {{ $labels.instance }}"
description: "Error rate for {{ $labels.instance }} is {{ $value | humanizePercentage }}."
- name: node_exporter_alerts
rules:
- alert: HighCpuLoad
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85 # CPU usage > 85%
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load on {{ $labels.instance }} is {{ $value }}%."
- alert: LowDiskSpace
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10 # Less than 10% disk space free
for: 15m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space on {{ $labels.instance }} is running low ({{ $value }}% free)."
Ensure your Prometheus server is configured to load these rules (add - 'rules.yml' under the rule_files directive in prometheus.yml) and that Alertmanager is running and configured to receive alerts from Prometheus.
Log Aggregation and Analysis
While metrics provide a quantitative view, logs offer qualitative insights into application behavior and errors. Centralizing logs from your C++ applications and PostgreSQL instances is essential for debugging and auditing.
Consider using a stack like:
- Fluentd/Fluent Bit: Lightweight log forwarders to collect logs from your droplets.
- Elasticsearch: For indexing and storing logs.
- Kibana: For visualizing and searching logs.
Configure your C++ application to log to stdout or a designated log file, and use Fluentd/Fluent Bit on each droplet to tail these logs and forward them to your Elasticsearch cluster. For PostgreSQL, configure log_destination in postgresql.conf (e.g., stderr) and use a similar log forwarding setup.
Conclusion
A robust server monitoring strategy for a C++ application and PostgreSQL cluster on DigitalOcean involves a multi-layered approach. By combining deep PostgreSQL introspection, application-level instrumentation with Prometheus, system-level metrics via Node Exporter, centralized alerting with Alertmanager, and comprehensive log aggregation, you can ensure high availability, performance, and rapid issue resolution.