Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on Linode

Proactive PostgreSQL Cluster Health Checks

Maintaining the health of a PostgreSQL cluster, especially in a distributed Linode environment, requires more than just basic uptime checks. We need to monitor key performance indicators (KPIs) that directly impact query performance, data integrity, and replication lag. This involves setting up granular checks that can alert us to potential issues before they cascade into outages.

A robust monitoring strategy for PostgreSQL should include:

Replication Lag: Crucial for high availability and disaster recovery.
Connection Counts: High connection counts can indicate resource exhaustion or denial-of-service attacks.
Query Performance: Slow queries are a primary driver of application performance degradation.
Disk I/O and Space: Essential for preventing data corruption and ensuring write operations can complete.
WAL (Write-Ahead Log) Activity: High WAL generation can impact replication and disk usage.
Cache Hit Ratio: Indicates the effectiveness of PostgreSQL’s shared buffer cache.

Implementing PostgreSQL Monitoring with Prometheus and Node Exporter

Prometheus is our go-to for time-series monitoring. We’ll leverage the official postgres_exporter to expose PostgreSQL metrics and node_exporter for system-level metrics on each PostgreSQL node. Ensure Prometheus is installed and configured to scrape these exporters.

First, install and configure postgres_exporter on each PostgreSQL server. This typically involves creating a dedicated user and granting it necessary permissions.

Create a monitoring user in PostgreSQL:

CREATE USER monitor WITH PASSWORD 'your_secure_password';
GRANT pg_read_all_stats TO monitor;
GRANT pg_read_all_settings TO monitor;
-- For specific metrics, you might need additional grants, e.g.:
-- GRANT SELECT ON pg_stat_replication TO monitor;
-- GRANT SELECT ON pg_stat_activity TO monitor;

Next, configure postgres_exporter. The primary configuration is often done via environment variables or a configuration file. A common approach is to use a .pgpass file for authentication.

Create a .pgpass file for the user running postgres_exporter (e.g., postgres or a dedicated monitoring user):

hostname:port:database:username:password
localhost:5432:*:monitor:your_secure_password

Ensure the permissions are strict:

chmod 0600 ~/.pgpass

Then, configure Prometheus to scrape the exporter. Add the following to your prometheus.yml:

scrape_configs:
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-node-1:9187', 'postgres-node-2:9187', 'postgres-node-3:9187'] # Replace with your actual hostnames/IPs and port
    metric_relabel_configs:
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: instance
        replacement: '$1'

For node_exporter, the configuration is simpler. Install it on each PostgreSQL node and add it to Prometheus’s scrape targets. If you’re using a service manager like systemd, ensure it’s running.

# Example systemd service file for node_exporter
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.textfile.directory=/var/lib/node_exporter/textfile-collector

[Install]
WantedBy=multi-user.target

And in prometheus.yml:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['postgres-node-1:9100', 'postgres-node-2:9100', 'postgres-node-3:9100'] # Replace with your actual hostnames/IPs and port
    metric_relabel_configs:
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: instance
        replacement: '$1'

C++ Application Performance Monitoring

Monitoring your C++ application involves instrumenting it to expose relevant metrics. For applications interacting with PostgreSQL, key metrics include query execution times, connection pool usage, and error rates. We can use libraries like prometheus-cpp to expose these metrics.

First, integrate the prometheus-cpp library into your C++ project. This typically involves adding it as a dependency (e.g., via CMake) and including its headers.

Here’s a simplified example of how to expose metrics:

#include <prometheus/registry.h>
#include <prometheus/exposer.h>
#include <prometheus/counter.h>
#include <prometheus/gauge.h>
#include <prometheus/summary.h>
#include <prometheus/histogram.h>
#include <chrono>
#include <thread>
#include <iostream>

// Global Prometheus registry and exposer
std::shared_ptr<prometheus::Registry> registry;
prometheus::Exposer exposer;

// Metrics
prometheus::Family<prometheus::Counter>& query_count_family =
    prometheus::BuildCounterFamily(*registry, "app_queries_total", "Total number of database queries executed.");

prometheus::Family<prometheus::Summary>& query_duration_family =
    prometheus::BuildSummaryFamily(*registry, "app_query_duration_seconds", "Distribution of database query execution times.",
                                   prometheus::Summary::Quantiles({.05, .50, .90, .99}));

prometheus::Family<prometheus::Gauge>& db_connections_gauge_family =
    prometheus::BuildGaugeFamily(*registry, "app_db_connections_current", "Current number of active database connections.");

void initialize_metrics() {
    registry = std::make_shared<prometheus::Registry>();
    exposer.RegisterCollectable(registry);
    exposer.Start("0.0.0.0", 9400); // Expose metrics on port 9400
    std::cout << "Prometheus metrics exposed on port 9400" << std::endl;
}

void execute_query(const std::string& query) {
    auto& query_counter = query_count_family.Add({{"query", query}});
    auto& query_duration_summary = query_duration_family.Add({{"query", query}});

    // Simulate connection pool usage
    auto& db_connections_gauge = db_connections_gauge_family.Add({{"pool", "main"}});
    db_connections_gauge.Increment();

    auto start_time = std::chrono::high_resolution_clock::now();

    // --- Simulate database query execution ---
    std::this_thread::sleep_for(std::chrono::milliseconds(rand() % 500 + 50)); // Simulate latency
    // --- End simulation ---

    auto end_time = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = end_time - start_time;

    query_counter.Increment();
    query_duration_summary.Observe(elapsed.count());

    db_connections_gauge.Decrement();
}

int main() {
    initialize_metrics();

    // Simulate application running and executing queries
    while (true) {
        execute_query("SELECT * FROM users WHERE id = 1");
        execute_query("INSERT INTO logs (message) VALUES ('User logged in')");
        std::this_thread::sleep_for(std::chrono::seconds(5));
    }

    return 0;
}

Compile this code and run it on your application servers. Then, configure Prometheus to scrape these metrics:

scrape_configs:
  - job_name: 'my_cpp_app'
    static_configs:
      - targets: ['app-server-1:9400', 'app-server-2:9400'] # Replace with your actual hostnames/IPs and port
    metric_relabel_configs:
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: instance
        replacement: '$1'

Alerting with Alertmanager

Once metrics are collected, we need to define alerts. Alertmanager is Prometheus’s companion for handling alerts. We’ll define alert rules in Prometheus and configure Alertmanager to route them to appropriate channels (e.g., Slack, PagerDuty).

Example Prometheus alert rules for PostgreSQL:

groups:
  - name: postgresql_alerts
    rules:
      - alert: PostgreSQLReplicationLagging
        expr: |
          sum(pg_replication_lag_seconds{master_role="primary"}) by (instance) > 60
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL replication lag detected on {{ $labels.instance }}"
          description: "Replication lag on {{ $labels.instance }} is greater than 60 seconds."

      - alert: HighPostgreSQLConnections
        expr: |
          sum(pg_stat_activity_count{datname="your_database_name"}) by (instance) > 500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High number of PostgreSQL connections on {{ $labels.instance }}"
          description: "Database 'your_database_name' on {{ $labels.instance }} has more than 500 active connections."

      - alert: LowPostgreSQLCacheHitRatio
        expr: |
          (pg_stat_database_blks_hit{datname="your_database_name"} / (pg_stat_database_blks_hit{datname="your_database_name"} + pg_stat_database_blks_read{datname="your_database_name"})) * 100 < 95
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Low PostgreSQL cache hit ratio on {{ $labels.instance }}"
          description: "Cache hit ratio for database 'your_database_name' on {{ $labels.instance }} is below 95%."

      - alert: PostgreSQLDiskSpaceLow
        expr: |
          node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql/data"} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql/data"} * 100 < 10
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on PostgreSQL data directory on {{ $labels.instance }}"
          description: "Filesystem on {{ $labels.instance }} with mountpoint '/var/lib/postgresql/data' has less than 10% free space."

And for your C++ application:

groups:
  - name: cpp_application_alerts
    rules:
      - alert: HighQueryLatency
        expr: |
          avg(rate(app_query_duration_seconds_sum{job="my_cpp_app"}[5m])) by (instance, query) / avg(rate(app_query_duration_seconds_count{job="my_cpp_app"}[5m])) by (instance, query) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High query latency for {{ $labels.query }} on {{ $labels.instance }}"
          description: "Average query latency for '{{ $labels.query }}' on {{ $labels.instance }} is over 0.5 seconds."

      - alert: ExcessiveDatabaseConnections
        expr: |
          avg(app_db_connections_current{job="my_cpp_app", pool="main"}) by (instance) > 200
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Excessive database connections from app on {{ $labels.instance }}"
          description: "Application on {{ $labels.instance }} is using more than 200 database connections."

      - alert: ApplicationQueryErrors
        expr: |
          sum(rate(app_queries_total{job="my_cpp_app", status="error"}[5m])) by (instance, query) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Application query errors detected on {{ $labels.instance }}"
          description: "Query '{{ $labels.query }}' on {{ $labels.instance }} is returning errors."

Configure Alertmanager to receive these alerts from Prometheus and route them. This involves setting up receivers (e.g., email, Slack webhook) and routes in alertmanager.yml.

Log Aggregation and Analysis

Metrics provide a high-level view, but logs are essential for deep diagnostics. Centralizing logs from your C++ applications and PostgreSQL instances allows for faster troubleshooting. We’ll use a combination of rsyslog/fluentd on the nodes and a central Elasticsearch/Loki instance.

On each PostgreSQL node, configure rsyslog to forward PostgreSQL logs to a central collector. Ensure PostgreSQL is configured to log to syslog.

# postgresql.conf
log_destination = 'syslog'
logging_collector = on
log_directory = 'log' # This might be ignored if log_destination is syslog
log_filename = 'postgresql-%a.log'
log_rotation_age = 1d
log_rotation_size = 0
log_min_duration_statement = '1s' # Log slow queries
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_error_verbosity = default
log_statement = 'ddl' # Log DDL statements

Then, configure rsyslog to forward these logs. On your application servers, configure rsyslog to forward application logs.

# /etc/rsyslog.d/99-remote.conf
*.* @@your_log_aggregator_host:514 # For TCP
# or
# *.* @your_log_aggregator_host:514 # For UDP

On the log aggregator side (e.g., using Fluentd), you’ll configure input plugins to receive logs and output plugins to send them to your chosen backend (Elasticsearch, Loki, S3, etc.).

# Example Fluentd configuration for receiving syslog and sending to Elasticsearch
<source>
  @type syslog
  port 514
  bind 0.0.0.0
  protocol_type tcp # or udp
  tag postgresql.log
</source>

<source>
  @type forward
  port 24224
  tag app.log
</source>

<match postgresql.log app.log>
  @type elasticsearch
  host elasticsearch.yourdomain.com
  port 9200
  logstash_format true
  logstash_prefix logstash
  include_tag_key true
  tag_key log_topic
  flush_interval 5s
</match>

For C++ applications, consider structured logging (e.g., JSON format) to make log parsing easier in your central system. Libraries like spdlog can facilitate this.

System Health and Resource Monitoring

Beyond application-specific metrics, fundamental system health is paramount. node_exporter provides CPU, memory, disk, and network statistics. We need to set up alerts for critical thresholds.

Key system metrics to monitor:

CPU Utilization: High sustained CPU can indicate performance bottlenecks or runaway processes.
Memory Usage: OOM (Out-Of-Memory) killer events are catastrophic.
Disk I/O Wait: High I/O wait times point to storage performance issues.
Network Throughput and Errors: Network saturation or errors can disrupt replication and application connectivity.
Load Average: Indicates the number of processes waiting for CPU time.

Example alerts for system resources (these would be added to your Prometheus rules file):

groups:
  - name: system_alerts
    rules:
      - alert: HighCPUUsage
        expr: |
          avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 < 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage on {{ $labels.instance }} is below 20% idle for 10 minutes."

      - alert: HighMemoryUsage
        expr: |
          node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Low memory available on {{ $labels.instance }}"
          description: "Less than 10% of memory is available on {{ $labels.instance }}."

      - alert: HighDiskIOWait
        expr: |
          avg by (instance) (rate(node_disk_io_time_seconds_total{device=~"sd.*|nvme.*"}[5m])) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High disk I/O wait on {{ $labels.instance }}"
          description: "Disk I/O wait time on {{ $labels.instance }} is consistently high."

Regularly review these alerts and tune thresholds based on your application’s baseline performance. Combining these monitoring strategies provides a comprehensive view of your C++ application and PostgreSQL cluster health on Linode, enabling proactive issue resolution and ensuring high availability.

Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on Linode

Proactive PostgreSQL Cluster Health Checks

Implementing PostgreSQL Monitoring with Prometheus and Node Exporter

C++ Application Performance Monitoring

Alerting with Alertmanager

Log Aggregation and Analysis

System Health and Resource Monitoring

Recent Posts

Top Categories

Our Products

Our Services