Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on AWS

Proactive C++ Application Health Checks with Prometheus and Node Exporter

For C++ applications running on AWS EC2 instances, robust health monitoring is paramount. We’ll leverage Prometheus for time-series data collection and Node Exporter for system-level metrics. For application-specific metrics, we’ll integrate the C++ client library for Prometheus.

First, ensure Node Exporter is installed and running on your EC2 instances. A common method is to download the latest release and run it as a systemd service.

Installing and Configuring Node Exporter

Download the appropriate binary for your EC2 instance’s architecture (e.g., amd64 for most t3/m5 instances).

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

Create a systemd service file for Node Exporter.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Create the Prometheus user and start the service.

sudo useradd -rs 600 prometheus
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Verify Node Exporter is running and accessible via its default port 9100.

curl http://localhost:9100/metrics

Instrumenting C++ Applications with Prometheus Client Library

To expose application-specific metrics, we’ll use the official C++ client library. This involves including the library headers, defining metrics (e.g., counters, gauges), and exposing an HTTP endpoint for Prometheus to scrape.

First, add the Prometheus C++ client library as a dependency to your build system (e.g., CMake). You’ll typically need to build it from source or use a package manager if available.

Here’s a simplified example of how to define and expose metrics:

#include <prometheus/counter.h>
#include <prometheus/exposer.h>
#include <prometheus/registry.h>
#include <prometheus/family.h>
#include <thread>
#include <chrono>
#include <iostream>

int main() {
    // Create a Prometheus registry to hold our metrics
    auto registry = std::make_shared<prometheus::Registry>();

    // Expose metrics on HTTP port 8080
    prometheus::Exposer exposer{"0.0.0.0:8080"};
    exposer.RegisterCollectable(registry);

    // Define a counter metric
    auto& request_counter = prometheus::BuildCounter()
        .Name("my_cpp_app_requests_total")
        .Help("Total number of requests processed by the application")
        .Register(*registry);

    // Define a gauge metric
    auto& active_connections = prometheus::BuildGauge()
        .Name("my_cpp_app_active_connections")
        .Help("Current number of active connections")
        .Register(*registry);

    // Simulate application work and metric updates
    int request_count = 0;
    int connections = 0;
    while (true) {
        // Simulate processing a request
        request_count++;
        request_counter.Increment();

        // Simulate connection changes
        if (rand() % 2 == 0) {
            connections++;
            active_connections.Increment();
        } else if (connections > 0) {
            connections--;
            active_connections.Decrement();
        }

        std::cout << "Processed request. Total: " << request_count << ", Active connections: " << connections << std::endl;

        // Sleep for a bit
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }

    return 0;
}

Compile this code and run it. You can then access your application’s metrics at http://your-ec2-ip:8080/metrics.

Configuring Prometheus to Scrape Metrics

On your Prometheus server (which could be another EC2 instance or a managed service like AWS Managed Prometheus), configure your prometheus.yml to scrape both Node Exporter and your C++ application.

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
  # Scrape Node Exporter for system metrics
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['ec2-instance-1-private-ip:9100', 'ec2-instance-2-private-ip:9100'] # Replace with your EC2 instance IPs
        labels:
          environment: 'production'
          region: 'us-east-1'

  # Scrape C++ application metrics
  - job_name: 'my_cpp_app'
    static_configs:
      - targets: ['ec2-instance-1-private-ip:8080', 'ec2-instance-2-private-ip:8080'] # Replace with your EC2 instance IPs and app port
        labels:
          environment: 'production'
          application: 'my_cpp_app'
          region: 'us-east-1'

Restart your Prometheus server after updating the configuration.

PostgreSQL Cluster Monitoring on AWS RDS/EC2

Monitoring PostgreSQL clusters, whether managed by AWS RDS or self-hosted on EC2, requires a multi-faceted approach. We’ll focus on key metrics and tools for both scenarios.

Monitoring AWS RDS PostgreSQL Instances

AWS RDS provides a wealth of built-in metrics via CloudWatch. For deeper insights, we can also leverage the pg_stat_statements extension and export these to Prometheus.

Key CloudWatch Metrics to Monitor:

CPUUtilization: High CPU can indicate inefficient queries or insufficient instance size.
DatabaseConnections: Monitor for excessive connections that could exhaust resources.
ReadIOPS and WriteIOPS: Track disk I/O to identify storage bottlenecks.
ReadLatency and WriteLatency: High latency points to slow disk performance.
FreeableMemory: Crucial for caching; low values can degrade performance.
DiskQueueDepth: Indicates I/O pressure.
NetworkReceiveThroughput and NetworkTransmitThroughput: Monitor network traffic.
AuroraReplicaLag (if using Aurora): Critical for replication health.

You can set up CloudWatch Alarms for these metrics to trigger notifications (e.g., via SNS) when thresholds are breached.

Leveraging `pg_stat_statements` for Query Analysis

The pg_stat_statements extension tracks execution statistics of all SQL statements run by the server. This is invaluable for identifying slow or frequently executed queries.

Enabling pg_stat_statements on RDS:

Navigate to your RDS instance’s configuration in the AWS console.
Edit the instance and go to “Database options”.
In the “Parameter group” section, create a new parameter group or modify an existing one.
Add pg_stat_statements to the shared_preload_libraries parameter.
Set pg_stat_statements.track to all (or top for only top-level statements).
Set pg_stat_statements.max to a sufficiently high value (e.g., 10000).
Apply the parameter group to your RDS instance and reboot it for changes to take effect.

Once enabled, you can query it:

SELECT
    query,
    calls,
    total_exec_time,
    rows,
    mean_exec_time,
    stddev_exec_time
FROM
    pg_stat_statements
ORDER BY
    total_exec_time DESC
LIMIT 10;

To integrate these metrics with Prometheus, you can use a PostgreSQL exporter. The postgres_exporter is a popular choice.

Using `postgres_exporter` with Prometheus

Deploy postgres_exporter on an instance that can connect to your RDS endpoint. Configure it with connection details and the queries you want to run.

Example postgres_exporter configuration (.pg_exporter.yml):

log_level: info
listen_address: 0.0.0.0:9187

metrics:
  - name: pg_stat_statements_calls
    query: "SELECT sum(calls) FROM pg_stat_statements"
    metrics:
      - pg_stat_statements_calls:
          usage: COUNTER
          description: "Total number of calls for all statements"
  - name: pg_stat_statements_exec_time
    query: "SELECT sum(total_exec_time) FROM pg_stat_statements"
    metrics:
      - pg_stat_statements_exec_time:
          usage: COUNTER
          description: "Total execution time in ms for all statements"
  - name: pg_stat_statements_rows
    query: "SELECT sum(rows) FROM pg_stat_statements"
    metrics:
      - pg_stat_statements_rows:
          usage: COUNTER
          description: "Total number of rows returned for all statements"
  - name: pg_stat_statements_query_plan_cache_hit_rate
    query: "SELECT avg(1.0 - shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0)) FROM pg_stat_statements"
    metrics:
      - pg_stat_statements_query_plan_cache_hit_rate:
          usage: GAUGE
          description: "Percentage of cache hits for query plans"

Run the exporter:

docker run --rm -p 9187:9187 \
  -v $(pwd)/.pg_exporter.yml:/.pg_exporter.yml \
  quay.io/prometheuscommunity/postgres-exporter \
  --config="/.pg_exporter.yml" \
  --extend.query-path="/etc/postgres_exporter/queries.d" \
  --web.listen-address=":9187" \
  --log.level="info" \
  --pg.dsn="postgresql://user:[email protected]:5432/mydatabase?sslmode=require"

Add this exporter to your Prometheus configuration:

scrape_configs:
  # ... other jobs
  - job_name: 'rds_postgres'
    static_configs:
      - targets: ['postgres-exporter-ip:9187'] # IP of the instance running postgres_exporter
        labels:
          environment: 'production'
          cluster: 'my-pg-cluster'
          region: 'us-east-1'

Monitoring Self-Hosted PostgreSQL on EC2

For PostgreSQL instances running directly on EC2, you’ll combine Node Exporter (for OS metrics), postgres_exporter (configured with the EC2 instance’s private IP), and potentially custom scripts for specific checks.

Key PostgreSQL Metrics to Monitor (via postgres_exporter or direct queries):

pg_stat_activity: Number of active connections, their states (idle, active, waiting).
pg_locks: Detect lock contention.
pg_stat_database: Transaction rates (commits, rollbacks), cache hit ratios.
pg_stat_bgwriter: Background writer performance.
pg_stat_replication: Replication lag and status for replicas.
pg_settings: Monitor critical configuration parameters (e.g., shared_buffers, work_mem).

You can add custom queries to postgres_exporter‘s configuration or run them periodically via cron jobs and push metrics to Prometheus using the Pushgateway.

Alerting Strategies with Alertmanager

Effective alerting is crucial for proactive issue resolution. Prometheus Alertmanager handles deduplication, grouping, and routing of alerts generated by Prometheus rules.

Defining Alerting Rules in Prometheus

Alerting rules are defined in YAML files and loaded by Prometheus. These rules specify conditions that, when met, trigger alerts.

groups:
  - name: cpp_app_alerts
    rules:
      - alert: HighRequestRate
        expr: rate(my_cpp_app_requests_total[5m]) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request rate detected for my_cpp_app"
          description: "The request rate for my_cpp_app has exceeded 1000 requests/sec for the last 5 minutes."

      - alert: LowActiveConnections
        expr: my_cpp_app_active_connections < 5
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Low active connections for my_cpp_app"
          description: "The number of active connections for my_cpp_app has dropped below 5 for 10 minutes."

  - name: postgres_alerts
    rules:
      - alert: HighPostgresCPU
        expr: avg by (instance) (node_cpu_seconds_total{mode="idle", instance=~"ec2-instance-.*:9100"}) < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High CPU utilization on PostgreSQL instance {{ $labels.instance }}"
          description: "PostgreSQL instance {{ $labels.instance }} is experiencing high CPU utilization (idle time < 10%) for 5 minutes."

      - alert: HighReplicationLag
        expr: aws_rds_replication_lag_seconds > 300 # Assuming you're scraping RDS metrics via a specific exporter or CloudWatch integration
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High replication lag on PostgreSQL replica {{ $labels.instance }}"
          description: "Replication lag on PostgreSQL replica {{ $labels.instance }} has exceeded 300 seconds for 5 minutes."

      - alert: TooManyPostgresConnections
        expr: pg_stat_activity_count > 100 # Assuming pg_stat_activity_count is exposed by postgres_exporter
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Too many active connections on PostgreSQL instance {{ $labels.instance }}"
          description: "PostgreSQL instance {{ $labels.instance }} has more than 100 active connections for 5 minutes."

Ensure your prometheus.yml points to these rule files.

Configuring Alertmanager Routing and Receivers

The alertmanager.yml file defines how alerts are grouped, silenced, and sent to various receivers (e.g., Slack, PagerDuty, email).

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver if no specific route matches

  routes:
    - match:
        severity: 'critical'
      receiver: 'pagerduty-critical'
      continue: true # Allows matching other routes if needed

    - match:
        severity: 'warning'
      receiver: 'slack-warnings'
      continue: true

receivers:
  - name: 'default-receiver'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts-default'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts-warnings'

Ensure your Prometheus configuration points to your Alertmanager instance.

Continuous Improvement and Observability

Monitoring is not a set-and-forget activity. Regularly review your metrics, alerts, and dashboards (e.g., using Grafana). Identify recurring issues, tune alert thresholds, and add new metrics as your application evolves. Consider implementing distributed tracing for deeper insights into request flows across microservices.