Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on AWS
Proactive C++ Application Health Checks with Prometheus and Node Exporter
For C++ applications running on AWS EC2 instances, robust health monitoring is paramount. We’ll leverage Prometheus for time-series data collection and Node Exporter for system-level metrics. For application-specific metrics, we’ll integrate the C++ client library for Prometheus.
First, ensure Node Exporter is installed and running on your EC2 instances. A common method is to download the latest release and run it as a systemd service.
Installing and Configuring Node Exporter
Download the appropriate binary for your EC2 instance’s architecture (e.g., amd64 for most t3/m5 instances).
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
Create a systemd service file for Node Exporter.
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Create the Prometheus user and start the service.
sudo useradd -rs 600 prometheus sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter
Verify Node Exporter is running and accessible via its default port 9100.
curl http://localhost:9100/metrics
Instrumenting C++ Applications with Prometheus Client Library
To expose application-specific metrics, we’ll use the official C++ client library. This involves including the library headers, defining metrics (e.g., counters, gauges), and exposing an HTTP endpoint for Prometheus to scrape.
First, add the Prometheus C++ client library as a dependency to your build system (e.g., CMake). You’ll typically need to build it from source or use a package manager if available.
Here’s a simplified example of how to define and expose metrics:
#include <prometheus/counter.h>
#include <prometheus/exposer.h>
#include <prometheus/registry.h>
#include <prometheus/family.h>
#include <thread>
#include <chrono>
#include <iostream>
int main() {
// Create a Prometheus registry to hold our metrics
auto registry = std::make_shared<prometheus::Registry>();
// Expose metrics on HTTP port 8080
prometheus::Exposer exposer{"0.0.0.0:8080"};
exposer.RegisterCollectable(registry);
// Define a counter metric
auto& request_counter = prometheus::BuildCounter()
.Name("my_cpp_app_requests_total")
.Help("Total number of requests processed by the application")
.Register(*registry);
// Define a gauge metric
auto& active_connections = prometheus::BuildGauge()
.Name("my_cpp_app_active_connections")
.Help("Current number of active connections")
.Register(*registry);
// Simulate application work and metric updates
int request_count = 0;
int connections = 0;
while (true) {
// Simulate processing a request
request_count++;
request_counter.Increment();
// Simulate connection changes
if (rand() % 2 == 0) {
connections++;
active_connections.Increment();
} else if (connections > 0) {
connections--;
active_connections.Decrement();
}
std::cout << "Processed request. Total: " << request_count << ", Active connections: " << connections << std::endl;
// Sleep for a bit
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return 0;
}
Compile this code and run it. You can then access your application’s metrics at http://your-ec2-ip:8080/metrics.
Configuring Prometheus to Scrape Metrics
On your Prometheus server (which could be another EC2 instance or a managed service like AWS Managed Prometheus), configure your prometheus.yml to scrape both Node Exporter and your C++ application.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_configs:
# Scrape Node Exporter for system metrics
- job_name: 'node_exporter'
static_configs:
- targets: ['ec2-instance-1-private-ip:9100', 'ec2-instance-2-private-ip:9100'] # Replace with your EC2 instance IPs
labels:
environment: 'production'
region: 'us-east-1'
# Scrape C++ application metrics
- job_name: 'my_cpp_app'
static_configs:
- targets: ['ec2-instance-1-private-ip:8080', 'ec2-instance-2-private-ip:8080'] # Replace with your EC2 instance IPs and app port
labels:
environment: 'production'
application: 'my_cpp_app'
region: 'us-east-1'
Restart your Prometheus server after updating the configuration.
PostgreSQL Cluster Monitoring on AWS RDS/EC2
Monitoring PostgreSQL clusters, whether managed by AWS RDS or self-hosted on EC2, requires a multi-faceted approach. We’ll focus on key metrics and tools for both scenarios.
Monitoring AWS RDS PostgreSQL Instances
AWS RDS provides a wealth of built-in metrics via CloudWatch. For deeper insights, we can also leverage the pg_stat_statements extension and export these to Prometheus.
Key CloudWatch Metrics to Monitor:
- CPUUtilization: High CPU can indicate inefficient queries or insufficient instance size.
- DatabaseConnections: Monitor for excessive connections that could exhaust resources.
- ReadIOPS and WriteIOPS: Track disk I/O to identify storage bottlenecks.
- ReadLatency and WriteLatency: High latency points to slow disk performance.
- FreeableMemory: Crucial for caching; low values can degrade performance.
- DiskQueueDepth: Indicates I/O pressure.
- NetworkReceiveThroughput and NetworkTransmitThroughput: Monitor network traffic.
- AuroraReplicaLag (if using Aurora): Critical for replication health.
You can set up CloudWatch Alarms for these metrics to trigger notifications (e.g., via SNS) when thresholds are breached.
Leveraging pg_stat_statements for Query Analysis
The pg_stat_statements extension tracks execution statistics of all SQL statements run by the server. This is invaluable for identifying slow or frequently executed queries.
Enabling pg_stat_statements on RDS:
- Navigate to your RDS instance’s configuration in the AWS console.
- Edit the instance and go to “Database options”.
- In the “Parameter group” section, create a new parameter group or modify an existing one.
- Add
pg_stat_statementsto theshared_preload_librariesparameter. - Set
pg_stat_statements.tracktoall(ortopfor only top-level statements). - Set
pg_stat_statements.maxto a sufficiently high value (e.g., 10000). - Apply the parameter group to your RDS instance and reboot it for changes to take effect.
Once enabled, you can query it:
SELECT
query,
calls,
total_exec_time,
rows,
mean_exec_time,
stddev_exec_time
FROM
pg_stat_statements
ORDER BY
total_exec_time DESC
LIMIT 10;
To integrate these metrics with Prometheus, you can use a PostgreSQL exporter. The postgres_exporter is a popular choice.
Using postgres_exporter with Prometheus
Deploy postgres_exporter on an instance that can connect to your RDS endpoint. Configure it with connection details and the queries you want to run.
Example postgres_exporter configuration (.pg_exporter.yml):
log_level: info
listen_address: 0.0.0.0:9187
metrics:
- name: pg_stat_statements_calls
query: "SELECT sum(calls) FROM pg_stat_statements"
metrics:
- pg_stat_statements_calls:
usage: COUNTER
description: "Total number of calls for all statements"
- name: pg_stat_statements_exec_time
query: "SELECT sum(total_exec_time) FROM pg_stat_statements"
metrics:
- pg_stat_statements_exec_time:
usage: COUNTER
description: "Total execution time in ms for all statements"
- name: pg_stat_statements_rows
query: "SELECT sum(rows) FROM pg_stat_statements"
metrics:
- pg_stat_statements_rows:
usage: COUNTER
description: "Total number of rows returned for all statements"
- name: pg_stat_statements_query_plan_cache_hit_rate
query: "SELECT avg(1.0 - shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0)) FROM pg_stat_statements"
metrics:
- pg_stat_statements_query_plan_cache_hit_rate:
usage: GAUGE
description: "Percentage of cache hits for query plans"
Run the exporter:
docker run --rm -p 9187:9187 \ -v $(pwd)/.pg_exporter.yml:/.pg_exporter.yml \ quay.io/prometheuscommunity/postgres-exporter \ --config="/.pg_exporter.yml" \ --extend.query-path="/etc/postgres_exporter/queries.d" \ --web.listen-address=":9187" \ --log.level="info" \ --pg.dsn="postgresql://user:[email protected]:5432/mydatabase?sslmode=require"
Add this exporter to your Prometheus configuration:
scrape_configs:
# ... other jobs
- job_name: 'rds_postgres'
static_configs:
- targets: ['postgres-exporter-ip:9187'] # IP of the instance running postgres_exporter
labels:
environment: 'production'
cluster: 'my-pg-cluster'
region: 'us-east-1'
Monitoring Self-Hosted PostgreSQL on EC2
For PostgreSQL instances running directly on EC2, you’ll combine Node Exporter (for OS metrics), postgres_exporter (configured with the EC2 instance’s private IP), and potentially custom scripts for specific checks.
Key PostgreSQL Metrics to Monitor (via postgres_exporter or direct queries):
pg_stat_activity: Number of active connections, their states (idle, active, waiting).pg_locks: Detect lock contention.pg_stat_database: Transaction rates (commits, rollbacks), cache hit ratios.pg_stat_bgwriter: Background writer performance.pg_stat_replication: Replication lag and status for replicas.pg_settings: Monitor critical configuration parameters (e.g.,shared_buffers,work_mem).
You can add custom queries to postgres_exporter‘s configuration or run them periodically via cron jobs and push metrics to Prometheus using the Pushgateway.
Alerting Strategies with Alertmanager
Effective alerting is crucial for proactive issue resolution. Prometheus Alertmanager handles deduplication, grouping, and routing of alerts generated by Prometheus rules.
Defining Alerting Rules in Prometheus
Alerting rules are defined in YAML files and loaded by Prometheus. These rules specify conditions that, when met, trigger alerts.
groups:
- name: cpp_app_alerts
rules:
- alert: HighRequestRate
expr: rate(my_cpp_app_requests_total[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High request rate detected for my_cpp_app"
description: "The request rate for my_cpp_app has exceeded 1000 requests/sec for the last 5 minutes."
- alert: LowActiveConnections
expr: my_cpp_app_active_connections < 5
for: 10m
labels:
severity: info
annotations:
summary: "Low active connections for my_cpp_app"
description: "The number of active connections for my_cpp_app has dropped below 5 for 10 minutes."
- name: postgres_alerts
rules:
- alert: HighPostgresCPU
expr: avg by (instance) (node_cpu_seconds_total{mode="idle", instance=~"ec2-instance-.*:9100"}) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU utilization on PostgreSQL instance {{ $labels.instance }}"
description: "PostgreSQL instance {{ $labels.instance }} is experiencing high CPU utilization (idle time < 10%) for 5 minutes."
- alert: HighReplicationLag
expr: aws_rds_replication_lag_seconds > 300 # Assuming you're scraping RDS metrics via a specific exporter or CloudWatch integration
for: 5m
labels:
severity: critical
annotations:
summary: "High replication lag on PostgreSQL replica {{ $labels.instance }}"
description: "Replication lag on PostgreSQL replica {{ $labels.instance }} has exceeded 300 seconds for 5 minutes."
- alert: TooManyPostgresConnections
expr: pg_stat_activity_count > 100 # Assuming pg_stat_activity_count is exposed by postgres_exporter
for: 5m
labels:
severity: warning
annotations:
summary: "Too many active connections on PostgreSQL instance {{ $labels.instance }}"
description: "PostgreSQL instance {{ $labels.instance }} has more than 100 active connections for 5 minutes."
Ensure your prometheus.yml points to these rule files.
Configuring Alertmanager Routing and Receivers
The alertmanager.yml file defines how alerts are grouped, silenced, and sent to various receivers (e.g., Slack, PagerDuty, email).
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver' # Default receiver if no specific route matches
routes:
- match:
severity: 'critical'
receiver: 'pagerduty-critical'
continue: true # Allows matching other routes if needed
- match:
severity: 'warning'
receiver: 'slack-warnings'
continue: true
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts-default'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts-warnings'
Ensure your Prometheus configuration points to your Alertmanager instance.
Continuous Improvement and Observability
Monitoring is not a set-and-forget activity. Regularly review your metrics, alerts, and dashboards (e.g., using Grafana). Identify recurring issues, tune alert thresholds, and add new metrics as your application evolves. Consider implementing distributed tracing for deeper insights into request flows across microservices.