Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on Linode
Proactive PostgreSQL Cluster Health Checks
Maintaining the health of a PostgreSQL cluster, especially in a distributed Linode environment, requires more than just basic uptime checks. We need to monitor key performance indicators (KPIs) that directly impact query performance, data integrity, and replication lag. This involves setting up granular checks that can alert us to potential issues before they cascade into outages.
A robust monitoring strategy for PostgreSQL should include:
- Replication Lag: Crucial for high availability and disaster recovery.
- Connection Counts: High connection counts can indicate resource exhaustion or denial-of-service attacks.
- Query Performance: Slow queries are a primary driver of application performance degradation.
- Disk I/O and Space: Essential for preventing data corruption and ensuring write operations can complete.
- WAL (Write-Ahead Log) Activity: High WAL generation can impact replication and disk usage.
- Cache Hit Ratio: Indicates the effectiveness of PostgreSQL’s shared buffer cache.
Implementing PostgreSQL Monitoring with Prometheus and Node Exporter
Prometheus is our go-to for time-series monitoring. We’ll leverage the official postgres_exporter to expose PostgreSQL metrics and node_exporter for system-level metrics on each PostgreSQL node. Ensure Prometheus is installed and configured to scrape these exporters.
First, install and configure postgres_exporter on each PostgreSQL server. This typically involves creating a dedicated user and granting it necessary permissions.
Create a monitoring user in PostgreSQL:
CREATE USER monitor WITH PASSWORD 'your_secure_password'; GRANT pg_read_all_stats TO monitor; GRANT pg_read_all_settings TO monitor; -- For specific metrics, you might need additional grants, e.g.: -- GRANT SELECT ON pg_stat_replication TO monitor; -- GRANT SELECT ON pg_stat_activity TO monitor;
Next, configure postgres_exporter. The primary configuration is often done via environment variables or a configuration file. A common approach is to use a .pgpass file for authentication.
Create a .pgpass file for the user running postgres_exporter (e.g., postgres or a dedicated monitoring user):
hostname:port:database:username:password localhost:5432:*:monitor:your_secure_password
Ensure the permissions are strict:
chmod 0600 ~/.pgpass
Then, configure Prometheus to scrape the exporter. Add the following to your prometheus.yml:
scrape_configs:
- job_name: 'postgres'
static_configs:
- targets: ['postgres-node-1:9187', 'postgres-node-2:9187', 'postgres-node-3:9187'] # Replace with your actual hostnames/IPs and port
metric_relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):.*'
target_label: instance
replacement: '$1'
For node_exporter, the configuration is simpler. Install it on each PostgreSQL node and add it to Prometheus’s scrape targets. If you’re using a service manager like systemd, ensure it’s running.
# Example systemd service file for node_exporter [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter ExecStart=/usr/local/bin/node_exporter \ --collector.textfile.directory=/var/lib/node_exporter/textfile-collector [Install] WantedBy=multi-user.target
And in prometheus.yml:
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['postgres-node-1:9100', 'postgres-node-2:9100', 'postgres-node-3:9100'] # Replace with your actual hostnames/IPs and port
metric_relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):.*'
target_label: instance
replacement: '$1'
C++ Application Performance Monitoring
Monitoring your C++ application involves instrumenting it to expose relevant metrics. For applications interacting with PostgreSQL, key metrics include query execution times, connection pool usage, and error rates. We can use libraries like prometheus-cpp to expose these metrics.
First, integrate the prometheus-cpp library into your C++ project. This typically involves adding it as a dependency (e.g., via CMake) and including its headers.
Here’s a simplified example of how to expose metrics:
#include <prometheus/registry.h>
#include <prometheus/exposer.h>
#include <prometheus/counter.h>
#include <prometheus/gauge.h>
#include <prometheus/summary.h>
#include <prometheus/histogram.h>
#include <chrono>
#include <thread>
#include <iostream>
// Global Prometheus registry and exposer
std::shared_ptr<prometheus::Registry> registry;
prometheus::Exposer exposer;
// Metrics
prometheus::Family<prometheus::Counter>& query_count_family =
prometheus::BuildCounterFamily(*registry, "app_queries_total", "Total number of database queries executed.");
prometheus::Family<prometheus::Summary>& query_duration_family =
prometheus::BuildSummaryFamily(*registry, "app_query_duration_seconds", "Distribution of database query execution times.",
prometheus::Summary::Quantiles({.05, .50, .90, .99}));
prometheus::Family<prometheus::Gauge>& db_connections_gauge_family =
prometheus::BuildGaugeFamily(*registry, "app_db_connections_current", "Current number of active database connections.");
void initialize_metrics() {
registry = std::make_shared<prometheus::Registry>();
exposer.RegisterCollectable(registry);
exposer.Start("0.0.0.0", 9400); // Expose metrics on port 9400
std::cout << "Prometheus metrics exposed on port 9400" << std::endl;
}
void execute_query(const std::string& query) {
auto& query_counter = query_count_family.Add({{"query", query}});
auto& query_duration_summary = query_duration_family.Add({{"query", query}});
// Simulate connection pool usage
auto& db_connections_gauge = db_connections_gauge_family.Add({{"pool", "main"}});
db_connections_gauge.Increment();
auto start_time = std::chrono::high_resolution_clock::now();
// --- Simulate database query execution ---
std::this_thread::sleep_for(std::chrono::milliseconds(rand() % 500 + 50)); // Simulate latency
// --- End simulation ---
auto end_time = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end_time - start_time;
query_counter.Increment();
query_duration_summary.Observe(elapsed.count());
db_connections_gauge.Decrement();
}
int main() {
initialize_metrics();
// Simulate application running and executing queries
while (true) {
execute_query("SELECT * FROM users WHERE id = 1");
execute_query("INSERT INTO logs (message) VALUES ('User logged in')");
std::this_thread::sleep_for(std::chrono::seconds(5));
}
return 0;
}
Compile this code and run it on your application servers. Then, configure Prometheus to scrape these metrics:
scrape_configs:
- job_name: 'my_cpp_app'
static_configs:
- targets: ['app-server-1:9400', 'app-server-2:9400'] # Replace with your actual hostnames/IPs and port
metric_relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):.*'
target_label: instance
replacement: '$1'
Alerting with Alertmanager
Once metrics are collected, we need to define alerts. Alertmanager is Prometheus’s companion for handling alerts. We’ll define alert rules in Prometheus and configure Alertmanager to route them to appropriate channels (e.g., Slack, PagerDuty).
Example Prometheus alert rules for PostgreSQL:
groups:
- name: postgresql_alerts
rules:
- alert: PostgreSQLReplicationLagging
expr: |
sum(pg_replication_lag_seconds{master_role="primary"}) by (instance) > 60
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL replication lag detected on {{ $labels.instance }}"
description: "Replication lag on {{ $labels.instance }} is greater than 60 seconds."
- alert: HighPostgreSQLConnections
expr: |
sum(pg_stat_activity_count{datname="your_database_name"}) by (instance) > 500
for: 10m
labels:
severity: warning
annotations:
summary: "High number of PostgreSQL connections on {{ $labels.instance }}"
description: "Database 'your_database_name' on {{ $labels.instance }} has more than 500 active connections."
- alert: LowPostgreSQLCacheHitRatio
expr: |
(pg_stat_database_blks_hit{datname="your_database_name"} / (pg_stat_database_blks_hit{datname="your_database_name"} + pg_stat_database_blks_read{datname="your_database_name"})) * 100 < 95
for: 15m
labels:
severity: warning
annotations:
summary: "Low PostgreSQL cache hit ratio on {{ $labels.instance }}"
description: "Cache hit ratio for database 'your_database_name' on {{ $labels.instance }} is below 95%."
- alert: PostgreSQLDiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql/data"} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql/data"} * 100 < 10
for: 30m
labels:
severity: critical
annotations:
summary: "Low disk space on PostgreSQL data directory on {{ $labels.instance }}"
description: "Filesystem on {{ $labels.instance }} with mountpoint '/var/lib/postgresql/data' has less than 10% free space."
And for your C++ application:
groups:
- name: cpp_application_alerts
rules:
- alert: HighQueryLatency
expr: |
avg(rate(app_query_duration_seconds_sum{job="my_cpp_app"}[5m])) by (instance, query) / avg(rate(app_query_duration_seconds_count{job="my_cpp_app"}[5m])) by (instance, query) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High query latency for {{ $labels.query }} on {{ $labels.instance }}"
description: "Average query latency for '{{ $labels.query }}' on {{ $labels.instance }} is over 0.5 seconds."
- alert: ExcessiveDatabaseConnections
expr: |
avg(app_db_connections_current{job="my_cpp_app", pool="main"}) by (instance) > 200
for: 10m
labels:
severity: warning
annotations:
summary: "Excessive database connections from app on {{ $labels.instance }}"
description: "Application on {{ $labels.instance }} is using more than 200 database connections."
- alert: ApplicationQueryErrors
expr: |
sum(rate(app_queries_total{job="my_cpp_app", status="error"}[5m])) by (instance, query) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Application query errors detected on {{ $labels.instance }}"
description: "Query '{{ $labels.query }}' on {{ $labels.instance }} is returning errors."
Configure Alertmanager to receive these alerts from Prometheus and route them. This involves setting up receivers (e.g., email, Slack webhook) and routes in alertmanager.yml.
Log Aggregation and Analysis
Metrics provide a high-level view, but logs are essential for deep diagnostics. Centralizing logs from your C++ applications and PostgreSQL instances allows for faster troubleshooting. We’ll use a combination of rsyslog/fluentd on the nodes and a central Elasticsearch/Loki instance.
On each PostgreSQL node, configure rsyslog to forward PostgreSQL logs to a central collector. Ensure PostgreSQL is configured to log to syslog.
# postgresql.conf log_destination = 'syslog' logging_collector = on log_directory = 'log' # This might be ignored if log_destination is syslog log_filename = 'postgresql-%a.log' log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = '1s' # Log slow queries log_checkpoints = on log_connections = on log_disconnections = on log_lock_waits = on log_temp_files = 0 log_autovacuum_min_duration = 0 log_error_verbosity = default log_statement = 'ddl' # Log DDL statements
Then, configure rsyslog to forward these logs. On your application servers, configure rsyslog to forward application logs.
# /etc/rsyslog.d/99-remote.conf *.* @@your_log_aggregator_host:514 # For TCP # or # *.* @your_log_aggregator_host:514 # For UDP
On the log aggregator side (e.g., using Fluentd), you’ll configure input plugins to receive logs and output plugins to send them to your chosen backend (Elasticsearch, Loki, S3, etc.).
# Example Fluentd configuration for receiving syslog and sending to Elasticsearch <source> @type syslog port 514 bind 0.0.0.0 protocol_type tcp # or udp tag postgresql.log </source> <source> @type forward port 24224 tag app.log </source> <match postgresql.log app.log> @type elasticsearch host elasticsearch.yourdomain.com port 9200 logstash_format true logstash_prefix logstash include_tag_key true tag_key log_topic flush_interval 5s </match>
For C++ applications, consider structured logging (e.g., JSON format) to make log parsing easier in your central system. Libraries like spdlog can facilitate this.
System Health and Resource Monitoring
Beyond application-specific metrics, fundamental system health is paramount. node_exporter provides CPU, memory, disk, and network statistics. We need to set up alerts for critical thresholds.
Key system metrics to monitor:
- CPU Utilization: High sustained CPU can indicate performance bottlenecks or runaway processes.
- Memory Usage: OOM (Out-Of-Memory) killer events are catastrophic.
- Disk I/O Wait: High I/O wait times point to storage performance issues.
- Network Throughput and Errors: Network saturation or errors can disrupt replication and application connectivity.
- Load Average: Indicates the number of processes waiting for CPU time.
Example alerts for system resources (these would be added to your Prometheus rules file):
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: |
avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 < 20
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is below 20% idle for 10 minutes."
- alert: HighMemoryUsage
expr: |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 15m
labels:
severity: critical
annotations:
summary: "Low memory available on {{ $labels.instance }}"
description: "Less than 10% of memory is available on {{ $labels.instance }}."
- alert: HighDiskIOWait
expr: |
avg by (instance) (rate(node_disk_io_time_seconds_total{device=~"sd.*|nvme.*"}[5m])) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High disk I/O wait on {{ $labels.instance }}"
description: "Disk I/O wait time on {{ $labels.instance }} is consistently high."
Regularly review these alerts and tune thresholds based on your application’s baseline performance. Combining these monitoring strategies provides a comprehensive view of your C++ application and PostgreSQL cluster health on Linode, enabling proactive issue resolution and ensuring high availability.