Server Monitoring Best Practices: Keeping Your C App and PostgreSQL Clusters Alive on DigitalOcean
Proactive PostgreSQL Cluster Health Checks
Maintaining the health of a PostgreSQL cluster, especially in a distributed cloud environment like DigitalOcean, requires more than just basic uptime checks. We need to go deeper, monitoring key performance indicators (KPIs) that signal potential issues before they impact application availability. This involves a multi-layered approach, starting with essential system metrics and progressing to PostgreSQL-specific diagnostics.
System-Level Monitoring with Prometheus and Node Exporter
For robust system-level monitoring, Prometheus is the de facto standard. We’ll deploy node_exporter on each DigitalOcean Droplet hosting a PostgreSQL instance. This exporter exposes a wealth of system metrics that Prometheus can scrape.
First, ensure you have Prometheus installed and configured. Then, on each PostgreSQL Droplet, install node_exporter. A common method is to download the pre-compiled binary:
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ sudo useradd -rs /bin/false node_exporter sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Next, create a systemd service file to manage the node_exporter process:
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Save this as /etc/systemd/system/node_exporter.service, then enable and start it:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
Now, configure your Prometheus server to scrape these targets. Add the following to your prometheus.yml configuration:
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: [':9100', ':9100', ':9100'] # Replace with your PostgreSQL Droplet IPs
Restart Prometheus for the changes to take effect. Key metrics to monitor here include:
node_cpu_seconds_total: CPU utilization trends.node_memory_MemAvailable_bytes: Available memory. Low availability is a critical indicator.node_disk_io_time_seconds_total: Disk I/O wait times. High values suggest I/O bottlenecks.node_network_receive_errs_totalandnode_network_transmit_errs_total: Network errors.
PostgreSQL-Specific Monitoring with the PostgreSQL Exporter
While node_exporter gives us system context, we need PostgreSQL-specific metrics. The postgres_exporter is an excellent choice for this. It queries PostgreSQL directly and exposes metrics in a Prometheus-compatible format.
Install postgres_exporter on a dedicated monitoring Droplet or one of the PostgreSQL nodes (though a separate node is preferred for isolation). Download the binary similarly to node_exporter.
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.13.0/postgres_exporter-0.13.0.linux-amd64.tar.gz tar xvfz postgres_exporter-0.13.0.linux-amd64.tar.gz sudo mv postgres_exporter-0.13.0.linux-amd64/postgres_exporter /usr/local/bin/ sudo useradd -rs /bin/false postgres_exporter sudo chown postgres_exporter:postgres_exporter /usr/local/bin/postgres_exporter
Create a PostgreSQL user with minimal privileges for the exporter to connect with. This user should only have `CONNECT` privilege on the databases you want to monitor and `SELECT` on specific system views.
-- Connect to your PostgreSQL instance as a superuser CREATE USER monitoring_user WITH PASSWORD 'your_secure_password'; GRANT CONNECT ON DATABASE your_database_name TO monitoring_user; -- On each database you want to monitor: GRANT USAGE ON SCHEMA pg_catalog TO monitoring_user; GRANT SELECT ON pg_stat_activity TO monitoring_user; GRANT SELECT ON pg_stat_database TO monitoring_user; GRANT SELECT ON pg_stat_replication TO monitoring_user; GRANT SELECT ON pg_stat_statements TO monitoring_user; -- If pg_stat_statements is enabled GRANT SELECT ON pg_locks TO monitoring_user; GRANT SELECT ON pg_settings TO monitoring_user;
Create a .pgpass file for the postgres_exporter user to authenticate:
# ~/.pgpass hostname:port:database:username:password localhost:5432:your_database_name:monitoring_user:your_secure_password
Set appropriate permissions:
sudo chown postgres_exporter:postgres_exporter ~/.pgpass sudo chmod 600 ~/.pgpass
Now, create a systemd service for postgres_exporter. You’ll need to specify the connection string. If running on the same host as PostgreSQL, it might be simpler. For remote connections, use the full DSN.
[Unit] Description=PostgreSQL Exporter Wants=network-online.target After=network-online.target [Service] User=postgres_exporter Group=postgres_exporter Type=simple # Example for a local connection: # ExecStart=/usr/local/bin/postgres_exporter --web.listen-address=":9187" --extend.queries-path=/etc/postgres_exporter/queries.yaml # Example for a remote connection (replace with your actual DSN): ExecStart=/usr/local/bin/postgres_exporter --web.listen-address=":9187" --pg.connstr="postgresql://monitoring_user:your_secure_password@:5432/your_database_name" --extend.queries-path=/etc/postgres_exporter/queries.yaml [Install] WantedBy=multi-user.target
Save this as /etc/systemd/system/postgres_exporter.service. You might also want to create a custom queries.yaml for additional metrics. Then, enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable postgres_exporter sudo systemctl start postgres_exporter sudo systemctl status postgres_exporter
Configure Prometheus to scrape the postgres_exporter:
scrape_configs:
- job_name: 'postgres_exporter'
static_configs:
- targets: [':9187'] # IP of the Droplet running postgres_exporter
Key PostgreSQL metrics to monitor include:
pg_up: Basic connectivity check.pg_stat_database_numbackends: Number of active connections. High numbers can indicate connection pool exhaustion or slow queries.pg_stat_replication_lag_seconds: Replication lag for standby servers. Critical for HA/DR.pg_stat_database_blks_readandpg_stat_database_blks_hit: Cache hit ratio. Low hit ratios suggest insufficient memory or inefficient queries.pg_stat_activity_count: Number of active queries, broken down by state (active, idle, etc.).pg_locks_count: Number of active locks. High contention is a major performance killer.pg_settings_max_connections: Monitor against actual usage.
Application-Level Monitoring with Custom Metrics
Your C application is the direct consumer of PostgreSQL. Monitoring its interaction with the database is crucial. This involves instrumenting your C code to expose custom metrics that can be scraped by Prometheus.
The prometheus-client-c library is a good starting point. You’ll need to integrate it into your C application’s build process.
# Example build integration (using CMake) # Add this to your CMakeLists.txt find_package(prometheus_client_c REQUIRED) target_link_libraries(your_app PRIVATE prometheus_client_c::prometheus_client_c)
In your C code, you can define and update metrics:
#include <prometheus/counter.h>
#include <prometheus/registry.h>
#include <prometheus/exposer.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
// Global registry and exposer
prometheusio_registry_t* registry = NULL;
prometheusio_exposer_t* exposer = NULL;
// Define a counter for successful database operations
prometheusio_counter_t* db_operations_success_counter = NULL;
void initialize_prometheus_metrics() {
// Create a new registry
registry = prometheusio_registry_new();
if (!registry) {
fprintf(stderr, "Failed to create Prometheus registry\n");
exit(EXIT_FAILURE);
}
// Create a counter metric
db_operations_success_counter = prometheusio_counter_new(
"app_db_operations_total",
"Total number of successful database operations.",
prometheusio_labels_new("operation_type", NULL) // Example label
);
if (!db_operations_success_counter) {
fprintf(stderr, "Failed to create success counter metric\n");
prometheusio_registry_destroy(registry);
exit(EXIT_FAILURE);
}
prometheusio_registry_register(registry, db_operations_success_counter);
// Create an HTTP exposer to serve metrics on port 9394
exposer = prometheusio_exposer_new(registry, "0.0.0.0", "9394");
if (!exposer) {
fprintf(stderr, "Failed to create Prometheus exposer\n");
prometheusio_registry_destroy(registry);
exit(EXIT_FAILURE);
}
prometheusio_exposer_start(exposer);
printf("Prometheus metrics exposed on http://0.0.0.0:9394/metrics\n");
}
void increment_db_success(const char* op_type) {
if (db_operations_success_counter) {
prometheusio_counter_inc(db_operations_success_counter, prometheusio_labels_new("operation_type", op_type, NULL));
}
}
int main() {
initialize_prometheus_metrics();
// Simulate database operations
while (1) {
// ... perform database operation ...
printf("Performing a 'read' operation...\n");
increment_db_success("read");
sleep(5);
// ... perform another database operation ...
printf("Performing a 'write' operation...\n");
increment_db_success("write");
sleep(10);
}
// Cleanup (though this loop is infinite in this example)
prometheusio_exposer_destroy(exposer);
prometheusio_registry_destroy(registry);
return 0;
}
Compile and run your C application. Then, configure Prometheus to scrape your application’s metrics endpoint (e.g., http://).
scrape_configs:
- job_name: 'my_c_app'
static_configs:
- targets: [':9394'] # IP of the Droplet running your C app
Custom metrics to consider:
- Database query latency (e.g.,
app_db_query_duration_secondshistogram). - Number of failed database operations (
app_db_operations_failed_totalcounter). - Connection pool usage (
app_db_pool_connections_currentgauge). - Specific application logic timings that depend on database interaction.
Alerting with Alertmanager
Collecting metrics is only half the battle. We need to be notified when things go wrong. Alertmanager integrates with Prometheus to handle alerts. Define alert rules in Prometheus based on the metrics we’re collecting.
Example alert rule for high PostgreSQL replication lag:
groups:
- name: postgresql_alerts
rules:
- alert: HighReplicationLag
expr: pg_replication_lag_seconds > 60 # Alert if lag is over 60 seconds
for: 5m # Condition must be true for 5 minutes
labels:
severity: critical
annotations:
summary: "High replication lag detected on {{ $labels.instance }}"
description: "PostgreSQL replication lag on {{ $labels.instance }} is {{ $value }} seconds, exceeding the threshold."
Example alert rule for low available memory on a PostgreSQL node:
groups:
- name: node_alerts
rules:
- alert: LowAvailableMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 # Less than 10% available memory
for: 10m
labels:
severity: warning
annotations:
summary: "Low available memory on {{ $labels.instance }}"
description: "Droplet {{ $labels.instance }} has only {{ printf "%.2f" (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) }}% available memory."
Configure Alertmanager to route these alerts to your preferred notification channels (Slack, PagerDuty, email, etc.). Ensure your Alertmanager configuration points to your Prometheus server.
Log Aggregation and Analysis
Metrics tell you *what* is happening, but logs often tell you *why*. Centralizing logs from your PostgreSQL instances and C application is essential for debugging and incident response.
Consider using a log aggregation tool like Loki, Elasticsearch, or a managed service. For Loki, you’d deploy Promtail on each Droplet to collect logs and forward them to your Loki instance.
Configure Promtail to tail PostgreSQL logs (typically found in /var/log/postgresql/postgresql-*.log) and your C application’s logs.
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/postgresql/postgresql-*.log
- job_name: my_c_app_logs
static_configs:
- targets:
- localhost
labels:
job: my_c_app
__path__: /var/log/my_c_app/*.log # Adjust path as needed
In Grafana, you can then query Loki to visualize and search your logs alongside your metrics, providing a holistic view of your system’s health.
Regular Health Checks and Maintenance
Beyond automated monitoring, schedule regular manual checks and maintenance. This includes:
- VACUUM and ANALYZE: Ensure these are running regularly (either via cron jobs or PostgreSQL’s autovacuum). Monitor
pg_stat_user_tablesforlast_autovacuumandlast_autoanalyze. - Disk Space: Proactively monitor disk usage on all Droplets.
- Backups: Verify that your PostgreSQL backups are completing successfully and test restores periodically.
- Connection Pooling: If using a connection pooler like PgBouncer, monitor its metrics as well.
- PostgreSQL Version Updates: Keep your PostgreSQL instances patched and updated.
By implementing this comprehensive monitoring strategy, you can significantly improve the reliability and performance of your C application and PostgreSQL clusters on DigitalOcean, moving from reactive firefighting to proactive system management.