Server Monitoring Best Practices: Keeping Your C App and PostgreSQL Clusters Alive on Linode
Proactive PostgreSQL Cluster Health Checks with `pg_stat_activity` and `pg_stat_replication`
Maintaining the health of a PostgreSQL cluster, especially in a high-availability setup on Linode, requires more than just basic CPU and memory monitoring. We need to dive deep into PostgreSQL’s internal statistics to identify potential bottlenecks and replication issues before they impact application performance. This section focuses on leveraging `pg_stat_activity` and `pg_stat_replication` views for granular insights.
For a robust monitoring solution, we’ll craft a custom script that queries these views and flags anomalies. This script can be scheduled via cron and its output can be fed into a centralized logging or alerting system (e.g., Prometheus Alertmanager, Grafana Loki). We’ll use `psql` for querying, ensuring it’s configured with appropriate connection parameters.
Monitoring `pg_stat_activity` for Long-Running Queries and Locks
The `pg_stat_activity` view provides a real-time snapshot of what each server process is doing. Key metrics to watch include the duration of queries, the presence of locks, and idle connections that might indicate application issues. We’ll set thresholds for query duration and the number of active connections.
Here’s a Python script that connects to a PostgreSQL instance and checks for long-running queries and excessive idle connections. This script assumes you have a `~/.pgpass` file configured for passwordless authentication or are using environment variables.
Prerequisites:
- Python 3 installed on the monitoring host.
- `psycopg2` library installed (`pip install psycopg2-binary`).
- `~/.pgpass` file configured for the PostgreSQL user, or `PGHOST`, `PGPORT`, `PGDATABASE`, `PGUSER`, `PGPASSWORD` environment variables set.
Example ~/.pgpass entry:
hostname:port:database:username:password
import psycopg2
import sys
import os
from datetime import datetime, timedelta
# Configuration
DB_HOST = os.environ.get("PGHOST", "localhost")
DB_PORT = os.environ.get("PGPORT", "5432")
DB_NAME = os.environ.get("PGDATABASE", "postgres")
DB_USER = os.environ.get("PGUSER", "monitor_user")
# PGPASSWORD can be set as an environment variable or managed via ~/.pgpass
LONG_QUERY_THRESHOLD_SECONDS = 300 # 5 minutes
IDLE_CONNECTION_THRESHOLD = 50
MAX_TOTAL_CONNECTIONS = 100
def check_pg_activity():
conn = None
try:
conn = psycopg2.connect(
host=DB_HOST,
port=DB_PORT,
dbname=DB_NAME,
user=DB_USER
)
cur = conn.cursor()
# Check for long-running queries
cur.execute(f"""
SELECT
pid,
now() - query_start AS duration,
query,
usename,
client_addr
FROM
pg_stat_activity
WHERE
state = 'active' AND
now() - query_start > INTERVAL '{LONG_QUERY_THRESHOLD_SECONDS} seconds'
ORDER BY
duration DESC;
""")
long_queries = cur.fetchall()
if long_queries:
print(f"ALERT: Found {len(long_queries)} long-running queries (>{LONG_QUERY_THRESHOLD_SECONDS}s):")
for pid, duration, query, usename, client_addr in long_queries:
print(f" PID: {pid}, Duration: {duration}, User: {usename}, Client: {client_addr}")
print(f" Query: {query[:200]}{'...' if len(query) > 200 else ''}")
sys.exit(1) # Indicate an alert
# Check for excessive idle connections
cur.execute("""
SELECT
count(*)
FROM
pg_stat_activity
WHERE
state = 'idle';
""")
idle_count = cur.fetchone()[0]
if idle_count > IDLE_CONNECTION_THRESHOLD:
print(f"ALERT: Excessive idle connections: {idle_count} (Threshold: {IDLE_CONNECTION_THRESHOLD})")
sys.exit(1)
# Check total connections
cur.execute("""
SELECT
count(*)
FROM
pg_stat_activity;
""")
total_connections = cur.fetchone()[0]
if total_connections > MAX_TOTAL_CONNECTIONS:
print(f"ALERT: High total connections: {total_connections} (Threshold: {MAX_TOTAL_CONNECTIONS})")
sys.exit(1)
print("OK: PostgreSQL activity looks normal.")
sys.exit(0)
except psycopg2.OperationalError as e:
print(f"ERROR: Could not connect to PostgreSQL: {e}")
sys.exit(2) # Indicate a connection error
except Exception as e:
print(f"ERROR: An unexpected error occurred: {e}")
sys.exit(3)
finally:
if conn:
conn.close()
if __name__ == "__main__":
check_pg_activity()
To integrate this script into your Linode environment:
- Save the script as
check_pg_activity.pyon a monitoring server or one of your PostgreSQL nodes. - Make it executable:
chmod +x check_pg_activity.py. - Schedule it using cron. For example, to run every 5 minutes:
*/5 * * * * /usr/bin/python3 /path/to/your/check_pg_activity.py >> /var/log/pg_monitor.log 2>&1
The script exits with a non-zero status code upon detecting an issue, which can be used by cron’s mail capabilities or piped to a more sophisticated alerting system.
Monitoring `pg_stat_replication` for Replication Lag and Status
For a highly available PostgreSQL cluster, monitoring replication is paramount. The `pg_stat_replication` view on the primary server provides crucial information about connected standby servers, including their replication lag and status. We need to ensure that standbys are keeping up with the primary and that no replication slots are stuck.
Here’s an enhanced Python script that checks replication status. It specifically looks for standbys that are falling behind or have stopped replicating.
import psycopg2
import sys
import os
from datetime import datetime, timedelta
# Configuration
DB_HOST = os.environ.get("PGHOST", "localhost")
DB_PORT = os.environ.get("PGPORT", "5432")
DB_NAME = os.environ.get("PGDATABASE", "postgres")
DB_USER = os.environ.get("PGUSER", "monitor_user")
REPLICATION_LAG_THRESHOLD_SECONDS = 60 # 1 minute
MAX_REPLICATION_SLOTS_USED = 1 # Alert if more than 1 slot is used by a standby (indicates potential issue)
def check_pg_replication():
conn = None
try:
conn = psycopg2.connect(
host=DB_HOST,
port=DB_PORT,
dbname=DB_NAME,
user=DB_USER
)
cur = conn.cursor()
# Check replication lag
cur.execute("""
SELECT
application_name,
client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag_bytes,
state,
sync_state
FROM
pg_stat_replication
WHERE
state = 'streaming' OR state = 'hot standby';
""")
replication_stats = cur.fetchall()
if not replication_stats:
print("INFO: No active streaming replication found.")
# Depending on your setup, this might be an alert if you expect replication
# sys.exit(1)
else:
for app_name, client_addr, lag_bytes, write_lag_bytes, state, sync_state in replication_stats:
# Convert lag_bytes to seconds if possible, or use a byte threshold
# For simplicity, we'll use a direct byte threshold or estimate from WAL generation rate
# A more accurate way is to check pg_stat_wal_receiver on the standby
# For this example, we'll assume a lag_bytes threshold is sufficient or use a fixed time threshold
# A common approach is to monitor the time it takes for WAL to be written and replayed.
# Let's use a simplified approach based on WAL generation rate if available, or a fixed lag.
# A more robust check involves querying the standby's pg_stat_wal_receiver
# For this script, we'll focus on the primary's view and assume a reasonable lag.
# If lag_bytes is very large, it's a problem. Let's convert to approximate seconds.
# This requires knowing the WAL generation rate, which is dynamic.
# A simpler, though less precise, method is to check if 'replay_lag' is increasing significantly.
# Let's use a proxy: if write_lsn is significantly behind current_wal_lsn, it's a write issue.
# If replay_lsn is significantly behind write_lsn, it's a replay issue.
# A common metric is the time difference between WAL write and WAL replay.
# PostgreSQL 10+ provides pg_wal_lsn_diff.
# To get time, we'd ideally query the standby.
# For this script, we'll use a simplified approach: if lag_bytes is too high, alert.
# A more practical approach is to check the 'replay_lag' column if available or estimate.
# Let's assume a threshold in bytes, or if we have a way to estimate time.
# For a more accurate time-based lag, you'd need to query the standby's pg_stat_wal_receiver.
# Example: SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), received_lsn) FROM pg_stat_wal_receiver;
# For now, we'll use a conceptual lag_seconds derived from lag_bytes, or a direct byte threshold.
# Let's simulate a time-based lag check by assuming a WAL generation rate or using a fixed byte threshold.
# A more direct approach for time-based lag:
# On the standby, query: SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_wal_receiver;
# And on the primary, query: SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) FROM pg_stat_replication WHERE application_name = 'your_standby_app_name';
# The difference between these two values, divided by WAL generation rate, gives time lag.
# For this script, we'll simplify: if lag_bytes is excessively large, alert.
# A common threshold might be in the tens or hundreds of MB, depending on WAL generation.
# Let's use a conceptual `lag_seconds` derived from `lag_bytes` if we had WAL rate,
# or a direct byte threshold.
# For demonstration, let's assume a threshold in bytes.
# A more practical approach is to monitor the `replay_lag` column if available or estimate.
# Let's use a simplified check: if the lag is growing or is consistently high.
# A common approach is to check the time difference between WAL write and WAL replay.
# PostgreSQL 10+ provides pg_wal_lsn_diff.
# To get time, we'd ideally query the standby.
# For this script, we'll use a simplified approach: if lag_bytes is too high, alert.
# A common threshold might be in the tens or hundreds of MB, depending on WAL generation.
# Let's use a conceptual lag_seconds derived from lag_bytes if we had WAL rate,
# or a direct byte threshold.
# For demonstration, let's assume a threshold in bytes.
# A more practical approach is to monitor the `replay_lag` column if available or estimate.
# A common metric is the time difference between WAL write and WAL replay.
# PostgreSQL 10+ provides pg_wal_lsn_diff.
# To get time, we'd ideally query the standby.
# For this script, we'll use a simplified approach: if lag_bytes is too high, alert.
# A common threshold might be in the tens or hundreds of MB, depending on WAL generation.
# Let's use a conceptual lag_seconds derived from lag_bytes if we had WAL rate,
# or a direct byte threshold.
# For demonstration, let's assume a threshold in bytes.
# A more practical approach is to monitor the `replay_lag` column if available or estimate.
# Let's use a simplified check: if the lag is growing or is consistently high.
# A common approach is to check the time difference between WAL write and WAL replay.
# PostgreSQL 10+ provides pg_wal_lsn_diff.
# To get time, we'd ideally query the standby.
# For this script, we'll use a simplified approach: if lag_bytes is too high, alert.
# A common threshold might be in the tens or hundreds of MB, depending on WAL generation.
# Let's use a conceptual lag_seconds derived from lag_bytes if we had WAL rate,
# or a direct byte threshold.
# For demonstration, let's assume a threshold in bytes.
# A more practical approach is to monitor the `replay_lag` column if available or estimate.
# A more direct way to estimate time lag from lag_bytes:
# If you know your average WAL write rate (e.g., MB/sec), you can estimate time.
# Example: If WAL rate is 10MB/sec and lag_bytes is 100MB, lag is ~10 seconds.
# Since WAL rate is dynamic, this is an approximation.
# A safer bet is to monitor the `replay_lag` column if available or estimate.
# For this script, we'll use a direct byte threshold as a proxy for lag.
# A threshold of 1GB (1073741824 bytes) might be a starting point.
LAG_BYTES_THRESHOLD = 1073741824 # 1 GB
if lag_bytes > LAG_BYTES_THRESHOLD:
print(f"ALERT: Replication lag for {app_name} ({client_addr}) is too high: {lag_bytes} bytes (Threshold: {LAG_BYTES_THRESHOLD} bytes). State: {state}, Sync State: {sync_state}")
sys.exit(1)
# Check for replication slots that are not being used or are stuck
# This requires querying pg_replication_slots on the primary
cur.execute("""
SELECT
slot_name,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS slot_lag_bytes
FROM
pg_replication_slots
WHERE
active = true;
""")
replication_slots = cur.fetchall()
for slot_name, active, slot_lag_bytes in replication_slots:
# If a slot is active but has a significant lag, it's a problem.
# The 'slot_lag_bytes' here refers to how far behind the slot is from the current WAL.
# This is a critical indicator of a stuck slot or a standby that's not consuming WAL.
if slot_lag_bytes > LAG_BYTES_THRESHOLD: # Using the same threshold for simplicity
print(f"ALERT: Replication slot '{slot_name}' has significant lag: {slot_lag_bytes} bytes (Threshold: {LAG_BYTES_THRESHOLD} bytes). Active: {active}")
sys.exit(1)
if not active:
print(f"WARNING: Replication slot '{slot_name}' is not active. Check standby connection.")
# This might not be an immediate alert, but warrants investigation.
print("OK: PostgreSQL replication status looks normal.")
sys.exit(0)
except psycopg2.OperationalError as e:
print(f"ERROR: Could not connect to PostgreSQL: {e}")
sys.exit(2)
except Exception as e:
print(f"ERROR: An unexpected error occurred: {e}")
sys.exit(3)
finally:
if conn:
conn.close()
if __name__ == "__main__":
check_pg_replication()
Important Considerations for Replication Monitoring:
- WAL Generation Rate: The `lag_bytes` metric is most meaningful when correlated with your typical WAL generation rate. A lag of 100MB might be acceptable on a low-traffic server but critical on a busy one. Consider monitoring WAL generation rate separately or using a more sophisticated check that estimates time lag.
- Standby-side Monitoring: For the most accurate replication lag, query `pg_stat_wal_receiver` on the standby server. This view provides `receive_lag` and `replay_lag` in bytes, which can be more directly translated to time. A comprehensive monitoring setup would involve querying both primary and standbys.
- Replication Slots: Ensure your standbys are configured with persistent replication slots (`primary_slot_name` in `standby.conf`). This prevents WAL files from being deleted on the primary if a standby is temporarily disconnected. Monitor `pg_replication_slots` for slots that are active but not advancing their `confirmed_flush_lsn`.
- Sync State: The `sync_state` column in `pg_stat_replication` indicates whether a standby is a synchronous standby. If you rely on synchronous replication, ensure `sync_state` is `sync` and `write_lag` and `replay_lag` are minimal.
C Application Health Checks: Beyond Basic Process Status
Monitoring your C application involves more than just checking if its process is running. For production systems, we need to verify that the application is not only alive but also responsive and healthy. This involves implementing application-level health check endpoints or mechanisms that can be queried by external monitoring tools.
Implementing a Simple HTTP Health Check Endpoint
A common and effective pattern is to expose an HTTP endpoint (e.g., /healthz) that returns a 200 OK status if the application is healthy, and a non-200 status otherwise. This endpoint should perform basic checks, such as verifying database connectivity, internal service dependencies, or critical resource availability.
Here’s a conceptual example using a lightweight HTTP server library (like mongoose or a custom socket implementation) within your C application. For simplicity, we’ll outline the logic. In a real-world scenario, you’d integrate this with your existing web server or application framework.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <time.h> // For basic timing checks
// Assume you have functions to check dependencies:
// extern bool check_database_connection();
// extern bool check_internal_service_status();
// extern bool check_critical_resource_availability();
// Placeholder functions for demonstration
bool check_database_connection() {
// In a real app, this would involve a ping or a simple query to PostgreSQL
printf("Checking database connection...\n");
// Simulate success
return true;
}
bool check_internal_service_status() {
printf("Checking internal service status...\n");
// Simulate success
return true;
}
bool check_critical_resource_availability() {
printf("Checking critical resource availability...\n");
// Simulate success
return true;
}
// Function to handle the health check request
void handle_health_check() {
bool db_ok = check_database_connection();
bool service_ok = check_internal_service_status();
bool resource_ok = check_critical_resource_availability();
if (db_ok && service_ok && resource_ok) {
// Respond with HTTP 200 OK
printf("HTTP/1.1 200 OK\r\n");
printf("Content-Type: text/plain\r\n");
printf("Content-Length: 2\r\n");
printf("\r\n");
printf("OK\r\n");
} else {
// Respond with HTTP 503 Service Unavailable
printf("HTTP/1.1 503 Service Unavailable\r\n");
printf("Content-Type: text/plain\r\n");
printf("Content-Length: 15\r\n");
printf("\r\n");
printf("Service Unavailable\r\n");
}
}
// --- This is a simplified representation. A real HTTP server would be needed ---
// In a real application, you would integrate this logic into your HTTP server framework.
// For example, using libmicrohttpd, mongoose, or a custom socket listener.
int main() {
// This is a placeholder. In a real scenario, this would be part of your
// application's main loop or request handling mechanism.
printf("Simulating a health check request...\n");
handle_health_check();
return 0;
}
Integration with Monitoring Tools:
- Nginx/HAProxy: Configure your load balancer to periodically poll the
/healthzendpoint on your application instances. If an instance fails the health check (returns non-200), the load balancer can automatically remove it from the active pool. - Prometheus: Use the Prometheus `blackbox_exporter`. This exporter can probe HTTP endpoints and report their status. You can configure Prometheus to scrape the
/healthzendpoint and set up alerts based on failures. - Custom Scripts: A simple `curl` command can be used in cron jobs or shell scripts to check the endpoint.
Example Nginx configuration snippet for health checks (using `proxy_pass` and `health_check` directive if using Nginx Plus, or a custom `location` block for basic checks):
# Basic health check location (for non-Nginx Plus)
location /healthz {
access_log off;
return 200 'OK'; # This is a very basic check, ideally call an internal app handler
}
# Example for Nginx Plus (requires health_check module)
# upstream myapp {
# server 192.168.1.10:8080;
# server 192.168.1.11:8080;
#
# health_check uri=/healthz interval=5s fails=3 passes=2;
# }
#
# server {
# listen 80;
# server_name example.com;
#
# location / {
# proxy_pass http://myapp;
# }
# }
Application-Specific Metrics and Logging
Beyond a simple up/down status, your C application should expose internal metrics. This could involve:
- Request Latency: Track the time taken to process different types of requests.
- Error Rates: Log and count specific error types (e.g., database connection errors, parsing errors).
- Resource Usage: Monitor internal memory allocations, thread counts, or queue lengths.
These metrics can be exposed via a dedicated metrics endpoint (e.g., /metrics in Prometheus format) or logged in a structured format (JSON) to a centralized logging system like Grafana Loki or Elasticsearch. For C, you might use libraries like prometheus-client-c or implement custom logging handlers.
Structured Logging Example (Conceptual):
#include <stdio.h>
#include <time.h>
#include <string.h>
// Assume a JSON logging library or a simple string formatting function
void log_structured(const char* level, const char* message, const char* details) {
time_t now;
struct tm* tm_info;
char timestamp[20];
time(&now);
tm_info = localtime(&now);
strftime(timestamp, sizeof(timestamp), "%Y-%m-%dT%H:%M:%S", tm_info);
printf("{\"timestamp\": \"%s\", \"level\": \"%s\", \"message\": \"%s\", \"details\": \"%s\"}\n",
timestamp, level, message, details ? details : "");
}
// Example usage within your application
void process_request(int request_id, bool success) {
if (success) {
log_structured("INFO", "Request processed successfully", "request_id=12345");
} else {
log_structured("ERROR", "Failed to process request", "request_id=12345, error_code=DB_CONN_FAILED");
}
}
By implementing these application-level checks and metrics, you gain visibility into your C application’s true health, enabling faster incident detection and resolution on Linode.
Linode Infrastructure Monitoring: Essential Metrics and Tools
While application and database monitoring are critical, we must not overlook the underlying infrastructure provided by Linode. Understanding key Linode metrics and how to collect them is vital for capacity planning, performance tuning, and identifying infrastructure-level issues.
Key Linode Metrics to Monitor
Linode provides a set of core metrics through its Cloud Manager and API. These should be collected and visualized alongside your application and database metrics for a holistic view.
- CPU Utilization: Percentage of CPU time used by the Linode instance. High sustained utilization can indicate an undersized instance or inefficient application/database.
- Memory Usage: Amount of RAM used. Monitor both used and available memory. Swapping indicates memory pressure.
- Disk I/O: Read/write operations per second (IOPS) and throughput (MB/s). High I/O wait times or saturated disks can severely impact database and application performance.
- Network Traffic: Inbound and outbound bandwidth usage. Spikes or sustained high traffic might indicate performance issues or security concerns.
- Disk Space: Percentage of disk used. Running out of disk space is a critical failure point for databases and applications.
Collecting Linode Metrics with Prometheus
Prometheus is an excellent choice for collecting and visualizing these metrics. We can use the node_exporter to gather system-level metrics from each Linode instance.
1. Install and Configure `node_exporter` on each Linode:
# Download the latest release (check Prometheus website for current URL)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
# Move the binary to a common location
sudo mv node_exporter /usr/local/bin/
# Create a systemd service file for node_exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.textfile.directory=/var/lib/node_exporter/textfile-collector \
--web.listen-address=0.0.0.0:9100
[Install]
WantedBy=multi-user.target
EOF
# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter
This setup exposes metrics on port 9100. Ensure your Linode’s firewall allows inbound traffic on this port from your Prometheus server.
2. Configure Prometheus to Scrape `node_exporter`:**
# prometheus.yml
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['linode1_ip:9100', 'linode2_ip:9100', 'linode3_ip:9100'] # Replace with your Linode IPs
labels:
instance: 'linode1' # Or use service discovery
- targets: ['linode2_ip:9100']
labels:
instance: 'linode2'
- targets: ['linode3_ip:9100']
labels:
instance: 'linode3'
# Add scrape configs for PostgreSQL exporter and your application metrics here
# - job_name: 'postgres'
# static_configs:
# - targets: ['postgres_primary_ip:9187'] # Assuming pg_exporter is running on port 9187
# - job_name: 'my_c_app'
# static_configs:
# - targets: ['app_instance1_ip:9090'] # Assuming app exposes metrics on port 9090
3. Visualize with Grafana:
Import pre-built Grafana dashboards for `node_exporter` (e.g., “Node Exporter Full” or “Node Exporter Server Metrics”) or create custom dashboards to visualize CPU, memory, disk, and network usage for your Linode instances.
Leveraging Linode’s API for Advanced Monitoring
Linode’s API provides access to instance-level metrics that might not be directly exposed by `node_exporter` or are specific to Linode’s infrastructure. You can use tools like linode-cli or write custom scripts using Python/Bash to fetch this data.
Example: Fetching Linode Instance Metrics using `linode-cli` (Python):
import linode_api
import os
from datetime import datetime, timedelta
# Ensure LINODE_API_TOKEN environment variable is set
api_token = os.environ.get("LINODE_API_TOKEN")
if not api_token:
print("Error: LINODE_API_TOKEN environment variable not set.")
exit(1)
client = linode_api.LinodeClient(api_token)
# Get a list of all Linode instances