Server Monitoring Best Practices: Keeping Your C App and PostgreSQL Clusters Alive on OVH
Proactive C Application Health Checks with `systemd`
For critical C applications deployed on OVH infrastructure, robust health checking is paramount. We’ll leverage `systemd`’s built-in capabilities to ensure our application is not only running but also responsive. This involves defining a `systemd` service unit with specific health check directives.
Consider a typical C application that listens on a specific port (e.g., 8080) and exposes a health check endpoint (e.g., `/healthz`). We’ll create a `systemd` service file to manage this application and its health monitoring.
`systemd` Service Unit Configuration
Create a file named `my-c-app.service` in `/etc/systemd/system/`:
[Unit] Description=My Critical C Application After=network.target [Service] ExecStart=/usr/local/bin/my_c_app --config /etc/my_c_app/config.conf ExecStop=/bin/kill -s TERM $MAINPID Restart=on-failure RestartSec=5s # Health Check Configuration Type=notify NotifyAccess=all WatchdogSec=10s # User and Group for security User=appuser Group=appgroup [Install] WantedBy=multi-user.target
In this configuration:
Type=notify: This tells `systemd` that our application will signal its readiness and health status.NotifyAccess=all: Allows the service to send notifications to `systemd`.WatchdogSec=10s: This is crucial. `systemd` will send a “keep-alive” message to the application every 10 seconds. If the application doesn’t respond within this interval, `systemd` will consider it unhealthy and restart it.
Your C application needs to be modified to support `systemd`’s notification protocol. This typically involves:
C Application Modifications for `systemd` Notification
Your C application must periodically send a “READY=1” message to `sd_notify(3)` when it’s ready to accept connections and a “STATUS=…” message to provide status updates. For the watchdog, it needs to respond to a specific signal or message from `systemd`.
#include <systemd/sd-daemon.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <fcntl.h>
#define PORT 8080
#define HEALTH_CHECK_PORT 8081 // Separate port for health check endpoint
volatile sig_atomic_t watchdog_triggered = 0;
void sig_handler(int signum) {
if (signum == SIGUSR1) { // Assuming SIGUSR1 is used for watchdog
watchdog_triggered = 1;
}
}
void setup_health_check_server() {
int sockfd;
struct sockaddr_in serv_addr;
sockfd = socket(AF_INET, SOCK_STREAM, 0);
if (sockfd < 0) {
perror("ERROR opening socket");
exit(1);
}
// Set socket to non-blocking
int flags = fcntl(sockfd, F_GETFL, 0);
if (flags == -1) {
perror("fcntl F_GETFL");
exit(1);
}
if (fcntl(sockfd, F_SETFL, flags | O_NONBLOCK) == -1) {
perror("fcntl F_SETFL O_NONBLOCK");
exit(1);
}
memset(&serv_addr, 0, sizeof(serv_addr));
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr = INADDR_ANY;
serv_addr.sin_port = htons(HEALTH_CHECK_PORT);
if (bind(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
perror("ERROR on binding");
close(sockfd);
exit(1);
}
listen(sockfd, 5);
// Handle incoming connections in a separate thread or async loop
// For simplicity, this example doesn't implement full async handling
// but demonstrates the concept.
printf("Health check server listening on port %d\n", HEALTH_CHECK_PORT);
}
void handle_health_check_request(int client_sock) {
char buffer[1024] = {0};
read(client_sock, buffer, 1024); // Read request (e.g., GET /healthz HTTP/1.1)
const char *response = "HTTP/1.1 200 OK\r\nContent-Type: application/json\r\nContent-Length: 25\r\n\r\n{\"status\": \"ok\"}";
send(client_sock, response, strlen(response), 0);
close(client_sock);
}
int main(int argc, char *argv[]) {
// Parse arguments, load config, etc.
// ...
// Signal handler for watchdog
signal(SIGUSR1, sig_handler);
// Setup main application server (e.g., listening on PORT 8080)
// ...
// Setup health check server
setup_health_check_server();
// Initial notification to systemd
sd_notify(0, "READY=1");
int health_check_server_fd = socket(AF_INET, SOCK_STREAM, 0);
// ... bind and listen for health_check_server_fd on HEALTH_CHECK_PORT ...
// For simplicity, assuming health_check_server_fd is already set up and listening.
// In a real app, this would be managed properly.
while (1) {
// Main application logic
// ...
// Periodically send watchdog notification
if (sd_notify(0, "STATUS=Processing requests") < 0) {
// Handle error, maybe log it. sd_notify can fail if systemd is not running.
fprintf(stderr, "sd_notify failed: %s\n", strerror(errno));
}
// Simulate watchdog response (in a real app, this would be a specific mechanism)
// For this example, we'll assume a simple periodic check.
// A more robust implementation would involve systemd sending a signal
// or a specific message that the app needs to acknowledge.
// The WatchdogSec in systemd service unit is the primary mechanism.
// The app's responsibility is to *stay alive* and *respond* to systemd's checks.
// If the app hangs, systemd's watchdog will time out.
// Example of handling health check requests (simplified)
struct sockaddr_in client_addr;
socklen_t client_len = sizeof(client_addr);
int client_sock = accept(health_check_server_fd, (struct sockaddr *)&client_addr, &client_len);
if (client_sock >= 0) {
handle_health_check_request(client_sock);
} else if (errno != EWOULDBLOCK && errno != EAGAIN) {
perror("accept");
// Handle error
}
usleep(100000); // Sleep for 100ms
}
return 0;
}
After creating the service file and modifying your C application, reload `systemd`, enable, and start the service:
sudo systemctl daemon-reload sudo systemctl enable my-c-app.service sudo systemctl start my-c-app.service sudo systemctl status my-c-app.service
This setup ensures that `systemd` actively monitors your C application. If the application becomes unresponsive (fails to signal `systemd` within `WatchdogSec`), `systemd` will automatically attempt to restart it.
PostgreSQL Cluster Monitoring with `pg_monitor` and `pg_stat_statements`
Monitoring PostgreSQL clusters, especially in a high-availability setup on OVH, requires a multi-faceted approach. We’ll focus on key metrics that indicate performance, availability, and potential issues, using `pg_monitor` (a custom script or tool) and the built-in `pg_stat_statements` extension.
Enabling and Configuring `pg_stat_statements`
`pg_stat_statements` is invaluable for identifying slow queries. Ensure it’s enabled in your `postgresql.conf`.
# postgresql.conf shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all pg_stat_statements.max = 10000 pg_stat_statements.save = on
After modifying `postgresql.conf`, you must restart your PostgreSQL instances for the changes to take effect. Then, create the extension in each database you want to monitor:
-- In each database you want to monitor CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
You can then query `pg_stat_statements` to find the most resource-intensive queries:
SELECT
calls,
total_time,
rows,
mean_time,
stddev_time,
"user",
"query"
FROM
pg_stat_statements
ORDER BY
total_time DESC
LIMIT 20;
Custom Monitoring Script (`pg_monitor.sh`)
A shell script can aggregate critical metrics. This script should be run periodically (e.g., via cron or `systemd` timers) on each PostgreSQL node.
#!/bin/bash
# Configuration
PG_USER="monitor_user"
PG_HOST="localhost"
PG_PORT="5432"
LOG_FILE="/var/log/pg_monitor.log"
ALERT_THRESHOLD_CPU=80 # %
ALERT_THRESHOLD_MEM=80 # %
ALERT_THRESHOLD_DISK=90 # %
ALERT_THRESHOLD_CONNECTIONS=500
ALERT_THRESHOLD_SLOTS=0 # Minimum number of replication slots
# --- Functions ---
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}
send_alert() {
local metric="$1"
local value="$2"
local threshold="$3"
local message="ALERT: PostgreSQL cluster issue on ${PG_HOST}:${PG_PORT}. Metric: ${metric}, Value: ${value}, Threshold: ${threshold}"
log_message "$message"
# In a production environment, integrate with your alerting system (e.g., PagerDuty, Slack, Prometheus Alertmanager)
# echo "$message" | mail -s "PostgreSQL Alert on ${PG_HOST}" [email protected]
}
check_replication_status() {
local slots_used=$(psql -U "$PG_USER" -h "$PG_HOST" -p "$PG_PORT" -tAc "SELECT count(*) FROM pg_replication_slots;")
if [ "$slots_used" -lt "$ALERT_THRESHOLD_SLOTS" ]; then
send_alert "Replication Slots" "$slots_used" "$ALERT_THRESHOLD_SLOTS"
fi
# Add checks for streaming replication lag if applicable
}
check_connection_count() {
local current_connections=$(psql -U "$PG_USER" -h "$PG_HOST" -p "$PG_PORT" -tAc "SELECT count(*) FROM pg_stat_activity;")
local max_connections=$(psql -U "$PG_USER" -h "$PG_HOST" -p "$PG_PORT" -tAc "SHOW max_connections;")
if [ "$current_connections" -gt "$ALERT_THRESHOLD_CONNECTIONS" ]; then
send_alert "Current Connections" "$current_connections" "$ALERT_THRESHOLD_CONNECTIONS"
fi
if [ "$current_connections" -gt "$((max_connections * 90 / 100))" ]; then
send_alert "Connection Usage" "$current_connections/$max_connections" "90% of max_connections"
fi
}
check_disk_usage() {
local disk_usage=$(df -h /var/lib/postgresql/data | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$disk_usage" -gt "$ALERT_THRESHOLD_DISK" ]; then
send_alert "Disk Usage" "$disk_usage%" "$ALERT_THRESHOLD_DISK%"
fi
}
check_cpu_usage() {
# This is a simplified check. For accurate PostgreSQL CPU usage,
# consider tools like `pg_top` or more advanced system monitoring.
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
if (( $(echo "$cpu_usage > $ALERT_THRESHOLD_CPU" | bc -l) )); then
send_alert "CPU Usage" "${cpu_usage}%" "${ALERT_THRESHOLD_CPU}%"
fi
}
check_memory_usage() {
# Similar to CPU, this is a system-wide check.
local mem_usage=$(free | grep Mem: | awk '{print $3/$2 * 100.0}')
if (( $(echo "$mem_usage > $ALERT_THRESHOLD_MEM" | bc -l) )); then
send_alert "Memory Usage" "${mem_usage}%" "${ALERT_THRESHOLD_MEM}%"
fi
}
check_pg_is_running() {
if ! pg_isready -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" >/dev/null 2>&1; then
send_alert "PostgreSQL Service" "Not Running" "Running"
return 1
fi
return 0
}
# --- Main Execution ---
log_message "Starting PostgreSQL health check..."
if ! check_pg_is_running; then
log_message "PostgreSQL is not running on ${PG_HOST}:${PG_PORT}. Exiting."
exit 1
fi
check_cpu_usage
check_memory_usage
check_disk_usage
check_connection_count
check_replication_status
log_message "PostgreSQL health check finished."
exit 0
To make this script executable and schedule it:
sudo chmod +x pg_monitor.sh # Create a user with read-only permissions for monitoring sudo -u postgres psql -c "CREATE USER monitor_user WITH PASSWORD 'your_secure_password';" sudo -u postgres psql -c "GRANT pg_read_all_stats TO monitor_user;" # Add to crontab for hourly checks echo "0 * * * * /path/to/your/pg_monitor.sh" | sudo crontab -
OVH Specific Considerations
When running on OVH, pay close attention to:
- Network Latency: If your PostgreSQL cluster spans multiple OVH regions or availability zones, monitor inter-node latency. Tools like `ping` and `mtr` can help diagnose network issues.
- Disk I/O: OVH offers various disk types (SSD, NVMe). Monitor I/O wait times and throughput using `iostat` to ensure your chosen storage meets performance requirements.
- Resource Limits: Be aware of CPU and memory limits imposed by your OVH instance type. Use `top`, `htop`, and `free` to monitor resource utilization.
- Firewall Rules: Ensure that PostgreSQL ports (default 5432) are accessible between your application servers and database nodes, and that monitoring tools can reach the database.
By combining `systemd`’s process management with detailed PostgreSQL metrics and custom scripting, you can build a resilient monitoring strategy for your C applications and PostgreSQL clusters on OVH.