Server Monitoring Best Practices: Keeping Your C App and PostgreSQL Clusters Alive on OVH

Proactive C Application Health Checks with `systemd`

For critical C applications deployed on OVH infrastructure, robust health checking is paramount. We’ll leverage `systemd`’s built-in capabilities to ensure our application is not only running but also responsive. This involves defining a `systemd` service unit with specific health check directives.

Consider a typical C application that listens on a specific port (e.g., 8080) and exposes a health check endpoint (e.g., `/healthz`). We’ll create a `systemd` service file to manage this application and its health monitoring.

`systemd` Service Unit Configuration

Create a file named `my-c-app.service` in `/etc/systemd/system/`:

[Unit]
Description=My Critical C Application
After=network.target

[Service]
ExecStart=/usr/local/bin/my_c_app --config /etc/my_c_app/config.conf
ExecStop=/bin/kill -s TERM $MAINPID
Restart=on-failure
RestartSec=5s

# Health Check Configuration
Type=notify
NotifyAccess=all
WatchdogSec=10s

# User and Group for security
User=appuser
Group=appgroup

[Install]
WantedBy=multi-user.target

In this configuration:

Type=notify: This tells `systemd` that our application will signal its readiness and health status.
NotifyAccess=all: Allows the service to send notifications to `systemd`.
WatchdogSec=10s: This is crucial. `systemd` will send a “keep-alive” message to the application every 10 seconds. If the application doesn’t respond within this interval, `systemd` will consider it unhealthy and restart it.

Your C application needs to be modified to support `systemd`’s notification protocol. This typically involves:

C Application Modifications for `systemd` Notification

Your C application must periodically send a “READY=1” message to `sd_notify(3)` when it’s ready to accept connections and a “STATUS=…” message to provide status updates. For the watchdog, it needs to respond to a specific signal or message from `systemd`.

#include <systemd/sd-daemon.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <fcntl.h>

#define PORT 8080
#define HEALTH_CHECK_PORT 8081 // Separate port for health check endpoint

volatile sig_atomic_t watchdog_triggered = 0;

void sig_handler(int signum) {
    if (signum == SIGUSR1) { // Assuming SIGUSR1 is used for watchdog
        watchdog_triggered = 1;
    }
}

void setup_health_check_server() {
    int sockfd;
    struct sockaddr_in serv_addr;

    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd < 0) {
        perror("ERROR opening socket");
        exit(1);
    }

    // Set socket to non-blocking
    int flags = fcntl(sockfd, F_GETFL, 0);
    if (flags == -1) {
        perror("fcntl F_GETFL");
        exit(1);
    }
    if (fcntl(sockfd, F_SETFL, flags | O_NONBLOCK) == -1) {
        perror("fcntl F_SETFL O_NONBLOCK");
        exit(1);
    }

    memset(&serv_addr, 0, sizeof(serv_addr));
    serv_addr.sin_family = AF_INET;
    serv_addr.sin_addr.s_addr = INADDR_ANY;
    serv_addr.sin_port = htons(HEALTH_CHECK_PORT);

    if (bind(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
        perror("ERROR on binding");
        close(sockfd);
        exit(1);
    }

    listen(sockfd, 5);

    // Handle incoming connections in a separate thread or async loop
    // For simplicity, this example doesn't implement full async handling
    // but demonstrates the concept.
    printf("Health check server listening on port %d\n", HEALTH_CHECK_PORT);
}

void handle_health_check_request(int client_sock) {
    char buffer[1024] = {0};
    read(client_sock, buffer, 1024); // Read request (e.g., GET /healthz HTTP/1.1)

    const char *response = "HTTP/1.1 200 OK\r\nContent-Type: application/json\r\nContent-Length: 25\r\n\r\n{\"status\": \"ok\"}";
    send(client_sock, response, strlen(response), 0);
    close(client_sock);
}

int main(int argc, char *argv[]) {
    // Parse arguments, load config, etc.
    // ...

    // Signal handler for watchdog
    signal(SIGUSR1, sig_handler);

    // Setup main application server (e.g., listening on PORT 8080)
    // ...

    // Setup health check server
    setup_health_check_server();

    // Initial notification to systemd
    sd_notify(0, "READY=1");

    int health_check_server_fd = socket(AF_INET, SOCK_STREAM, 0);
    // ... bind and listen for health_check_server_fd on HEALTH_CHECK_PORT ...
    // For simplicity, assuming health_check_server_fd is already set up and listening.
    // In a real app, this would be managed properly.

    while (1) {
        // Main application logic
        // ...

        // Periodically send watchdog notification
        if (sd_notify(0, "STATUS=Processing requests") < 0) {
            // Handle error, maybe log it. sd_notify can fail if systemd is not running.
            fprintf(stderr, "sd_notify failed: %s\n", strerror(errno));
        }

        // Simulate watchdog response (in a real app, this would be a specific mechanism)
        // For this example, we'll assume a simple periodic check.
        // A more robust implementation would involve systemd sending a signal
        // or a specific message that the app needs to acknowledge.
        // The WatchdogSec in systemd service unit is the primary mechanism.
        // The app's responsibility is to *stay alive* and *respond* to systemd's checks.
        // If the app hangs, systemd's watchdog will time out.

        // Example of handling health check requests (simplified)
        struct sockaddr_in client_addr;
        socklen_t client_len = sizeof(client_addr);
        int client_sock = accept(health_check_server_fd, (struct sockaddr *)&client_addr, &client_len);
        if (client_sock >= 0) {
            handle_health_check_request(client_sock);
        } else if (errno != EWOULDBLOCK && errno != EAGAIN) {
            perror("accept");
            // Handle error
        }

        usleep(100000); // Sleep for 100ms
    }

    return 0;
}

After creating the service file and modifying your C application, reload `systemd`, enable, and start the service:

sudo systemctl daemon-reload
sudo systemctl enable my-c-app.service
sudo systemctl start my-c-app.service
sudo systemctl status my-c-app.service

This setup ensures that `systemd` actively monitors your C application. If the application becomes unresponsive (fails to signal `systemd` within `WatchdogSec`), `systemd` will automatically attempt to restart it.

PostgreSQL Cluster Monitoring with `pg_monitor` and `pg_stat_statements`

Monitoring PostgreSQL clusters, especially in a high-availability setup on OVH, requires a multi-faceted approach. We’ll focus on key metrics that indicate performance, availability, and potential issues, using `pg_monitor` (a custom script or tool) and the built-in `pg_stat_statements` extension.

Enabling and Configuring `pg_stat_statements`

`pg_stat_statements` is invaluable for identifying slow queries. Ensure it’s enabled in your `postgresql.conf`.

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000
pg_stat_statements.save = on

After modifying `postgresql.conf`, you must restart your PostgreSQL instances for the changes to take effect. Then, create the extension in each database you want to monitor:

-- In each database you want to monitor
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

You can then query `pg_stat_statements` to find the most resource-intensive queries:

SELECT
    calls,
    total_time,
    rows,
    mean_time,
    stddev_time,
    "user",
    "query"
FROM
    pg_stat_statements
ORDER BY
    total_time DESC
LIMIT 20;

Custom Monitoring Script (`pg_monitor.sh`)

A shell script can aggregate critical metrics. This script should be run periodically (e.g., via cron or `systemd` timers) on each PostgreSQL node.

#!/bin/bash

# Configuration
PG_USER="monitor_user"
PG_HOST="localhost"
PG_PORT="5432"
LOG_FILE="/var/log/pg_monitor.log"
ALERT_THRESHOLD_CPU=80 # %
ALERT_THRESHOLD_MEM=80 # %
ALERT_THRESHOLD_DISK=90 # %
ALERT_THRESHOLD_CONNECTIONS=500
ALERT_THRESHOLD_SLOTS=0 # Minimum number of replication slots

# --- Functions ---

log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}

send_alert() {
    local metric="$1"
    local value="$2"
    local threshold="$3"
    local message="ALERT: PostgreSQL cluster issue on ${PG_HOST}:${PG_PORT}. Metric: ${metric}, Value: ${value}, Threshold: ${threshold}"
    log_message "$message"
    # In a production environment, integrate with your alerting system (e.g., PagerDuty, Slack, Prometheus Alertmanager)
    # echo "$message" | mail -s "PostgreSQL Alert on ${PG_HOST}" [email protected]
}

check_replication_status() {
    local slots_used=$(psql -U "$PG_USER" -h "$PG_HOST" -p "$PG_PORT" -tAc "SELECT count(*) FROM pg_replication_slots;")
    if [ "$slots_used" -lt "$ALERT_THRESHOLD_SLOTS" ]; then
        send_alert "Replication Slots" "$slots_used" "$ALERT_THRESHOLD_SLOTS"
    fi
    # Add checks for streaming replication lag if applicable
}

check_connection_count() {
    local current_connections=$(psql -U "$PG_USER" -h "$PG_HOST" -p "$PG_PORT" -tAc "SELECT count(*) FROM pg_stat_activity;")
    local max_connections=$(psql -U "$PG_USER" -h "$PG_HOST" -p "$PG_PORT" -tAc "SHOW max_connections;")
    if [ "$current_connections" -gt "$ALERT_THRESHOLD_CONNECTIONS" ]; then
        send_alert "Current Connections" "$current_connections" "$ALERT_THRESHOLD_CONNECTIONS"
    fi
    if [ "$current_connections" -gt "$((max_connections * 90 / 100))" ]; then
        send_alert "Connection Usage" "$current_connections/$max_connections" "90% of max_connections"
    fi
}

check_disk_usage() {
    local disk_usage=$(df -h /var/lib/postgresql/data | awk 'NR==2 {print $5}' | sed 's/%//')
    if [ "$disk_usage" -gt "$ALERT_THRESHOLD_DISK" ]; then
        send_alert "Disk Usage" "$disk_usage%" "$ALERT_THRESHOLD_DISK%"
    fi
}

check_cpu_usage() {
    # This is a simplified check. For accurate PostgreSQL CPU usage,
    # consider tools like `pg_top` or more advanced system monitoring.
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
    if (( $(echo "$cpu_usage > $ALERT_THRESHOLD_CPU" | bc -l) )); then
        send_alert "CPU Usage" "${cpu_usage}%" "${ALERT_THRESHOLD_CPU}%"
    fi
}

check_memory_usage() {
    # Similar to CPU, this is a system-wide check.
    local mem_usage=$(free | grep Mem: | awk '{print $3/$2 * 100.0}')
    if (( $(echo "$mem_usage > $ALERT_THRESHOLD_MEM" | bc -l) )); then
        send_alert "Memory Usage" "${mem_usage}%" "${ALERT_THRESHOLD_MEM}%"
    fi
}

check_pg_is_running() {
    if ! pg_isready -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" >/dev/null 2>&1; then
        send_alert "PostgreSQL Service" "Not Running" "Running"
        return 1
    fi
    return 0
}

# --- Main Execution ---
log_message "Starting PostgreSQL health check..."

if ! check_pg_is_running; then
    log_message "PostgreSQL is not running on ${PG_HOST}:${PG_PORT}. Exiting."
    exit 1
fi

check_cpu_usage
check_memory_usage
check_disk_usage
check_connection_count
check_replication_status

log_message "PostgreSQL health check finished."
exit 0

To make this script executable and schedule it:

sudo chmod +x pg_monitor.sh
# Create a user with read-only permissions for monitoring
sudo -u postgres psql -c "CREATE USER monitor_user WITH PASSWORD 'your_secure_password';"
sudo -u postgres psql -c "GRANT pg_read_all_stats TO monitor_user;"
# Add to crontab for hourly checks
echo "0 * * * * /path/to/your/pg_monitor.sh" | sudo crontab -

OVH Specific Considerations

When running on OVH, pay close attention to:

Network Latency: If your PostgreSQL cluster spans multiple OVH regions or availability zones, monitor inter-node latency. Tools like `ping` and `mtr` can help diagnose network issues.
Disk I/O: OVH offers various disk types (SSD, NVMe). Monitor I/O wait times and throughput using `iostat` to ensure your chosen storage meets performance requirements.
Resource Limits: Be aware of CPU and memory limits imposed by your OVH instance type. Use `top`, `htop`, and `free` to monitor resource utilization.
Firewall Rules: Ensure that PostgreSQL ports (default 5432) are accessible between your application servers and database nodes, and that monitoring tools can reach the database.

By combining `systemd`’s process management with detailed PostgreSQL metrics and custom scripting, you can build a resilient monitoring strategy for your C applications and PostgreSQL clusters on OVH.

Server Monitoring Best Practices: Keeping Your C App and PostgreSQL Clusters Alive on OVH

Proactive C Application Health Checks with `systemd`

`systemd` Service Unit Configuration

C Application Modifications for `systemd` Notification

PostgreSQL Cluster Monitoring with `pg_monitor` and `pg_stat_statements`

Enabling and Configuring `pg_stat_statements`

Custom Monitoring Script (`pg_monitor.sh`)

OVH Specific Considerations

Recent Posts

Top Categories

Our Products

Our Services