Server Monitoring Best Practices: Keeping Your C App and MySQL Clusters Alive on OVH

Proactive C Application Health Checks with `systemd` and `netcat`

Maintaining the uptime of a critical C application, especially one serving a MySQL cluster, requires more than just basic process monitoring. We need to ensure the application is not only running but also responsive and healthy. A robust approach involves leveraging `systemd` for process management and `netcat` (`nc`) for simple, low-overhead health checks.

The core idea is to expose a simple health check endpoint within your C application that `systemd` can query periodically. This endpoint should perform essential internal checks (e.g., database connection pool status, internal queue depth) and return a clear success or failure status. We’ll then configure `systemd`’s `watchdog` mechanism to restart the application if it fails these checks.

Implementing the Health Check Endpoint in C

For this example, let’s assume your C application already listens on a specific TCP port for client connections. We’ll modify it to also respond to a specific command on that same port, or a dedicated health check port, to indicate its status. A common pattern is to listen for a simple string like “HEALTHCHECK\n” and respond with “OK\n” or “ERROR: [reason]\n”.

Here’s a simplified C snippet demonstrating the concept. This is a basic illustration; a production-ready implementation would require more robust error handling, thread safety, and potentially a separate listening socket for health checks to avoid blocking client requests.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>

#define HEALTH_PORT 8081 // Or use the same port as your app
#define BUFFER_SIZE 1024

// Placeholder for your actual health check logic
int perform_internal_health_checks() {
    // Example: Check if database connection pool is healthy
    // Example: Check if critical queues are not overloaded
    // Return 0 for OK, non-zero for error
    return 0; // Assume healthy for this example
}

void handle_health_check(int client_socket) {
    char buffer[BUFFER_SIZE];
    ssize_t bytes_received;

    // Read the request (expecting "HEALTHCHECK\n")
    bytes_received = recv(client_socket, buffer, BUFFER_SIZE - 1, 0);
    if (bytes_received <= 0) {
        // Error or connection closed
        return;
    }
    buffer[bytes_received] = '\0';

    if (strncmp(buffer, "HEALTHCHECK\n", 12) == 0) {
        if (perform_internal_health_checks() == 0) {
            const char* response = "OK\n";
            send(client_socket, response, strlen(response), 0);
        } else {
            const char* response = "ERROR: Internal check failed\n";
            send(client_socket, response, strlen(response), 0);
        }
    }
    close(client_socket);
}

// ... (rest of your application's main loop and socket handling)
// In your main loop, you'd need to fork or use threads to handle
// health check requests without blocking main client processing.
// For simplicity, this example shows a direct handler.
// A better approach would be a separate thread/process listening on HEALTH_PORT.

int main() {
    // ... (your application's initialization and main client listening logic)

    // Example of setting up a separate health check listener (simplified)
    int health_sockfd, new_sock;
    struct sockaddr_in server_addr, client_addr;
    socklen_t client_len = sizeof(client_addr);

    health_sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (health_sockfd < 0) {
        perror("Health socket creation failed");
        exit(EXIT_FAILURE);
    }

    int enable = 1;
    if (setsockopt(health_sockfd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(int)) < 0) {
        perror("setsockopt(SO_REUSEADDR) failed");
    }

    server_addr.sin_family = AF_INET;
    server_addr.sin_addr.s_addr = INADDR_ANY;
    server_addr.sin_port = htons(HEALTH_PORT);

    if (bind(health_sockfd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0) {
        perror("Health socket bind failed");
        close(health_sockfd);
        exit(EXIT_FAILURE);
    }

    if (listen(health_sockfd, 5) < 0) {
        perror("Health socket listen failed");
        close(health_sockfd);
        exit(EXIT_FAILURE);
    }

    printf("Health check server listening on port %d\n", HEALTH_PORT);

    while (1) {
        new_sock = accept(health_sockfd, (struct sockaddr *)&client_addr, &client_len);
        if (new_sock < 0) {
            perror("Health socket accept failed");
            continue;
        }
        // In a real app, fork() or pthread_create() here to handle_health_check
        // For this simplified example, we'll just handle one at a time.
        handle_health_check(new_sock);
    }

    close(health_sockfd);
    return 0;
}

`systemd` Service Unit and Watchdog Configuration

Once your application is compiled with this health check capability, you need to integrate it with `systemd`. This involves creating a service unit file that defines how to start, stop, and monitor your application. Crucially, we’ll enable the `watchdog` feature.

First, ensure your application is started by `systemd`. Create a file like `/etc/systemd/system/my-c-app.service`:

[Unit]
Description=My Critical C Application
After=network.target mysql.service

[Service]
# User and Group to run the application as
User=appuser
Group=appgroup

# Working directory for the application
WorkingDirectory=/opt/my-c-app

# Path to your compiled C application executable
ExecStart=/opt/my-c-app/bin/my_c_app

# Restart policy: always restart if it crashes or fails checks
Restart=always
RestartSec=5s

# Enable watchdog support
# This tells systemd to expect a "keep-alive" signal from the service.
# The service must periodically call sd_notify(0, "READY=1") or similar.
# For our netcat approach, we'll use a separate watchdog script.
# If you implement sd_notify in C, you'd set WatchdogSec=10s here.
# For this example, we'll use a script that pings netcat.
WatchdogSec=30s

# Environment variables if needed
# Environment="DB_HOST=127.0.0.1"
# Environment="DB_PORT=3306"

# Standard output and error logging
StandardOutput=journal
StandardError=journal

# Type=simple is default and suitable if ExecStart is the main process.
# If your app forks, consider Type=forking.
Type=simple

[Install]
WantedBy=multi-user.target

Now, let’s create a small script that `systemd` can use to periodically check the health endpoint. This script will be executed by `systemd`’s `ExecStartPost` or a separate timer unit, but for simplicity with `WatchdogSec`, we’ll rely on `systemd`’s built-in `systemctl status` check combined with a script that pings the health port. A more direct integration with `sd_notify` in C is preferred for true watchdog functionality.

A more robust `systemd` watchdog integration involves the application itself signaling its health. If your C application can be modified to use `sd_notify` (part of `libsystemd-dev`), you would:

Link against `libsystemd`.
Call `sd_notify(0, “READY=1”)` after initialization.
Periodically call `sd_notify(0, “”)` (an empty string) to signal it’s alive.
Set `WatchdogSec=10` (or appropriate interval) in the `.service` file.

However, if modifying the C application is not feasible or you prefer an external check, we can use `netcat` and `systemd`’s `ExecStartPost` to trigger an initial check, and rely on `Restart=always` with `RestartSec` for recovery. For true watchdog behavior with `WatchdogSec`, the application *must* signal its liveness. Let’s assume for now we’re using `Restart=always` and a separate monitoring tool or a more complex `systemd` setup.

For a simpler external check that `systemd` can trigger, consider a script run by `systemctl start` or a timer. A common pattern is to use `systemctl status` and `netcat` in a loop, but this is often better handled by dedicated monitoring agents.

Let’s refine the `systemd` approach to use `Type=notify` and `sd_notify` in C, as this is the idiomatic way to use `WatchdogSec`.

Implementing `sd_notify` in C

Modify your C application to include `systemd` notification:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <systemd/sd-daemon.h> // Include systemd header

#define HEALTH_PORT 8081
#define BUFFER_SIZE 1024

// Function to perform internal health checks
int perform_internal_health_checks() {
    // ... your checks here ...
    return 0; // 0 for OK
}

// Function to send status updates to systemd
void send_systemd_status(const char* status) {
    // sd_notify sends status to systemd's watchdog socket.
    // The first argument (0) means "don't block".
    // The second argument is the status string.
    // An empty string "" is used for periodic "I'm alive" signals.
    sd_notify(0, status);
}

// ... (socket handling code as before, but integrated with status updates)

int main() {
    // ... (socket setup for client connections) ...

    // --- Systemd Notification Setup ---
    // Signal that the service is ready after initialization
    if (sd_notify(0, "READY=1") == 0) {
        fprintf(stderr, "systemd notification not enabled or failed.\n");
    } else {
        printf("Service signaled READY=1 to systemd.\n");
    }

    // --- Periodic Health Check and Notification ---
    // This loop should run in parallel or be integrated with your main event loop.
    // For simplicity, showing a separate thread/loop concept.
    while (1) {
        // Perform internal checks
        if (perform_internal_health_checks() == 0) {
            // Signal that the service is alive and healthy
            sd_notify(0, ""); // Empty string for periodic keep-alive
        } else {
            // Signal an error state to systemd
            sd_notify(0, "STATUS=Internal health check failed");
            // Depending on your strategy, you might want to exit or attempt recovery
            // For watchdog, systemd will restart if it doesn't get keep-alives.
        }

        // Sleep for a duration, e.g., 10 seconds
        sleep(10);
    }

    // ... (rest of your application logic) ...
    return 0;
}

To compile this, you’ll need to link against `libsystemd`:

gcc my_c_app.c -o my_c_app -lsystemd -pthread -lrt # Add other necessary libs

Now, update your `my-c-app.service` file to use `Type=notify` and set `WatchdogSec`:

[Unit]
Description=My Critical C Application
After=network.target mysql.service

[Service]
User=appuser
Group=appgroup
WorkingDirectory=/opt/my-c-app
ExecStart=/opt/my-c-app/bin/my_c_app

# Use Type=notify to enable sd_notify integration
Type=notify

# Set a watchdog timeout. If the service doesn't notify systemd within this time,
# systemd will consider it failed and restart it.
WatchdogSec=15s

Restart=always
RestartSec=5s

StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

After creating/updating the service file, reload `systemd`, enable, and start your service:

sudo systemctl daemon-reload
sudo systemctl enable my-c-app.service
sudo systemctl start my-c-app.service
sudo systemctl status my-c-app.service

The `systemctl status` output should indicate that the service is `active (running)` and that `Watchdog` is enabled. If `perform_internal_health_checks()` returns an error, `sd_notify(0, “”)` will not be called, and `systemd` will eventually trigger a restart based on `WatchdogSec`.

Monitoring MySQL Cluster Health with `mysqldumpslow` and `pt-query-digest`

A healthy C application is only half the battle; it needs a responsive MySQL cluster. Slow queries are a primary indicator of database performance degradation. We’ll use `mysqldumpslow` and Percona Toolkit’s `pt-query-digest` to analyze the slow query log and identify problematic queries.

Configuring MySQL for Slow Query Logging

First, ensure your MySQL server is configured to log slow queries. This is done via the `my.cnf` (or `my.ini`) configuration file. The key parameters are:

[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 1  # Log queries taking longer than 1 second
# Optional: Log queries not using indexes
# log_queries_not_using_indexes = 1

After modifying `my.cnf`, restart the MySQL service:

sudo systemctl restart mysql
# or
sudo service mysql restart

Verify that the log file is being created and written to.

Analyzing Slow Queries with `pt-query-digest`

Percona Toolkit’s `pt-query-digest` is the industry standard for analyzing slow query logs. It aggregates similar queries, calculates statistics, and provides actionable insights.

Install Percona Toolkit if you haven’t already. On Debian/Ubuntu:

sudo apt-get update
sudo apt-get install percona-toolkit

To analyze the slow query log and output a report:

sudo pt-query-digest /var/log/mysql/mysql-slow.log > /var/log/mysql/mysql-slow-report.txt

The output report (`mysql-slow-report.txt`) will contain a summary of the slowest queries, grouped by similarity. Look for queries with high `Total Latency`, `Executions`, or `Rows examined` relative to their execution count. The report also provides a “fingerprint” for each query type, making it easy to identify patterns.

Automating Analysis and Alerting

To make this proactive, we need to automate the analysis and set up alerts. A cron job can periodically run `pt-query-digest` and then parse its output for critical thresholds.

Create a script, e.g., `/opt/scripts/analyze_mysql_slowlog.sh`:

#!/bin/bash

LOG_FILE="/var/log/mysql/mysql-slow.log"
REPORT_FILE="/var/log/mysql/mysql-slow-report.txt"
ALERT_THRESHOLD_LATENCY="5s" # Alert if total latency for a query type exceeds 5 seconds
ALERT_THRESHOLD_EXECS="1000" # Alert if a query type executes more than 1000 times
ALERT_THRESHOLD_ROWS_EXAMINED="100000000" # Alert if total rows examined exceeds 100M

# Rotate the slow query log if it gets too large or old
# mysqldumpslow can do basic rotation, or use logrotate.d
# For pt-query-digest, it's best to analyze a snapshot and then potentially clear/rotate.
# Let's assume logrotate handles the actual file rotation.

# Analyze the log file
pt-query-digest --limit $ALERT_THRESHOLD_LATENCY --complexity 0 --no-report-check --print "$LOG_FILE" > "$REPORT_FILE"

# Check for critical queries based on thresholds
# This is a simplified check. A more robust script would parse the REPORT_FILE
# more thoroughly using awk or grep with specific patterns.

# Example: Check for queries exceeding total latency threshold
if grep -q "Total Latency:" "$REPORT_FILE"; then
    echo "Potential performance issues detected in $LOG_FILE:"
    # Extract and check specific query types if needed
    # For now, just report the existence of slow queries.
    echo "See $REPORT_FILE for details."

    # --- Alerting Mechanism ---
    # Integrate with your alerting system (e.g., PagerDuty, Slack, email)
    # Example: Send an email alert
    # mail -s "MySQL Slow Query Alert" [email protected] <<EOF
    # MySQL slow query log analysis detected potential issues.
    # Please review: $REPORT_FILE
    # EOF

    # Example: Send to a monitoring system endpoint (e.g., Prometheus Alertmanager webhook)
    # curl -X POST -H "Content-Type: application/json" -d '{"message": "MySQL slow query detected", "report_url": "http://your-monitoring-server/reports/mysql-slow-report.txt"}' http://alertmanager.example.com/api/v1/alerts
fi

# Optional: Rotate or clear the log file after analysis if not handled by logrotate
# mv $LOG_FILE /var/log/mysql/mysql-slow.log.processed_$(date +%Y%m%d%H%M%S)
# touch $LOG_FILE
# chown mysql:mysql $LOG_FILE

Add this script to cron:

# Run analysis every 15 minutes
*/15 * * * * /opt/scripts/analyze_mysql_slowlog.sh

OVH Specific Considerations: Network and Instance Monitoring

When running on OVH, several platform-specific aspects need attention:

Instance Resource Utilization

OVH instances (Public Cloud) provide metrics through their API and control panel. Ensure you’re monitoring CPU, RAM, Disk I/O, and Network I/O. Tools like `node_exporter` (for Prometheus) or Datadog/New Relic agents can collect these metrics.

For basic CLI checks, `top`, `htop`, `iostat`, and `vmstat` are essential:

# Real-time CPU/Mem usage
htop

# Disk I/O statistics
iostat -xz 5 # Report every 5 seconds, extended format, with timestamps

# Virtual Memory statistics
vmstat 5 # Report every 5 seconds

Network Connectivity and Latency

Monitor network latency between your application servers and the MySQL cluster, and from your application servers to any external services. OVH’s network can be reliable, but inter-region or inter-datacenter latency can fluctuate.

Use `ping` and `mtr` for basic checks:

# Ping your MySQL primary node
ping mysql-primary.your-domain.com

# MTR provides traceroute and ping statistics
mtr --report --report-wide mysql-primary.your-domain.com

For more advanced monitoring, consider deploying `blackbox_exporter` (Prometheus) to actively probe your application endpoints and MySQL ports from different locations, simulating user experience.

OVH Control Panel and Alerts

Leverage OVH’s built-in alerting for infrastructure-level issues:

Instance Alerts: Configure alerts for high CPU, low disk space, or network anomalies directly in the OVH control panel.
Network Alerts: Monitor bandwidth usage and potential DDoS attacks.
Database Service Alerts: If using OVH managed database services (e.g., Managed Databases for MySQL), configure their specific alerts for performance and availability.

These platform-level alerts are crucial as a first line of defense, complementing your application-specific monitoring.

Centralized Logging and Alerting Strategy

A distributed system requires centralized logging and a unified alerting strategy. Logs from your C application (`journald`), MySQL (`/var/log/mysql/`), and any analysis scripts should be aggregated.

Tools like:

ELK Stack (Elasticsearch, Logstash, Kibana): Powerful for log aggregation, searching, and visualization.
Prometheus + Alertmanager: Excellent for metrics-based alerting. You can push custom metrics from your C app or scripts.
Graylog: An open-source alternative to ELK.
Datadog, Splunk, New Relic: Commercial SaaS solutions offering comprehensive monitoring and alerting.

For alerting, define clear thresholds and escalation policies. Ensure alerts are actionable and routed to the correct teams. Avoid alert fatigue by tuning thresholds and using severity levels.