Server Monitoring Best Practices: Keeping Your C App and MongoDB Clusters Alive on OVH

Proactive C Application Health Checks with Systemd

For critical C applications running on OVH infrastructure, robust health checking is paramount. Relying solely on external probes can lead to delayed detection of internal application failures. Implementing systemd service units with built-in health check mechanisms provides a more immediate and granular approach to application resilience. This involves leveraging systemd’s `ExecStartPre`, `ExecStartPost`, and crucially, `ExecStart` with a long-running process that periodically signals its health.

Consider a C application that manages network connections and data processing. We can wrap this application within a systemd service that monitors its own operational status. The key is to have the C application periodically write a timestamp or a simple status indicator to a designated file or socket, which systemd can then monitor.

Systemd Service Unit Configuration

Here’s a sample systemd service unit file for a hypothetical C application named `my_c_app`.

[Unit]
Description=My Critical C Application
After=network.target

[Service]
Type=simple
User=appuser
Group=appgroup
WorkingDirectory=/opt/my_c_app
ExecStartPre=/usr/bin/systemctl is-active --quiet my_c_app.service || /opt/my_c_app/scripts/pre_start_check.sh
ExecStart=/opt/my_c_app/bin/my_c_app --config /etc/my_c_app/config.conf
ExecStartPost=/usr/bin/systemctl is-active --quiet my_c_app.service || /opt/my_c_app/scripts/post_start_failure.sh
Restart=on-failure
RestartSec=5
WatchdogSec=30
KillMode=process

[Install]
WantedBy=multi-user.target

In this configuration:

Type=simple is suitable for applications that fork or daemonize themselves. If your C app doesn’t daemonize, consider Type=exec.
ExecStartPre can run pre-flight checks. The example shows a check to ensure the service isn’t already active, preventing duplicate instances.
ExecStart is the main command to launch your C application.
ExecStartPost can execute a script if the main process fails to start correctly.
Restart=on-failure ensures automatic restarts.
RestartSec=5 adds a small delay before restarting.
WatchdogSec=30 is crucial. This tells systemd to expect a “keep-alive” signal from the application within 30 seconds. If this signal is not received, systemd will consider the service unhealthy and restart it.
KillMode=process ensures only the main process is terminated on stop/restart.

Implementing the C Application’s Watchdog Signal

Your C application needs to periodically signal systemd that it’s alive. The most common and efficient way to do this is by writing to a file descriptor that systemd monitors. Systemd uses the `sd_notify()` function (part of `libsystemd`) for this purpose. If your C application cannot directly link against `libsystemd`, you can simulate this by periodically updating a timestamp file and configuring systemd to watch that file.

Method 1: Using `sd_notify()` (Recommended)

This method requires linking your C application with `libsystemd`. The application periodically calls `sd_notify(0, “READY=1”)` to indicate it’s running and healthy. For more advanced status, you can send other messages like “STATUS=Processing request X” or “BUSY=1”.

#include <systemd/sd-daemon.h>
#include <unistd.h>
#include <stdio.h>
#include <time.h>

// ... your application logic ...

void send_heartbeat() {
    // Indicate that the service is ready and running.
    // The first argument (0) means no timeout.
    sd_notify(0, "READY=1");

    // Optionally, send status updates
    time_t now = time(NULL);
    char status_msg[128];
    sprintf(status_msg, "STATUS=Heartbeat at %s", ctime(&now));
    sd_notify(0, status_msg);

    // If the application is busy with a long task, you can signal that too.
    // sd_notify(0, "BUSY=1");
}

int main(int argc, char *argv[]) {
    // ... parse arguments, initialize ...

    // Main application loop
    while (1) {
        // ... perform application tasks ...

        // Send a heartbeat signal periodically
        send_heartbeat();

        // Sleep for a duration less than WatchdogSec
        sleep(20); // e.g., 20 seconds
    }

    return 0;
}

To compile this, you’ll need to link against `libsystemd`:

gcc my_c_app.c -o my_c_app -lsystemd -lpthread

And in your systemd service file, you need to enable the watchdog:

[Service]
# ... other settings ...
WatchdogSec=30

Method 2: File Timestamp Monitoring (Fallback)

If direct `libsystemd` integration is not feasible, you can use a simpler approach. The C application periodically updates a timestamp file. Systemd then uses `RuntimeDirectory=`, `StateDirectory=`, or `RuntimePath=` and `ExecStartPre=/usr/bin/touch /path/to/heartbeat.timestamp` in conjunction with `WatchdogSec` and a custom script that checks the file’s modification time. However, systemd’s `WatchdogSec` is designed for `sd_notify()`. For file-based monitoring, you’d typically use `systemd-analyze verify` or a custom script that checks the file’s age and triggers a restart if it’s too old. A more direct systemd approach for file monitoring is to use `Type=notify` and have the application create a socket that systemd can communicate with, but `sd_notify` is the most idiomatic.

A more robust file-based approach involves a separate systemd service that monitors the timestamp. This is less ideal than `sd_notify` but can work.

[Unit]
Description=My C App Heartbeat Monitor
Requires=my_c_app.service
After=my_c_app.service

[Service]
Type=oneshot
ExecStart=/opt/my_c_app/scripts/monitor_heartbeat.sh /opt/my_c_app/heartbeat.timestamp 30

#!/bin/bash

HEARTBEAT_FILE="$1"
MAX_AGE_SECONDS="$2"

if [ ! -f "$HEARTBEAT_FILE" ]; then
    echo "Heartbeat file $HEARTBEAT_FILE not found."
    exit 1
fi

LAST_MOD_TIME=$(stat -c %Y "$HEARTBEAT_FILE")
CURRENT_TIME=$(date +%s)
AGE=$((CURRENT_TIME - LAST_MOD_TIME))

if [ "$AGE" -gt "$MAX_AGE_SECONDS" ]; then
    echo "Heartbeat file $HEARTBEAT_FILE is too old (age: $AGE seconds, max: $MAX_AGE_SECONDS)."
    # Trigger restart of the main service
    systemctl restart my_c_app.service
    exit 1
else
    echo "Heartbeat OK. Age: $AGE seconds."
    exit 0
fi

The C application would then need to periodically `touch` this file. This is less efficient and adds complexity compared to `sd_notify`.

MongoDB Cluster Monitoring with Prometheus and Grafana on OVH

Monitoring a MongoDB replica set or sharded cluster on OVH requires a multi-faceted approach. Prometheus is an excellent choice for time-series data collection, and Grafana provides powerful visualization and alerting capabilities. We’ll focus on setting up the MongoDB exporter for Prometheus and configuring essential dashboards and alerts.

Setting up the MongoDB Exporter

The official MongoDB exporter for Prometheus is `mongodb_exporter`. It scrapes metrics directly from MongoDB instances. Ensure your MongoDB instances are accessible from where you’ll run the exporter.

Installation and Configuration

Download the latest release from the official GitHub repository. For example, on a Debian/Ubuntu system:

wget https://github.com/mongodb/mongodb-prometheus-exporter/releases/download/v0.15.0/mongodb_exporter-0.15.0.linux-amd64.tar.gz
tar -xzf mongodb_exporter-0.15.0.linux-amd64.tar.gz
sudo mv mongodb_exporter-0.15.0.linux-amd64/mongodb_exporter /usr/local/bin/
sudo mv mongodb_exporter-0.15.0.linux-amd64/mongodb_exporter.yml /etc/mongodb_exporter/
rm -rf mongodb_exporter-0.15.0.linux-amd64*

Create a dedicated user and directory for the exporter:

sudo useradd --system --no-create-home mongodb_exporter
sudo mkdir -p /var/lib/mongodb_exporter
sudo chown mongodb_exporter:mongodb_exporter /var/lib/mongodb_exporter
sudo mkdir -p /etc/mongodb_exporter
sudo chown mongodb_exporter:mongodb_exporter /etc/mongodb_exporter/mongodb_exporter.yml

Configure the exporter to connect to your MongoDB instances. The configuration file (`mongodb_exporter.yml`) uses a YAML format. For a replica set, you’d typically specify the connection string for one of the members, and the exporter will discover the rest. For sharded clusters, you’ll need to configure connections to config servers and shards.

# Example mongodb_exporter.yml for a replica set
mongodb_uri: "mongodb://exporter_user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin"
log_level: "info"
log_format: "json"
# Optional: specify which metrics to collect
# collect_metrics:
#   - "server_status"
#   - "repl_set_status"
#   - "db_stats"
#   - "coll_stats"
#   - "oplog_stats"

Create a MongoDB user with sufficient read-only privileges for the exporter. This user should be able to run commands like `serverStatus`, `replSetGetStatus`, `dbStats`, `collStats`, etc.

use admin
db.createUser({
  user: "exporter_user",
  pwd: "exporter_password",
  roles: [
    { role: "clusterMonitor", db: "admin" },
    { role: "readAnyDatabase", db: "admin" }
  ]
})

Systemd Service for the Exporter

Create a systemd service file to manage the exporter process.

[Unit]
Description=MongoDB Prometheus Exporter
Wants=network-online.target
After=network-online.target mongodb.service # Assuming mongodb.service is your MongoDB instance's service

[Service]
User=mongodb_exporter
Group=mongodb_exporter
Type=simple
ExecStart=/usr/local/bin/mongodb_exporter \
  --config.path="/etc/mongodb_exporter/mongodb_exporter.yml" \
  --web.listen-address="0.0.0.0:9216" \
  --log.level="info" \
  --log.format="json"

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable mongodb_exporter.service
sudo systemctl start mongodb_exporter.service
sudo systemctl status mongodb_exporter.service

Verify that the exporter is running and exposing metrics at http://your_exporter_host:9216/metrics.

Prometheus Configuration for Scraping MongoDB Metrics

Add a scrape job to your Prometheus configuration file (`prometheus.yml`) to collect metrics from the exporter.

scrape_configs:
  - job_name: 'mongodb'
    static_configs:
      - targets: ['your_exporter_host_1:9216', 'your_exporter_host_2:9216'] # Add all exporter instances
    metrics_path: '/metrics'
    scrape_interval: '30s'
    scrape_timeout: '10s'

Reload Prometheus configuration.

Essential MongoDB Metrics to Monitor

When setting up dashboards and alerts, focus on these key metrics:

Availability: `mongodb_up` (1 if exporter can connect, 0 otherwise).
Replication Lag: `mongodb_replset_member_optime_diff` (difference in seconds between the primary’s oplog timestamp and a secondary’s). This is critical for ensuring data consistency.
Performance:

`mongodb_mongod_network_in_bytes_total` / `mongodb_mongod_network_out_bytes_total`
`mongodb_mongod_opcounters_insert_total`, `mongodb_mongod_opcounters_query_total`, `mongodb_mongod_opcounters_update_total`, `mongodb_mongod_opcounters_delete_total`
`mongodb_mongod_extra_locks_total` (indicates contention)
`mongodb_mongod_connections_current` / `mongodb_mongod_connections_available`

Resource Usage:

`mongodb_mongod_memory_resident`
`mongodb_mongod_disk_storage_bytes`
`mongodb_mongod_cpu_user_seconds_total` / `mongodb_mongod_cpu_system_seconds_total`

Errors: Monitor logs for specific error messages that the exporter might surface or that can be derived from other metrics (e.g., high `mongodb_mongod_opcounters_query_total` with low throughput might indicate slow queries).

Grafana Dashboards and Alerts

Import pre-built MongoDB dashboards from Grafana’s dashboard repository (e.g., search for “MongoDB Prometheus”). Customize them to include the metrics most relevant to your specific workload and OVH environment. Key alerts to configure:

Replica Set Unhealthy: Alert if `mongodb_up` is 0 for any member.
High Replication Lag: Alert if `mongodb_replset_member_optime_diff` exceeds a defined threshold (e.g., 60 seconds) for a sustained period.
Low Disk Space: Monitor `mongodb_mongod_disk_storage_bytes` and trigger alerts when approaching capacity.
High CPU/Memory Usage: Set thresholds for `mongodb_mongod_cpu_user_seconds_total` and `mongodb_mongod_memory_resident`.
Connection Pool Exhaustion: Alert if `mongodb_mongod_connections_current` approaches `mongodb_mongod_connections_available`.
Slow Operations: While direct slow query logging is better, high opcounter rates without corresponding throughput can be an indicator.

For example, an alert for replication lag:

# Grafana Alert Rule (PromQL)
# Name: High MongoDB Replication Lag
# Condition:
avg_over_time(mongodb_replset_member_optime_diff{job="mongodb", member_state="SECONDARY"}[5m]) > 60
# For: 5m
# Labels:
#   severity: warning
# Annotations:
#   summary: "MongoDB replication lag is high on {{ $labels.instance }} ({{ $value }}s)"
#   description: "The secondary instance {{ $labels.instance }} is {{ $value }} seconds behind the primary."

Regularly review these metrics and alerts, tuning thresholds based on your application’s behavior and OVH instance performance characteristics. This proactive monitoring strategy is essential for maintaining the stability and availability of your C applications and MongoDB clusters.