Server Monitoring Best Practices: Keeping Your C App and MongoDB Clusters Alive on OVH
Proactive C Application Health Checks with Systemd
For critical C applications running on OVH infrastructure, robust health checking is paramount. Relying solely on external probes can lead to delayed detection of internal application failures. Implementing systemd service units with built-in health check mechanisms provides a more immediate and granular approach to application resilience. This involves leveraging systemd’s `ExecStartPre`, `ExecStartPost`, and crucially, `ExecStart` with a long-running process that periodically signals its health.
Consider a C application that manages network connections and data processing. We can wrap this application within a systemd service that monitors its own operational status. The key is to have the C application periodically write a timestamp or a simple status indicator to a designated file or socket, which systemd can then monitor.
Systemd Service Unit Configuration
Here’s a sample systemd service unit file for a hypothetical C application named `my_c_app`.
[Unit] Description=My Critical C Application After=network.target [Service] Type=simple User=appuser Group=appgroup WorkingDirectory=/opt/my_c_app ExecStartPre=/usr/bin/systemctl is-active --quiet my_c_app.service || /opt/my_c_app/scripts/pre_start_check.sh ExecStart=/opt/my_c_app/bin/my_c_app --config /etc/my_c_app/config.conf ExecStartPost=/usr/bin/systemctl is-active --quiet my_c_app.service || /opt/my_c_app/scripts/post_start_failure.sh Restart=on-failure RestartSec=5 WatchdogSec=30 KillMode=process [Install] WantedBy=multi-user.target
In this configuration:
Type=simpleis suitable for applications that fork or daemonize themselves. If your C app doesn’t daemonize, considerType=exec.ExecStartPrecan run pre-flight checks. The example shows a check to ensure the service isn’t already active, preventing duplicate instances.ExecStartis the main command to launch your C application.ExecStartPostcan execute a script if the main process fails to start correctly.Restart=on-failureensures automatic restarts.RestartSec=5adds a small delay before restarting.WatchdogSec=30is crucial. This tells systemd to expect a “keep-alive” signal from the application within 30 seconds. If this signal is not received, systemd will consider the service unhealthy and restart it.KillMode=processensures only the main process is terminated on stop/restart.
Implementing the C Application’s Watchdog Signal
Your C application needs to periodically signal systemd that it’s alive. The most common and efficient way to do this is by writing to a file descriptor that systemd monitors. Systemd uses the `sd_notify()` function (part of `libsystemd`) for this purpose. If your C application cannot directly link against `libsystemd`, you can simulate this by periodically updating a timestamp file and configuring systemd to watch that file.
Method 1: Using sd_notify() (Recommended)
This method requires linking your C application with `libsystemd`. The application periodically calls `sd_notify(0, “READY=1”)` to indicate it’s running and healthy. For more advanced status, you can send other messages like “STATUS=Processing request X” or “BUSY=1”.
#include <systemd/sd-daemon.h>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
// ... your application logic ...
void send_heartbeat() {
// Indicate that the service is ready and running.
// The first argument (0) means no timeout.
sd_notify(0, "READY=1");
// Optionally, send status updates
time_t now = time(NULL);
char status_msg[128];
sprintf(status_msg, "STATUS=Heartbeat at %s", ctime(&now));
sd_notify(0, status_msg);
// If the application is busy with a long task, you can signal that too.
// sd_notify(0, "BUSY=1");
}
int main(int argc, char *argv[]) {
// ... parse arguments, initialize ...
// Main application loop
while (1) {
// ... perform application tasks ...
// Send a heartbeat signal periodically
send_heartbeat();
// Sleep for a duration less than WatchdogSec
sleep(20); // e.g., 20 seconds
}
return 0;
}
To compile this, you’ll need to link against `libsystemd`:
gcc my_c_app.c -o my_c_app -lsystemd -lpthread
And in your systemd service file, you need to enable the watchdog:
[Service] # ... other settings ... WatchdogSec=30
Method 2: File Timestamp Monitoring (Fallback)
If direct `libsystemd` integration is not feasible, you can use a simpler approach. The C application periodically updates a timestamp file. Systemd then uses `RuntimeDirectory=`, `StateDirectory=`, or `RuntimePath=` and `ExecStartPre=/usr/bin/touch /path/to/heartbeat.timestamp` in conjunction with `WatchdogSec` and a custom script that checks the file’s modification time. However, systemd’s `WatchdogSec` is designed for `sd_notify()`. For file-based monitoring, you’d typically use `systemd-analyze verify` or a custom script that checks the file’s age and triggers a restart if it’s too old. A more direct systemd approach for file monitoring is to use `Type=notify` and have the application create a socket that systemd can communicate with, but `sd_notify` is the most idiomatic.
A more robust file-based approach involves a separate systemd service that monitors the timestamp. This is less ideal than `sd_notify` but can work.
[Unit] Description=My C App Heartbeat Monitor Requires=my_c_app.service After=my_c_app.service [Service] Type=oneshot ExecStart=/opt/my_c_app/scripts/monitor_heartbeat.sh /opt/my_c_app/heartbeat.timestamp 30
#!/bin/bash
HEARTBEAT_FILE="$1"
MAX_AGE_SECONDS="$2"
if [ ! -f "$HEARTBEAT_FILE" ]; then
echo "Heartbeat file $HEARTBEAT_FILE not found."
exit 1
fi
LAST_MOD_TIME=$(stat -c %Y "$HEARTBEAT_FILE")
CURRENT_TIME=$(date +%s)
AGE=$((CURRENT_TIME - LAST_MOD_TIME))
if [ "$AGE" -gt "$MAX_AGE_SECONDS" ]; then
echo "Heartbeat file $HEARTBEAT_FILE is too old (age: $AGE seconds, max: $MAX_AGE_SECONDS)."
# Trigger restart of the main service
systemctl restart my_c_app.service
exit 1
else
echo "Heartbeat OK. Age: $AGE seconds."
exit 0
fi
The C application would then need to periodically `touch` this file. This is less efficient and adds complexity compared to `sd_notify`.
MongoDB Cluster Monitoring with Prometheus and Grafana on OVH
Monitoring a MongoDB replica set or sharded cluster on OVH requires a multi-faceted approach. Prometheus is an excellent choice for time-series data collection, and Grafana provides powerful visualization and alerting capabilities. We’ll focus on setting up the MongoDB exporter for Prometheus and configuring essential dashboards and alerts.
Setting up the MongoDB Exporter
The official MongoDB exporter for Prometheus is `mongodb_exporter`. It scrapes metrics directly from MongoDB instances. Ensure your MongoDB instances are accessible from where you’ll run the exporter.
Installation and Configuration
Download the latest release from the official GitHub repository. For example, on a Debian/Ubuntu system:
wget https://github.com/mongodb/mongodb-prometheus-exporter/releases/download/v0.15.0/mongodb_exporter-0.15.0.linux-amd64.tar.gz tar -xzf mongodb_exporter-0.15.0.linux-amd64.tar.gz sudo mv mongodb_exporter-0.15.0.linux-amd64/mongodb_exporter /usr/local/bin/ sudo mv mongodb_exporter-0.15.0.linux-amd64/mongodb_exporter.yml /etc/mongodb_exporter/ rm -rf mongodb_exporter-0.15.0.linux-amd64*
Create a dedicated user and directory for the exporter:
sudo useradd --system --no-create-home mongodb_exporter sudo mkdir -p /var/lib/mongodb_exporter sudo chown mongodb_exporter:mongodb_exporter /var/lib/mongodb_exporter sudo mkdir -p /etc/mongodb_exporter sudo chown mongodb_exporter:mongodb_exporter /etc/mongodb_exporter/mongodb_exporter.yml
Configure the exporter to connect to your MongoDB instances. The configuration file (`mongodb_exporter.yml`) uses a YAML format. For a replica set, you’d typically specify the connection string for one of the members, and the exporter will discover the rest. For sharded clusters, you’ll need to configure connections to config servers and shards.
# Example mongodb_exporter.yml for a replica set mongodb_uri: "mongodb://exporter_user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin" log_level: "info" log_format: "json" # Optional: specify which metrics to collect # collect_metrics: # - "server_status" # - "repl_set_status" # - "db_stats" # - "coll_stats" # - "oplog_stats"
Create a MongoDB user with sufficient read-only privileges for the exporter. This user should be able to run commands like `serverStatus`, `replSetGetStatus`, `dbStats`, `collStats`, etc.
use admin
db.createUser({
user: "exporter_user",
pwd: "exporter_password",
roles: [
{ role: "clusterMonitor", db: "admin" },
{ role: "readAnyDatabase", db: "admin" }
]
})
Systemd Service for the Exporter
Create a systemd service file to manage the exporter process.
[Unit] Description=MongoDB Prometheus Exporter Wants=network-online.target After=network-online.target mongodb.service # Assuming mongodb.service is your MongoDB instance's service [Service] User=mongodb_exporter Group=mongodb_exporter Type=simple ExecStart=/usr/local/bin/mongodb_exporter \ --config.path="/etc/mongodb_exporter/mongodb_exporter.yml" \ --web.listen-address="0.0.0.0:9216" \ --log.level="info" \ --log.format="json" Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable mongodb_exporter.service sudo systemctl start mongodb_exporter.service sudo systemctl status mongodb_exporter.service
Verify that the exporter is running and exposing metrics at http://your_exporter_host:9216/metrics.
Prometheus Configuration for Scraping MongoDB Metrics
Add a scrape job to your Prometheus configuration file (`prometheus.yml`) to collect metrics from the exporter.
scrape_configs:
- job_name: 'mongodb'
static_configs:
- targets: ['your_exporter_host_1:9216', 'your_exporter_host_2:9216'] # Add all exporter instances
metrics_path: '/metrics'
scrape_interval: '30s'
scrape_timeout: '10s'
Reload Prometheus configuration.
Essential MongoDB Metrics to Monitor
When setting up dashboards and alerts, focus on these key metrics:
- Availability: `mongodb_up` (1 if exporter can connect, 0 otherwise).
- Replication Lag: `mongodb_replset_member_optime_diff` (difference in seconds between the primary’s oplog timestamp and a secondary’s). This is critical for ensuring data consistency.
- Performance:
- `mongodb_mongod_network_in_bytes_total` / `mongodb_mongod_network_out_bytes_total`
- `mongodb_mongod_opcounters_insert_total`, `mongodb_mongod_opcounters_query_total`, `mongodb_mongod_opcounters_update_total`, `mongodb_mongod_opcounters_delete_total`
- `mongodb_mongod_extra_locks_total` (indicates contention)
- `mongodb_mongod_connections_current` / `mongodb_mongod_connections_available`
- Resource Usage:
- `mongodb_mongod_memory_resident`
- `mongodb_mongod_disk_storage_bytes`
- `mongodb_mongod_cpu_user_seconds_total` / `mongodb_mongod_cpu_system_seconds_total`
- Errors: Monitor logs for specific error messages that the exporter might surface or that can be derived from other metrics (e.g., high `mongodb_mongod_opcounters_query_total` with low throughput might indicate slow queries).
Grafana Dashboards and Alerts
Import pre-built MongoDB dashboards from Grafana’s dashboard repository (e.g., search for “MongoDB Prometheus”). Customize them to include the metrics most relevant to your specific workload and OVH environment. Key alerts to configure:
- Replica Set Unhealthy: Alert if `mongodb_up` is 0 for any member.
- High Replication Lag: Alert if `mongodb_replset_member_optime_diff` exceeds a defined threshold (e.g., 60 seconds) for a sustained period.
- Low Disk Space: Monitor `mongodb_mongod_disk_storage_bytes` and trigger alerts when approaching capacity.
- High CPU/Memory Usage: Set thresholds for `mongodb_mongod_cpu_user_seconds_total` and `mongodb_mongod_memory_resident`.
- Connection Pool Exhaustion: Alert if `mongodb_mongod_connections_current` approaches `mongodb_mongod_connections_available`.
- Slow Operations: While direct slow query logging is better, high opcounter rates without corresponding throughput can be an indicator.
For example, an alert for replication lag:
# Grafana Alert Rule (PromQL)
# Name: High MongoDB Replication Lag
# Condition:
avg_over_time(mongodb_replset_member_optime_diff{job="mongodb", member_state="SECONDARY"}[5m]) > 60
# For: 5m
# Labels:
# severity: warning
# Annotations:
# summary: "MongoDB replication lag is high on {{ $labels.instance }} ({{ $value }}s)"
# description: "The secondary instance {{ $labels.instance }} is {{ $value }} seconds behind the primary."
Regularly review these metrics and alerts, tuning thresholds based on your application’s behavior and OVH instance performance characteristics. This proactive monitoring strategy is essential for maintaining the stability and availability of your C applications and MongoDB clusters.