Server Monitoring Best Practices: Keeping Your C++ App and MySQL Clusters Alive on OVH

Proactive C++ Application Health Checks with `systemd` and `netcat`

Maintaining the health of C++ applications, especially those handling critical data or user traffic, requires more than just basic process monitoring. We need to ensure the application is not only running but also responsive and capable of performing its core functions. For applications deployed on OVH infrastructure, leveraging `systemd` for service management and `netcat` for simple health checks offers a robust, low-overhead solution.

The strategy involves creating a `systemd` service unit that defines how to start, stop, and, crucially, check the health of your C++ application. This health check will typically involve the application listening on a specific network port. We can then use `netcat` (or `nc`) to attempt a connection and potentially send a simple command or just verify port accessibility.

Configuring the `systemd` Service Unit

Let’s assume your C++ application, `my_cpp_app`, is compiled and installed in `/usr/local/bin/my_cpp_app` and listens on port `8080` for health check requests. We’ll create a `systemd` service file at `/etc/systemd/system/my_cpp_app.service`.

The `[Unit]` section defines metadata about the service, its dependencies, and its description. The `[Service]` section details how to manage the service, including the executable path, restart policy, and the critical `ExecStart` and `ExecStartPost` directives. `ExecStartPost` is where we’ll implement our health check.

The `[Install]` section specifies how the service should be enabled to start on boot.

`my_cpp_app.service` Example

[Unit]
Description=My C++ Application Service
After=network.target

[Service]
Type=simple
User=appuser
Group=appgroup
WorkingDirectory=/opt/my_cpp_app
ExecStart=/usr/local/bin/my_cpp_app --config /etc/my_cpp_app/config.conf
Restart=on-failure
RestartSec=5

# Health check: Attempt to connect to port 8080 within 2 seconds.
# If the connection fails, systemd will consider the service unhealthy.
ExecStartPost=/bin/sh -c 'nc -z -w 2 127.0.0.1 8080 || exit 1'

# Optional: If your app has a specific health check endpoint (e.g., HTTP GET /health)
# you might use curl instead. This example assumes a simple TCP port check.
# ExecStartPost=/usr/bin/curl --fail http://127.0.0.1:8080/health || exit 1

[Install]
WantedBy=multi-user.target

After creating this file, you need to reload the `systemd` daemon, enable, and start the service:

systemctl daemon-reload

systemctl enable my_cpp_app.service

systemctl start my_cpp_app.service

You can check the status with systemctl status my_cpp_app.service. If the `ExecStartPost` command fails (returns a non-zero exit code), `systemd` will mark the service as failed and attempt to restart it according to the `Restart` policy.

Monitoring MySQL Cluster Health with `percona-monitoring-plugins` and `check_mysql_health`

For MySQL clusters, especially those deployed on OVH, a comprehensive monitoring strategy is paramount. This involves not just checking if the MySQL server process is running, but also verifying its internal state, replication status, and query performance. We’ll focus on using `check_mysql_health` (part of the Percona Monitoring Plugins) for detailed health checks, integrated with a monitoring system like Nagios or Prometheus (via `node_exporter`’s textfile collector).

Installing Percona Monitoring Plugins and `check_mysql_health`

On Debian/Ubuntu-based systems, you can often install these via package managers. If not, manual installation is straightforward.

First, ensure you have a dedicated monitoring user in your MySQL cluster with appropriate privileges. This user should have at least `PROCESS`, `REPLICATION CLIENT`, and `SHOW DATABASES` privileges.

Create a MySQL user (example for a specific host):

CREATE USER 'monitor'@'localhost' IDENTIFIED BY 'your_secure_password';
GRANT PROCESS, REPLICATION CLIENT, SHOW DATABASES ON *.* TO 'monitor'@'localhost';
FLUSH PRIVILEGES;

Download and install the Percona Monitoring Plugins. You can typically find them on GitHub or via Percona’s repositories.

Assuming you’ve downloaded the plugins and `check_mysql_health` is available in your PATH (e.g., `/usr/local/bin/check_mysql_health`), you can start using it.

Configuring `check_mysql_health` for Critical Checks

`check_mysql_health` is highly configurable and can perform a wide array of checks. Here are some essential ones for a production MySQL cluster:

1. Basic Connectivity and Version Check:

check_mysql_health --user=monitor --password='your_secure_password' --host=127.0.0.1 --port=3306 --ping

2. Replication Status Check (for replicas):

check_mysql_health --user=monitor --password='your_secure_password' --host=127.0.0.1 --port=3306 --replication

This checks `Seconds_Behind_Master`. A non-zero value indicates replication lag.

3. InnoDB Status Check:

check_mysql_health --user=monitor --password='your_secure_password' --host=127.0.0.1 --port=3306 --innodb

This checks for InnoDB issues like deadlocks, corruptions, or excessive waits.

4. Query Performance Check (e.g., slow queries):

check_mysql_health --user=monitor --password='your_secure_password' --host=127.0.0.1 --port=3306 --query-ok="SELECT 1" --query-error="SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';" --critical-query="SELECT @@GLOBAL.read_only" --critical-query-value="1"

The `–critical-query` and `–critical-query-value` are useful for ensuring a node is not in read-only mode unexpectedly (e.g., during failover scenarios). The `–query-error` can be used to check specific cluster states (like Galera’s `wsrep_cluster_size`).

Integrating with Prometheus `node_exporter` Textfile Collector

For Prometheus-based monitoring, we can use the `node_exporter`’s textfile collector to expose the output of `check_mysql_health` as Prometheus metrics. This involves creating a script that runs `check_mysql_health` and writes its output to a file in the `node_exporter`’s collector directory (e.g., `/var/lib/node_exporter/textfile_collector/`).

Create a script, for example, `/usr/local/bin/mysql_health_exporter.sh`:

#!/bin/bash

# MySQL Credentials
MYSQL_USER="monitor"
MYSQL_PASSWORD="your_secure_password"
MYSQL_HOST="127.0.0.1"
MYSQL_PORT="3306"

# Output file for Prometheus metrics
OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/mysql_health.prom"

# --- Perform Checks ---

# Basic Ping
PING_STATUS=$(check_mysql_health --user=$MYSQL_USER --password=$MYSQL_PASSWORD --host=$MYSQL_HOST --port=$MYSQL_PORT --ping --quiet)
if [ "$PING_STATUS" -eq 0 ]; then
    PING_METRIC="mysql_health_ping 1"
else
    PING_METRIC="mysql_health_ping 0"
fi

# Replication Status (only on replicas)
REPLICATION_STATUS=$(check_mysql_health --user=$MYSQL_USER --password=$MYSQL_PASSWORD --host=$MYSQL_HOST --port=$MYSQL_PORT --replication --quiet)
if [ "$REPLICATION_STATUS" -eq 0 ]; then
    # Extract Seconds_Behind_Master if available and numeric
    SECONDS_BEHIND=$(mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -h$MYSQL_HOST -P$MYSQL_PORT -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master:" | awk '{print $2}')
    if [[ "$SECONDS_BEHIND" =~ ^[0-9]+$ ]]; then
        REPLICATION_METRIC="mysql_replication_seconds_behind_master $SECONDS_BEHIND"
    else
        REPLICATION_METRIC="mysql_replication_seconds_behind_master -1 # Replication not running or status unavailable"
    fi
else
    REPLICATION_METRIC="mysql_replication_seconds_behind_master -2 # Replication check failed"
fi

# InnoDB Status
INNODB_STATUS=$(check_mysql_health --user=$MYSQL_USER --password=$MYSQL_PASSWORD --host=$MYSQL_HOST --port=$MYSQL_PORT --innodb --quiet)
if [ "$INNODB_STATUS" -eq 0 ]; then
    INNODB_METRIC="mysql_health_innodb 1"
else
    INNODB_METRIC="mysql_health_innodb 0"
fi

# --- Write Metrics to File ---
echo "# HELP mysql_health_ping MySQL server is reachable." > $OUTPUT_FILE
echo "# TYPE mysql_health_ping gauge" >> $OUTPUT_FILE
echo "$PING_METRIC" >> $OUTPUT_FILE

echo "# HELP mysql_replication_seconds_behind_master Seconds behind master for replication." >> $OUTPUT_FILE
echo "# TYPE mysql_replication_seconds_behind_master gauge" >> $OUTPUT_FILE
echo "$REPLICATION_METRIC" >> $OUTPUT_FILE

echo "# HELP mysql_health_innodb InnoDB health status." >> $OUTPUT_FILE
echo "# TYPE mysql_health_innodb gauge" >> $OUTPUT_FILE
echo "$INNODB_METRIC" >> $OUTPUT_FILE

# Ensure file is readable by node_exporter user (often 'prometheus')
chown prometheus:prometheus $OUTPUT_FILE
chmod 644 $OUTPUT_FILE

exit 0

Make the script executable and set up a cron job to run it periodically (e.g., every minute):

chmod +x /usr/local/bin/mysql_health_exporter.sh
crontab -e
# Add the following line:
* * * * * /usr/local/bin/mysql_health_exporter.sh

Ensure the `node_exporter` is configured to read from `/var/lib/node_exporter/textfile_collector/`. Once `node_exporter` scrapes the target, you will have `mysql_health_ping`, `mysql_replication_seconds_behind_master`, and `mysql_health_innodb` metrics available in Prometheus for alerting and dashboarding.

Advanced C++ Application Performance Profiling with `perf`

Beyond basic health checks, understanding the performance characteristics of your C++ application is crucial for optimization and identifying bottlenecks. The Linux `perf` tool is an indispensable utility for this purpose. It leverages hardware performance counters and kernel tracepoints to provide deep insights into CPU usage, cache misses, branch prediction, and more.

Capturing Performance Data

To profile your C++ application (`my_cpp_app`), you can use `perf record`. Ensure your application is compiled with debug symbols (`-g` flag) for more meaningful output.

# Run your application and record performance data
sudo perf record -g -o /tmp/my_cpp_app.perf /usr/local/bin/my_cpp_app --config /etc/my_cpp_app/config.conf

The `-g` flag enables call graph (stack trace) recording, which is essential for understanding the context of performance events. The `-o` flag specifies the output file.

To analyze the recorded data, use `perf report`:

sudo perf report -i /tmp/my_cpp_app.perf

This will launch an interactive TUI (Text User Interface) where you can navigate through functions, see their contribution to CPU time, and drill down into call stacks. Use the arrow keys to navigate and `Enter` to expand/collapse call chains.

Interpreting `perf` Output for Bottlenecks

When analyzing `perf report`, look for:

High percentage of CPU time: Functions consuming a large percentage of CPU time are prime candidates for optimization.
Cache Misses (e.g., `cache-misses`, ` L1-dcache-load-misses`): High cache miss rates indicate suboptimal memory access patterns. This might suggest issues with data structures, algorithms, or data locality.
Branch Mispredictions (e.g., `branch-misses`): Frequent branch mispredictions can stall the CPU pipeline. This often points to complex conditional logic or data-dependent control flow that is hard for the processor to predict.
System Calls (e.g., ` syscalls`): Excessive system calls can indicate I/O bottlenecks or inefficient use of kernel resources.

You can also use `perf stat` to get a summary of performance counters without detailed call graphs:

sudo perf stat -e cpu-cycles,instructions,cache-misses,branch-misses,syscalls -o /tmp/my_cpp_app_stats.txt /usr/local/bin/my_cpp_app --config /etc/my_cpp_app/config.conf

This provides a quick overview of key performance indicators. For continuous monitoring or integration into CI/CD pipelines, consider tools like `FlameGraph` (which can process `perf` data) to generate visual flame graphs, making performance bottlenecks easier to spot.

OVH Specific Considerations: Network and Storage Monitoring

When operating on OVH infrastructure, specific attention must be paid to network and storage performance, as these can be common sources of issues. While general Linux tools apply, understanding OVH’s offerings and potential limitations is key.

Network Monitoring

For network traffic, `iftop`, `nload`, and `iptraf-ng` are excellent command-line tools for real-time monitoring. For historical data and trend analysis, consider integrating with a time-series database like InfluxDB or Prometheus, collecting data via `node_exporter`’s network collectors or dedicated network monitoring agents.

Key metrics to watch:

Bandwidth Usage: Monitor ingress and egress traffic to identify saturation or unexpected spikes.
Packet Loss: High packet loss can severely degrade application performance, especially for latency-sensitive C++ applications. Use `ping` with large packet sizes and `mtr` (My Traceroute) to diagnose connectivity issues.
Latency: Measure round-trip times to critical services (e.g., database, external APIs).

OVH provides network performance metrics through its control panel. Regularly review these alongside your internal metrics to correlate any observed issues.

Storage Monitoring

Disk I/O performance is critical for database clusters and applications with heavy disk access. Use `iostat` to monitor disk utilization, read/write speeds, and I/O wait times.

# Monitor disk I/O statistics every 5 seconds
iostat -dx 5

Key metrics:

%util: Percentage of time the disk was busy. Consistently high values (near 100%) indicate a bottleneck.
await: The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent waiting for the disk to become available and the time to serve the request. High `await` times are a strong indicator of storage performance issues.
r/s, w/s: Reads and writes per second.
rkB/s, wkB/s: Kilobytes read/written per second.

For MySQL, specific metrics like `innodb_io_capacity` and `innodb_io_capacity_max` in your `my.cnf` should be tuned based on your underlying storage performance. OVH offers various storage solutions (local SSDs, network storage). Understanding the IOPS and throughput capabilities of your chosen solution is vital for setting realistic performance expectations and tuning.