Server Monitoring Best Practices: Keeping Your C++ App and DynamoDB Clusters Alive on OVH

Proactive C++ Application Health Checks with Systemd

For a C++ application running on OVH infrastructure, robust health checking is paramount. We’ll leverage systemd’s capabilities to ensure our application is not only running but also responsive and healthy. This involves defining a custom systemd service unit that includes health check mechanisms.

First, let’s define a basic systemd service file. Assume your C++ application is compiled into an executable named my_cpp_app and resides in /opt/my_cpp_app/bin/. It listens on port 8080.

Systemd Service Unit Configuration

Create a file named /etc/systemd/system/my_cpp_app.service with the following content:

[Unit]
Description=My C++ Application Service
After=network.target

[Service]
Type=simple
ExecStart=/opt/my_cpp_app/bin/my_cpp_app --config /opt/my_cpp_app/etc/config.toml
WorkingDirectory=/opt/my_cpp_app
Restart=on-failure
RestartSec=5
User=my_app_user
Group=my_app_group
StandardOutput=journal
StandardError=journal

# Health Check Configuration
ExecStartPost=/opt/my_cpp_app/bin/health_check.sh
TimeoutStartSec=30

[Install]
WantedBy=multi-user.target

The ExecStartPost directive is crucial here. It executes a script after the main service has started. This script will perform our health checks.

Health Check Script Implementation

Create the /opt/my_cpp_app/bin/health_check.sh script. This script will attempt to connect to the application’s port and verify its response.

#!/bin/bash

APP_HOST="127.0.0.1"
APP_PORT="8080"
HEALTH_CHECK_URL="/health" # Assuming your app exposes a /health endpoint
TIMEOUT_SECONDS=5

# Use curl to check the health endpoint.
# -s: Silent mode
# -m: Maximum time allowed for the operation
# -f: Fail silently (no output on HTTP errors)
# -I: Fetch the headers only
if curl -s -m $TIMEOUT_SECONDS -f "http://${APP_HOST}:${APP_PORT}${HEALTH_CHECK_URL}" > /dev/null; then
    echo "C++ Application health check successful."
    exit 0
else
    echo "C++ Application health check failed. Application may not be responsive."
    exit 1
fi

Make the script executable:

sudo chmod +x /opt/my_cpp_app/bin/health_check.sh

After creating/modifying the service file, reload systemd and start/restart your service:

sudo systemctl daemon-reload
sudo systemctl enable my_cpp_app
sudo systemctl start my_cpp_app

You can monitor the status and logs using:

sudo systemctl status my_cpp_app
sudo journalctl -u my_cpp_app -f

Monitoring DynamoDB Performance and Capacity on OVH

While OVH doesn’t directly host AWS DynamoDB, you might be using DynamoDB as a managed service for your C++ application deployed on OVH. Monitoring DynamoDB’s performance and capacity is critical to prevent throttling and ensure low latency. We’ll focus on key metrics and how to collect them.

Key DynamoDB Metrics to Track

Consumed Read Capacity Units (RCUs): The amount of read capacity your application is consuming.
Consumed Write Capacity Units (WCUs): The amount of write capacity your application is consuming.
Provisioned Read Capacity Units: The RCU you have provisioned for your table/index.
Provisioned Write Capacity Units: The WCU you have provisioned for your table/index.
Throttled Requests: The number of requests that were throttled due to exceeding provisioned capacity. This is a critical indicator of potential performance issues.
Latency (e.g., GetItem, PutItem, Query, Scan): Average, p90, p95, and p99 latencies for key operations.
System Errors: Number of 5xx errors returned by DynamoDB.
Table Size: The total size of your table data.

Leveraging AWS CloudWatch and External Monitoring Tools

The primary source for DynamoDB metrics is AWS CloudWatch. To integrate these metrics into a centralized monitoring system (e.g., Prometheus, Grafana, Datadog) that might also be monitoring your OVH infrastructure, you can use various methods:

Option 1: AWS CloudWatch Agent with Prometheus Exporter

You can deploy the CloudWatch agent on an EC2 instance (or even a VM on OVH if you’re comfortable managing it) to scrape CloudWatch metrics and expose them via a Prometheus-compatible endpoint. Alternatively, use a dedicated Prometheus exporter for CloudWatch.

A common approach is to use the cloudwatch_exporter. You’d typically run this exporter in a Docker container or as a systemd service.

CloudWatch Exporter Configuration Example

The exporter requires AWS credentials and a configuration file specifying which metrics to scrape. Here’s a simplified example of a configuration file (config.yml):

# config.yml
scrape_configs:
  - job_name: 'dynamodb'
    metrics_path: /metrics
    static_configs:
      - targets: ['localhost:9100'] # Assuming cloudwatch_exporter runs on port 9100
    # Specify the AWS region
    aws_region: 'us-east-1' # Replace with your DynamoDB region
    # Define the metrics to scrape
    metrics:
      - name: ConsumedReadCapacityUnits
        namespace: AWS/DynamoDB
        dimensions:
          - name: TableName
            value: 'your-dynamodb-table-name' # Replace with your table name
      - name: ConsumedWriteCapacityUnits
        namespace: AWS/DynamoDB
        dimensions:
          - name: TableName
            value: 'your-dynamodb-table-name'
      - name: ThrottledRequests
        namespace: AWS/DynamoDB
        dimensions:
          - name: TableName
            value: 'your-dynamodb-table-name'
      - name: ProvisionedReadCapacityUnits
        namespace: AWS/DynamoDB
        dimensions:
          - name: TableName
            value: 'your-dynamodb-table-name'
      - name: ProvisionedWriteCapacityUnits
        namespace: AWS/DynamoDB
        dimensions:
          - name: TableName
            value: 'your-dynamodb-table-name'
      # Add latency metrics, e.g., AverageItemSize, etc.
      # For latency, you might need to aggregate or use specific metrics like
      # SuccessfulRequestLatency, etc.
      # Example for latency (adjust metric name as per CloudWatch availability)
      - name: SuccessfulRequestLatency
        namespace: AWS/DynamoDB
        dimensions:
          - name: TableName
            value: 'your-dynamodb-table-name'
        statistics:
          - Average
          - Maximum
          - p90
          - p95
          - p99

You would then configure your Prometheus server to scrape the cloudwatch_exporter endpoint. Ensure your AWS credentials (IAM role or access keys) are securely provided to the exporter.

Option 2: AWS SDK for Application-Level Metrics

Your C++ application can directly query DynamoDB metrics using the AWS SDK for C++. This allows for more granular control and the ability to correlate application behavior with DynamoDB performance. You can then expose these metrics via an HTTP endpoint for Prometheus to scrape.

C++ Example: Fetching DynamoDB Metrics (Conceptual)

This is a conceptual snippet. Actual implementation requires the AWS SDK for C++ and proper error handling.

#include <aws/core/Aws.h>
#include <aws/monitoring/CloudWatchClient.h>
#include <aws/monitoring/model/GetMetricStatisticsRequest.h>
#include <aws/monitoring/model/MetricDatum.h>
#include <aws/monitoring/model/StatisticType.h>
#include <aws/core/utils/DateTime.h>
#include <chrono>
#include <iostream>

// ... (other includes for your web server/metrics exporter)

void CollectDynamoDBMetrics(Aws::CloudWatch::CloudWatchClient& cwClient) {
    using namespace Aws::CloudWatch::Model;
    using namespace Aws::Utils;

    auto endTime = DateTime::Now();
    auto startTime = endTime - std::chrono::minutes(5); // Last 5 minutes

    GetMetricStatisticsRequest request;
    request.SetNamespace("AWS/DynamoDB");
    request.SetStartTime(startTime);
    request.SetEndTime(endTime);
    request.SetPeriod(300); // 5 minutes period
    request.AddDimensions({
        MetricDimension{"TableName", "your-dynamodb-table-name"} // Replace
    });
    request.AddMetricNames("ConsumedReadCapacityUnits");
    request.AddMetricNames("ConsumedWriteCapacityUnits");
    request.AddMetricNames("ThrottledRequests");
    request.SetStatistics({Statistic::Average, Statistic::Maximum});

    auto outcome = cwClient.GetMetricStatistics(request);

    if (outcome.IsSuccess()) {
        const auto& metricStatistics = outcome.GetResult().GetMetricStatistics();
        for (const auto& datum : metricStatistics) {
            std::cout << "Metric: " << datum.GetMetricName() << std::endl;
            std::cout << "Timestamp: " << datum.GetTimestamp().ToGmtString(DateFormat::ISO_8601) << std::endl;
            for (const auto& datapoint : datum.GetDatapoints()) {
                std::cout << "  Average: " << datapoint.GetAverage() << std::endl;
                std::cout << "  Maximum: " << datapoint.GetMaximum() << std::endl;
            }
        }
        // Expose these metrics via an HTTP endpoint for Prometheus
    } else {
        std::cerr << "Error fetching metric statistics: " << outcome.GetError().GetMessage() << std::endl;
    }
}

// In your main application or a dedicated metrics service:
// Aws::SDKOptions options;
// Aws::InitAPI(options);
// Aws::CloudWatch::CloudWatchClient cwClient(aws_credentials, aws_region);
// CollectDynamoDBMetrics(cwClient);
// Aws::ShutdownAPI(options);

This C++ code would then need to be integrated into a web server (like Boost.Beast or cpp-httplib) to expose metrics in Prometheus exposition format.

Alerting Strategies

Configure alerts in your monitoring system (e.g., Prometheus Alertmanager, Grafana Alerting) for critical thresholds:

High RCU/WCU Utilization: Alert when consumed capacity is consistently above 80% of provisioned capacity.
Throttled Requests Spike: Immediate alert on any throttled requests.
High Latency: Alert on p95/p99 latencies exceeding acceptable thresholds (e.g., > 100ms for GetItem).
System Errors: Alert on any 5xx errors from DynamoDB.

OVH Infrastructure Monitoring: Network and System Resources

Beyond application and database specifics, fundamental infrastructure monitoring on OVH is essential. This includes CPU, memory, disk I/O, and network traffic for your compute instances.

Prometheus Node Exporter for System Metrics

The Prometheus Node Exporter is the de facto standard for collecting hardware and OS metrics from Linux servers. Deploying it on your OVH instances provides a rich dataset for analysis.

Installation and Configuration

Download the latest release from the Prometheus Node Exporter GitHub repository. For example, on a Debian/Ubuntu system:

# Download the binary (adjust version and architecture as needed)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Move the binary to a standard location
sudo mv node_exporter /usr/local/bin/

# Create a systemd service file
sudo nano /etc/systemd/system/node_exporter.service

Paste the following content into /etc/systemd/system/node_exporter.service:

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.filesystem.mount-types="^(ext4|xfs|btrfs|overlay)$" \
  --collector.netdev.sample-interval=15s \
  --collector.diskstats.sample-interval=15s \
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Ensure your firewall (e.g., ufw or OVH’s firewall rules) allows inbound traffic on port 9100 from your Prometheus server.

Network Traffic Monitoring with `iftop` and Prometheus

For real-time network traffic analysis, iftop is invaluable. To integrate this into Prometheus, you can use the Node Exporter’s textfile collector.

Custom Script for Network Metrics

Create a script that runs iftop (or a similar tool like nload) and parses its output to generate Prometheus metrics. This is often done by running iftop in batch mode and parsing the results.

#!/bin/bash

# This script is a simplified example.
# A more robust solution might involve parsing 'ip -s link show' or using tools like 'bmon'
# and then formatting for Prometheus.

# Example: Get total bytes received/sent for a specific interface (e.g., eth0)
INTERFACE="eth0"
METRIC_FILE="/var/lib/node_exporter/textfile_collector/network_stats.prom"

# Ensure the directory exists
sudo mkdir -p /var/lib/node_exporter/textfile_collector

# Get RX/TX bytes using /proc/net/dev
if [ -f /proc/net/dev ]; then
    # Extract RX bytes for the specified interface
    RX_BYTES=$(awk -v interface="$INTERFACE" '$1 == interface":" {print $2}' /proc/net/dev)
    # Extract TX bytes for the specified interface
    TX_BYTES=$(awk -v interface="$INTERFACE" '$1 == interface":" {print $10}' /proc/net/dev)

    if [ -n "$RX_BYTES" ] && [ -n "$TX_BYTES" ]; then
        echo "# HELP ovh_network_bytes_received Total bytes received on interface $INTERFACE" > $METRIC_FILE
        echo "# TYPE ovh_network_bytes_received counter" >> $METRIC_FILE
        echo "ovh_network_bytes_received{interface=\"$INTERFACE\"} $RX_BYTES" >> $METRIC_FILE

        echo "# HELP ovh_network_bytes_sent Total bytes sent on interface $INTERFACE" >> $METRIC_FILE
        echo "# TYPE ovh_network_bytes_sent counter" >> $METRIC_FILE
        echo "ovh_network_bytes_sent{interface=\"$INTERFACE\"} $TX_BYTES" >> $METRIC_FILE
    else
        echo "# Error: Could not parse /proc/net/dev for interface $INTERFACE" > $METRIC_FILE
    fi
else
    echo "# Error: /proc/net/dev not found" > $METRIC_FILE
fi

# Ensure permissions are correct for node_exporter user if running as root
# sudo chown node_exporter:node_exporter $METRIC_FILE

Schedule this script to run periodically (e.g., every minute) using cron. The Node Exporter will automatically pick up metrics from /var/lib/node_exporter/textfile_collector/.

OVH Firewall and Network Monitoring

Don’t forget to monitor OVH’s own network infrastructure. This includes:

OVH Control Panel Metrics: Regularly check the OVH control panel for network traffic graphs, bandwidth usage, and any reported network anomalies.
Firewall Logs: If you have specific firewall rules configured in OVH, monitor their logs for excessive denied traffic, which could indicate scanning or attack attempts.
Ping/Traceroute Checks: Implement external checks (from a different network location) to ping your OVH instances and perform traceroutes to ensure network path stability and latency.

Centralized Logging and Alerting with ELK Stack or Grafana Loki

Aggregating logs from your C++ application, systemd journal, and potentially DynamoDB access logs (if enabled) is crucial for debugging and incident response. We’ll briefly touch upon ELK (Elasticsearch, Logstash, Kibana) or Grafana Loki.

Log Shipping Agents

On your OVH instances, you’ll need a log shipping agent. Filebeat (part of the Elastic Stack) or Promtail (for Loki) are excellent choices.

Filebeat Configuration for C++ App Logs

Configure Filebeat to tail your application logs (e.g., from journalctl or log files) and send them to Logstash or Elasticsearch.

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/my_cpp_app/*.log # If your app logs to files
  json.keys_under_root: true
  json.overwrite_keys: true
  json.message_key: log

# Example for systemd journal
- type: journald
  enabled: true
  json: true # Ensure journald logs are in JSON format if possible
  # You might need to filter journald logs to only capture your app's logs
  # journald.fields:
  #   _SYSTEMD_UNIT: "my_cpp_app.service"

output.elasticsearch:
  hosts: ["your-elasticsearch-host:9200"]
  # username: "elastic"
  # password: "changeme"

# Or to Logstash
# output.logstash:
#   hosts: ["your-logstash-host:5044"]

Promtail Configuration for Loki

If using Grafana Loki, Promtail is the agent. It’s designed to work seamlessly with Prometheus labels.

# promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://your-loki-host:3100/loki/api/v1/push

scrape_configs:
  - job_name: systemd
    journal:
      max_age: 12h
      path: /var/log/journal # Or wherever your journald logs are
      labels:
        job: systemd
        __journal__: "true"
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        regex: 'my_cpp_app.service'
        target_label: 'app_unit'
        action: keep # Keep only logs from my_cpp_app.service

  - job_name: my_cpp_app_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: my_cpp_app
          host: ${HOSTNAME} # Automatically add hostname
    pipeline_stages:
      - json:
          expressions:
            message: log # Assuming your app logs JSON with a 'log' field
            level: level # Assuming a 'level' field
      - labels:
          app_unit: # Add labels from parsed JSON if needed
      - timestamp:
          source: time # Assuming a 'time' field for timestamp
          format: RFC3339Nano # Adjust format as needed

Deploying and configuring these logging solutions allows for centralized log analysis, enabling faster root cause identification during incidents.