Server Monitoring Best Practices: Keeping Your C++ App and DynamoDB Clusters Alive on OVH
Proactive C++ Application Health Checks with Systemd
For a C++ application running on OVH infrastructure, robust health checking is paramount. We’ll leverage systemd’s capabilities to ensure our application is not only running but also responsive and healthy. This involves defining a custom systemd service unit that includes health check mechanisms.
First, let’s define a basic systemd service file. Assume your C++ application is compiled into an executable named my_cpp_app and resides in /opt/my_cpp_app/bin/. It listens on port 8080.
Systemd Service Unit Configuration
Create a file named /etc/systemd/system/my_cpp_app.service with the following content:
[Unit] Description=My C++ Application Service After=network.target [Service] Type=simple ExecStart=/opt/my_cpp_app/bin/my_cpp_app --config /opt/my_cpp_app/etc/config.toml WorkingDirectory=/opt/my_cpp_app Restart=on-failure RestartSec=5 User=my_app_user Group=my_app_group StandardOutput=journal StandardError=journal # Health Check Configuration ExecStartPost=/opt/my_cpp_app/bin/health_check.sh TimeoutStartSec=30 [Install] WantedBy=multi-user.target
The ExecStartPost directive is crucial here. It executes a script after the main service has started. This script will perform our health checks.
Health Check Script Implementation
Create the /opt/my_cpp_app/bin/health_check.sh script. This script will attempt to connect to the application’s port and verify its response.
#!/bin/bash
APP_HOST="127.0.0.1"
APP_PORT="8080"
HEALTH_CHECK_URL="/health" # Assuming your app exposes a /health endpoint
TIMEOUT_SECONDS=5
# Use curl to check the health endpoint.
# -s: Silent mode
# -m: Maximum time allowed for the operation
# -f: Fail silently (no output on HTTP errors)
# -I: Fetch the headers only
if curl -s -m $TIMEOUT_SECONDS -f "http://${APP_HOST}:${APP_PORT}${HEALTH_CHECK_URL}" > /dev/null; then
echo "C++ Application health check successful."
exit 0
else
echo "C++ Application health check failed. Application may not be responsive."
exit 1
fi
Make the script executable:
sudo chmod +x /opt/my_cpp_app/bin/health_check.sh
After creating/modifying the service file, reload systemd and start/restart your service:
sudo systemctl daemon-reload sudo systemctl enable my_cpp_app sudo systemctl start my_cpp_app
You can monitor the status and logs using:
sudo systemctl status my_cpp_app sudo journalctl -u my_cpp_app -f
Monitoring DynamoDB Performance and Capacity on OVH
While OVH doesn’t directly host AWS DynamoDB, you might be using DynamoDB as a managed service for your C++ application deployed on OVH. Monitoring DynamoDB’s performance and capacity is critical to prevent throttling and ensure low latency. We’ll focus on key metrics and how to collect them.
Key DynamoDB Metrics to Track
- Consumed Read Capacity Units (RCUs): The amount of read capacity your application is consuming.
- Consumed Write Capacity Units (WCUs): The amount of write capacity your application is consuming.
- Provisioned Read Capacity Units: The RCU you have provisioned for your table/index.
- Provisioned Write Capacity Units: The WCU you have provisioned for your table/index.
- Throttled Requests: The number of requests that were throttled due to exceeding provisioned capacity. This is a critical indicator of potential performance issues.
- Latency (e.g.,
GetItem,PutItem,Query,Scan): Average, p90, p95, and p99 latencies for key operations. - System Errors: Number of 5xx errors returned by DynamoDB.
- Table Size: The total size of your table data.
Leveraging AWS CloudWatch and External Monitoring Tools
The primary source for DynamoDB metrics is AWS CloudWatch. To integrate these metrics into a centralized monitoring system (e.g., Prometheus, Grafana, Datadog) that might also be monitoring your OVH infrastructure, you can use various methods:
Option 1: AWS CloudWatch Agent with Prometheus Exporter
You can deploy the CloudWatch agent on an EC2 instance (or even a VM on OVH if you’re comfortable managing it) to scrape CloudWatch metrics and expose them via a Prometheus-compatible endpoint. Alternatively, use a dedicated Prometheus exporter for CloudWatch.
A common approach is to use the cloudwatch_exporter. You’d typically run this exporter in a Docker container or as a systemd service.
CloudWatch Exporter Configuration Example
The exporter requires AWS credentials and a configuration file specifying which metrics to scrape. Here’s a simplified example of a configuration file (config.yml):
# config.yml
scrape_configs:
- job_name: 'dynamodb'
metrics_path: /metrics
static_configs:
- targets: ['localhost:9100'] # Assuming cloudwatch_exporter runs on port 9100
# Specify the AWS region
aws_region: 'us-east-1' # Replace with your DynamoDB region
# Define the metrics to scrape
metrics:
- name: ConsumedReadCapacityUnits
namespace: AWS/DynamoDB
dimensions:
- name: TableName
value: 'your-dynamodb-table-name' # Replace with your table name
- name: ConsumedWriteCapacityUnits
namespace: AWS/DynamoDB
dimensions:
- name: TableName
value: 'your-dynamodb-table-name'
- name: ThrottledRequests
namespace: AWS/DynamoDB
dimensions:
- name: TableName
value: 'your-dynamodb-table-name'
- name: ProvisionedReadCapacityUnits
namespace: AWS/DynamoDB
dimensions:
- name: TableName
value: 'your-dynamodb-table-name'
- name: ProvisionedWriteCapacityUnits
namespace: AWS/DynamoDB
dimensions:
- name: TableName
value: 'your-dynamodb-table-name'
# Add latency metrics, e.g., AverageItemSize, etc.
# For latency, you might need to aggregate or use specific metrics like
# SuccessfulRequestLatency, etc.
# Example for latency (adjust metric name as per CloudWatch availability)
- name: SuccessfulRequestLatency
namespace: AWS/DynamoDB
dimensions:
- name: TableName
value: 'your-dynamodb-table-name'
statistics:
- Average
- Maximum
- p90
- p95
- p99
You would then configure your Prometheus server to scrape the cloudwatch_exporter endpoint. Ensure your AWS credentials (IAM role or access keys) are securely provided to the exporter.
Option 2: AWS SDK for Application-Level Metrics
Your C++ application can directly query DynamoDB metrics using the AWS SDK for C++. This allows for more granular control and the ability to correlate application behavior with DynamoDB performance. You can then expose these metrics via an HTTP endpoint for Prometheus to scrape.
C++ Example: Fetching DynamoDB Metrics (Conceptual)
This is a conceptual snippet. Actual implementation requires the AWS SDK for C++ and proper error handling.
#include <aws/core/Aws.h>
#include <aws/monitoring/CloudWatchClient.h>
#include <aws/monitoring/model/GetMetricStatisticsRequest.h>
#include <aws/monitoring/model/MetricDatum.h>
#include <aws/monitoring/model/StatisticType.h>
#include <aws/core/utils/DateTime.h>
#include <chrono>
#include <iostream>
// ... (other includes for your web server/metrics exporter)
void CollectDynamoDBMetrics(Aws::CloudWatch::CloudWatchClient& cwClient) {
using namespace Aws::CloudWatch::Model;
using namespace Aws::Utils;
auto endTime = DateTime::Now();
auto startTime = endTime - std::chrono::minutes(5); // Last 5 minutes
GetMetricStatisticsRequest request;
request.SetNamespace("AWS/DynamoDB");
request.SetStartTime(startTime);
request.SetEndTime(endTime);
request.SetPeriod(300); // 5 minutes period
request.AddDimensions({
MetricDimension{"TableName", "your-dynamodb-table-name"} // Replace
});
request.AddMetricNames("ConsumedReadCapacityUnits");
request.AddMetricNames("ConsumedWriteCapacityUnits");
request.AddMetricNames("ThrottledRequests");
request.SetStatistics({Statistic::Average, Statistic::Maximum});
auto outcome = cwClient.GetMetricStatistics(request);
if (outcome.IsSuccess()) {
const auto& metricStatistics = outcome.GetResult().GetMetricStatistics();
for (const auto& datum : metricStatistics) {
std::cout << "Metric: " << datum.GetMetricName() << std::endl;
std::cout << "Timestamp: " << datum.GetTimestamp().ToGmtString(DateFormat::ISO_8601) << std::endl;
for (const auto& datapoint : datum.GetDatapoints()) {
std::cout << " Average: " << datapoint.GetAverage() << std::endl;
std::cout << " Maximum: " << datapoint.GetMaximum() << std::endl;
}
}
// Expose these metrics via an HTTP endpoint for Prometheus
} else {
std::cerr << "Error fetching metric statistics: " << outcome.GetError().GetMessage() << std::endl;
}
}
// In your main application or a dedicated metrics service:
// Aws::SDKOptions options;
// Aws::InitAPI(options);
// Aws::CloudWatch::CloudWatchClient cwClient(aws_credentials, aws_region);
// CollectDynamoDBMetrics(cwClient);
// Aws::ShutdownAPI(options);
This C++ code would then need to be integrated into a web server (like Boost.Beast or cpp-httplib) to expose metrics in Prometheus exposition format.
Alerting Strategies
Configure alerts in your monitoring system (e.g., Prometheus Alertmanager, Grafana Alerting) for critical thresholds:
- High RCU/WCU Utilization: Alert when consumed capacity is consistently above 80% of provisioned capacity.
- Throttled Requests Spike: Immediate alert on any throttled requests.
- High Latency: Alert on p95/p99 latencies exceeding acceptable thresholds (e.g., > 100ms for
GetItem). - System Errors: Alert on any 5xx errors from DynamoDB.
OVH Infrastructure Monitoring: Network and System Resources
Beyond application and database specifics, fundamental infrastructure monitoring on OVH is essential. This includes CPU, memory, disk I/O, and network traffic for your compute instances.
Prometheus Node Exporter for System Metrics
The Prometheus Node Exporter is the de facto standard for collecting hardware and OS metrics from Linux servers. Deploying it on your OVH instances provides a rich dataset for analysis.
Installation and Configuration
Download the latest release from the Prometheus Node Exporter GitHub repository. For example, on a Debian/Ubuntu system:
# Download the binary (adjust version and architecture as needed) wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz cd node_exporter-1.7.0.linux-amd64 # Move the binary to a standard location sudo mv node_exporter /usr/local/bin/ # Create a systemd service file sudo nano /etc/systemd/system/node_exporter.service
Paste the following content into /etc/systemd/system/node_exporter.service:
[Unit] Description=Prometheus Node Exporter Wants=network-online.target After=network-online.target [Service] User=nobody Group=nogroup Type=simple ExecStart=/usr/local/bin/node_exporter \ --collector.filesystem.mount-types="^(ext4|xfs|btrfs|overlay)$" \ --collector.netdev.sample-interval=15s \ --collector.diskstats.sample-interval=15s \ --collector.textfile.directory=/var/lib/node_exporter/textfile_collector [Install] WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter
Ensure your firewall (e.g., ufw or OVH’s firewall rules) allows inbound traffic on port 9100 from your Prometheus server.
Network Traffic Monitoring with `iftop` and Prometheus
For real-time network traffic analysis, iftop is invaluable. To integrate this into Prometheus, you can use the Node Exporter’s textfile collector.
Custom Script for Network Metrics
Create a script that runs iftop (or a similar tool like nload) and parses its output to generate Prometheus metrics. This is often done by running iftop in batch mode and parsing the results.
#!/bin/bash
# This script is a simplified example.
# A more robust solution might involve parsing 'ip -s link show' or using tools like 'bmon'
# and then formatting for Prometheus.
# Example: Get total bytes received/sent for a specific interface (e.g., eth0)
INTERFACE="eth0"
METRIC_FILE="/var/lib/node_exporter/textfile_collector/network_stats.prom"
# Ensure the directory exists
sudo mkdir -p /var/lib/node_exporter/textfile_collector
# Get RX/TX bytes using /proc/net/dev
if [ -f /proc/net/dev ]; then
# Extract RX bytes for the specified interface
RX_BYTES=$(awk -v interface="$INTERFACE" '$1 == interface":" {print $2}' /proc/net/dev)
# Extract TX bytes for the specified interface
TX_BYTES=$(awk -v interface="$INTERFACE" '$1 == interface":" {print $10}' /proc/net/dev)
if [ -n "$RX_BYTES" ] && [ -n "$TX_BYTES" ]; then
echo "# HELP ovh_network_bytes_received Total bytes received on interface $INTERFACE" > $METRIC_FILE
echo "# TYPE ovh_network_bytes_received counter" >> $METRIC_FILE
echo "ovh_network_bytes_received{interface=\"$INTERFACE\"} $RX_BYTES" >> $METRIC_FILE
echo "# HELP ovh_network_bytes_sent Total bytes sent on interface $INTERFACE" >> $METRIC_FILE
echo "# TYPE ovh_network_bytes_sent counter" >> $METRIC_FILE
echo "ovh_network_bytes_sent{interface=\"$INTERFACE\"} $TX_BYTES" >> $METRIC_FILE
else
echo "# Error: Could not parse /proc/net/dev for interface $INTERFACE" > $METRIC_FILE
fi
else
echo "# Error: /proc/net/dev not found" > $METRIC_FILE
fi
# Ensure permissions are correct for node_exporter user if running as root
# sudo chown node_exporter:node_exporter $METRIC_FILE
Schedule this script to run periodically (e.g., every minute) using cron. The Node Exporter will automatically pick up metrics from /var/lib/node_exporter/textfile_collector/.
OVH Firewall and Network Monitoring
Don’t forget to monitor OVH’s own network infrastructure. This includes:
- OVH Control Panel Metrics: Regularly check the OVH control panel for network traffic graphs, bandwidth usage, and any reported network anomalies.
- Firewall Logs: If you have specific firewall rules configured in OVH, monitor their logs for excessive denied traffic, which could indicate scanning or attack attempts.
- Ping/Traceroute Checks: Implement external checks (from a different network location) to ping your OVH instances and perform traceroutes to ensure network path stability and latency.
Centralized Logging and Alerting with ELK Stack or Grafana Loki
Aggregating logs from your C++ application, systemd journal, and potentially DynamoDB access logs (if enabled) is crucial for debugging and incident response. We’ll briefly touch upon ELK (Elasticsearch, Logstash, Kibana) or Grafana Loki.
Log Shipping Agents
On your OVH instances, you’ll need a log shipping agent. Filebeat (part of the Elastic Stack) or Promtail (for Loki) are excellent choices.
Filebeat Configuration for C++ App Logs
Configure Filebeat to tail your application logs (e.g., from journalctl or log files) and send them to Logstash or Elasticsearch.
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/my_cpp_app/*.log # If your app logs to files
json.keys_under_root: true
json.overwrite_keys: true
json.message_key: log
# Example for systemd journal
- type: journald
enabled: true
json: true # Ensure journald logs are in JSON format if possible
# You might need to filter journald logs to only capture your app's logs
# journald.fields:
# _SYSTEMD_UNIT: "my_cpp_app.service"
output.elasticsearch:
hosts: ["your-elasticsearch-host:9200"]
# username: "elastic"
# password: "changeme"
# Or to Logstash
# output.logstash:
# hosts: ["your-logstash-host:5044"]
Promtail Configuration for Loki
If using Grafana Loki, Promtail is the agent. It’s designed to work seamlessly with Prometheus labels.
# promtail-config.yaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://your-loki-host:3100/loki/api/v1/push
scrape_configs:
- job_name: systemd
journal:
max_age: 12h
path: /var/log/journal # Or wherever your journald logs are
labels:
job: systemd
__journal__: "true"
relabel_configs:
- source_labels: ['__journal__systemd_unit']
regex: 'my_cpp_app.service'
target_label: 'app_unit'
action: keep # Keep only logs from my_cpp_app.service
- job_name: my_cpp_app_logs
static_configs:
- targets:
- localhost
labels:
job: my_cpp_app
host: ${HOSTNAME} # Automatically add hostname
pipeline_stages:
- json:
expressions:
message: log # Assuming your app logs JSON with a 'log' field
level: level # Assuming a 'level' field
- labels:
app_unit: # Add labels from parsed JSON if needed
- timestamp:
source: time # Assuming a 'time' field for timestamp
format: RFC3339Nano # Adjust format as needed
Deploying and configuring these logging solutions allows for centralized log analysis, enabling faster root cause identification during incidents.