Server Monitoring Best Practices: Keeping Your C++ App and PostgreSQL Clusters Alive on OVH

Core Metrics for C++ Applications on OVH

When running performance-critical C++ applications on OVH infrastructure, a granular understanding of system and application-level metrics is paramount. This isn’t just about uptime; it’s about identifying and mitigating performance bottlenecks before they impact users. We’ll focus on key indicators that directly reflect the health and efficiency of your C++ processes and the underlying OVH compute resources.

Process Resource Utilization

The first line of defense is monitoring the resources consumed by your C++ application’s processes. High CPU, excessive memory allocation, and uncontrolled I/O can signal inefficiencies or outright bugs. We’ll use standard Linux tools, often exposed via monitoring agents.

CPU Usage per Process

A consistently high CPU utilization for your application’s PID (Process ID) is a red flag. This could indicate inefficient algorithms, busy-waiting loops, or insufficient processing power. We’ll monitor the percentage of CPU time consumed by the application’s main process and any significant worker threads.

Memory Footprint

Memory leaks or excessive memory allocation can lead to swapping, drastically degrading performance, and eventually out-of-memory (OOM) killer events. Monitoring resident set size (RSS) and virtual memory size (VMS) for your C++ processes is crucial. Tools like pmap and /proc/[pid]/status are invaluable for deep dives.

File Descriptor Usage

C++ applications, especially those handling network connections or numerous files, can exhaust their file descriptor limits. This will manifest as errors like “Too many open files.” Monitoring the number of open file descriptors per process is essential.

Application-Specific Metrics (C++)

Beyond system-level metrics, instrumenting your C++ code to expose application-specific performance indicators is vital. This requires a robust C++ metrics library and a collection endpoint.

Custom Metrics with Prometheus Client Library for C++

The Prometheus C++ client library is an excellent choice for instrumenting your application. It allows you to expose metrics via an HTTP endpoint, which Prometheus can then scrape.

Example: Exposing Request Latency and Error Counts

Consider a C++ web service. You’d want to track request latency and the number of errors encountered.

#include <prometheus/gauge.h>
#include <prometheus/summary.h>
#include <prometheus/registry.h>
#include <prometheus/exposer.h>
#include <chrono>
#include <string>

// Global registry and exposer
std::unique_ptr<prometheus::Registry> registry;
std::unique_ptr<prometheus::Exposer> exposer;

// Metrics
prometheus::Family<prometheus::Summary>& request_latency_family =
    prometheus::BuildSummary()
        .WithName("http_request_duration_seconds")
        .WithHelp("Total time spent serving HTTP requests.")
        .Register(*registry);

prometheus::Family<prometheus::Counter>& error_count_family =
    prometheus::BuildCounter()
        .WithName("http_requests_total")
        .WithHelp("Total number of HTTP requests.")
        .Register(*registry);

void initialize_metrics() {
    registry = std::make_unique<prometheus::Registry>();
    exposer = std::make_unique<prometheus::Exposer>("0.0.0.0:9100"); // Expose on port 9100
    exposer->RegisterCollectable(registry->Collect);
}

void handle_request(const std::string& endpoint) {
    auto start_time = std::chrono::high_resolution_clock::now();
    bool success = false;

    try {
        // ... your request handling logic ...
        // Simulate success
        success = true;
    } catch (const std::exception& e) {
        // Log error
        success = false;
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = end_time - start_time;

    // Observe latency
    request_latency_family.Add({{"endpoint", endpoint}}, elapsed.count());

    // Increment counter
    if (success) {
        error_count_family.Add({{"endpoint", endpoint}, {"status", "200"}}).Increment();
    } else {
        error_count_family.Add({{"endpoint", endpoint}, {"status", "500"}}).Increment();
    }
}

int main() {
    initialize_metrics();
    // ... your application setup ...

    // Example usage within a request loop
    // handle_request("/api/v1/users");

    return 0;
}

This code snippet demonstrates how to set up a Prometheus registry, expose metrics on a specific port (e.g., 9100), and record request durations and error counts. The Exposer will serve metrics at /metrics on the specified address.

PostgreSQL Cluster Monitoring on OVH

For PostgreSQL clusters, especially those hosted on OVH’s managed services or self-hosted on their instances, monitoring goes beyond basic connectivity. We need to ensure query performance, replication health, and resource utilization of the database instances.

Key PostgreSQL Metrics

Replication Lag: Critical for high availability setups. Monitor pg_stat_replication on the primary and pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn() on replicas.
Connection Usage: Track active connections, waiting connections, and total connections against max_connections.
Query Performance: Identify slow queries using pg_stat_statements and monitor overall query throughput.
Disk I/O: Monitor read/write operations and latency for the database data directories.
Buffer Cache Hit Ratio: A high hit ratio (e.g., > 95%) indicates effective caching.
Transaction Rate: Monitor xact_commit and xact_rollback.

Monitoring Tools and Techniques

Several tools can be leveraged. For Prometheus users, the postgres_exporter is a robust solution. For more integrated solutions, OVH’s own monitoring dashboards (if using managed services) or third-party APM tools can provide valuable insights.

Configuring Prometheus Exporter for PostgreSQL

The postgres_exporter (often found as wrouesnel/postgres_exporter on GitHub) can be deployed as a sidecar or a separate service. It connects to your PostgreSQL instances and exposes metrics in Prometheus format.

Example: `.pgpass` and Environment Variables

Securely connecting the exporter to your PostgreSQL instances is key. Using a .pgpass file or environment variables is recommended over embedding credentials directly in configuration.

# ~/.pgpass file (permissions must be 0600)
hostname:port:database:username:password

# Example environment variables for postgres_exporter
export DATA_SOURCE_NAME="postgresql://user:password@host:port/database?sslmode=disable"
# Or using .pgpass
export PGUSER="your_db_user"
export PGPASSWORD="your_db_password"
export PGHOST="your_db_host"
export PGPORT="5432"
export PGDATABASE="your_db_name"

The exporter typically runs on a port (e.g., 9187) and exposes metrics at /metrics. You’ll configure Prometheus to scrape this endpoint.

Replication Lag Monitoring

Replication lag is a critical indicator of data consistency and availability in a failover scenario. We’ll query PostgreSQL directly to assess this.

Querying Replication Status

-- On the Primary Server
SELECT
    pid,
    application_name,
    client_addr,
    state,
    sync_state,
    pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag,
    pg_wal_lsn_diff(sent_lsn, flush_lsn) AS flush_lag,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag
FROM
    pg_stat_replication;

-- On the Replica Server
SELECT
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replay_lag_local;

These queries provide insights into how far behind the replicas are. You’ll want to set up alerts based on replay_lag exceeding a defined threshold (e.g., 60 seconds).

OVH Infrastructure Monitoring Integration

OVH provides its own monitoring tools and APIs for its cloud instances and managed services. Integrating these with your broader monitoring strategy (e.g., Prometheus, Grafana) provides a holistic view.

Leveraging OVH API for Instance Metrics

OVH’s Public Cloud API allows you to retrieve metrics for your instances, such as CPU utilization, network traffic, and disk I/O. You can write custom scripts or use existing integrations to pull this data into your central monitoring system.

Example: Fetching Instance Metrics with `curl`

You’ll need your OVH API credentials (Application Key, Secret Key, Consumer Key) and the instance ID.

# Replace with your actual credentials and instance ID
APPLICATION_KEY="YOUR_APPLICATION_KEY"
APPLICATION_SECRET="YOUR_APPLICATION_SECRET"
CONSUMER_KEY="YOUR_CONSUMER_KEY"
INSTANCE_ID="your-instance-id"
REGION="your-region" # e.g., "GRA1", "BHS1"

# Get a temporary token
TOKEN=$(curl -s -X POST "https://api.ovh.com/1.0/auth/consumer/${CONSUMER_KEY}/login" \
  -H "Content-Type: application/json" \
  -d "{\"consumerKey\": \"${CONSUMER_KEY}\", \"accessRules\": [{\"method\": \"GET\", \"path\": \"/cloud/project/*/instance/${INSTANCE_ID}/metrics/*\"}]}" \
  | jq -r '.token')

# Fetch CPU metrics (example)
curl -s -X GET "https://api.ovh.com/1.0/cloud/project/YOUR_PROJECT_ID/instance/${INSTANCE_ID}/metrics/cpu" \
  -H "X-Auth-Token: ${TOKEN}" \
  | jq '.'

# Fetch network metrics (example)
curl -s -X GET "https://api.ovh.com/1.0/cloud/project/YOUR_PROJECT_ID/instance/${INSTANCE_ID}/metrics/network" \
  -H "X-Auth-Token: ${TOKEN}" \
  | jq '.'

You would then parse the JSON output from these `curl` commands and push the relevant metrics to your time-series database (e.g., Prometheus) using an exporter or a custom agent.

Alerting Strategies

Effective alerting is crucial. It should be actionable, informative, and minimize alert fatigue. We’ll use Prometheus Alertmanager as a common example.

Alerting on C++ Application Issues

High CPU/Memory: Alert when application process CPU or memory usage exceeds a defined threshold (e.g., 80%) for a sustained period (e.g., 5 minutes).
High File Descriptors: Alert when the number of open file descriptors approaches the system limit (e.g., 80% of ulimit -n).
Application Errors: Alert on spikes in error rates (e.g., 5xx errors from your C++ app’s metrics) or when error rates exceed a percentage of total requests.
Latency Degradation: Alert when request latency percentiles (e.g., p95, p99) exceed acceptable limits.

Alerting on PostgreSQL Cluster Issues

Replication Lag: Alert when replay_lag on any replica exceeds a critical threshold (e.g., 30 seconds).
High Connection Count: Alert when the number of active or waiting connections approaches max_connections.
Low Buffer Cache Hit Ratio: Alert if the hit ratio drops below a critical level (e.g., 90%).
Disk Space Full: Monitor disk usage on database volumes and alert well in advance of capacity being reached.
Unhealthy Primary/Replica: Alert if a replica is not reachable or not replicating.

Example Prometheus Alerting Rules

These rules would be placed in a Prometheus rules file (e.g., alerts.yml) and loaded by Prometheus.

groups:
- name: cpp_app_alerts
  rules:
  - alert: HighCppAppCpuUsage
    expr: avg by (instance, job) (rate(process_cpu_seconds_total{job="your_cpp_app"}[5m])) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage for {{ $labels.job }} on {{ $labels.instance }}"
      description: "CPU usage for {{ $labels.job }} on {{ $labels.instance }} is {{ $value | printf "%.2f" }}% for the last 5 minutes."

  - alert: HighCppAppMemoryUsage
    expr: process_resident_memory_bytes{job="your_cpp_app"} / (1024*1024*1024) > 10 # Example: > 10GB
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High Memory usage for {{ $labels.job }} on {{ $labels.instance }}"
      description: "Memory usage for {{ $labels.job }} on {{ $labels.instance }} is {{ $value | printf "%.2f" }} GB for the last 10 minutes."

- name: postgres_alerts
  rules:
  - alert: HighReplicationLag
    expr: pg_replication_lag_seconds{job="postgres_exporter"} > 30
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "High PostgreSQL replication lag on {{ $labels.instance }}"
      description: "PostgreSQL replication lag on {{ $labels.instance }} is {{ $value | printf "%.2f" }} seconds, exceeding the 30-second threshold."

  - alert: HighPostgresConnections
    expr: pg_stat_activity_count{job="postgres_exporter", state="active"} > max_connections * 0.9 # Assuming max_connections is exposed or known
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High number of active PostgreSQL connections on {{ $labels.instance }}"
      description: "Active PostgreSQL connections on {{ $labels.instance }} are {{ $value | printf "%.0f" }}, approaching max_connections."

These rules define conditions that trigger alerts. Alertmanager then routes these alerts to appropriate notification channels (e.g., Slack, PagerDuty, email).

Conclusion

Maintaining the health and performance of C++ applications and PostgreSQL clusters on OVH requires a multi-layered monitoring approach. By combining system-level metrics, application-specific instrumentation, database-level insights, and leveraging OVH’s infrastructure data, you can build a robust monitoring system that ensures reliability and proactively addresses potential issues.