Server Monitoring Best Practices: Keeping Your C++ App and Elasticsearch Clusters Alive on OVH

Proactive C++ Application Health Checks

Maintaining the stability of a C++ application, especially one handling critical data like an Elasticsearch cluster, requires more than just basic process monitoring. We need to implement application-level health checks that provide granular insights into its operational status. For a C++ application, this often involves exposing an internal HTTP endpoint that reports on key metrics and internal states.

Consider a C++ application that uses a custom HTTP server library (e.g., Boost.Beast, cpp-httplib) to expose a health check endpoint. This endpoint should not only confirm the process is running but also verify critical internal components, such as database connections, thread pool status, and cache health.

Implementing a Basic Health Check Endpoint in C++

Here’s a simplified example using a hypothetical C++ HTTP server framework. The goal is to return a 200 OK for a healthy state and a 5xx error for any detected issues. We’ll simulate checking a critical internal resource.

#include <iostream>
#include <string>
#include <chrono>
#include <thread>
#include <atomic>

// Assume a simplified HTTP server class
class HttpServer {
public:
    void start(int port) {
        std::cout << "HTTP server started on port " << port << std::endl;
        // In a real scenario, this would involve socket programming and request handling
        // For this example, we'll just simulate a running server
        running_ = true;
        // Simulate request handling loop
        while(running_) {
            // In a real server, this loop would process incoming requests
            std::this_thread::sleep_for(std::chrono::milliseconds(100));
        }
    }
    void stop() {
        running_ = false;
        std::cout << "HTTP server stopped." << std::endl;
    }

    // Simulate handling a GET request to /health
    void handle_health_request() {
        if (is_application_healthy()) {
            std::cout << "Health check: OK" << std::endl;
            // In a real server, this would send HTTP 200 OK response
        } else {
            std::cerr << "Health check: FAILED" << std::endl;
            // In a real server, this would send HTTP 503 Service Unavailable response
        }
    }

private:
    std::atomic<bool> running_{false};

    // Simulate checking a critical internal resource (e.g., database connection pool)
    bool is_application_healthy() {
        // Simulate a check that might fail intermittently
        static int check_counter = 0;
        check_counter++;
        if (check_counter % 10 == 0) { // Simulate failure every 10 checks
            return false;
        }
        return true;
    }
};

// Global instance for demonstration
HttpServer app_server;

// Function to simulate application logic and health check endpoint
void run_application_logic() {
    std::cout << "Application logic running..." << std::endl;
    // Simulate periodic health check endpoint invocation
    while(true) {
        app_server.handle_health_request();
        std::this_thread::sleep_for(std::chrono::seconds(5));
    }
}

int main() {
    // Start the HTTP server in a separate thread
    std::thread server_thread([](){
        app_server.start(8080); // Listen on port 8080
    });

    // Run the main application logic in the main thread
    run_application_logic();

    // In a real application, you'd have a mechanism to gracefully shut down
    app_server.stop();
    server_thread.join();

    return 0;
}

In a production environment, this `handle_health_request` would be integrated into your HTTP server's request routing. The `is_application_healthy` function would contain actual checks: verifying connectivity to Elasticsearch, checking queue depths, ensuring background worker threads are active, and validating cache availability. The response should ideally include more detailed JSON for easier parsing by monitoring tools.

Monitoring the Health Endpoint with Prometheus

Prometheus is an excellent choice for scraping these health endpoints. We can configure Prometheus to scrape an HTTP endpoint that returns a simple status code or a more detailed JSON payload. For a basic check, we can use the `blackbox_exporter` to probe the HTTP endpoint.

First, ensure you have the `blackbox_exporter` running. It's designed to probe endpoints over various protocols (HTTP, TCP, ICMP, etc.) and report their status. We'll configure it to probe our C++ application's health endpoint.

Blackbox Exporter Configuration (`blackbox.yml`)

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      # Expect a 2xx status code for a healthy response
      # For more detailed checks, you might parse JSON response body
      # and check specific fields.
      # For now, we rely on the C++ app returning 200 OK.
      valid_status_codes: [] # Defaults to 2xx
      # If your C++ app returns JSON, you can add:
      # fail_if_body_contains: "error"
      # fail_if_not_ssl: false
      # tls_config:
      #   insecure_skip_verify: true

Prometheus Configuration (`prometheus.yml`)

Next, configure Prometheus to scrape the `blackbox_exporter` itself, telling it which targets to probe. The `targets` parameter in the scrape config will point to the `blackbox_exporter`'s endpoint, and the `params` will specify the actual application endpoint to check.

scrape_configs:
  - job_name: 'blackbox_http_app'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Use the module defined in blackbox.yml
      target: ['http://your-cpp-app-ip:8080/health'] # The actual target to probe
    static_configs:
      - targets:
        - http://localhost:9115 # Address of your blackbox_exporter
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115 # Address of your blackbox_exporter

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Add scrape configs for your Elasticsearch nodes here
  # ...

With this setup, Prometheus will periodically scrape the `blackbox_exporter`, which in turn will probe your C++ application's `/health` endpoint. Any failures will be recorded as Prometheus metrics, allowing you to set up alerts.

Elasticsearch Cluster Health and Performance Monitoring

Monitoring an Elasticsearch cluster involves more than just checking if the nodes are up. We need to track cluster health status, shard allocation, indexing performance, search latency, and resource utilization (CPU, memory, disk). OVH's infrastructure, like any cloud provider, can introduce network latency or resource contention, making robust monitoring crucial.

Leveraging Elasticsearch's Monitoring APIs

Elasticsearch provides a rich set of Monitoring APIs that expose detailed cluster and node metrics. Prometheus can scrape these directly, or we can use the Elasticsearch Exporter for Prometheus, which is often more convenient as it aggregates metrics and exposes them in a Prometheus-friendly format.

Using the Elasticsearch Exporter for Prometheus

The Elasticsearch Exporter is a standalone application that queries Elasticsearch's APIs and exposes the metrics via an HTTP endpoint for Prometheus to scrape. This decouples monitoring from the Elasticsearch cluster itself.

First, download and run the Elasticsearch Exporter. You'll need to configure it to connect to your Elasticsearch cluster.

Elasticsearch Exporter Configuration (`config.yml`)

# Example configuration for Elasticsearch Exporter
# See https://github.com/prometheus-community/elasticsearch_exporter/blob/main/docs/config.yml
# for full options.

# Elasticsearch connection details
elasticsearch:
  urls:
    - "http://your-elasticsearch-node-ip:9200" # Replace with your ES node address

  # Optional: Authentication if your ES cluster requires it
  # username: "monitor_user"
  # password: "monitor_password"

# Metrics to expose. Defaults to a comprehensive set.
# You can customize this to reduce overhead if needed.
# metrics:
#   cluster_health: true
#   node_stats: true
#   indices_stats: true
#   ...

# Listen address for Prometheus to scrape
listen_address: "0.0.0.0:9118" # Default port for elasticsearch_exporter

Run the exporter:

./elasticsearch_exporter --config.file=config.yml

Prometheus Configuration for Elasticsearch Exporter

Add a new job to your `prometheus.yml` to scrape the Elasticsearch Exporter.

scrape_configs:
  # ... other scrape configs ...

  - job_name: 'elasticsearch'
    static_configs:
      - targets:
        - 'localhost:9118' # Address of your elasticsearch_exporter

Key Elasticsearch Metrics to Monitor

Once Prometheus is scraping the Elasticsearch Exporter, you'll have access to a wealth of metrics. Here are some critical ones to focus on:

Cluster Health Status: `elasticsearch_cluster_health_status` (0 for green, 1 for yellow, 2 for red). Alert on anything other than green.
Shard Allocation: `elasticsearch_cluster_nodes_shards_total` and `elasticsearch_cluster_nodes_shards_unassigned_total`. Unassigned shards are a critical issue.
Indexing Performance: `elasticsearch_indices_indexing_total` (rate of documents indexed), `elasticsearch_indices_indexing_index_time_seconds_total` (time spent indexing).
Search Performance: `elasticsearch_indices_search_query_total` (rate of search requests), `elasticsearch_indices_search_query_time_seconds_total` (time spent on search queries).
JVM Heap Usage: `elasticsearch_jvm_memory_heap_used_bytes` and `elasticsearch_jvm_memory_heap_max_bytes`. High heap usage can lead to garbage collection pauses and instability.
Disk Usage: `elasticsearch_node_fs_data_available_bytes`. Ensure sufficient disk space for indices.
CPU Usage: `elasticsearch_process_cpu_seconds_total`.

Alerting on Elasticsearch Cluster Issues

Use Prometheus Alertmanager to define alert rules based on these metrics. For example, an alert for a red cluster status:

Prometheus Alerting Rules (`alerts.yml`)

groups:
- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status > 1 # 2 for red
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster is RED"
      description: "The Elasticsearch cluster {{ $labels.cluster }} has entered a RED health state. Shards may be unavailable."

  - alert: ElasticsearchClusterYellow
    expr: elasticsearch_cluster_health_status == 1 # 1 for yellow
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch cluster is YELLOW"
      description: "The Elasticsearch cluster {{ $labels.cluster }} has entered a YELLOW health state. Some shards are unassigned."

  - alert: HighElasticsearchHeapUsage
    expr: elasticsearch_jvm_memory_heap_used_bytes / elasticsearch_jvm_memory_heap_max_bytes * 100 > 85
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High Elasticsearch JVM Heap Usage"
      description: "Elasticsearch JVM heap usage on {{ $labels.instance }} is {{ $value | humanizePercentage }}."

  - alert: LowDiskSpaceElasticsearch
    expr: elasticsearch_node_fs_data_available_bytes / elasticsearch_node_fs_data_total_bytes * 100 < 20
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Low Disk Space on Elasticsearch Node"
      description: "Elasticsearch node {{ $labels.instance }} has {{ $value | humanizePercentage }} disk space remaining."

Ensure your `prometheus.yml` is configured to load these alerting rules and that Alertmanager is set up to receive and route these alerts to your preferred notification channels (e.g., Slack, PagerDuty).

OVH Specific Considerations and Network Monitoring

When running on a cloud provider like OVH, network performance and reliability are paramount. Latency spikes, packet loss, or intermittent connectivity issues can directly impact your C++ application's performance and Elasticsearch cluster stability.

Network Latency and Packet Loss Monitoring

We can extend our Prometheus monitoring to include network-level checks. The `blackbox_exporter` can be configured to probe TCP ports for Elasticsearch (9200, 9300) and your C++ application (e.g., 8080) to measure latency and success rates.

Blackbox Exporter TCP Probes

modules:
  # ... other modules ...
  tcp_9200:
    prober: tcp
    timeout: 5s
    tcp:
      # Port to probe on the target
      port: 9200
      # Optional: Send specific data and expect a response
      # preferred_ip_protocol: "ip4" # or "ip6"

  tcp_9300:
    prober: tcp
    timeout: 5s
    tcp:
      port: 9300

  tcp_cpp_app:
    prober: tcp
    timeout: 5s
    tcp:
      port: 8080 # Port of your C++ application

Prometheus Configuration for Network Probes

Add new scrape jobs to `prometheus.yml` to probe your Elasticsearch nodes and C++ application instances over TCP.

scrape_configs:
  # ... other scrape configs ...

  - job_name: 'blackbox_elasticsearch_tcp'
    metrics_path: /probe
    params:
      module: [tcp_9200] # Or tcp_9300
    static_configs:
      - targets:
        - 'elasticsearch-node-1.ovh.internal:9200' # Replace with actual ES node IPs/hostnames
        - 'elasticsearch-node-2.ovh.internal:9200'
        # ... for all ES nodes
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115 # Address of your blackbox_exporter

  - job_name: 'blackbox_cpp_app_tcp'
    metrics_path: /probe
    params:
      module: [tcp_cpp_app]
    static_configs:
      - targets:
        - 'your-cpp-app-ip:8080' # Replace with your C++ app IP/hostname
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115 # Address of your blackbox_exporter

These TCP probes will provide metrics like `probe_success`, `probe_duration_seconds`, and `probe_tcp_connect_time_seconds`. Alert on `probe_success == 0` or high `probe_tcp_connect_time_seconds` to detect network issues.

Monitoring OVH Infrastructure Metrics

OVH provides its own set of metrics through its API or control panel. While not directly scraped by Prometheus in this setup, these are crucial for understanding the underlying infrastructure. Key metrics include:

Instance CPU/Memory/Disk Usage: Basic resource utilization of your virtual machines or bare-metal servers.
Network Traffic: Ingress/Egress traffic volume, bandwidth utilization.
Hardware Health: For bare-metal servers, check disk SMART status, temperature, etc.
Load Balancer Metrics: If using OVH load balancers, monitor request rates, backend health, and latency.

Integrate these OVH-specific metrics into your overall dashboarding (e.g., Grafana) to correlate application/cluster behavior with infrastructure performance. Set up alerts within OVH's system for critical infrastructure events (e.g., server reboot, disk failure).

Centralized Logging and Trace Analysis

Even with robust monitoring, issues can arise. Centralized logging and distributed tracing are essential for debugging and understanding the root cause of problems, especially in distributed systems like Elasticsearch and complex C++ applications.

Log Aggregation with ELK Stack or Loki

Collect logs from your C++ application and Elasticsearch nodes into a central system. The ELK stack (Elasticsearch, Logstash, Kibana) is a natural fit if you're already using Elasticsearch. Alternatively, Grafana Loki offers a more lightweight, Prometheus-inspired approach to log aggregation.

Ensure your C++ application logs in a structured format (e.g., JSON) to facilitate parsing and searching in your log aggregation system. Elasticsearch nodes also produce detailed logs that should be collected.

Distributed Tracing with Jaeger or Zipkin

For deep dives into request flows across your C++ application and Elasticsearch, implement distributed tracing. Libraries like OpenTelemetry can be integrated into your C++ application to send trace data to backends like Jaeger or Zipkin. This allows you to visualize the entire path of a request, identify bottlenecks, and pinpoint errors in specific services or network hops.

By combining proactive health checks, comprehensive metric monitoring, network diagnostics, and centralized logging/tracing, you can build a resilient and observable system for your C++ applications and Elasticsearch clusters on OVH.