Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on Google Cloud

Proactive C++ Application Health Checks on Google Cloud

Maintaining the health of a C++ application deployed on Google Cloud Platform (GCP) requires a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure graceful degradation or immediate alerting under duress. This involves instrumenting your C++ code to expose internal states and leveraging GCP’s monitoring tools.

A common pattern is to expose an HTTP endpoint within your C++ application that reports its status. This endpoint can be polled by external monitoring services. For a robust solution, consider using a lightweight HTTP server library like cpp-httplib or integrating with an existing web framework if your application already has one.

Implementing a Health Check Endpoint in C++

Here’s a simplified example using cpp-httplib to expose a /healthz endpoint. This endpoint will check internal application state, such as the status of critical background threads or the availability of essential data structures.

First, ensure you have cpp-httplib integrated into your build system (e.g., CMake). The core logic involves creating a server instance and defining a handler for the health check route.

Example C++ Health Check Implementation

#include <iostream>
#include <string>
#include <atomic>
#include "httplib.h" // Assuming cpp-httplib is included

// Global atomic flag to simulate application state
std::atomic<bool> g_is_processing_ok(true);

// Function to simulate some background work that might fail
void simulate_work() {
    // In a real app, this would involve database connections, external API calls, etc.
    // For demonstration, we'll toggle the flag.
    // g_is_processing_ok = !g_is_processing_ok; // Uncomment to simulate state changes
}

int main() {
    httplib::Server svr;

    // Health check endpoint
    svr.Get("/healthz", [&](const httplib::Request& req, httplib::Response& res) {
        // In a real application, perform more sophisticated checks:
        // - Check database connection status
        // - Verify critical service availability
        // - Inspect internal queue sizes or worker thread health
        // - Check for recent errors or exceptions

        if (g_is_processing_ok.load()) {
            res.set_content("OK", "text/plain");
            res.status = 200;
        } else {
            res.set_content("Degraded", "text/plain");
            res.status = 503; // Service Unavailable
        }
    });

    // A simple endpoint to simulate state change for testing
    svr.Get("/toggle_health", [&](const httplib::Request& req, httplib::Response& res) {
        g_is_processing_ok = !g_is_processing_ok.load();
        res.set_content(g_is_processing_ok.load() ? "Health set to OK" : "Health set to Degraded", "text/plain");
        res.status = 200;
    });

    // Start the server on a specific port (e.g., 8080)
    // In a production environment, this might be a dedicated monitoring port.
    std::cout << "Starting health check server on port 8080..." << std::endl;
    if (!svr.listen("0.0.0.0", 8080)) {
        std::cerr << "Failed to start server." << std::endl;
        return 1;
    }

    return 0;
}

Integrating with Google Cloud Monitoring (Cloud Monitoring)

Once your C++ application exposes a health check endpoint, you can use Cloud Monitoring to poll it. The most straightforward method is to use a custom metric that periodically checks the endpoint's status.

We'll create a small agent (e.g., a Python script) that runs on the same Compute Engine instance or Kubernetes Pod as your C++ application. This agent will poll the /healthz endpoint and send a custom metric to Cloud Monitoring.

Python Health Check Polling Agent

import requests
import time
import google.auth
from google.cloud import monitoring_v3
from google.protobuf.timestamp_pb2 import Timestamp

# --- Configuration ---
APP_HEALTH_URL = "http://localhost:8080/healthz" # Adjust if your app is elsewhere
PROJECT_ID = "your-gcp-project-id" # Replace with your GCP Project ID
METRIC_TYPE = "custom.googleapis.com/my_cpp_app/health_status"
INTERVAL_SECONDS = 60 # How often to poll and report
# ---------------------

def get_health_status():
    try:
        response = requests.get(APP_HEALTH_URL, timeout=5)
        if response.status_code == 200 and "OK" in response.text:
            return 1.0 # Healthy
        else:
            print(f"Health check failed: Status {response.status_code}, Body: {response.text}")
            return 0.0 # Unhealthy/Degraded
    except requests.exceptions.RequestException as e:
        print(f"Error connecting to health check endpoint: {e}")
        return 0.0 # Unhealthy

def write_metric(status_value):
    client = monitoring_v3.MetricServiceClient()
    project_name = f"projects/{PROJECT_ID}"

    now = time.time()
    seconds = int(now)
    nanos = int((now - seconds) * 10**9)
    timestamp = Timestamp(seconds=seconds, nanos=nanos)

    series = monitoring_v3.MetricDescriptor()
    series.type = METRIC_TYPE
    series.metric_kind = monitoring_v3.MetricDescriptor.MetricKind.GAUGE
    series.value_type = monitoring_v3.MetricDescriptor.ValueType.DOUBLE
    series.description = "Health status of the C++ application (1.0 for OK, 0.0 for Degraded/Unhealthy)."

    # Add resource information (e.g., for a GCE instance)
    # You'll need to adapt this based on your deployment (GKE, GCE, etc.)
    # For GCE, you can often get metadata automatically.
    # For GKE, use kubernetes.io/pod or kubernetes.io/container
    try:
        # Attempt to get GCE instance metadata
        import google.compute_engine_metadata
        instance_name = google.compute_engine_metadata.instance.name
        zone = google.compute_engine_metadata.instance.zone.split('/')[-1]
        resource = monitoring_v3.MonitoredResource()
        resource.type = "gce_instance"
        resource.labels["instance_id"] = google.compute_engine_metadata.instance.id
        resource.labels["project_id"] = PROJECT_ID
        resource.labels["zone"] = zone
        resource.labels["name"] = instance_name
        print(f"Reporting for GCE instance: {instance_name} in zone {zone}")
    except Exception as e:
        print(f"Could not determine GCE instance metadata: {e}. Reporting without resource labels.")
        # Fallback or error handling if not on GCE or metadata is unavailable
        resource = monitoring_v3.MonitoredResource()
        resource.type = "global" # Generic resource if specific one can't be determined
        resource.labels["project_id"] = PROJECT_ID

    point = monitoring_v3.Point()
    point.value.double_value = status_value
    point.interval.end_time = timestamp

    try:
        client.create_time_series(
            name=project_name,
            time_series=[
                {
                    "metric": {"type": METRIC_TYPE},
                    "resource": resource,
                    "points": [point],
                }
            ],
        )
        print(f"Reported metric: {METRIC_TYPE} = {status_value}")
    except Exception as e:
        print(f"Failed to write metric to Cloud Monitoring: {e}")

if __name__ == "__main__":
    # Ensure you have authenticated correctly (e.g., via service account or gcloud auth)
    # and that the service account has the 'monitoring.metricWriter' role.
    print(f"Starting health check agent. Polling {APP_HEALTH_URL} every {INTERVAL_SECONDS} seconds.")
    while True:
        health_status = get_health_status()
        write_metric(health_status)
        time.sleep(INTERVAL_SECONDS)

To deploy this agent:

Ensure the service account running this agent has the roles/monitoring.metricWriter IAM role in your GCP project.
Install the necessary Python libraries: pip install google-cloud-monitoring requests google-auth google-compute-engine-metadata.
Replace "your-gcp-project-id" with your actual project ID.
Adjust APP_HEALTH_URL if your C++ application is not running on the same host or port.
Run this script as a background process or a systemd service on your Compute Engine instance, or as a sidecar container in your Kubernetes Pod.

Creating Cloud Monitoring Dashboards and Alerting

With the custom metric flowing into Cloud Monitoring, you can now visualize it and set up alerts.

Dashboard Configuration

Navigate to the Cloud Monitoring console in GCP. Create a new dashboard. Add a chart, select "Metric Explorer," and search for your custom metric (e.g., custom.googleapis.com/my_cpp_app/health_status). Configure it as a "Stacked bar chart" or "Line chart" to show the 0.0 (unhealthy) and 1.0 (healthy) values over time. Filter by the appropriate resource (e.g., your GCE instance or GKE Pod) to see the status of individual application instances.

Alerting Policy

To create an alert:

In the Cloud Monitoring console, go to "Alerting" and click "Create Policy."
Click "Add Condition."
Under "Select a metric," find and select your custom metric (custom.googleapis.com/my_cpp_app/health_status).
Under "Filter," add filters to target specific instances or services if needed (e.g., by instance name, zone, or Kubernetes pod labels).
Under "Configure trigger," set the condition to "is below" and the threshold to 0.5 (or any value less than 1.0). This means if the metric drops from 1.0 to 0.0, the condition is met.
Set the "For" duration to trigger the alert only if the condition persists for a certain period (e.g., 1 minute) to avoid flapping alerts.
Configure notification channels (e.g., email, PagerDuty, Slack) to receive alerts when the condition is met.

Elasticsearch Cluster Health and Performance Monitoring on Google Cloud

Monitoring Elasticsearch clusters, especially those deployed on GCP (e.g., on Compute Engine or GKE), is critical for maintaining search performance, data integrity, and availability. This involves tracking cluster-wide health, node-specific metrics, JVM performance, and Elasticsearch-specific operational metrics.

Leveraging Elasticsearch's Built-in Monitoring APIs

Elasticsearch exposes a wealth of information through its Monitoring APIs. The most important endpoints include:

Cluster Health API (_cluster/health): Provides an overview of the cluster's status (green, yellow, red), number of nodes, shards, and pending tasks.
Node Stats API (_nodes/stats): Detailed statistics for each node, including CPU usage, JVM heap usage, disk I/O, network traffic, and file system usage.
Index Stats API (_stats): Statistics for indices, such as document counts, size, indexing rate, search rate, and query cache performance.

Example: Fetching Cluster Health with `curl`

curl -X GET "http://localhost:9200/_cluster/health?pretty"

The output will look something like this:

{
  "cluster_name" : "my-es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue" : 0,
  "active_shards_percent_as_number" : 100.0
}

Integrating with Cloud Monitoring and Alerting

Similar to the C++ application, we can use a Python agent to poll Elasticsearch APIs and send custom metrics to Cloud Monitoring. This agent should run on a node within the cluster or have network access to it.

Python Elasticsearch Monitoring Agent

import requests
import time
import google.auth
from google.cloud import monitoring_v3
from google.protobuf.timestamp_pb2 import Timestamp
import json

# --- Configuration ---
ES_HOST = "http://localhost:9200" # Elasticsearch host
PROJECT_ID = "your-gcp-project-id" # Replace with your GCP Project ID
INTERVAL_SECONDS = 60 # How often to poll and report

# Metrics to collect
METRICS_TO_COLLECT = {
    "cluster_health_status": {
        "api": "/_cluster/health",
        "value_path": "status", # 'green': 2, 'yellow': 1, 'red': 0
        "type": "gauge",
        "value_type": "double",
        "description": "Elasticsearch cluster health status (2=green, 1=yellow, 0=red)."
    },
    "cluster_node_count": {
        "api": "/_cluster/health",
        "value_path": "number_of_nodes",
        "type": "gauge",
        "value_type": "double",
        "description": "Number of nodes in the Elasticsearch cluster."
    },
    "cluster_unassigned_shards": {
        "api": "/_cluster/health",
        "value_path": "unassigned_shards",
        "type": "gauge",
        "value_type": "double",
        "description": "Number of unassigned shards in the Elasticsearch cluster."
    },
    "node_jvm_heap_used_percent": {
        "api": "/_nodes/stats/jvm",
        "value_path": "nodes.*.jvm.mem.heap_used_percent", # Wildcard will iterate
        "type": "gauge",
        "value_type": "double",
        "description": "JVM heap usage percentage for Elasticsearch nodes."
    },
    "node_fs_data_free_percent": {
        "api": "/_nodes/stats/fs",
        "value_path": "nodes.*.fs.data.*.path", # Need to extract path and available_percent
        "type": "gauge",
        "value_type": "double",
        "description": "Free disk space percentage for Elasticsearch data paths."
    }
}
# ---------------------

def get_nested_value(data, path_keys):
    """Safely retrieves a nested value from a dictionary using a list of keys."""
    current_data = data
    for key in path_keys:
        if isinstance(current_data, dict):
            if key == "*": # Handle wildcard for iterating over nodes/paths
                results = []
                for k, v in current_data.items():
                    nested_results = get_nested_value(v, path_keys[path_keys.index(key)+1:])
                    if nested_results is not None:
                        if isinstance(nested_results, list):
                            results.extend(nested_results)
                        else:
                            results.append(nested_results)
                return results if results else None
            elif key in current_data:
                current_data = current_data[key]
            else:
                return None
        elif isinstance(current_data, list):
            # If we encounter a list and the key is not an index, try to process it
            # This is a simplified handling for cases like 'nodes.*.fs.data.*.path'
            # where '*' might represent multiple data paths on a single node.
            if key == "*":
                results = []
                for item in current_data:
                    nested_results = get_nested_value(item, path_keys[path_keys.index(key)+1:])
                    if nested_results is not None:
                        if isinstance(nested_results, list):
                            results.extend(nested_results)
                        else:
                            results.append(nested_results)
                return results if results else None
            else:
                return None # Cannot index into list with non-index key
        else:
            return None
    return current_data

def map_health_status(status_str):
    if status_str == "green":
        return 2.0
    elif status_str == "yellow":
        return 1.0
    elif status_str == "red":
        return 0.0
    return -1.0 # Unknown

def extract_fs_free_percent(node_stats_data):
    """Extracts free disk space percentage from node stats."""
    free_percents = []
    if not node_stats_data or 'nodes' not in node_stats_data:
        return None

    for node_id, node_data in node_stats_data['nodes'].items():
        if 'fs' in node_data and 'data' in node_data['fs']:
            for path, path_data in node_data['fs']['data'].items():
                if 'path' in path_data and 'available_percent' in path_data:
                    free_percents.append(float(path_data['available_percent']))
    return free_percents if free_percents else None

def write_metric(metric_name, value, metric_type_info, resource):
    client = monitoring_v3.MetricServiceClient()
    project_name = f"projects/{PROJECT_ID}"

    now = time.time()
    seconds = int(now)
    nanos = int((now - seconds) * 10**9)
    timestamp = Timestamp(seconds=seconds, nanos=nanos)

    metric_descriptor = monitoring_v3.MetricDescriptor()
    metric_descriptor.type = f"custom.googleapis.com/elasticsearch/{metric_name}"
    metric_descriptor.metric_kind = getattr(monitoring_v3.MetricDescriptor.MetricKind, metric_type_info["type"].upper())
    metric_descriptor.value_type = getattr(monitoring_v3.MetricDescriptor.ValueType, metric_type_info["value_type"].upper())
    metric_descriptor.description = metric_type_info["description"]

    point = monitoring_v3.Point()
    if metric_type_info["value_type"] == "double":
        point.value.double_value = float(value)
    elif metric_type_info["value_type"] == "int64":
        point.value.int64_value = int(value)
    # Add other types as needed

    point.interval.end_time = timestamp

    try:
        client.create_time_series(
            name=project_name,
            time_series=[
                {
                    "metric": {"type": metric_descriptor.type},
                    "resource": resource,
                    "points": [point],
                }
            ],
        )
        print(f"Reported metric: {metric_descriptor.type} = {value}")
    except Exception as e:
        print(f"Failed to write metric {metric_descriptor.type} to Cloud Monitoring: {e}")

def get_gcp_resource():
    """Attempts to identify the GCP resource (GCE instance or GKE pod)."""
    resource = monitoring_v3.MonitoredResource()
    resource.labels["project_id"] = PROJECT_ID

    try:
        # Try GCE instance metadata
        import google.compute_engine_metadata
        instance_name = google.compute_engine_metadata.instance.name
        zone = google.compute_engine_metadata.instance.zone.split('/')[-1]
        resource.type = "gce_instance"
        resource.labels["instance_id"] = google.compute_engine_metadata.instance.id
        resource.labels["zone"] = zone
        resource.labels["name"] = instance_name
        print(f"Reporting for GCE instance: {instance_name} in zone {zone}")
        return resource
    except Exception:
        pass # Not on GCE or metadata unavailable

    try:
        # Try GKE pod metadata (requires running inside a pod with service account)
        import os
        pod_name = os.environ.get('HOSTNAME') # HOSTNAME is set by Kubernetes
        namespace = os.environ.get('POD_NAMESPACE') # POD_NAMESPACE is set by Kubernetes
        if pod_name and namespace:
            resource.type = "k8s_pod"
            resource.labels["pod_name"] = pod_name
            resource.labels["namespace_name"] = namespace
            print(f"Reporting for GKE pod: {pod_name} in namespace {namespace}")
            return resource
    except Exception:
        pass # Not on GKE or env vars not set

    print("Could not determine GCP resource. Reporting with generic resource.")
    resource.type = "global" # Fallback
    return resource

if __name__ == "__main__":
    # Ensure you have authenticated correctly (e.g., via service account or gcloud auth)
    # and that the service account has the 'monitoring.metricWriter' role.
    print(f"Starting Elasticsearch monitoring agent. Polling {ES_HOST} every {INTERVAL_SECONDS} seconds.")
    gcp_resource = get_gcp_resource()

    while True:
        try:
            for metric_name, config in METRICS_TO_COLLECT.items():
                try:
                    response = requests.get(f"{ES_HOST}{config['api']}", timeout=10)
                    response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
                    data = response.json()

                    if metric_name == "node_jvm_heap_used_percent" or metric_name == "node_fs_data_free_percent":
                        # Handle metrics that apply per node/path
                        node_stats_data = data # For these APIs, the whole response is relevant
                        if metric_name == "node_jvm_heap_used_percent":
                            values = get_nested_value(node_stats_data, config['value_path'].split('.'))
                            if values is not None:
                                for node_jvm_heap_percent in values:
                                    # Add node ID as a label for per-node metrics
                                    node_resource = monitoring_v3.MonitoredResource()
                                    node_resource.type = gcp_resource.type
                                    node_resource.labels.update(gcp_resource.labels)
                                    # Extract node ID from the nested structure if possible
                                    # This is a simplification; actual node ID extraction might be needed
                                    # For now, we'll report without specific node labels if not easily available
                                    # A better approach would be to iterate through nodes and extract node_id
                                    # and add it as a label.
                                    write_metric(metric_name, node_jvm_heap_percent, config, node_resource)
                        elif metric_name == "node_fs_data_free_percent":
                            free_percents = extract_fs_free_percent(node_stats_data)
                            if free_percents is not None:
                                for percent in free_percents:
                                    # Similar to above, ideally add path/node labels
                                    write_metric(metric_name, percent, config, gcp_resource)

                    else:
                        # Handle cluster-wide metrics
                        value = get_nested_value(data, config['value_path'].split('.'))

                        if value is not None:
                            if metric_name == "cluster_health_status":
                                mapped_value = map_health_status(value)
                                if mapped_value != -1.0:
                                    write_metric(metric_name, mapped_value, config, gcp_resource)
                            else:
                                write_metric(metric_name, value, config, gcp_resource)
                        else:
                            print(f"Could not find value for {metric_name} at path {config['value_path']}")

                except requests.exceptions.RequestException as e:
                    print(f"Error fetching {config['api']}: {e}")
                except json.JSONDecodeError:
                    print(f"Error decoding JSON from {config['api']}")
                except Exception as e:
                    print(f"An unexpected error occurred for {metric_name}: {e}")

        except Exception as e:
            print(f"An error occurred in the main loop: {e}")

        time.sleep(INTERVAL_SECONDS)

Key considerations for this agent:

Authentication: Ensure the agent can authenticate with Elasticsearch. If Elasticsearch is secured (e.g., with X-Pack Security), you'll need to pass credentials (e.g., via basic auth headers or API keys).
Resource Labels: The get_gcp_resource function attempts to identify the GCP resource (GCE instance or GKE pod) to attach appropriate labels to the metrics. This is crucial for filtering and aggregation in Cloud Monitoring. You might need to adjust this based on your specific deployment. For per-node metrics, you'd ideally extract the node ID and add it as a label.
Metric Definitions: The METRICS_TO_COLLECT dictionary defines which APIs to call, how to extract values (using dot notation for nested JSON), and the Cloud Monitoring metric details.
Error Handling: Robust error handling is included for network issues, JSON parsing, and Cloud Monitoring API calls.
Deployment: Deploy this agent similarly to the C++ health check agent – as a background process, systemd service, or sidecar container.

Elasticsearch Cluster-Level Alerting in Cloud Monitoring

Using the custom metrics collected by the agent, you can set up powerful alerts:

Example Alerting Policies:

Cluster Status Red/Yellow:
- Metric: custom.googleapis.com/elasticsearch/cluster_health_status
- Condition: "is below" 1.0 (triggers if status is yellow or red).
- Duration: 1 minute.
High Unassigned Shards:
- Metric: custom.googleapis.com/elasticsearch/cluster_unassigned_shards
- Condition: "is above" 0.
- Duration: 5 minutes (to allow for temporary rebalancing).
High JVM Heap Usage:
- Metric: custom.googleapis.com/elasticsearch/node_jvm_heap_used_percent
- Condition: "is above" 85.
- Duration: 5 minutes.
- Aggregation: Use "mean" or "max" across relevant nodes.
Low Disk Space:
- Metric: custom.googleapis.com/elasticsearch/node_fs_data_free_percent
- Condition: "is below" 20.
- Duration: 10 minutes.
- Aggregation: Use "mean" or "min" across relevant nodes.

Remember to configure appropriate notification channels for these alerts. For critical metrics like cluster status or unassigned shards, consider using PagerDuty or Opsgenie for immediate on-call notification.

Leveraging Elasticsearch's X-Pack Monitoring (Commercial/Basic)

If you are using Elasticsearch with X-Pack (even the basic free tier), it provides a more integrated monitoring solution. X-Pack Monitoring collects metrics and sends them to a dedicated .monitoring-es-* index within your cluster. You can then visualize these metrics using Kibana's Stack Monitoring UI.

To enable X-Pack Monitoring:

# elasticsearch.yml
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true
xpack.monitoring.collection.interval: 10s # Adjust as needed

After enabling and restarting your Elasticsearch nodes, you can access the Stack Monitoring UI in Kibana. This UI provides dashboards for cluster, node, index, and JVM metrics. While convenient, it's still recommended to export key metrics to Cloud Monitoring for centralized alerting and correlation with other GCP services.

You can use the Elasticsearch Exporter for Prometheus, which can then scrape these metrics and expose them in a Prometheus format. Google Cloud's operations suite can then ingest Prometheus metrics.

Ingesting Prometheus Metrics into Cloud Monitoring

1. Deploy Prometheus: Set up Prometheus to scrape metrics from your Elasticsearch cluster (either directly via its APIs or via an exporter that translates X-Pack metrics). Ensure Prometheus is configured to expose metrics in a way Cloud Monitoring can scrape.

2. Configure Cloud Monitoring Agent: Install the Cloud Operations for GKE agent or the Cloud Operations for Compute Engine agent on your nodes. Configure the agent to scrape Prometheus metrics. This typically involves adding a configuration snippet to the agent's configuration file (e.g., /etc/google-ops-agent/config.yaml for Compute Engine).

# Example snippet for /etc/google-ops-agent/config.yaml
logging:
  receivers:
    prometheus_es:
      type: prometheus
      endpoint: "http://localhost:9090/metrics" # Prometheus endpoint
      collection_interval: "60s"
  processors:
    # Optional: Add processors for filtering or modifying metrics
    es_filter:
      type: filter
      # Example: Keep only metrics starting with 'elasticsearch_'
      include_metrics:
        - "elasticsearch_.*"
  # ... other logging configurations ...
metrics:
  receivers:
    prometheus:
      type: prometheus
      endpoint: "http://localhost:9090/metrics" # Prometheus endpoint
      collection_interval: "60s"
  # ... other metrics configurations ...

3. Restart Agent: Restart the Cloud Operations agent (e.g., sudo systemctl restart google-cloud-ops-agent).

4. View in Cloud Monitoring: The scraped Prometheus metrics will appear in Cloud Monitoring under the "Prometheus" metric source, allowing you to build dashboards and alerts using the familiar Cloud Monitoring interface.