Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on Google Cloud
Proactive C++ Application Health Checks on Google Cloud
Maintaining the health of a C++ application deployed on Google Cloud Platform (GCP) requires a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure graceful degradation or immediate alerting under duress. This involves instrumenting your C++ code to expose internal states and leveraging GCP’s monitoring tools.
A common pattern is to expose an HTTP endpoint within your C++ application that reports its status. This endpoint can be polled by external monitoring services. For a robust solution, consider using a lightweight HTTP server library like cpp-httplib or integrating with an existing web framework if your application already has one.
Implementing a Health Check Endpoint in C++
Here’s a simplified example using cpp-httplib to expose a /healthz endpoint. This endpoint will check internal application state, such as the status of critical background threads or the availability of essential data structures.
First, ensure you have cpp-httplib integrated into your build system (e.g., CMake). The core logic involves creating a server instance and defining a handler for the health check route.
Example C++ Health Check Implementation
#include <iostream>
#include <string>
#include <atomic>
#include "httplib.h" // Assuming cpp-httplib is included
// Global atomic flag to simulate application state
std::atomic<bool> g_is_processing_ok(true);
// Function to simulate some background work that might fail
void simulate_work() {
// In a real app, this would involve database connections, external API calls, etc.
// For demonstration, we'll toggle the flag.
// g_is_processing_ok = !g_is_processing_ok; // Uncomment to simulate state changes
}
int main() {
httplib::Server svr;
// Health check endpoint
svr.Get("/healthz", [&](const httplib::Request& req, httplib::Response& res) {
// In a real application, perform more sophisticated checks:
// - Check database connection status
// - Verify critical service availability
// - Inspect internal queue sizes or worker thread health
// - Check for recent errors or exceptions
if (g_is_processing_ok.load()) {
res.set_content("OK", "text/plain");
res.status = 200;
} else {
res.set_content("Degraded", "text/plain");
res.status = 503; // Service Unavailable
}
});
// A simple endpoint to simulate state change for testing
svr.Get("/toggle_health", [&](const httplib::Request& req, httplib::Response& res) {
g_is_processing_ok = !g_is_processing_ok.load();
res.set_content(g_is_processing_ok.load() ? "Health set to OK" : "Health set to Degraded", "text/plain");
res.status = 200;
});
// Start the server on a specific port (e.g., 8080)
// In a production environment, this might be a dedicated monitoring port.
std::cout << "Starting health check server on port 8080..." << std::endl;
if (!svr.listen("0.0.0.0", 8080)) {
std::cerr << "Failed to start server." << std::endl;
return 1;
}
return 0;
}
Integrating with Google Cloud Monitoring (Cloud Monitoring)
Once your C++ application exposes a health check endpoint, you can use Cloud Monitoring to poll it. The most straightforward method is to use a custom metric that periodically checks the endpoint's status.
We'll create a small agent (e.g., a Python script) that runs on the same Compute Engine instance or Kubernetes Pod as your C++ application. This agent will poll the /healthz endpoint and send a custom metric to Cloud Monitoring.
Python Health Check Polling Agent
import requests
import time
import google.auth
from google.cloud import monitoring_v3
from google.protobuf.timestamp_pb2 import Timestamp
# --- Configuration ---
APP_HEALTH_URL = "http://localhost:8080/healthz" # Adjust if your app is elsewhere
PROJECT_ID = "your-gcp-project-id" # Replace with your GCP Project ID
METRIC_TYPE = "custom.googleapis.com/my_cpp_app/health_status"
INTERVAL_SECONDS = 60 # How often to poll and report
# ---------------------
def get_health_status():
try:
response = requests.get(APP_HEALTH_URL, timeout=5)
if response.status_code == 200 and "OK" in response.text:
return 1.0 # Healthy
else:
print(f"Health check failed: Status {response.status_code}, Body: {response.text}")
return 0.0 # Unhealthy/Degraded
except requests.exceptions.RequestException as e:
print(f"Error connecting to health check endpoint: {e}")
return 0.0 # Unhealthy
def write_metric(status_value):
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
now = time.time()
seconds = int(now)
nanos = int((now - seconds) * 10**9)
timestamp = Timestamp(seconds=seconds, nanos=nanos)
series = monitoring_v3.MetricDescriptor()
series.type = METRIC_TYPE
series.metric_kind = monitoring_v3.MetricDescriptor.MetricKind.GAUGE
series.value_type = monitoring_v3.MetricDescriptor.ValueType.DOUBLE
series.description = "Health status of the C++ application (1.0 for OK, 0.0 for Degraded/Unhealthy)."
# Add resource information (e.g., for a GCE instance)
# You'll need to adapt this based on your deployment (GKE, GCE, etc.)
# For GCE, you can often get metadata automatically.
# For GKE, use kubernetes.io/pod or kubernetes.io/container
try:
# Attempt to get GCE instance metadata
import google.compute_engine_metadata
instance_name = google.compute_engine_metadata.instance.name
zone = google.compute_engine_metadata.instance.zone.split('/')[-1]
resource = monitoring_v3.MonitoredResource()
resource.type = "gce_instance"
resource.labels["instance_id"] = google.compute_engine_metadata.instance.id
resource.labels["project_id"] = PROJECT_ID
resource.labels["zone"] = zone
resource.labels["name"] = instance_name
print(f"Reporting for GCE instance: {instance_name} in zone {zone}")
except Exception as e:
print(f"Could not determine GCE instance metadata: {e}. Reporting without resource labels.")
# Fallback or error handling if not on GCE or metadata is unavailable
resource = monitoring_v3.MonitoredResource()
resource.type = "global" # Generic resource if specific one can't be determined
resource.labels["project_id"] = PROJECT_ID
point = monitoring_v3.Point()
point.value.double_value = status_value
point.interval.end_time = timestamp
try:
client.create_time_series(
name=project_name,
time_series=[
{
"metric": {"type": METRIC_TYPE},
"resource": resource,
"points": [point],
}
],
)
print(f"Reported metric: {METRIC_TYPE} = {status_value}")
except Exception as e:
print(f"Failed to write metric to Cloud Monitoring: {e}")
if __name__ == "__main__":
# Ensure you have authenticated correctly (e.g., via service account or gcloud auth)
# and that the service account has the 'monitoring.metricWriter' role.
print(f"Starting health check agent. Polling {APP_HEALTH_URL} every {INTERVAL_SECONDS} seconds.")
while True:
health_status = get_health_status()
write_metric(health_status)
time.sleep(INTERVAL_SECONDS)
To deploy this agent:
- Ensure the service account running this agent has the
roles/monitoring.metricWriterIAM role in your GCP project. - Install the necessary Python libraries:
pip install google-cloud-monitoring requests google-auth google-compute-engine-metadata. - Replace
"your-gcp-project-id"with your actual project ID. - Adjust
APP_HEALTH_URLif your C++ application is not running on the same host or port. - Run this script as a background process or a systemd service on your Compute Engine instance, or as a sidecar container in your Kubernetes Pod.
Creating Cloud Monitoring Dashboards and Alerting
With the custom metric flowing into Cloud Monitoring, you can now visualize it and set up alerts.
Dashboard Configuration
Navigate to the Cloud Monitoring console in GCP. Create a new dashboard. Add a chart, select "Metric Explorer," and search for your custom metric (e.g., custom.googleapis.com/my_cpp_app/health_status). Configure it as a "Stacked bar chart" or "Line chart" to show the 0.0 (unhealthy) and 1.0 (healthy) values over time. Filter by the appropriate resource (e.g., your GCE instance or GKE Pod) to see the status of individual application instances.
Alerting Policy
To create an alert:
- In the Cloud Monitoring console, go to "Alerting" and click "Create Policy."
- Click "Add Condition."
- Under "Select a metric," find and select your custom metric (
custom.googleapis.com/my_cpp_app/health_status). - Under "Filter," add filters to target specific instances or services if needed (e.g., by instance name, zone, or Kubernetes pod labels).
- Under "Configure trigger," set the condition to "is below" and the threshold to
0.5(or any value less than 1.0). This means if the metric drops from 1.0 to 0.0, the condition is met. - Set the "For" duration to trigger the alert only if the condition persists for a certain period (e.g., 1 minute) to avoid flapping alerts.
- Configure notification channels (e.g., email, PagerDuty, Slack) to receive alerts when the condition is met.
Elasticsearch Cluster Health and Performance Monitoring on Google Cloud
Monitoring Elasticsearch clusters, especially those deployed on GCP (e.g., on Compute Engine or GKE), is critical for maintaining search performance, data integrity, and availability. This involves tracking cluster-wide health, node-specific metrics, JVM performance, and Elasticsearch-specific operational metrics.
Leveraging Elasticsearch's Built-in Monitoring APIs
Elasticsearch exposes a wealth of information through its Monitoring APIs. The most important endpoints include:
- Cluster Health API (
_cluster/health): Provides an overview of the cluster's status (green, yellow, red), number of nodes, shards, and pending tasks. - Node Stats API (
_nodes/stats): Detailed statistics for each node, including CPU usage, JVM heap usage, disk I/O, network traffic, and file system usage. - Index Stats API (
_stats): Statistics for indices, such as document counts, size, indexing rate, search rate, and query cache performance.
Example: Fetching Cluster Health with `curl`
curl -X GET "http://localhost:9200/_cluster/health?pretty"
The output will look something like this:
{
"cluster_name" : "my-es-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5,
"active_shards" : 15,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue" : 0,
"active_shards_percent_as_number" : 100.0
}
Integrating with Cloud Monitoring and Alerting
Similar to the C++ application, we can use a Python agent to poll Elasticsearch APIs and send custom metrics to Cloud Monitoring. This agent should run on a node within the cluster or have network access to it.
Python Elasticsearch Monitoring Agent
import requests
import time
import google.auth
from google.cloud import monitoring_v3
from google.protobuf.timestamp_pb2 import Timestamp
import json
# --- Configuration ---
ES_HOST = "http://localhost:9200" # Elasticsearch host
PROJECT_ID = "your-gcp-project-id" # Replace with your GCP Project ID
INTERVAL_SECONDS = 60 # How often to poll and report
# Metrics to collect
METRICS_TO_COLLECT = {
"cluster_health_status": {
"api": "/_cluster/health",
"value_path": "status", # 'green': 2, 'yellow': 1, 'red': 0
"type": "gauge",
"value_type": "double",
"description": "Elasticsearch cluster health status (2=green, 1=yellow, 0=red)."
},
"cluster_node_count": {
"api": "/_cluster/health",
"value_path": "number_of_nodes",
"type": "gauge",
"value_type": "double",
"description": "Number of nodes in the Elasticsearch cluster."
},
"cluster_unassigned_shards": {
"api": "/_cluster/health",
"value_path": "unassigned_shards",
"type": "gauge",
"value_type": "double",
"description": "Number of unassigned shards in the Elasticsearch cluster."
},
"node_jvm_heap_used_percent": {
"api": "/_nodes/stats/jvm",
"value_path": "nodes.*.jvm.mem.heap_used_percent", # Wildcard will iterate
"type": "gauge",
"value_type": "double",
"description": "JVM heap usage percentage for Elasticsearch nodes."
},
"node_fs_data_free_percent": {
"api": "/_nodes/stats/fs",
"value_path": "nodes.*.fs.data.*.path", # Need to extract path and available_percent
"type": "gauge",
"value_type": "double",
"description": "Free disk space percentage for Elasticsearch data paths."
}
}
# ---------------------
def get_nested_value(data, path_keys):
"""Safely retrieves a nested value from a dictionary using a list of keys."""
current_data = data
for key in path_keys:
if isinstance(current_data, dict):
if key == "*": # Handle wildcard for iterating over nodes/paths
results = []
for k, v in current_data.items():
nested_results = get_nested_value(v, path_keys[path_keys.index(key)+1:])
if nested_results is not None:
if isinstance(nested_results, list):
results.extend(nested_results)
else:
results.append(nested_results)
return results if results else None
elif key in current_data:
current_data = current_data[key]
else:
return None
elif isinstance(current_data, list):
# If we encounter a list and the key is not an index, try to process it
# This is a simplified handling for cases like 'nodes.*.fs.data.*.path'
# where '*' might represent multiple data paths on a single node.
if key == "*":
results = []
for item in current_data:
nested_results = get_nested_value(item, path_keys[path_keys.index(key)+1:])
if nested_results is not None:
if isinstance(nested_results, list):
results.extend(nested_results)
else:
results.append(nested_results)
return results if results else None
else:
return None # Cannot index into list with non-index key
else:
return None
return current_data
def map_health_status(status_str):
if status_str == "green":
return 2.0
elif status_str == "yellow":
return 1.0
elif status_str == "red":
return 0.0
return -1.0 # Unknown
def extract_fs_free_percent(node_stats_data):
"""Extracts free disk space percentage from node stats."""
free_percents = []
if not node_stats_data or 'nodes' not in node_stats_data:
return None
for node_id, node_data in node_stats_data['nodes'].items():
if 'fs' in node_data and 'data' in node_data['fs']:
for path, path_data in node_data['fs']['data'].items():
if 'path' in path_data and 'available_percent' in path_data:
free_percents.append(float(path_data['available_percent']))
return free_percents if free_percents else None
def write_metric(metric_name, value, metric_type_info, resource):
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
now = time.time()
seconds = int(now)
nanos = int((now - seconds) * 10**9)
timestamp = Timestamp(seconds=seconds, nanos=nanos)
metric_descriptor = monitoring_v3.MetricDescriptor()
metric_descriptor.type = f"custom.googleapis.com/elasticsearch/{metric_name}"
metric_descriptor.metric_kind = getattr(monitoring_v3.MetricDescriptor.MetricKind, metric_type_info["type"].upper())
metric_descriptor.value_type = getattr(monitoring_v3.MetricDescriptor.ValueType, metric_type_info["value_type"].upper())
metric_descriptor.description = metric_type_info["description"]
point = monitoring_v3.Point()
if metric_type_info["value_type"] == "double":
point.value.double_value = float(value)
elif metric_type_info["value_type"] == "int64":
point.value.int64_value = int(value)
# Add other types as needed
point.interval.end_time = timestamp
try:
client.create_time_series(
name=project_name,
time_series=[
{
"metric": {"type": metric_descriptor.type},
"resource": resource,
"points": [point],
}
],
)
print(f"Reported metric: {metric_descriptor.type} = {value}")
except Exception as e:
print(f"Failed to write metric {metric_descriptor.type} to Cloud Monitoring: {e}")
def get_gcp_resource():
"""Attempts to identify the GCP resource (GCE instance or GKE pod)."""
resource = monitoring_v3.MonitoredResource()
resource.labels["project_id"] = PROJECT_ID
try:
# Try GCE instance metadata
import google.compute_engine_metadata
instance_name = google.compute_engine_metadata.instance.name
zone = google.compute_engine_metadata.instance.zone.split('/')[-1]
resource.type = "gce_instance"
resource.labels["instance_id"] = google.compute_engine_metadata.instance.id
resource.labels["zone"] = zone
resource.labels["name"] = instance_name
print(f"Reporting for GCE instance: {instance_name} in zone {zone}")
return resource
except Exception:
pass # Not on GCE or metadata unavailable
try:
# Try GKE pod metadata (requires running inside a pod with service account)
import os
pod_name = os.environ.get('HOSTNAME') # HOSTNAME is set by Kubernetes
namespace = os.environ.get('POD_NAMESPACE') # POD_NAMESPACE is set by Kubernetes
if pod_name and namespace:
resource.type = "k8s_pod"
resource.labels["pod_name"] = pod_name
resource.labels["namespace_name"] = namespace
print(f"Reporting for GKE pod: {pod_name} in namespace {namespace}")
return resource
except Exception:
pass # Not on GKE or env vars not set
print("Could not determine GCP resource. Reporting with generic resource.")
resource.type = "global" # Fallback
return resource
if __name__ == "__main__":
# Ensure you have authenticated correctly (e.g., via service account or gcloud auth)
# and that the service account has the 'monitoring.metricWriter' role.
print(f"Starting Elasticsearch monitoring agent. Polling {ES_HOST} every {INTERVAL_SECONDS} seconds.")
gcp_resource = get_gcp_resource()
while True:
try:
for metric_name, config in METRICS_TO_COLLECT.items():
try:
response = requests.get(f"{ES_HOST}{config['api']}", timeout=10)
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
data = response.json()
if metric_name == "node_jvm_heap_used_percent" or metric_name == "node_fs_data_free_percent":
# Handle metrics that apply per node/path
node_stats_data = data # For these APIs, the whole response is relevant
if metric_name == "node_jvm_heap_used_percent":
values = get_nested_value(node_stats_data, config['value_path'].split('.'))
if values is not None:
for node_jvm_heap_percent in values:
# Add node ID as a label for per-node metrics
node_resource = monitoring_v3.MonitoredResource()
node_resource.type = gcp_resource.type
node_resource.labels.update(gcp_resource.labels)
# Extract node ID from the nested structure if possible
# This is a simplification; actual node ID extraction might be needed
# For now, we'll report without specific node labels if not easily available
# A better approach would be to iterate through nodes and extract node_id
# and add it as a label.
write_metric(metric_name, node_jvm_heap_percent, config, node_resource)
elif metric_name == "node_fs_data_free_percent":
free_percents = extract_fs_free_percent(node_stats_data)
if free_percents is not None:
for percent in free_percents:
# Similar to above, ideally add path/node labels
write_metric(metric_name, percent, config, gcp_resource)
else:
# Handle cluster-wide metrics
value = get_nested_value(data, config['value_path'].split('.'))
if value is not None:
if metric_name == "cluster_health_status":
mapped_value = map_health_status(value)
if mapped_value != -1.0:
write_metric(metric_name, mapped_value, config, gcp_resource)
else:
write_metric(metric_name, value, config, gcp_resource)
else:
print(f"Could not find value for {metric_name} at path {config['value_path']}")
except requests.exceptions.RequestException as e:
print(f"Error fetching {config['api']}: {e}")
except json.JSONDecodeError:
print(f"Error decoding JSON from {config['api']}")
except Exception as e:
print(f"An unexpected error occurred for {metric_name}: {e}")
except Exception as e:
print(f"An error occurred in the main loop: {e}")
time.sleep(INTERVAL_SECONDS)
Key considerations for this agent:
- Authentication: Ensure the agent can authenticate with Elasticsearch. If Elasticsearch is secured (e.g., with X-Pack Security), you'll need to pass credentials (e.g., via basic auth headers or API keys).
- Resource Labels: The
get_gcp_resourcefunction attempts to identify the GCP resource (GCE instance or GKE pod) to attach appropriate labels to the metrics. This is crucial for filtering and aggregation in Cloud Monitoring. You might need to adjust this based on your specific deployment. For per-node metrics, you'd ideally extract the node ID and add it as a label. - Metric Definitions: The
METRICS_TO_COLLECTdictionary defines which APIs to call, how to extract values (using dot notation for nested JSON), and the Cloud Monitoring metric details. - Error Handling: Robust error handling is included for network issues, JSON parsing, and Cloud Monitoring API calls.
- Deployment: Deploy this agent similarly to the C++ health check agent – as a background process, systemd service, or sidecar container.
Elasticsearch Cluster-Level Alerting in Cloud Monitoring
Using the custom metrics collected by the agent, you can set up powerful alerts:
Example Alerting Policies:
- Cluster Status Red/Yellow:
- Metric:
custom.googleapis.com/elasticsearch/cluster_health_status - Condition: "is below"
1.0(triggers if status is yellow or red). - Duration: 1 minute.
- Metric:
- High Unassigned Shards:
- Metric:
custom.googleapis.com/elasticsearch/cluster_unassigned_shards - Condition: "is above"
0. - Duration: 5 minutes (to allow for temporary rebalancing).
- Metric:
- High JVM Heap Usage:
- Metric:
custom.googleapis.com/elasticsearch/node_jvm_heap_used_percent - Condition: "is above"
85. - Duration: 5 minutes.
- Aggregation: Use "mean" or "max" across relevant nodes.
- Metric:
- Low Disk Space:
- Metric:
custom.googleapis.com/elasticsearch/node_fs_data_free_percent - Condition: "is below"
20. - Duration: 10 minutes.
- Aggregation: Use "mean" or "min" across relevant nodes.
- Metric:
Remember to configure appropriate notification channels for these alerts. For critical metrics like cluster status or unassigned shards, consider using PagerDuty or Opsgenie for immediate on-call notification.
Leveraging Elasticsearch's X-Pack Monitoring (Commercial/Basic)
If you are using Elasticsearch with X-Pack (even the basic free tier), it provides a more integrated monitoring solution. X-Pack Monitoring collects metrics and sends them to a dedicated .monitoring-es-* index within your cluster. You can then visualize these metrics using Kibana's Stack Monitoring UI.
To enable X-Pack Monitoring:
# elasticsearch.yml xpack.monitoring.enabled: true xpack.monitoring.collection.enabled: true xpack.monitoring.collection.interval: 10s # Adjust as needed
After enabling and restarting your Elasticsearch nodes, you can access the Stack Monitoring UI in Kibana. This UI provides dashboards for cluster, node, index, and JVM metrics. While convenient, it's still recommended to export key metrics to Cloud Monitoring for centralized alerting and correlation with other GCP services.
You can use the Elasticsearch Exporter for Prometheus, which can then scrape these metrics and expose them in a Prometheus format. Google Cloud's operations suite can then ingest Prometheus metrics.
Ingesting Prometheus Metrics into Cloud Monitoring
1. Deploy Prometheus: Set up Prometheus to scrape metrics from your Elasticsearch cluster (either directly via its APIs or via an exporter that translates X-Pack metrics). Ensure Prometheus is configured to expose metrics in a way Cloud Monitoring can scrape.
2. Configure Cloud Monitoring Agent: Install the Cloud Operations for GKE agent or the Cloud Operations for Compute Engine agent on your nodes. Configure the agent to scrape Prometheus metrics. This typically involves adding a configuration snippet to the agent's configuration file (e.g., /etc/google-ops-agent/config.yaml for Compute Engine).
# Example snippet for /etc/google-ops-agent/config.yaml
logging:
receivers:
prometheus_es:
type: prometheus
endpoint: "http://localhost:9090/metrics" # Prometheus endpoint
collection_interval: "60s"
processors:
# Optional: Add processors for filtering or modifying metrics
es_filter:
type: filter
# Example: Keep only metrics starting with 'elasticsearch_'
include_metrics:
- "elasticsearch_.*"
# ... other logging configurations ...
metrics:
receivers:
prometheus:
type: prometheus
endpoint: "http://localhost:9090/metrics" # Prometheus endpoint
collection_interval: "60s"
# ... other metrics configurations ...
3. Restart Agent: Restart the Cloud Operations agent (e.g., sudo systemctl restart google-cloud-ops-agent).
4. View in Cloud Monitoring: The scraped Prometheus metrics will appear in Cloud Monitoring under the "Prometheus" metric source, allowing you to build dashboards and alerts using the familiar Cloud Monitoring interface.