Server Monitoring Best Practices: Keeping Your C++ App and Redis Clusters Alive on Google Cloud
Proactive C++ Application Health Checks with Prometheus and Alertmanager
Maintaining the stability of a high-performance C++ application, especially one serving critical functions, demands more than just reactive error logging. We need to implement proactive health checks that can detect subtle performance degradations or impending failures before they impact users. A robust approach involves instrumenting the C++ application to expose metrics and then leveraging Prometheus for collection and Alertmanager for sophisticated alerting.
Our C++ application will expose metrics via an HTTP endpoint, typically `/metrics`. We’ll use the excellent Prometheus C++ client library. This library allows us to define counters, gauges, histograms, and summaries, which are crucial for understanding application behavior.
Instrumenting a C++ Application
Let’s consider a simple C++ application that performs some work and needs to report its operational status. We’ll track the number of requests processed and the latency of a critical operation.
Example C++ Code with Prometheus Metrics
#include <chrono>
#include <iostream>
#include <memory>
#include <string>
#include <thread>
#include <vector>
#include <prometheus/counter.h>
#include <prometheus/exposer.h>
#include <prometheus/registry.h>
#include <prometheus/summary.h>
// Function to simulate a critical operation
void perform_critical_operation() {
// Simulate work with random latency
std::this_thread::sleep_for(std::chrono::milliseconds(50 + rand() % 150));
}
int main() {
// Initialize Prometheus registry and exposer
auto registry = std::make_shared<prometheus::Registry>();
prometheus::Exposer exposer{"0.0.0.0:9100"}; // Expose metrics on port 9100
exposer.RegisterCollectable(registry);
// Define metrics
auto& request_counter = prometheus::BuildCounter()
.WithName("my_cpp_app_requests_total")
.WithHelp("Total number of requests processed by the C++ application.")
.Register(*registry);
auto& operation_latency = prometheus::BuildSummary()
.WithName("my_cpp_app_critical_operation_duration_seconds")
.WithHelp("Latency of the critical operation in seconds.")
.Register(*registry);
std::cout << "C++ application started. Metrics exposed on http://0.0.0.0:9100/metrics" << std::endl;
// Main application loop
while (true) {
// Simulate receiving a request
request_counter.Increment();
// Measure latency of the critical operation
auto start = std::chrono::high_resolution_clock::now();
perform_critical_operation();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
operation_latency.Observe(elapsed.count());
// Simulate some other work
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
return 0;
}
To compile this, you’ll need to link against the Prometheus C++ client library. A typical CMakeLists.txt might look like this:
cmake_minimum_required(VERSION 3.10) project(my_cpp_app) find_package(prometheus-cpp REQUIRED) add_executable(my_cpp_app main.cpp) target_link_libraries(my_cpp_app PRIVATE prometheus-cpp::core)
Configuring Prometheus for Collection
Once your C++ application is running and exposing metrics, you need to configure Prometheus to scrape these endpoints. This is done in the prometheus.yml configuration file.
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape our C++ application
- job_name: 'my_cpp_app'
static_configs:
- targets: [':9100'] # Replace with the actual IP of your C++ app instance
metrics_path: '/metrics' # Default, but good to be explicit
If your C++ application is running within Google Kubernetes Engine (GKE), you’d typically use a Prometheus Operator or a managed Prometheus service like Cloud Monitoring’s metrics collection. For GKE, you might configure Prometheus Operator’s `ServiceMonitor` or `PodMonitor` resources to discover and scrape your application pods. For example, a `ServiceMonitor` might look like this:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-cpp-app-monitor
labels:
release: prometheus # Assuming your Prometheus Operator is deployed with this release name
spec:
selector:
matchLabels:
app: my-cpp-app # Label on your C++ application's Service
namespaceSelector:
matchNames:
- default # Namespace where your C++ app is running
endpoints:
- port: metrics # Name of the port in your Service definition
interval: 15s
path: /metrics
Setting Up Alertmanager for C++ App Alerts
Alertmanager handles deduplication, grouping, and routing of alerts generated by Prometheus. We’ll define alert rules in Prometheus and configure Alertmanager to send notifications (e.g., to Slack, PagerDuty).
Prometheus Alerting Rules
groups:
- name: cpp_app_alerts
rules:
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(my_cpp_app_critical_operation_duration_seconds_bucket[5m])) by (le)) > 0.5 # 95th percentile latency > 0.5 seconds for 5 minutes
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected for critical operation in C++ app"
description: "The 95th percentile latency for the critical operation has been above 0.5s for 5 minutes. Current value: {{ $value }}s"
- alert: ApplicationDown
expr: up{job="my_cpp_app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "C++ application is down"
description: "Prometheus cannot scrape metrics from the C++ application. Check if the application is running and accessible."
Alertmanager Configuration
global: resolve_timeout: 5m route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' # Default receiver if no specific route matches receivers: - name: 'default-receiver' slack_configs: - api_url: '' channel: '#alerts' send_resolved: true text: '{{ template "slack.default.text" . }}' inhibit_rules: - target_match: severity: 'critical' source_match: severity: 'warning' equal: ['alertname', 'cluster', 'service'] # Inhibit warnings if a critical alert is firing for the same service
Ensure your Prometheus server is configured to send alerts to Alertmanager via the alerting section in prometheus.yml:
alerting:
alertmanagers:
- static_configs:
- targets:
- ':' # e.g., localhost:9093
Monitoring Redis Clusters with Google Cloud Operations Suite
Redis, whether used as a cache, message broker, or session store, is a critical component. Monitoring its health, performance, and resource utilization is paramount. Google Cloud Operations Suite (formerly Stackdriver) provides powerful tools for this, especially when running Redis on Compute Engine or within GKE.
Key Redis Metrics to Monitor
- Latency: Average and P99 latency for GET/SET operations. High latency indicates potential bottlenecks or overloaded Redis instances.
- Memory Usage: Current memory used, peak memory usage, and available memory. Crucial for preventing OOM (Out Of Memory) errors.
- Connections: Number of connected clients, maximum clients. High connection counts can strain Redis.
- CPU Usage: CPU utilization of the Redis process. High CPU can indicate complex operations or insufficient resources.
- Network Traffic: Bytes received and sent. Useful for identifying network saturation.
- Cache Hit Rate: For caching use cases, a low hit rate might indicate insufficient memory or an ineffective caching strategy.
- Replication Lag: For Redis Sentinel or Cluster setups, monitor the replication lag between master and replicas.
- Evictions: Number of keys evicted due to memory limits. High eviction rates mean your cache is too small or your data set is too large for the allocated memory.
Leveraging Cloud Monitoring Agent
The Cloud Monitoring agent (Ops Agent) can collect system and application metrics. For Redis, we can use its built-in Redis plugin or configure custom metrics collection.
Ops Agent Configuration for Redis
First, ensure the Ops Agent is installed and running on your Compute Engine instances or GKE nodes. Then, configure its config.yaml (typically located at /etc/google-cloud-ops-agent/config.yaml) to include Redis metrics.
logging:
receivers:
redis_logs:
type: redis
# Optional: specify log file path if not default
# log_file_path: /var/log/redis/redis-server.log
processors:
# Example: Add Kubernetes metadata if running in GKE
- type: k8s_object
# ... k8s object processor config ...
forwarders:
default:
destination:
cloud_logging:
# ... cloud logging config ...
metrics:
# Enable the Redis receiver
receivers:
redis:
type: redis
# Specify the Redis endpoint(s) to monitor
# For a single Redis instance:
# endpoint: "localhost:6379"
# For Redis Sentinel:
# endpoint: "sentinel:26379"
# For Redis Cluster:
# endpoint: "redis-cluster-node-1:6379,redis-cluster-node-2:6379"
# If using password authentication:
# password: "YOUR_REDIS_PASSWORD"
# If using TLS:
# tls: true
# If using a specific Redis configuration file for metrics:
# config_file: "/etc/redis/redis.conf"
# Example for a common setup:
endpoint: "localhost:6379" # Adjust if your Redis is elsewhere
password: "" # Set if your Redis requires a password
interval: "30s" # How often to collect metrics
metrics:
# Explicitly list metrics to collect, or use "all"
# Common metrics:
- "connected_clients"
- "used_memory"
- "used_memory_peak"
- "instantaneous_ops_per_sec"
- "keyspace_hits"
- "keyspace_misses"
- "evicted_keys"
- "latest_fork_usec"
- "rejected_connections"
- "sync_full"
- "sync_partial_ok"
- "sync_partial_err"
- "master_repl_offset"
- "slave_repl_offset"
- "master_link_down_since_seconds"
- "master_link_status"
- "instantaneous_clients_lagging"
- "blocked_clients"
- "mem_fragmentation_ratio"
- "rdb_changes_since_last_save"
- "rdb_last_bgsave_status"
- "rdb_last_save_time"
- "aof_enabled"
- "aof_last_bgrewrite_time"
- "aof_last_rewrite_time"
- "aof_rewrite_in_progress"
- "aof_last_write_status"
- "aof_last_write_pending_fsync"
- "total_net_input_bytes"
- "total_net_output_bytes"
- "sync_partial_slave_offset"
- "role" # 'master' or 'slave'
- "master_host"
- "master_port"
- "slave_priority"
- "slave_read_only"
- "connected_slaves"
- "master_repl_id"
- "master_repl_id_offset"
- "second_repl_offset"
- "repl_backlog_size"
- "repl_backlog_first_byte_offset"
- "repl_backlog_histlen"
- "active_defrag_running"
- "active_defrag_hits"
- "active_defrag_misses"
- "active_defrag_key_hits"
- "active_defrag_key_misses"
- "tracking_total_keys"
- "tracking_total_items"
- "tracking_total_reached"
- "tracking_total_missed"
- "tracking_total_expired"
- "tracking_total_evicted"
- "current_cpu_usage"
- "redis_version"
- "process_id"
- "tcp_port"
- "uptime_in_seconds"
- "run_id"
- "loading"
- "aof_buffer_length"
- "aof_pending_bio_fsync"
- "aof_pending_bio_flush"
- "lazyfree_pending_objects"
- "cluster_enabled"
- "db0" # Example for keyspace stats for db0
- "db1" # ... and so on for other DBs
- "keyspace_hits" # Alias for dbX_keyspace_hits
- "keyspace_misses" # Alias for dbX_keyspace_misses
- "keys" # Alias for dbX_keys
- "expires" # Alias for dbX_expires
- "avg_ttl" # Alias for dbX_avg_ttl
# Configure the exporter to send metrics to Cloud Monitoring
exporters:
google_cloud_monitoring:
# This exporter is enabled by default if not specified otherwise.
# No specific configuration is usually needed here for basic usage.
# For advanced options like custom metrics or resource types, refer to documentation.
# For example, to specify a custom resource type:
# resource_type: "gce_instance" # or "k8s_container" etc.
# resource_labels:
# project_id: "your-gcp-project-id"
# instance_id: "your-instance-id"
# zone: "your-instance-zone"
# cluster_name: "your-gke-cluster-name"
# namespace_name: "your-gke-namespace"
# pod_name: "your-gke-pod-name"
# container_name: "your-gke-container-name"
# location: "your-gke-location"
# cluster_location: "your-gke-cluster-location"
# node_name: "your-gke-node-name"
# node_id: "your-gke-node-id"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_id: "your-gke-node-pool-id"
# cluster_uid: "your-gke-cluster-uid"
# namespace_id: "your-gke-namespace-id"
# pod_uid: "your-gke-pod-uid"
# container_uid: "your-gke-container-uid"
# node_uid: "your-gke-node-uid"
# node_pool_uid: "your-gke-node-pool-uid"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# location: "your-gcp-location"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id"
# cluster_name: "your-gke-cluster-name"
# cluster_location: "your-gke-cluster-location"
# cluster_location_type: "region" # or "zone"
# cluster_uid: "your-gke-cluster-uid"
# namespace_name: "your-gke-namespace"
# namespace_id: "your-gke-namespace-id"
# pod_name: "your-gke-pod-name"
# pod_uid: "your-gke-pod-uid"
# container_name: "your-gke-container-name"
# container_uid: "your-gke-container-uid"
# node_name: "your-gke-node-name"
# node_uid: "your-gke-node-uid"
# node_pool_name: "your-gke-node-pool-name"
# node_pool_uid: "your-gke-node-pool-uid"
# location: "your-gke-location"
# location_type: "region" # or "zone"
# project_id: "your-gcp-project-id