Server Monitoring Best Practices: Keeping Your C++ App and Redis Clusters Alive on Google Cloud

Proactive C++ Application Health Checks with Prometheus and Alertmanager

Maintaining the stability of a high-performance C++ application, especially one serving critical functions, demands more than just reactive error logging. We need to implement proactive health checks that can detect subtle performance degradations or impending failures before they impact users. A robust approach involves instrumenting the C++ application to expose metrics and then leveraging Prometheus for collection and Alertmanager for sophisticated alerting.

Our C++ application will expose metrics via an HTTP endpoint, typically `/metrics`. We’ll use the excellent Prometheus C++ client library. This library allows us to define counters, gauges, histograms, and summaries, which are crucial for understanding application behavior.

Instrumenting a C++ Application

Let’s consider a simple C++ application that performs some work and needs to report its operational status. We’ll track the number of requests processed and the latency of a critical operation.

Example C++ Code with Prometheus Metrics

#include <chrono>
#include <iostream>
#include <memory>
#include <string>
#include <thread>
#include <vector>

#include <prometheus/counter.h>
#include <prometheus/exposer.h>
#include <prometheus/registry.h>
#include <prometheus/summary.h>

// Function to simulate a critical operation
void perform_critical_operation() {
    // Simulate work with random latency
    std::this_thread::sleep_for(std::chrono::milliseconds(50 + rand() % 150));
}

int main() {
    // Initialize Prometheus registry and exposer
    auto registry = std::make_shared<prometheus::Registry>();
    prometheus::Exposer exposer{"0.0.0.0:9100"}; // Expose metrics on port 9100
    exposer.RegisterCollectable(registry);

    // Define metrics
    auto& request_counter = prometheus::BuildCounter()
                                .WithName("my_cpp_app_requests_total")
                                .WithHelp("Total number of requests processed by the C++ application.")
                                .Register(*registry);

    auto& operation_latency = prometheus::BuildSummary()
                                  .WithName("my_cpp_app_critical_operation_duration_seconds")
                                  .WithHelp("Latency of the critical operation in seconds.")
                                  .Register(*registry);

    std::cout << "C++ application started. Metrics exposed on http://0.0.0.0:9100/metrics" << std::endl;

    // Main application loop
    while (true) {
        // Simulate receiving a request
        request_counter.Increment();

        // Measure latency of the critical operation
        auto start = std::chrono::high_resolution_clock::now();
        perform_critical_operation();
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> elapsed = end - start;

        operation_latency.Observe(elapsed.count());

        // Simulate some other work
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }

    return 0;
}

To compile this, you’ll need to link against the Prometheus C++ client library. A typical CMakeLists.txt might look like this:

cmake_minimum_required(VERSION 3.10)
project(my_cpp_app)

find_package(prometheus-cpp REQUIRED)

add_executable(my_cpp_app main.cpp)
target_link_libraries(my_cpp_app PRIVATE prometheus-cpp::core)

Configuring Prometheus for Collection

Once your C++ application is running and exposing metrics, you need to configure Prometheus to scrape these endpoints. This is done in the prometheus.yml configuration file.

global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape our C++ application
  - job_name: 'my_cpp_app'
    static_configs:
      - targets: [':9100'] # Replace with the actual IP of your C++ app instance
    metrics_path: '/metrics' # Default, but good to be explicit

If your C++ application is running within Google Kubernetes Engine (GKE), you’d typically use a Prometheus Operator or a managed Prometheus service like Cloud Monitoring’s metrics collection. For GKE, you might configure Prometheus Operator’s `ServiceMonitor` or `PodMonitor` resources to discover and scrape your application pods. For example, a `ServiceMonitor` might look like this:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-cpp-app-monitor
  labels:
    release: prometheus # Assuming your Prometheus Operator is deployed with this release name
spec:
  selector:
    matchLabels:
      app: my-cpp-app # Label on your C++ application's Service
  namespaceSelector:
    matchNames:
      - default # Namespace where your C++ app is running
  endpoints:
  - port: metrics # Name of the port in your Service definition
    interval: 15s
    path: /metrics

Setting Up Alertmanager for C++ App Alerts

Alertmanager handles deduplication, grouping, and routing of alerts generated by Prometheus. We’ll define alert rules in Prometheus and configure Alertmanager to send notifications (e.g., to Slack, PagerDuty).

Prometheus Alerting Rules

groups:
- name: cpp_app_alerts
  rules:
  - alert: HighRequestLatency
    expr: histogram_quantile(0.95, sum(rate(my_cpp_app_critical_operation_duration_seconds_bucket[5m])) by (le)) > 0.5 # 95th percentile latency > 0.5 seconds for 5 minutes
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected for critical operation in C++ app"
      description: "The 95th percentile latency for the critical operation has been above 0.5s for 5 minutes. Current value: {{ $value }}s"

  - alert: ApplicationDown
    expr: up{job="my_cpp_app"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "C++ application is down"
      description: "Prometheus cannot scrape metrics from the C++ application. Check if the application is running and accessible."

Alertmanager Configuration

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver if no specific route matches

receivers:
- name: 'default-receiver'
  slack_configs:
  - api_url: ''
    channel: '#alerts'
    send_resolved: true
    text: '{{ template "slack.default.text" . }}'

inhibit_rules:
  - target_match:
      severity: 'critical'
    source_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service'] # Inhibit warnings if a critical alert is firing for the same service

Ensure your Prometheus server is configured to send alerts to Alertmanager via the alerting section in prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - ':' # e.g., localhost:9093

Monitoring Redis Clusters with Google Cloud Operations Suite

Redis, whether used as a cache, message broker, or session store, is a critical component. Monitoring its health, performance, and resource utilization is paramount. Google Cloud Operations Suite (formerly Stackdriver) provides powerful tools for this, especially when running Redis on Compute Engine or within GKE.

Key Redis Metrics to Monitor

Latency: Average and P99 latency for GET/SET operations. High latency indicates potential bottlenecks or overloaded Redis instances.
Memory Usage: Current memory used, peak memory usage, and available memory. Crucial for preventing OOM (Out Of Memory) errors.
Connections: Number of connected clients, maximum clients. High connection counts can strain Redis.
CPU Usage: CPU utilization of the Redis process. High CPU can indicate complex operations or insufficient resources.
Network Traffic: Bytes received and sent. Useful for identifying network saturation.
Cache Hit Rate: For caching use cases, a low hit rate might indicate insufficient memory or an ineffective caching strategy.
Replication Lag: For Redis Sentinel or Cluster setups, monitor the replication lag between master and replicas.
Evictions: Number of keys evicted due to memory limits. High eviction rates mean your cache is too small or your data set is too large for the allocated memory.

Leveraging Cloud Monitoring Agent

The Cloud Monitoring agent (Ops Agent) can collect system and application metrics. For Redis, we can use its built-in Redis plugin or configure custom metrics collection.

Ops Agent Configuration for Redis

First, ensure the Ops Agent is installed and running on your Compute Engine instances or GKE nodes. Then, configure its config.yaml (typically located at /etc/google-cloud-ops-agent/config.yaml) to include Redis metrics.

logging:
  receivers:
    redis_logs:
      type: redis
      # Optional: specify log file path if not default
      # log_file_path: /var/log/redis/redis-server.log
  processors:
    # Example: Add Kubernetes metadata if running in GKE
    - type: k8s_object
      # ... k8s object processor config ...
  forwarders:
    default:
      destination:
        cloud_logging:
          # ... cloud logging config ...

metrics:
  # Enable the Redis receiver
  receivers:
    redis:
      type: redis
      # Specify the Redis endpoint(s) to monitor
      # For a single Redis instance:
      # endpoint: "localhost:6379"
      # For Redis Sentinel:
      # endpoint: "sentinel:26379"
      # For Redis Cluster:
      # endpoint: "redis-cluster-node-1:6379,redis-cluster-node-2:6379"
      # If using password authentication:
      # password: "YOUR_REDIS_PASSWORD"
      # If using TLS:
      # tls: true
      # If using a specific Redis configuration file for metrics:
      # config_file: "/etc/redis/redis.conf"
      # Example for a common setup:
      endpoint: "localhost:6379" # Adjust if your Redis is elsewhere
      password: "" # Set if your Redis requires a password
      interval: "30s" # How often to collect metrics
      metrics:
        # Explicitly list metrics to collect, or use "all"
        # Common metrics:
        - "connected_clients"
        - "used_memory"
        - "used_memory_peak"
        - "instantaneous_ops_per_sec"
        - "keyspace_hits"
        - "keyspace_misses"
        - "evicted_keys"
        - "latest_fork_usec"
        - "rejected_connections"
        - "sync_full"
        - "sync_partial_ok"
        - "sync_partial_err"
        - "master_repl_offset"
        - "slave_repl_offset"
        - "master_link_down_since_seconds"
        - "master_link_status"
        - "instantaneous_clients_lagging"
        - "blocked_clients"
        - "mem_fragmentation_ratio"
        - "rdb_changes_since_last_save"
        - "rdb_last_bgsave_status"
        - "rdb_last_save_time"
        - "aof_enabled"
        - "aof_last_bgrewrite_time"
        - "aof_last_rewrite_time"
        - "aof_rewrite_in_progress"
        - "aof_last_write_status"
        - "aof_last_write_pending_fsync"
        - "total_net_input_bytes"
        - "total_net_output_bytes"
        - "sync_partial_slave_offset"
        - "role" # 'master' or 'slave'
        - "master_host"
        - "master_port"
        - "slave_priority"
        - "slave_read_only"
        - "connected_slaves"
        - "master_repl_id"
        - "master_repl_id_offset"
        - "second_repl_offset"
        - "repl_backlog_size"
        - "repl_backlog_first_byte_offset"
        - "repl_backlog_histlen"
        - "active_defrag_running"
        - "active_defrag_hits"
        - "active_defrag_misses"
        - "active_defrag_key_hits"
        - "active_defrag_key_misses"
        - "tracking_total_keys"
        - "tracking_total_items"
        - "tracking_total_reached"
        - "tracking_total_missed"
        - "tracking_total_expired"
        - "tracking_total_evicted"
        - "current_cpu_usage"
        - "redis_version"
        - "process_id"
        - "tcp_port"
        - "uptime_in_seconds"
        - "run_id"
        - "loading"
        - "aof_buffer_length"
        - "aof_pending_bio_fsync"
        - "aof_pending_bio_flush"
        - "lazyfree_pending_objects"
        - "cluster_enabled"
        - "db0" # Example for keyspace stats for db0
        - "db1" # ... and so on for other DBs
        - "keyspace_hits" # Alias for dbX_keyspace_hits
        - "keyspace_misses" # Alias for dbX_keyspace_misses
        - "keys" # Alias for dbX_keys
        - "expires" # Alias for dbX_expires
        - "avg_ttl" # Alias for dbX_avg_ttl

  # Configure the exporter to send metrics to Cloud Monitoring
  exporters:
    google_cloud_monitoring:
      # This exporter is enabled by default if not specified otherwise.
      # No specific configuration is usually needed here for basic usage.
      # For advanced options like custom metrics or resource types, refer to documentation.
      # For example, to specify a custom resource type:
      # resource_type: "gce_instance" # or "k8s_container" etc.
      # resource_labels:
      #   project_id: "your-gcp-project-id"
      #   instance_id: "your-instance-id"
      #   zone: "your-instance-zone"
      #   cluster_name: "your-gke-cluster-name"
      #   namespace_name: "your-gke-namespace"
      #   pod_name: "your-gke-pod-name"
      #   container_name: "your-gke-container-name"
      #   location: "your-gke-location"
      #   cluster_location: "your-gke-cluster-location"
      #   node_name: "your-gke-node-name"
      #   node_id: "your-gke-node-id"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_id: "your-gke-node-pool-id"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_uid: "your-gke-pod-uid"
      #   container_uid: "your-gke-container-uid"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   location: "your-gcp-location"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id"
      #   cluster_name: "your-gke-cluster-name"
      #   cluster_location: "your-gke-cluster-location"
      #   cluster_location_type: "region" # or "zone"
      #   cluster_uid: "your-gke-cluster-uid"
      #   namespace_name: "your-gke-namespace"
      #   namespace_id: "your-gke-namespace-id"
      #   pod_name: "your-gke-pod-name"
      #   pod_uid: "your-gke-pod-uid"
      #   container_name: "your-gke-container-name"
      #   container_uid: "your-gke-container-uid"
      #   node_name: "your-gke-node-name"
      #   node_uid: "your-gke-node-uid"
      #   node_pool_name: "your-gke-node-pool-name"
      #   node_pool_uid: "your-gke-node-pool-uid"
      #   location: "your-gke-location"
      #   location_type: "region" # or "zone"
      #   project_id: "your-gcp-project-id