Server Monitoring Best Practices: Keeping Your C++ App and MySQL Clusters Alive on Google Cloud

Proactive C++ Application Health Checks with Google Cloud Monitoring

Maintaining the health of C++ applications, especially those handling critical workloads, requires more than just reactive error logging. We need to implement proactive health checks that integrate seamlessly with Google Cloud’s monitoring infrastructure. This involves exposing application-specific metrics and defining custom health check endpoints that Cloud Monitoring can poll.

For a C++ application, we can leverage libraries like Prometheus client libraries (though not natively C++, bindings exist or custom implementations can be built) or simply expose an HTTP endpoint that returns a status code and basic health information. Let’s consider a simple HTTP health check endpoint.

Implementing a Basic HTTP Health Check Endpoint in C++

We’ll use a lightweight HTTP server library. For demonstration purposes, let’s assume a hypothetical `SimpleHttpServer` class. The goal is to have an endpoint, say `/healthz`, that returns a 200 OK if the application is healthy, and a non-2xx status code otherwise. We can also include basic diagnostic information in the response body.

Example C++ Health Check Handler

This snippet illustrates the core logic of a health check handler. In a real-world scenario, this handler would query internal application states, database connections, or external service dependencies.

#include <iostream>
#include <string>
#include <chrono>
#include <sstream>

// Assume SimpleHttpServer is a class that handles HTTP requests
// and provides a way to register handlers.

// Hypothetical function to check application-specific health
bool isApplicationHealthy() {
    // In a real app, this would check:
    // - Database connection status
    // - Cache availability
    // - Internal component health
    // - Resource utilization (e.g., memory, CPU beyond OS limits)
    // For this example, we'll simulate a healthy state.
    return true;
}

// Hypothetical function to get application metrics
std::string getApplicationMetrics() {
    std::stringstream ss;
    ss << "{";
    ss << "\"uptime_seconds\": " << std::chrono::duration_cast<std::chrono::seconds>(std::chrono::system_clock::now().time_since_epoch()).count();
    // Add more metrics like request counts, error rates, etc.
    ss << ", \"is_healthy\": " << (isApplicationHealthy() ? "true" : "false");
    ss << "}";
    return ss.str();
}

// This function would be registered with your HTTP server
void healthCheckHandler(HttpRequest& request, HttpResponse& response) {
    if (isApplicationHealthy()) {
        response.setStatusCode(200);
        response.setContentType("application/json");
        response.setBody(getApplicationMetrics());
    } else {
        response.setStatusCode(503); // Service Unavailable
        response.setContentType("application/json");
        response.setBody(getApplicationMetrics()); // Still provide metrics for debugging
    }
}

// In your main application setup:
// SimpleHttpServer server;
// server.registerHandler("/healthz", healthCheckHandler);
// server.start();

Configuring Google Cloud Monitoring for C++ Health Checks

Once your application exposes a health check endpoint, you can configure Google Cloud Monitoring to poll it. This is typically done using “External HTTP(S) Load Balancer Health Checks” if your application is behind a load balancer, or by using the Cloud Monitoring Agent with custom checks.

Using Cloud Monitoring Agent (Ops Agent)

The Ops Agent is the recommended way to collect logs and metrics from your Compute Engine instances. We can configure it to periodically fetch the `/healthz` endpoint.

Ops Agent Configuration (`/etc/google-cloud-ops-agent/config.yaml`)

We’ll define a custom metrics receiver that uses the `http_listener` plugin to scrape our health endpoint. This requires the Ops Agent to be installed and running on your Compute Engine instances.

logging:
  receivers:
    app_logs:
      type: files
      include_paths:
        - /var/log/my_cpp_app/*.log

metrics:
  receivers:
    http_health_check:
      type: http_listener
      endpoint: "http://localhost:8080/healthz" # Replace with your app's host/port
      interval: "30s" # How often to poll the endpoint
      timeout: "5s"
      # Optional: If your health check endpoint requires authentication
      # auth:
      #   type: basic
      #   username: "user"
      #   password: "password"

  service_processes:
    - name: my_cpp_app
      metrics_sources:
        - type: http_listener
          endpoint: "http://localhost:8080/healthz" # Must match receiver endpoint

  processors:
    # Example processor to add labels based on health status
    add_health_label:
      type: add_label
      labels:
        health_status: ${http_listener.status_code} # Use status code as a label

  # Default exporter configuration
  exporters:
    google_cloud_monitoring:
      # This exporter will automatically send metrics to Cloud Monitoring.
      # Ensure the service account running the agent has the necessary permissions
      # (e.g., Monitoring Metric Writer role).
      interval: "60s"

# Example of how to link the receiver to an exporter
# This is implicitly handled by the service_processes configuration for metrics.
# For explicit configuration:
# logs:
#   - name: app_logs
#     storage:
#       type: google_cloud_logging
# metrics:
#   - name: http_health_check
#     storage:
#       type: google_cloud_monitoring

After updating the `config.yaml`, restart the Ops Agent:

sudo systemctl restart google-cloud-ops-agent

You should now see custom metrics appearing in Google Cloud Monitoring, such as `custom.googleapis.com/http_listener/status_code` and potentially others if your health endpoint returns structured data that the Ops Agent can parse into metrics.

Monitoring MySQL Clusters on Google Cloud

For MySQL, especially in a clustered or highly available setup (like Cloud SQL or a self-managed cluster on GCE), monitoring needs to cover instance health, replication status, query performance, and resource utilization.

Key MySQL Metrics to Monitor

Replication Lag: Crucial for read replicas. Use `Seconds_Behind_Master` (or `Seconds_Behind_Source` in newer versions).
Connection Count: High connection counts can indicate performance issues or resource exhaustion.
Query Performance: Slow query log analysis, `SHOW PROCESSLIST` for long-running queries.
InnoDB Metrics: Buffer pool hit rate, I/O activity (reads/writes), deadlocks.
Disk I/O: Throughput and latency for the data volumes.
CPU/Memory Usage: Instance-level resource consumption.
Replication Errors: Check `SHOW REPLICA STATUS` (or `SHOW SLAVE STATUS`) for `Last_Error`.

Leveraging Cloud SQL Insights

If you are using Cloud SQL, Cloud SQL Insights provides a managed way to monitor query performance, identify slow queries, and analyze database load. It’s the first line of defense for database performance issues.

Custom MySQL Monitoring with Ops Agent

For more granular control or for self-managed MySQL on GCE, the Ops Agent can be configured to collect specific MySQL metrics. This often involves using tools like `mysqld_exporter` (for Prometheus) and then scraping that exporter with the Ops Agent, or directly querying MySQL for status variables.

Example: Monitoring Replication Lag with Ops Agent

We can use a custom script executed by the Ops Agent’s `exec` receiver to query replication status.

1. Create a MySQL Query Script

#!/bin/bash

# Script to get MySQL replication lag
# Assumes ~/.my.cnf exists with credentials for user 'monitor'

MYSQL_HOST="localhost"
MYSQL_PORT="3306"

# Check if replication is running and get lag
REPLICA_STATUS=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -e "SHOW REPLICA STATUS\G" 2>&1)

if echo "$REPLICA_STATUS" | grep -q "Error"; then
    echo "ERROR: Could not connect to MySQL or get replica status."
    exit 1
fi

# Extract Seconds_Behind_Source (or Seconds_Behind_Master)
LAG=$(echo "$REPLICA_STATUS" | grep "Seconds_Behind_Source:" | awk '{print $2}')
if [ -z "$LAG" ]; then
    LAG=$(echo "$REPLICA_STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')
fi

# If still empty, it might be the primary or not configured for replication
if [ -z "$LAG" ]; then
    echo "INFO: Not a replica or replication not configured. Setting lag to 0."
    LAG=0
fi

echo "mysql.replication_lag $LAG"

Make this script executable: `chmod +x /usr/local/bin/check_mysql_lag.sh`

2. Configure Ops Agent (`/etc/google-cloud-ops-agent/config.yaml`)

metrics:
  receivers:
    mysql_replication_check:
      type: exec
      command: "/usr/local/bin/check_mysql_lag.sh"
      interval: "60s"
      timeout: "10s"
      # The output format "metric_name value" is expected by the exec receiver.
      # Example output: mysql.replication_lag 15

  service_processes:
    - name: mysql_replication_monitor
      metrics_sources:
        - type: exec
          command: "/usr/local/bin/check_mysql_lag.sh"

  # Ensure google_cloud_monitoring exporter is configured
  exporters:
    google_cloud_monitoring:
      interval: "60s"

Restart the Ops Agent after applying changes:

sudo systemctl restart google-cloud-ops-agent

This will send a metric named `custom.googleapis.com/exec/mysql.replication_lag` to Cloud Monitoring. You can then create alerting policies based on this metric (e.g., alert if lag > 60 seconds).

Monitoring MySQL Cluster Health with Cloud Monitoring Alerts

Beyond basic metrics, set up robust alerting. For MySQL clusters, consider alerts for:

Replication Lag Threshold: As configured above.
High Connection Count: Alert if `Threads_connected` exceeds a predefined threshold (e.g., 80% of `max_connections`).
Replication Errors: Monitor `SHOW REPLICA STATUS` for non-empty `Last_Error`. This can be done via a custom script or by parsing MySQL error logs if they are ingested by Cloud Logging.
Disk Space Full: Monitor `df -h` or specific data directory usage.
Unhealthy Instances: If using GCE, combine instance health checks with MySQL-specific checks.

Example Alerting Policy (Conceptual)

In the Google Cloud Console, navigate to Monitoring > Alerting. Create a new policy:

Condition Type: Metric Threshold
Resource Type: GCE VM Instance (or Cloud SQL Instance)
Metric: Custom > `exec/mysql.replication_lag` (or `cloudsql.googleapis.com/database/replication_lag` for Cloud SQL)
Filter: Specify the VM instance name or Cloud SQL instance ID.
Trigger:

Threshold Position: Above
Threshold Value: 60 (seconds)
For: 5 minutes (to avoid flapping alerts)

Configure notification channels (e.g., PagerDuty, Slack, Email) to receive alerts.

Integrating C++ App and MySQL Monitoring for Holistic Observability

The true power comes from correlating events. When your C++ application experiences increased latency or errors, you need to quickly check if it correlates with MySQL performance degradation (e.g., high replication lag, slow queries) or if the application itself is unhealthy. Cloud Monitoring’s dashboards and incident management features are key here.

Creating a Unified Dashboard

Build a custom dashboard in Cloud Monitoring that displays:

C++ Application Health Status (from `/healthz` endpoint, e.g., count of 503 errors).
C++ Application Latency metrics (if exposed).
MySQL Replication Lag.
MySQL Connection Count.
MySQL Slow Query Rate (from Cloud SQL Insights or custom logs).
GCE VM CPU/Memory Utilization for both app and DB instances.

This dashboard provides a single pane of glass to assess the overall health of your stack. When an alert fires, you can immediately jump to the dashboard to investigate potential root causes across your C++ application and MySQL cluster.