Server Monitoring Best Practices: Keeping Your C App and PostgreSQL Clusters Alive on Google Cloud

Proactive PostgreSQL Cluster Health Checks with pg_cron

Maintaining the health of a PostgreSQL cluster, especially in a distributed cloud environment like Google Cloud, requires more than just reactive alerts. Proactive, automated checks are crucial for identifying potential issues before they impact production. We’ll leverage pg_cron, a PostgreSQL extension, to schedule regular maintenance and health checks directly within the database cluster.

First, ensure pg_cron is installed and enabled on your PostgreSQL instances. This typically involves adding pg_cron to shared_preload_libraries in your postgresql.conf and then restarting the PostgreSQL service. On Google Cloud SQL for PostgreSQL, this can often be managed via instance flags.

Once enabled, you can schedule jobs using SQL. A fundamental check is to monitor for long-running queries. These can indicate performance bottlenecks or even deadlocks. We’ll create a job that runs every 15 minutes to identify queries exceeding a certain threshold (e.g., 5 minutes).

Scheduling Long-Running Query Detection

Connect to your primary PostgreSQL instance using psql or your preferred SQL client and execute the following:

-- Create a table to store findings
CREATE TABLE IF NOT EXISTS pg_cron_jobs.long_running_queries (
    pid INTEGER,
    user_name VARCHAR(255),
    database_name VARCHAR(255),
    query TEXT,
    start_time TIMESTAMP WITH TIME ZONE,
    duration INTERVAL,
    logged_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Schedule the job to run every 15 minutes
SELECT cron.schedule(
    'detect-long-queries',
    '*/15 * * * *',
    $$
    INSERT INTO pg_cron_jobs.long_running_queries (pid, user_name, database_name, query, start_time, duration)
    SELECT
        s.pid,
        s.usename,
        s.datname,
        s.query,
        s.query_start,
        NOW() - s.query_start AS duration
    FROM pg_stat_activity s
    WHERE s.state = 'active'
      AND NOW() - s.query_start > INTERVAL '5 minutes'
      AND s.pid <> pg_backend_pid();
    $$
);

This script inserts details of active queries longer than 5 minutes into a dedicated table. You would then set up external monitoring (e.g., Cloud Monitoring, Prometheus Alertmanager) to alert on new entries in pg_cron_jobs.long_running_queries. Regularly purging old entries from this table is also essential; a separate pg_cron job can handle this.

Automated Vacuum and Analyze Jobs

Stale statistics and bloated tables are common performance killers in PostgreSQL. pg_cron can automate VACUUM and ANALYZE operations. While PostgreSQL has autovacuum, its aggressiveness can be tuned, and explicit scheduling can provide more control, especially for large, busy tables.

Consider scheduling a more aggressive VACUUM FULL (use with caution due to locking) or a targeted VACUUM ANALYZE on critical tables during off-peak hours. For this example, we’ll schedule a daily VACUUM ANALYZE on all user databases.

-- Schedule daily VACUUM ANALYZE for all user databases
SELECT cron.schedule(
    'daily-vacuum-analyze',
    '0 3 * * *', -- Run at 3 AM daily
    $$
    DO $$
    DECLARE
        db_record RECORD;
    BEGIN
        FOR db_record IN SELECT datname FROM pg_database WHERE datistemplate = false AND datallowconn = true
        LOOP
            RAISE NOTICE 'Running VACUUM ANALYZE on database: %', db_record.datname;
            EXECUTE format('VACUUM ANALYZE %I', db_record.datname);
        END LOOP;
    END $$;
    $$
);

Important Note on VACUUM FULL: While powerful for reclaiming space, VACUUM FULL locks the entire table and can be very disruptive. It should be used sparingly and scheduled during maintenance windows. For most cases, a regular VACUUM ANALYZE (or relying on autovacuum with proper tuning) is sufficient.

Monitoring Your C Application with Prometheus and Grafana on GKE

For a custom C application deployed on Google Kubernetes Engine (GKE), integrating Prometheus for metrics collection and Grafana for visualization is a robust solution. This allows you to track application-specific performance indicators, resource utilization, and error rates.

Instrumenting Your C Application

The first step is to instrument your C application to expose metrics in a format Prometheus can scrape. The prometheus/client_c library is an excellent choice for this. It provides C functions to define and update counters, gauges, histograms, and summaries.

Here’s a simplified example of how you might instrument a C application to expose request counts and latency:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include "prometheus/client_c/metrics.h"
#include "prometheus/client_c/expose.h"

// Define metrics
metric_family_t *request_counter_family;
metric_t *request_total;

metric_family_t *request_latency_histogram_family;
metric_t *request_latency_seconds;

// Function to initialize metrics
void init_metrics() {
    // Initialize Prometheus client library
    prom_init();

    // Create a counter for total requests
    request_counter_family = prom_counter_family_new("myapp_requests_total", "Total number of requests received.");
    request_total = prom_counter_new(request_counter_family, "myapp_requests_total", "Total number of requests received.");
    prom_register_metric(request_counter_family);

    // Create a histogram for request latency
    double buckets[] = {0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, INFINITY};
    request_latency_histogram_family = prom_histogram_family_new("myapp_request_duration_seconds", "Request latency in seconds.", 11, buckets);
    request_latency_seconds = prom_histogram_new(request_latency_histogram_family, "myapp_request_duration_seconds", "Request latency in seconds.");
    prom_register_metric(request_latency_histogram_family);

    // Start the HTTP server to expose metrics
    // This should ideally run in a separate thread or be managed by your application's event loop
    prom_expose_start_server(8080); // Expose on port 8080
}

// Function to simulate processing a request
void process_request() {
    // Increment request counter
    prom_counter_inc(request_total);

    // Measure latency
    struct timespec start_time, end_time;
    clock_gettime(CLOCK_MONOTONIC, &start_time);

    // Simulate work
    usleep((rand() % 500000) + 100000); // Simulate 0.1 to 0.6 seconds of work

    clock_gettime(CLOCK_MONOTONIC, &end_time);
    double duration = (end_time.tv_sec - start_time.tv_sec) + (end_time.tv_nsec - start_time.tv_nsec) / 1e9;

    // Observe latency in histogram
    prom_histogram_observe(request_latency_seconds, duration);

    printf("Request processed in %.3f seconds.\n", duration);
}

int main() {
    init_metrics();

    printf("Application started. Metrics exposed on http://localhost:8080/metrics\n");

    // Main application loop
    while (1) {
        process_request();
        sleep(1); // Simulate receiving requests every second
    }

    return 0;
}

Compile this code with the prometheus/client_c library linked. Ensure the prom_expose_start_server function is called and accessible from within your application’s lifecycle. In a GKE environment, you’ll need to expose this port via a Kubernetes Service.

Deploying Prometheus and Grafana on GKE

The most straightforward way to deploy Prometheus and Grafana on GKE is by using the Prometheus Operator, often managed via Helm. This operator simplifies the deployment and management of Prometheus, Alertmanager, and related components.

First, add the Prometheus community Helm repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Next, install the Prometheus Operator. You can customize the installation using a values.yaml file. Key configurations include enabling the Prometheus Operator itself, setting up Prometheus instances, and configuring Grafana.

# prometheus-operator-values.yaml
prometheusOperator:
  createCustomResource: true

prometheus:
  prometheusSpec:
    serviceMonitorSelector: {} # Scrape all ServiceMonitors by default
    podMonitorSelector: {}     # Scrape all PodMonitors by default
    retention: 10d             # Data retention period

grafana:
  enabled: true
  adminPassword: "your_strong_admin_password" # Change this!
  persistence:
    enabled: true
    size: 10Gi
  ingress:
    enabled: true
    hosts:
      - grafana.your-domain.com # Configure your Ingress host
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.your-domain.com

# Example of a ServiceMonitor to scrape your C application
# This would typically be in a separate YAML file applied to your cluster
# ---
# apiVersion: monitoring.coreos.com/v1
# kind: ServiceMonitor
# metadata:
#   name: myapp-servicemonitor
#   labels:
#     release: prometheus # This label must match your Prometheus instance's selector
# spec:
#   selector:
#     matchLabels:
#       app: myapp # Label on your Kubernetes Service for the C app
#   namespaceSelector:
#     matchNames:
#       - default # Namespace where your C app is deployed
#   endpoints:
#   - port: metrics # Name of the port in your Kubernetes Service
#     interval: 30s
#     path: /metrics # The path where your C app exposes metrics

Install the chart:

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f prometheus-operator-values.yaml

After installation, you’ll need to create a ServiceMonitor (or PodMonitor if you’re not using a Service) that tells Prometheus how to find and scrape your C application’s metrics endpoint. The example ServiceMonitor is commented out in the values.yaml above; apply it as a separate resource.

Configuring Grafana Dashboards

Once Grafana is up and running (accessible via the Ingress you configured), log in with the admin credentials. Prometheus will be automatically configured as a data source. You can then import or create dashboards to visualize your C application’s metrics. Look for dashboards that can display counters (like request rates) and histograms (for latency distributions).

For histograms, Grafana can render them as heatmaps or bar charts, allowing you to see the distribution of request latencies. Key metrics to monitor include:

Request rate (myapp_requests_total counter, often graphed as a rate over time).
Request latency distribution (myapp_request_duration_seconds histogram).
Application-specific error counts.
Resource utilization (CPU, memory) of your application pods, which can be scraped by Prometheus via kube-state-metrics and node-exporter.

Integrating Cloud Monitoring for PostgreSQL and GKE

While pg_cron and Prometheus provide deep insights into your PostgreSQL cluster and C application respectively, Google Cloud’s native Cloud Monitoring offers a centralized view and powerful alerting capabilities across your entire GCP infrastructure.

Leveraging Cloud Monitoring for PostgreSQL

Google Cloud SQL for PostgreSQL automatically exports a rich set of metrics to Cloud Monitoring. These include:

CPU utilization
Memory utilization
Disk I/O operations
Network traffic
Database connections (active, max)
Replication lag
Transaction throughput
Query performance metrics (though less granular than direct SQL queries or pg_stat_statements).

You can create custom dashboards in Cloud Monitoring to visualize these metrics. For example, a dashboard showing CPU, memory, active connections, and replication lag side-by-side provides a quick health overview.

Alerting on PostgreSQL Metrics:

Set up alerting policies for critical thresholds. Essential alerts include:

High CPU utilization (e.g., > 80% for 15 minutes).
Low disk space (e.g., < 10% free).
High replication lag (e.g., > 60 seconds).
Excessive active connections (e.g., > 90% of max_connections).
High I/O wait times.

To create an alert policy:

Navigate to Cloud Monitoring in the Google Cloud Console.
Go to “Alerting” and click “Create Policy”.
Select the metric (e.g., “Cloud SQL Database” -> “CPU utilization”).
Configure the filter (e.g., specific instance name).
Set the trigger condition (e.g., “above” 80 for “15 minutes”).
Configure notification channels (e.g., email, PagerDuty, Slack).

Monitoring GKE Pods and Nodes with Cloud Monitoring

GKE integrates seamlessly with Cloud Monitoring. By default, it collects metrics from your cluster’s nodes and pods. This includes:

Node CPU, memory, disk, and network usage.
Pod CPU and memory usage.
Kubernetes API server metrics.
Workload health (e.g., pod restarts).

You can also deploy the Cloud Operations for GKE agent (formerly Stackdriver Kubernetes agent) to collect more detailed metrics, including container-level metrics and logs, directly into Cloud Logging and Cloud Monitoring.

Custom Metrics from C Application to Cloud Monitoring:

While Prometheus is excellent for internal cluster monitoring, you might want to push key metrics from your C application directly to Cloud Monitoring for unified alerting and dashboarding. You can achieve this by:

Using the Cloud Monitoring API client libraries in your C application (requires more development effort).
Having a sidecar container in your GKE pod that scrapes Prometheus metrics and pushes them to Cloud Monitoring using the OpenCensus or OpenTelemetry agent.
Using a Prometheus remote write exporter that targets Cloud Monitoring’s Prometheus ingestion endpoint.

The remote write approach is often the most practical for existing Prometheus setups. You’ll configure your Prometheus instance to forward metrics to the Cloud Monitoring endpoint.

# Example Prometheus configuration snippet for remote write
remoteWrite:
  - url: "https://monitoring.googleapis.com/v1/projects/YOUR_PROJECT_ID/metricDescriptors/custom.googleapis.com/myapp/requests_total:ingest"
    # Authentication would typically be handled by the Kubernetes service account
    # or by providing credentials. For GKE, the default service account often has
    # the necessary permissions if the Cloud Operations for GKE add-on is enabled.
    # You might need to configure specific auth headers or use a dedicated exporter.
    # A more robust solution involves using the Prometheus GCP Exporter or
    # a custom remote write handler.

Alerting on GKE Metrics:

Similar to Cloud SQL, set up alerts for GKE resources:

High pod restart counts.
High CPU/memory utilization on nodes.
Pods in `CrashLoopBackOff` or `Error` states.
Deployment/StatefulSet rollout failures.

These alerts can be configured directly within Cloud Monitoring, leveraging the Kubernetes-related metrics available.

Establishing a Unified Monitoring Strategy

The goal is not to have disparate monitoring systems but a cohesive strategy. Cloud Monitoring serves as the central nervous system for infrastructure-level alerts and high-level dashboards. Prometheus provides deep, application-centric metrics for your C application and potentially for PostgreSQL if you deploy exporters like postgres_exporter. pg_cron automates database maintenance and health checks, feeding findings into your alerting pipeline.

Key principles for a unified strategy:

Centralized Alerting: Route critical alerts from all sources (Cloud Monitoring, Prometheus Alertmanager) to a single incident management system (e.g., Opsgenie, VictorOps, PagerDuty).
Layered Dashboards:

Cloud Monitoring: Infrastructure overview, PostgreSQL health, GKE cluster status.
Grafana: Deep dives into C application performance, custom PostgreSQL metrics if exported.

Correlation: Ensure you can correlate events. For example, if Cloud Monitoring alerts on high PostgreSQL CPU, you should be able to quickly pivot to Grafana or Prometheus to see if a specific query or application behavior is the cause.
Automated Remediation: For certain predictable issues (e.g., restarting a pod that’s unhealthy), consider integrating with Cloud Functions or Kubernetes operators to trigger automated recovery actions based on alerts.

By combining the strengths of Google Cloud’s native monitoring, Prometheus for application-level observability, and automated database checks with pg_cron, you build a resilient and observable system capable of handling the demands of production workloads.