Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on Google Cloud

Proactive PostgreSQL Cluster Health Checks with pg_cron and Custom Metrics

Maintaining the health of a PostgreSQL cluster, especially in a distributed cloud environment like Google Cloud, requires more than just reactive alerts. Proactive, automated checks are crucial. We’ll leverage pg_cron, a PostgreSQL extension, to schedule regular, lightweight checks directly within the database. This approach minimizes external dependencies and ensures checks run in the same context as the data they’re inspecting.

Our strategy involves two key components: scheduled vacuuming/analyzing and custom metric collection for critical database states. For vacuuming, PostgreSQL’s autovacuum daemon is essential, but sometimes manual intervention or more granular control is needed. For custom metrics, we’ll identify key indicators of performance degradation or potential issues that standard monitoring tools might miss.

Configuring pg_cron for Scheduled Tasks

First, ensure the pg_cron extension is installed and enabled on your PostgreSQL instances. This typically involves adding pg_cron to shared_preload_libraries in your postgresql.conf and then running CREATE EXTENSION pg_cron; in each database you want to manage.

We’ll schedule a daily `VACUUM FULL` on a specific, less-critical table as a demonstration. Caution: VACUUM FULL locks the table exclusively and rewrites the entire table. Use it judiciously, preferably during maintenance windows or on tables where bloat is a significant, persistent problem. For most scenarios, relying on autovacuum and occasional `VACUUM` (without `FULL`) is sufficient.

The syntax for scheduling jobs with pg_cron uses cron-like expressions. Here’s how to schedule a `VACUUM FULL` on a table named audit_logs every day at 3 AM:

SELECT cron.schedule(
    'daily-vacuum-audit-logs',
    '0 3 * * *',
    $$VACUUM FULL audit_logs;$$
);

To verify the schedule, you can query the cron.job table:

SELECT * FROM cron.job WHERE job_name = 'daily-vacuum-audit-logs';

Custom PostgreSQL Metrics for Cloud Monitoring

Google Cloud’s operations suite (formerly Stackdriver) provides excellent infrastructure and application monitoring. However, for deep PostgreSQL insights, we need to push custom metrics. We can create a PostgreSQL function that gathers specific metrics and then use pg_cron to execute this function periodically, writing its output to a table that a custom script can then scrape.

Let’s create a function to capture the number of dead tuples, bloat estimates, and replication lag. We’ll need the pgstattuple extension for bloat estimation. Install it with CREATE EXTENSION pgstattuple;.

CREATE OR REPLACE FUNCTION collect_pg_metrics()
RETURNS VOID AS $$
DECLARE
    r RECORD;
    replication_lag_sec INT := 0;
BEGIN
    -- Collect bloat and dead tuple information for critical tables
    FOR r IN SELECT
                 schemaname,
                 relname,
                 n_live_tup,
                 n_dead_tup,
                 pg_size_pretty(pg_table_size(oid)) AS table_size,
                 pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
                 (SELECT current_setting('wal_level')) AS wal_level,
                 (SELECT pg_catalog.pg_is_in_recovery()) AS is_replica
             FROM pg_stat_user_tables
             WHERE n_dead_tup > 0 AND n_live_tup > 0 -- Only interested in tables with some activity and dead tuples
             ORDER BY n_dead_tup DESC
             LIMIT 10 -- Limit to top 10 tables with dead tuples
    LOOP
        -- Estimate bloat using pgstattuple
        DECLARE
            t_info pgstattuple%ROWTYPE;
        BEGIN
            SELECT * INTO t_info FROM pgstattuple(r.schemaname || '.' || r.relname);
            INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, table_name, schema_name, wal_level, is_replica)
            VALUES (NOW(), 'dead_tuples', r.n_dead_tup, r.relname, r.schemaname, r.wal_level, r.is_replica);
            INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, table_name, schema_name, wal_level, is_replica)
            VALUES (NOW(), 'live_tuples', r.n_live_tup, r.relname, r.schemaname, r.wal_level, r.is_replica);
            INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, table_name, schema_name, wal_level, is_replica)
            VALUES (NOW(), 'table_size_bytes', pg_table_size(r.schemaname || '.' || r.relname), r.relname, r.schemaname, r.wal_level, r.is_replica);
            INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, table_name, schema_name, wal_level, is_replica)
            VALUES (NOW(), 'total_relation_size_bytes', pg_total_relation_size(r.schemaname || '.' || r.relname), r.relname, r.schemaname, r.wal_level, r.is_replica);
            INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, table_name, schema_name, wal_level, is_replica)
            VALUES (NOW(), 'estimated_bloat_bytes', t_info.free_percent * pg_table_size(r.oid), r.relname, r.schemaname, r.wal_level, r.is_replica);
        EXCEPTION WHEN OTHERS THEN
            -- Handle cases where pgstattuple might fail (e.g., temporary tables)
            RAISE NOTICE 'Could not get pgstattuple for %.% : %', r.schemaname, r.relname, SQLERRM;
        END;
    END LOOP;

    -- Collect replication lag if this is a replica
    IF (SELECT pg_catalog.pg_is_in_recovery()) THEN
        SELECT COALESCE(EXTRACT(EPOCH FROM (pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn())), 0) INTO replication_lag_sec;
        INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, wal_level, is_replica)
        VALUES (NOW(), 'replication_lag_seconds', replication_lag_sec, (SELECT current_setting('wal_level')), TRUE);
    END IF;

    -- Collect connection count
    DECLARE
        current_connections INT;
        max_connections INT;
    BEGIN
        SELECT count(*) INTO current_connections FROM pg_stat_activity;
        SELECT setting::int INTO max_connections FROM pg_settings WHERE name = 'max_connections';
        INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, wal_level, is_replica)
        VALUES (NOW(), 'current_connections', current_connections, (SELECT current_setting('wal_level')), (SELECT pg_catalog.pg_is_in_recovery()));
        INSERT INTO pg_metrics (metric_timestamp, metric_name, metric_value, wal_level, is_replica)
        VALUES (NOW(), 'max_connections', max_connections, (SELECT current_setting('wal_level')), (SELECT pg_catalog.pg_is_in_recovery()));
    END;

END;
$$ LANGUAGE plpgsql;

Before creating the function, you need a table to store these metrics. Create it in a dedicated monitoring database or a schema within your main database.

CREATE TABLE IF NOT EXISTS pg_metrics (
    metric_timestamp TIMESTAMP WITH TIME ZONE,
    metric_name VARCHAR(100),
    metric_value NUMERIC,
    table_name VARCHAR(100),
    schema_name VARCHAR(100),
    wal_level VARCHAR(50),
    is_replica BOOLEAN
);

Now, schedule this function to run every 5 minutes using pg_cron:

SELECT cron.schedule(
    'collect-pg-custom-metrics',
    '*/5 * * * *',
    $$SELECT collect_pg_metrics();$$
);

Exporting Custom Metrics to Google Cloud Monitoring

With custom metrics being collected in the pg_metrics table, we need a mechanism to export them to Google Cloud Monitoring. A simple Python script running on a Compute Engine instance or within a GKE pod can achieve this. This script will periodically query the pg_metrics table, transform the data into Google Cloud Monitoring’s `TimeSeries` format, and use the Cloud Monitoring API to write the metrics.

First, ensure you have the Google Cloud client libraries for Python installed:

pip install google-cloud-monitoring psycopg2-binary

Here’s a Python script to export the metrics. This script assumes your PostgreSQL database is accessible from where it’s running (e.g., via a private IP, Cloud SQL proxy, or public IP with appropriate firewall rules). It also assumes you have authenticated your environment to Google Cloud (e.g., via `gcloud auth application-default login` or a service account key).

import os import psycopg2 from google.cloud import monitoring_v3 from google.protobuf.timestamp_pb2 import Timestamp import time from datetime import datetime, timezone # --- Configuration --- DB_HOST = os.environ.get("DB_HOST", "your-postgres-host") DB_PORT = os.environ.get("DB_PORT", "5432") DB_NAME = os.environ.get("DB_NAME", "your-db-name") DB_USER = os.environ.get("DB_USER", "your-db-user") DB_PASSWORD = os.environ.get("DB_PASSWORD", "your-db-password") PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") # Automatically set in GCE/GKE or via env var MONITORING_INTERVAL_SECONDS = 60 # How often to fetch and send metrics # --- Google Cloud Monitoring Client --- client = monitoring_v3.MetricServiceClient() project_name = f"projects/{PROJECT_ID}" # --- PostgreSQL Connection --- def get_db_connection(): try: conn = psycopg2.connect( host=DB_HOST, port=DB_PORT, dbname=DB_NAME, user=DB_USER, password=DB_PASSWORD ) return conn except psycopg2.Error as e: print(f"Error connecting to PostgreSQL: {e}") return None # --- Metric Conversion --- def create_time_series(metric_type, value, timestamp, resource_type, resource_labels, metric_labels=None): if metric_labels is None: metric_labels = {} series = monitoring_v3.TimeSeries() series.metric.type = metric_type series.resource.type = resource_type series.resource.labels.update(resource_labels) series.metric.labels.update(metric_labels) point = monitoring_v3.Point() point.value.double_value = float(value) point.interval.end_time.seconds = int(timestamp.timestamp()) point.interval.end_time.nanos = timestamp.microsecond * 1000 series.points.append(point) return series # --- Main Export Loop --- def export_metrics(): conn = get_db_connection() if not conn: return cursor = conn.cursor() try: # Fetch metrics from the pg_metrics table # We'll fetch metrics since the last export to avoid duplicates and handle potential restarts # For simplicity here, we fetch all, but a 'last_exported_timestamp' would be better in production cursor.execute(""" SELECT metric_timestamp, metric_name, metric_value, table_name, schema_name, wal_level, is_replica FROM pg_metrics ORDER BY metric_timestamp ASC; """) rows = cursor.fetchall() if not rows: print("No new metrics to export.") return time_series_list = [] now = datetime.now(timezone.utc) # Determine resource type and labels based on environment # This is a simplified example; you'd likely want to detect GCE instance, GKE pod, etc. # For Cloud SQL, use 'cloudsql_database' resource type. # For GCE, use 'gce_instance'. For GKE, use 'k8s_container'. # Assuming this script runs on a GCE instance for demonstration: resource_type = "gce_instance" resource_labels = { "project_id": PROJECT_ID, "instance_id": os.environ.get("INSTANCE_ID", "unknown-instance"), # Get from metadata server if possible "zone": os.environ.get("INSTANCE_ZONE", "unknown-zone") # Get from metadata server if possible } # If running on Cloud SQL, you'd use: # resource_type = "cloudsql_database" # resource_labels = { # "project_id": PROJECT_ID, # "database_id": "your-cloudsql-instance-id" # e.g., "my-project:us-central1:my-db-instance" # } for row in rows: ts, metric_name, metric_value, table_name, schema_name, wal_level, is_replica = row metric_timestamp = ts if ts else now # Use fetched timestamp or current time metric_labels = { "wal_level": wal_level, "is_replica": str(is_replica).lower() # Cloud Monitoring expects string labels } if table_name: metric_labels["table_name"] = table_name if schema_name: metric_labels["schema_name"] = schema_name # Map PostgreSQL metric names to Cloud Monitoring metric types # You'll need to define these metric types in your Cloud Monitoring setup or use custom ones. # Example custom metric types: # custom.googleapis.com/postgres/dead_tuples # custom.googleapis.com/postgres/replication_lag_seconds # custom.googleapis.com/postgres/connection_count # custom.googleapis.com/postgres/bloat_bytes metric_type = f"custom.googleapis.com/postgres/{metric_name}" time_series = create_time_series( metric_type, metric_value, metric_timestamp, resource_type, resource_labels, metric_labels ) time_series_list.append(time_series) if time_series_list: print(f"Sending {len(time_series_list)} time series to Cloud Monitoring...") client.create_time_series(name=project_name, time_series=time_series_list) print("Metrics sent successfully.") # Optional: Clean up the pg_metrics table after successful export # This is crucial to prevent re-exporting and to manage table size. # You might want to keep data for a retention period. # For simplicity, we'll delete all processed rows. A more robust approach # would be to track the last exported timestamp. cursor.execute("DELETE FROM pg_metrics;") conn.commit() print("Cleaned up pg_metrics table.") except Exception as e: print(f"An error occurred during metric export: {e}") conn.rollback() finally: cursor.close() conn.close() if __name__ == "__main__": print("Starting PostgreSQL metrics exporter...") while True: export_metrics() time.sleep(MONITORING_INTERVAL_SECONDS)

To run this script reliably, consider deploying it as a Kubernetes Deployment in GKE, a service on Cloud Run, or a systemd service on a Compute Engine instance. Ensure the service account used by the script has the monitoring.metricWriter IAM role.

PHP Application Monitoring with OpenTelemetry and Cloud Trace

Monitoring your PHP application's performance and tracing requests across distributed systems is vital for debugging and optimization. Google Cloud's operations suite integrates seamlessly with OpenTelemetry, a vendor-neutral standard for observability. We'll focus on instrumenting a PHP application to send traces to Cloud Trace.

Instrumenting PHP with OpenTelemetry SDK

The OpenTelemetry PHP SDK allows you to automatically and manually instrument your code. For automatic instrumentation, it can capture requests, database queries, and other common operations without code changes.

First, install the necessary packages via Composer:

composer require open-telemetry/sdk
composer require open-telemetry/exporter-otlp
composer require open-telemetry/auto-instrumentation-http
composer require open-telemetry/auto-instrumentation-psr18
composer require open-telemetry/auto-instrumentation-pdo

Next, configure the OpenTelemetry SDK. This typically involves setting up a tracer provider and an exporter. For Google Cloud Trace, we'll use the OTLP (OpenTelemetry Protocol) exporter, which can be configured to send data to the Cloud Trace agent or directly to the Cloud Trace API endpoint.

Create a bootstrap file (e.g., opentelemetry_bootstrap.php) that initializes the SDK. This file should be included early in your application's request lifecycle.

<?php
require __DIR__ . '/vendor/autoload.php';

use OpenTelemetry\API\Trace\TracerProviderInterface;
use OpenTelemetry\SDK\Trace\TracerProvider;
use OpenTelemetry\SDK\Trace\SpanProcessor\SimpleSpanProcessor;
use OpenTelemetry\SDK\Trace\SpanExporter\OtlpExporter;
use OpenTelemetry\SDK\Trace\Sampler\ParentBased;
use OpenTelemetry\SDK\Trace\Sampler\TraceIdRatioSampler;
use OpenTelemetry\Context\Context;
use OpenTelemetry\API\Globals;
use OpenTelemetry\Extension\AutoInstrumentation\Http\Psr18\HttpClientInstrumentation;
use OpenTelemetry\Extension\AutoInstrumentation\PDO\PDOInstrumentation;

// --- Configuration ---
// For Google Cloud Trace, you can export directly or via the OpenTelemetry Collector.
// Direct export to Cloud Trace API endpoint:
// $endpoint = 'https://cloudtrace.googleapis.com/v2/trace'; // Not directly supported by OtlpExporter, use collector or specific GCP exporter if available.
// The OtlpExporter typically sends to a collector or a gRPC endpoint.
// For Cloud Trace, it's common to send to the OpenTelemetry Collector, which then forwards to Cloud Trace.
// If running on GCE/GKE, the collector might be configured to use GCP's native trace ingestion.
// For simplicity, let's assume an OTLP collector is running locally or accessible.
$otlp_endpoint = getenv('OTEL_EXPORTER_OTLP_ENDPOINT') ?: 'http://localhost:4318'; // Default to local collector

// Sampler: Decide what percentage of traces to sample. 1.0 means 100%.
$sampler = new ParentBased(new TraceIdRatioSampler(1.0)); // Sample all traces

// Tracer Provider
$tracerProvider = TracerProvider::builder()
    ->withSampler($sampler)
    ->addSpanProcessor(new SimpleSpanProcessor(
        new OtlpExporter($otlp_endpoint)
    ))
    ->build();

// Set the global tracer provider
Globals::setTracerProvider($tracerProvider);

// --- Automatic Instrumentation ---
// HTTP Client Instrumentation (e.g., Guzzle, Symfony HttpClient)
HttpClientInstrumentation::register();

// PDO Instrumentation (for database queries)
PDOInstrumentation::register();

// --- Manual Instrumentation Example (Optional) ---
// You can also manually create spans for specific code blocks.
function my_custom_operation(TracerProviderInterface $tracerProvider) {
    $tracer = $tracerProvider->getTracer('com.example.myapp');
    $span = $tracer->spanBuilder('my_custom_operation')->startSpan();

    try {
        // Your custom logic here
        $span->setAttribute('custom.attribute', 'example_value');
        sleep(1); // Simulate work
    } catch (\Throwable $e) {
        $span->recordException($e);
        throw $e;
    } finally {
        $span->end();
    }
}

// --- Request Handling ---
// This part depends heavily on your framework (e.g., Symfony, Laravel, plain PHP).
// You need to ensure the tracer provider is available and spans are ended.
// For a simple PHP-FastCGI setup, you might do this at the start of your index.php.

// Get the global tracer provider
$globalTracerProvider = Globals::getTracerProvider();

// Start a root span for the incoming request
$requestSpan = $globalTracerProvider->getTracer('com.example.myapp.request')
    ->spanBuilder('HTTP Request')
    ->setSpanKind(OpenTelemetry\API\Trace\SpanKind::KIND_SERVER)
    ->startSpan();

// Set attributes for the request span (e.g., HTTP method, URL)
// This requires access to $_SERVER or your framework's request object.
$requestSpan->setAttribute('http.method', $_SERVER['REQUEST_METHOD'] ?? 'UNKNOWN');
$requestSpan->setAttribute('http.url', $_SERVER['REQUEST_URI'] ?? 'UNKNOWN');
$requestSpan->setAttribute('user.agent', $_SERVER['HTTP_USER_AGENT'] ?? 'UNKNOWN');

// Activate the span in the current context
$rootScope = Context::getRoot()->with($requestSpan);
Context::storage()->attach($rootScope);

// --- Include your application's main logic ---
// For example:
// require __DIR__ . '/src/App/bootstrap.php';
// require __DIR__ . '/public/index.php'; // If using a framework like Laravel/Symfony

// --- End of request processing ---
// Ensure the root span is ended. This should happen after your application logic has finished.
// In a real application, this might be handled by a framework's event listener or middleware.
register_shutdown_function(function () use ($requestSpan, $globalTracerProvider) {
    // Ensure the context is correctly managed if this shutdown function runs in a different context
    $currentContext = Context::storage()->get();
    if ($currentContext !== $rootScope) {
        Context::storage()->attach($rootScope);
    }

    $requestSpan->end();
    $globalTracerProvider->shutdown(); // Flush any remaining spans
});

// Example of calling a manual operation
// my_custom_operation($globalTracerProvider);

?>

In a framework like Symfony or Laravel, you would typically integrate this bootstrap logic into their respective bootstrapping processes, perhaps via a custom service provider or event listener.

Configuring the OpenTelemetry Collector for Google Cloud Trace

The OpenTelemetry Collector acts as a central point for receiving, processing, and exporting telemetry data. For sending traces to Google Cloud Trace, you'll configure the collector to receive OTLP data and export it to Google Cloud.

Here's a sample collector configuration file (otel-collector-config.yaml):

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
    send_batch_size: 1000
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_mib: 200
    spike_limit_mib: 50
  attributes:
    actions:
      # Add GCP project ID and cluster name if running in GKE
      - key: gcp.project.id
        action: insert
        value: "your-gcp-project-id" # Replace with your project ID
      - key: k8s.cluster.name
        action: insert
        value: "your-gke-cluster-name" # Replace with your cluster name if applicable
      - key: cloud.provider
        action: insert
        value: "gcp"

exporters:
  googlecloudtrace:
    project: "your-gcp-project-id" # Replace with your project ID
    trace_client_options:
      # Optional: If running outside GCP, you might need to specify credentials.
      # For GCE/GKE, it uses the instance's service account.
      # credentials_file: "/path/to/your/service-account-key.json"

  logging: # Useful for debugging the collector itself
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [googlecloudtrace, logging] # Send to Cloud Trace and log for debugging

You can run the OpenTelemetry Collector as a Docker container or directly from its binary. Ensure the `googlecloudtrace` exporter is configured with the correct GCP project ID. If running outside Google Cloud, you'll need to provide service account credentials.

Once the collector is running and your PHP application is configured to send traces to it (e.g., via http://localhost:4318), you should start seeing traces in the Google Cloud Trace console.

Google Cloud Operations Suite: Alerting and Dashboards

With custom PostgreSQL metrics and application traces flowing into Google Cloud Operations Suite, the final step is to create actionable alerts and informative dashboards.

Setting Up PostgreSQL Alerts

Navigate to the Monitoring section in the Google Cloud Console, then go to "Alerting". Click "Create Policy".

Example Alert: High Replication Lag

Metric: Select your custom metric, e.g., custom.googleapis.com/postgres/replication_lag_seconds.
Filter: Filter by the resource type (e.g., gce_instance or cloudsql_database) and specific labels if needed (e.g., is_replica: true).
Transform: Use the mean or max aggregation over a suitable time window (e.g., 5 minutes).
Condition: Set a threshold. For example, "is above" 300 (seconds).
Trigger: Configure the number of minutes the condition must be met before triggering (e.g., 5 minutes).
Notifications: Configure notification channels (e.g., email, PagerDuty, Slack via Pub/Sub).
Documentation: Add runbooks or links to relevant documentation for the alert.

Example Alert: High Bloat/Dead Tuples

Metric: custom.googleapis.com/postgres/dead_tuples or a calculated metric for bloat percentage.
Filter: Filter by resource and potentially by table_name or schema_name.
Transform: Use mean or max. You might want to alert on tables with a high number of dead tuples relative to live tuples, or a high estimated bloat.
Condition: Alert if dead_tuples is above 1000000 (or a percentage threshold) for 15 minutes.
Notifications: Configure as above.

Building Application Performance Dashboards

Dashboards provide a consolidated view of your system's health. In Google Cloud Monitoring, go to "Dashboards" and click "Create Dashboard".

Key Dashboard Widgets:

PostgreSQL Cluster Health:
- Replication Lag (Line chart)
- Connection Count (Gauge or Line chart)
- CPU/Memory Usage (if monitoring GCE/Cloud SQL directly)
- Disk I/O (if applicable)
Application Performance:
- Request Latency (from Cloud Trace, e.g., p95, p99 latency)
- Error Rate (from Cloud Trace or application logs)
- Number of Spans per Service (from Cloud Trace)
- Custom application metrics (e.g., queue depth, cache hit rate)
Database Performance (from Traces):
- Average duration of database query spans (filtered by db.system=postgresql and span.kind=client)
- Count of slow database queries (e.g., > 500ms)

When creating widgets for custom metrics, ensure you select the correct metric type and apply appropriate filters and aggregations. For trace data, use the "Trace" widget type and filter by service name, span name, or attributes.

By combining proactive database checks with comprehensive application tracing and robust alerting, you can significantly improve the reliability and performance of your PHP applications running on PostgreSQL clusters in Google Cloud.