Server Monitoring Best Practices: Keeping Your PHP App and MySQL Clusters Alive on Google Cloud

Proactive MySQL Cluster Health Checks with `pt-heartbeat`

Maintaining the health and replication lag of MySQL clusters, especially in a distributed cloud environment like Google Cloud, is paramount. Relying solely on Cloud Monitoring’s basic metrics can leave you blind to subtle replication issues or performance degradations that impact application availability. A robust strategy involves deploying specialized tools that provide deeper insights into MySQL’s internal state. One such indispensable tool is Percona Toolkit’s `pt-heartbeat`.

pt-heartbeat monitors replication lag by writing a timestamp to a designated table on the primary and then checking the timestamp on replicas. This provides a real-time, granular view of how far behind each replica is. For a production MySQL cluster managed on Google Cloud (e.g., using Cloud SQL or self-managed on GCE), setting this up is a critical step.

Setting up `pt-heartbeat` on Replicas

First, ensure Percona Toolkit is installed on all your MySQL instances (primary and replicas). On a Debian/Ubuntu-based system, this is typically:

sudo apt-get update
sudo apt-get install percona-toolkit

Next, create a dedicated table on the primary MySQL server to store the heartbeat information. This table should be replicated to all replicas.

-- On the Primary MySQL Server
CREATE DATABASE IF NOT EXISTS heartbeat;
USE heartbeat;
CREATE TABLE IF NOT EXISTS hb (
  id INT PRIMARY KEY,
  ts DATETIME NOT NULL
) ENGINE=InnoDB;

Now, configure pt-heartbeat to run on each replica. It will connect to the replica, read the `heartbeat.hb` table, and calculate the lag based on the `ts` column. The command typically looks like this:

pt-heartbeat --host <replica_host> --user <replication_user> --password <replication_password> --database heartbeat --table hb --update-primary --interval 1 --daemonize

Explanation:

--host: The IP address or hostname of the replica.
--user, --password: Credentials for a user with sufficient privileges to read the `heartbeat.hb` table.
--database heartbeat --table hb: Specifies the table to monitor.
--update-primary: This crucial flag tells pt-heartbeat to also update the timestamp on the primary if it’s running on the primary itself. This is useful for a unified monitoring setup. If you run this *only* on replicas, you’ll need a separate process on the primary to *write* to the table.
--interval 1: Check every 1 second. Adjust as needed.
--daemonize: Run in the background.

To ensure the heartbeat table is consistently updated on the primary, you can run a similar command on the primary, or more commonly, use a cron job or a dedicated script. A simple cron job on the primary could execute:

-- On the Primary Server (via cron)
mysql -u<replication_user> -p<replication_password> -h localhost heartbeat -e "REPLACE INTO hb (id, ts) VALUES (1, NOW())"

This cron job should run frequently, e.g., every 10-30 seconds. The pt-heartbeat on the replicas will then read this timestamp and calculate the lag. The output of pt-heartbeat (when not daemonized) will show the calculated lag in seconds. You’ll want to capture this output and send it to Cloud Monitoring.

Integrating `pt-heartbeat` Lag into Cloud Monitoring

The raw output of pt-heartbeat needs to be ingested into Google Cloud Monitoring (formerly Stackdriver) to be actionable. A common approach is to use a custom agent or script that periodically runs pt-heartbeat (without --daemonize), parses its output, and pushes the metrics to Cloud Monitoring using the Cloud Monitoring API or the Ops Agent.

Using the Ops Agent for Custom Metrics

The Google Cloud Ops Agent is the recommended way to collect logs and metrics from your Compute Engine instances. You can configure it to run custom scripts and parse their output as metrics.

First, ensure the Ops Agent is installed and running on your replica instances. Then, you’ll need to create a configuration file for the metrics collection. This typically involves defining a receiver that executes a command and a processor that parses the output.

# /etc/google-cloud-ops-agent/config.yaml
metrics:
  receivers:
    mysql_heartbeat:
      type: command
      command: "pt-heartbeat --host <replica_host> --user <replication_user> --password <replication_password> --database heartbeat --table hb --interval 0 --quiet" # --interval 0 to run once, --quiet for cleaner output
      timeout: 60s
  processors:
    parse_mysql_heartbeat:
      type: metrics_parser
      record_separator: "\n"
      field_separator: " "
      # Example output: "1397988000.123456" (timestamp) or "0.000000" (lag)
      # We need to parse the lag value. pt-heartbeat outputs the lag directly when not daemonized.
      # If pt-heartbeat outputs "0.000000" it means 0 lag.
      # If it outputs a large number, that's the lag in seconds.
      # Let's assume pt-heartbeat outputs just the lag value when --quiet is used.
      # A more robust parsing might be needed depending on exact pt-heartbeat version and flags.
      # For simplicity, let's assume the output is just the numeric lag.
      metric_type: "custom.googleapis.com/mysql/replication_lag_seconds"
      value_key: "value"
      # If pt-heartbeat outputs something like "lag: 0.000000" you'd use:
      # parse_format: "lag: %{value}"
      # Assuming direct numeric output:
      parse_format: "%{value}"
  service:
    pipelines:
      metrics:
        receivers: [mysql_heartbeat]
        processors: [parse_mysql_heartbeat]

Important Considerations for the Ops Agent Configuration:

Replace placeholders like <replica_host>, <replication_user>, and <replication_password> with your actual credentials. Storing passwords directly in configuration files is a security risk; consider using environment variables or secrets management.
The pt-heartbeat command within the receiver should be adjusted. Using --interval 0 makes it run once per collection interval. --quiet can help simplify parsing.
The parse_format in the processor is critical. You’ll need to inspect the exact output of your pt-heartbeat command to ensure the parsing logic correctly extracts the lag value. If pt-heartbeat outputs the epoch timestamp of the last write, you’ll need to calculate the lag in the processor or by adjusting the pt-heartbeat command itself. A common output when run non-daemonized is just the lag in seconds.
The metric_type should follow the custom metrics naming convention.

After applying this configuration (e.g., by placing it at /etc/google-cloud-ops-agent/config.yaml and restarting the agent: sudo systemctl restart google-cloud-ops-agent), you should start seeing metrics like custom.googleapis.com/mysql/replication_lag_seconds appear in Cloud Monitoring for each replica where the agent is configured.

PHP Application-Level Monitoring with Prometheus Exporters

While infrastructure-level monitoring is crucial, understanding the performance and health of your PHP application from its own perspective is equally important. This includes tracking request latency, error rates, queue depths, and specific business metrics. Prometheus, with its client libraries and exporters, is a powerful choice for this.

Instrumenting PHP Code with Prometheus Client Library

The prometheus_client PHP library allows you to expose application metrics via an HTTP endpoint, which Prometheus can then scrape.

First, install the library using Composer:

composer require prometheus_client/prometheus_client

Next, instrument your PHP application. You’ll typically create metric objects (Counters, Gauges, Histograms) and update them as your application logic executes. You’ll also need an endpoint to expose these metrics.

<?php
require __DIR__ . '/vendor/autoload.php';

use Prometheus\CollectorRegistry;
use Prometheus\Render\CallbackRenderer;
use Prometheus\Storage\InMemory;

// Initialize registry and storage
$registry = new CollectorRegistry(new InMemory());

// Define metrics
$requestCounter = $registry->registerCounter(
    'my_php_app', 'http_requests_total', 'Total HTTP requests', ['method', 'code']
);
$requestLatency = $registry->registerHistogram(
    'my_php_app', 'http_request_duration_seconds', 'HTTP request duration in seconds', ['method', 'code']
);
$queueGauge = $registry->registerGauge(
    'my_php_app', 'message_queue_depth', 'Current depth of the message queue'
);

// --- Example Usage ---

// Simulate receiving a request
$method = $_SERVER['REQUEST_METHOD'] ?? 'UNKNOWN';
$startTime = microtime(true);

try {
    // Your application logic here...
    // Simulate some work
    sleep(rand(0, 2));

    // Simulate a successful response
    $statusCode = 200;
    $queueDepth = rand(0, 100);
    $queueGauge->set($queueDepth);

    // Record metrics
    $requestCounter->inc(['method' => $method, 'code' => $statusCode]);
    $requestLatency->observe(microtime(true) - $startTime, ['method' => $method, 'code' => $statusCode]);

    http_response_code($statusCode);
    echo "Request processed successfully.\n";

} catch (\Exception $e) {
    // Simulate an error response
    $statusCode = 500;
    $requestCounter->inc(['method' => $method, 'code' => $statusCode]);
    $requestLatency->observe(microtime(true) - $startTime, ['method' => $method, 'code' => $statusCode]);

    http_response_code($statusCode);
    echo "An error occurred: " . $e->getMessage() . "\n";
}

// --- Metrics Endpoint ---
// This part should ideally be handled by a dedicated script or a routing mechanism
// to avoid exposing metrics on every application request.
if (isset($_GET['metrics'])) {
    header('Content-Type: text/plain');
    $renderer = new CallbackRenderer($registry);
    echo $renderer->render();
    exit;
}
?>

In a production environment, you would typically have a separate script or a dedicated route (e.g., /metrics) that only serves the Prometheus metrics endpoint. This prevents every application request from also triggering the metrics rendering, which can add overhead.

Setting up Prometheus and Grafana on Google Cloud

To scrape and visualize these metrics, you’ll need a Prometheus server and a Grafana instance. You can deploy these on Google Cloud Compute Engine instances, or leverage managed services if available.

Prometheus Configuration (`prometheus.yml`):

global:
  scrape_interval: 15s # How often to scrape targets

scrape_configs:
  - job_name: 'php_application'
    static_configs:
      - targets: ['<your_php_app_ip>:80'] # Assuming your PHP app runs on port 80 and exposes /metrics
        # If you have a dedicated metrics endpoint script, adjust the port and path accordingly.
        # For example, if your metrics endpoint is at http://<your_php_app_ip>:8080/metrics
        # targets: ['<your_php_app_ip>:8080']
        # metric_path: /metrics

  - job_name: 'mysql_replicas'
    static_configs:
      - targets:
          - '<replica1_ip>:9104' # Assuming mysqld_exporter is running on port 9104
          - '<replica2_ip>:9104'
          # Add all your replica IPs

  - job_name: 'mysql_heartbeat_custom'
    # This job scrapes the custom metrics pushed by the Ops Agent.
    # This requires Prometheus to be configured to scrape the Ops Agent's metrics endpoint,
    # or you might need to configure the Ops Agent to expose metrics in Prometheus format.
    # A simpler approach for custom metrics is often to use the Cloud Monitoring API directly
    # from your application or a dedicated script, rather than relying on Prometheus scraping.
    # However, if you MUST use Prometheus, you'd typically run a Prometheus exporter for Cloud Monitoring
    # or configure the Ops Agent to expose Prometheus metrics.
    # For this example, let's assume you're using mysqld_exporter for MySQL metrics and
    # a separate mechanism for pt-heartbeat lag.
    # If you want to scrape pt-heartbeat lag via Prometheus, you'd need a custom exporter
    # that runs pt-heartbeat and exposes its output as a Prometheus metric.
    # Example custom exporter (Python):
    # https://github.com/prometheus/client_python/blob/master/examples/custom_exporter.py
    # You would then configure Prometheus to scrape this custom exporter.
    static_configs:
      - targets: ['<custom_exporter_ip>:9100'] # Assuming your custom exporter runs on port 9100

Grafana Dashboards:

Once Prometheus is scraping your PHP app and MySQL metrics, you can create dashboards in Grafana. You can find pre-built dashboards for MySQL (e.g., using mysqld_exporter) and create custom ones for your PHP application metrics (request rate, latency percentiles, error rates, queue depth).

Leveraging Cloud Monitoring for Alerting

While Prometheus and Grafana provide excellent visualization and real-time dashboards, Google Cloud Monitoring offers robust alerting capabilities that can integrate with your existing Google Cloud infrastructure and notification channels (e.g., Slack, PagerDuty, email).

Alerting on MySQL Replication Lag

If you’ve successfully ingested the pt-heartbeat lag into Cloud Monitoring as a custom metric (e.g., custom.googleapis.com/mysql/replication_lag_seconds), you can create alerting policies based on this metric.

# Example Alerting Policy Configuration (Conceptual - use Cloud Console UI or gcloud CLI)

# Metric: custom.googleapis.com/mysql/replication_lag_seconds
# Filter: resource.type="gce_instance" AND metric.labels.instance_name="your-replica-instance-name"
# Condition:
#   - Alert if the average replication lag over 5 minutes is greater than 60 seconds.
#   - Alert if the replication lag has been consistently above 30 seconds for 15 minutes.
#   - Trigger an incident when the condition is met.
# Notification Channels: Configure your preferred channels (e.g., Slack, Email, PagerDuty).

For each replica, you’ll want a specific alerting policy or a policy with filters that can distinguish between them. This ensures you get alerted to lag on a specific replica, allowing for targeted investigation.

Alerting on PHP Application Errors and Performance

For metrics collected via Prometheus and potentially pushed to Cloud Monitoring (or if you use Cloud Logging for application logs), you can set up alerts for:

High HTTP Error Rates: Alert when the percentage of 5xx responses exceeds a threshold (e.g., > 5% over 10 minutes).
Increased Latency: Alert when the 95th percentile of request latency exceeds a defined SLO (e.g., > 2 seconds for 15 minutes).
Low Request Volume: Alert if the total request count drops significantly, indicating a potential outage or deployment issue.
High Queue Depth: Alert if message queues become excessively large, signaling a processing bottleneck.

You can configure these alerts directly in the Cloud Monitoring console. If your Prometheus metrics are not directly ingested into Cloud Monitoring, you might need to use a custom exporter that bridges Prometheus metrics to Cloud Monitoring API, or rely on Prometheus’s own alerting manager, which can then trigger external notification systems.

Essential System-Level Metrics on GCE

Beyond application-specific metrics, ensuring the underlying Compute Engine instances are healthy is fundamental. The Ops Agent collects a wealth of system metrics that should be monitored.

Key System Metrics to Monitor

CPU Utilization: High CPU usage on database servers or application servers can indicate performance bottlenecks or resource exhaustion. Set alerts for sustained high CPU (e.g., > 80% for 15 minutes).
Memory Usage: Monitor both overall memory usage and swap usage. High swap usage is a strong indicator of memory pressure.
Disk I/O: For database servers, disk I/O is critical. Monitor read/write operations per second, latency, and queue depth. High I/O wait times can severely impact database performance.
Network Traffic: Monitor inbound and outbound network traffic. Unexpected spikes or sustained high traffic can indicate issues or require scaling.
Disk Space: Ensure sufficient free disk space on all instances, especially database servers where logs and data files reside. Alert when free space drops below a critical threshold (e.g., < 10%).

These metrics are typically available out-of-the-box with the Ops Agent and can be configured for alerting within Cloud Monitoring. For example, monitoring CPU utilization on your primary MySQL instance:

# Example Alerting Policy Configuration (Conceptual)

# Metric: compute.googleapis.com/instance/cpu/utilization
# Filter: resource.type="gce_instance" AND resource.labels.instance_name="your-mysql-primary-instance"
# Condition:
#   - Alert if the average CPU utilization over 5 minutes is greater than 90%.
#   - Trigger an incident when the condition is met.
# Notification Channels: Configure your preferred channels.

By combining infrastructure-level monitoring (MySQL replication, system metrics) with application-level insights (PHP request metrics), and leveraging Cloud Monitoring for centralized alerting, you build a resilient and observable system on Google Cloud.