Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on Google Cloud

Proactive PHP Application Health Checks

Maintaining the health of a PHP application on Google Cloud Platform (GCP) requires more than just basic uptime monitoring. We need to implement deep, application-aware checks that can identify issues before they impact users. This involves a multi-layered approach, starting with simple HTTP probes and escalating to more sophisticated checks that interact with the application’s core logic and dependencies.

HTTP Probes with Custom Headers and Status Codes

Google Cloud’s built-in Load Balancer health checks are a good starting point. However, for PHP applications, especially those behind a reverse proxy like Nginx or Apache, we need to ensure the application itself is responding correctly, not just the web server. This means configuring health checks to hit a specific endpoint within your application that performs its own validation.

Consider an endpoint like /healthz. This endpoint should not only return a 200 OK status but also potentially check database connectivity, cache status, or other critical services. We can also use custom headers to signal to the load balancer that this is a legitimate health check, preventing accidental hits from external sources and allowing for more granular control.

Implementing a Robust `/healthz` Endpoint in PHP

Here’s a basic example of a PHP script for a /healthz endpoint. This script checks a database connection and returns a JSON response with detailed status. For production, you’d want to expand this to check other dependencies like Redis, Memcached, or external APIs.

<?php
header('Content-Type: application/json');

$response = [
    'status' => 'unhealthy',
    'checks' => [],
    'timestamp' => date('c'),
];

// 1. Database Check
$db_host = getenv('DB_HOST') ?: 'localhost';
$db_name = getenv('DB_NAME') ?: 'myapp_db';
$db_user = getenv('DB_USER') ?: 'user';
$db_pass = getenv('DB_PASS') ?: 'password';

try {
    $pdo = new PDO("mysql:host={$db_host};dbname={$db_name};charset=utf8mb4", $db_user, $db_pass, [
        PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
        PDO::ATTR_TIMEOUT => 5, // 5-second timeout
    ]);
    $stmt = $pdo->query("SELECT 1");
    if ($stmt->fetchColumn()) {
        $response['checks']['database'] = 'ok';
    } else {
        throw new Exception("Database query failed.");
    }
} catch (PDOException $e) {
    $response['checks']['database'] = 'error: ' . $e->getMessage();
} catch (Exception $e) {
    $response['checks']['database'] = 'error: ' . $e->getMessage();
}

// 2. Add more checks here (e.g., Redis, external API)
// Example: Redis Check (assuming Predis library is installed)
/*
try {
    $redis = new Predis\Client([
        'scheme' => 'tcp',
        'host' => getenv('REDIS_HOST') ?: 'localhost',
        'port' => getenv('REDIS_PORT') ?: 6379,
    ]);
    $redis->ping();
    $response['checks']['redis'] = 'ok';
} catch (Exception $e) {
    $response['checks']['redis'] = 'error: ' . $e->getMessage();
}
*/

// Determine overall status
$all_checks_ok = true;
foreach ($response['checks'] as $service => $status) {
    if (strpos($status, 'error') !== false) {
        $all_checks_ok = false;
        break;
    }
}

if ($all_checks_ok) {
    $response['status'] = 'ok';
    http_response_code(200);
} else {
    http_response_code(503); // Service Unavailable
}

echo json_encode($response, JSON_PRETTY_PRINT);
exit;
?>

GCP Load Balancer Health Check Configuration

When configuring your GCP Load Balancer (e.g., HTTP(S) Load Balancer), you’ll set up a health check resource. Here’s how you’d configure it to use the /healthz endpoint:

gcloud compute health-checks create http /healthz \
    --port 80 \
    --request-path=/healthz \
    --check-interval=10s \
    --timeout=5s \
    --unhealthy-threshold=3 \
    --healthy-threshold=2 \
    --global \
    --description="PHP App Health Check" \
    --proxy-header=NONE \
    --check-interval=10s \
    --timeout=5s \
    --unhealthy-threshold=3 \
    --healthy-threshold=2

Key Parameters:

--request-path=/healthz: Specifies the path to check.
--port 80: The port your backend instances are listening on (adjust if using HTTPS directly on instances).
--check-interval: How often to perform the check.
--timeout: How long to wait for a response.
--unhealthy-threshold: Number of consecutive failures to mark an instance unhealthy.
--healthy-threshold: Number of consecutive successes to mark an instance healthy.
--proxy-header=NONE: Important if your PHP app isn’t expecting specific proxy headers for health checks. If you’re using Cloud Armor or other proxy layers, you might need to adjust this.

Remember to associate this health check with your backend service. This ensures that unhealthy instances are automatically removed from the load balancer’s rotation.

Advanced PHP Application Metrics with Prometheus and Grafana

Beyond basic health checks, we need to collect detailed performance metrics from our PHP application. Prometheus is an excellent choice for this, and Grafana provides powerful visualization. We’ll use a PHP Prometheus client library to expose metrics.

Integrating the PHP Prometheus Client

First, install the Prometheus PHP client library via Composer:

composer require promphp/prometheus_client_php

Next, create an endpoint (e.g., /metrics) in your PHP application to expose these metrics. This endpoint will be scraped by the Prometheus server.

<?php
require 'vendor/autoload.php';

use Prometheus\CollectorRegistry;
use Prometheus\Render\CallbackRenderer;
use Prometheus\Storage\InMemory;

// Initialize registry and storage
$adapter = new InMemory();
$registry = new CollectorRegistry($adapter);

// Define custom metrics
// Counter for total requests
$counter = $registry->registerCounter('http_requests_total', 'Total HTTP requests', ['method', 'path']);

// Gauge for current active requests (example, might need more complex tracking)
$gauge = $registry->registerGauge('app_active_requests', 'Number of currently active requests');

// Histogram for request duration
$histogram = $registry->registerHistogram('http_request_duration_seconds', 'HTTP request duration in seconds', ['method', 'path']);

// --- In your application's request handling logic ---
// Before processing a request:
// $gauge->set(1); // Increment active requests
// $counter->incBy(1, ['GET', '/some/path']); // Increment total requests for this path/method
// $startTime = microtime(true);

// After processing a request:
// $duration = microtime(true) - $startTime;
// $histogram->observe($duration, ['GET', '/some/path']);
// $gauge->set(0); // Decrement active requests
// ----------------------------------------------------

// In the /metrics endpoint:
header('Content-Type: ' . CallbackRenderer::CONTENT_TYPE_TEXT_PLAIN);

$renderer = new CallbackRenderer($registry);
echo $renderer->render();
exit;
?>

You’ll need to integrate the metric collection logic (commented out in the example) into your actual request handling flow. This typically involves starting a timer at the beginning of a request and recording the duration and incrementing counters upon completion.

Setting up Prometheus Server on GCP

Deploy a Prometheus server, typically as a Kubernetes Deployment or a Compute Engine instance. Configure Prometheus to scrape your PHP application’s /metrics endpoint. This involves adding a scrape configuration to your prometheus.yml file.

scrape_configs:
  - job_name: 'php_app'
    static_configs:
      - targets: ['your-php-app-instance-1:80', 'your-php-app-instance-2:80'] # Or use service discovery
    metrics_path: /metrics
    scheme: http # or https if your app uses TLS

For dynamic environments (like GKE), use Prometheus’s service discovery mechanisms (e.g., Kubernetes SD) to automatically find and scrape your application pods.

Visualizing Metrics with Grafana

Deploy Grafana on GCP and configure it to use your Prometheus server as a data source. Create dashboards to visualize key metrics like request latency, error rates (derived from counters and gauges), and resource utilization.

Elasticsearch Cluster Monitoring: Beyond Basic Node Status

Monitoring Elasticsearch clusters requires a deep understanding of its internal workings. We need to track not just node availability but also cluster health, shard status, indexing performance, and query latency. GCP’s operations suite (formerly Stackdriver) can ingest logs and metrics, but for detailed Elasticsearch-specific insights, Prometheus and Grafana are invaluable.

Elasticsearch Exporter for Prometheus

The official Prometheus Elasticsearch Exporter is the standard way to get detailed metrics from your Elasticsearch cluster into Prometheus. Deploy this exporter as a separate service that can access your Elasticsearch cluster’s API.

# Example deployment using Docker
docker run -d \
  --name elasticsearch-exporter \
  -p 9114:9114 \
  quay.io/prometheus/elasticsearch-exporter \
  --es.uri=http://your-elasticsearch-master-node:9200 \
  --es.indices \
  --es.shards \
  --es.cluster_health

Key Flags:

--es.uri: The HTTP endpoint of your Elasticsearch cluster.
--es.indices: Collect metrics about indices.
--es.shards: Collect metrics about shards.
--es.cluster_health: Collect cluster health status.

Configure your Prometheus server to scrape the exporter’s metrics endpoint (default is :9114).

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['your-elasticsearch-exporter-host:9114']

Essential Elasticsearch Metrics to Monitor

With the exporter in place, focus on these critical metrics:

Cluster Health: elasticsearch_cluster_health_status (0 for green, 1 for yellow, 2 for red). Alert aggressively on non-green statuses.
Node Status: elasticsearch_node_up (1 if node is up, 0 if down).
Shard Status: elasticsearch_shard_count, elasticsearch_shard_unassigned_count. Unassigned shards are a major red flag.
Indexing Performance: elasticsearch_index_indexing_total (rate of documents indexed), elasticsearch_index_indexing_time_seconds_count (number of indexing operations), elasticsearch_index_indexing_time_seconds_sum (total time spent indexing).
Search Performance: elasticsearch_search_query_total (rate of search requests), elasticsearch_search_query_time_seconds_count, elasticsearch_search_query_time_seconds_sum.
JVM Heap Usage: elasticsearch_jvm_memory_used_bytes, elasticsearch_jvm_memory_max_bytes. High heap usage can lead to garbage collection pauses and instability. Aim to keep heap usage below 75-80%.
Disk Usage: Monitor disk I/O and free space on your Elasticsearch nodes. Elasticsearch is I/O intensive and requires sufficient disk space.

GCP Operations Suite Integration for Logs and Basic Metrics

While Prometheus excels at time-series metrics, GCP’s Operations Suite is crucial for log aggregation and basic infrastructure monitoring. Ensure your PHP applications and Elasticsearch nodes are configured to send logs to Cloud Logging.

Log Aggregation for PHP Applications

Use the Cloud Logging agent (Ops Agent) on your Compute Engine instances or configure your GKE pods to send application logs. For PHP, this typically means configuring your php.ini or logging framework (like Monolog) to output to stderr or a file that the agent monitors.

; In php.ini or a custom conf.d file
error_log = /var/log/php/app.log
; Or configure Monolog to log to stdout/stderr

Then, configure the Ops Agent to collect these logs. For a Compute Engine instance, this might involve editing /etc/google-cloud-ops-agent/config.yaml:

logging:
  receivers:
    php_app_logs:
      type: files
      include_paths:
        - /var/log/php/app.log
  processors:
    parse_json:
      type: json_payload
      field: 'text'
  forwarders:
    default_forwarder:
      destination:
        cloud_logging:
          use_grpc: true
  logs:
    - name: php_app_log_collection
      receivers:
        - php_app_logs
      processors:
        - parse_json # If your app logs in JSON format
      forwarder:
        destination:
          cloud_logging:
            log_name: php-application-logs

Log Aggregation for Elasticsearch

Elasticsearch itself generates extensive logs (e.g., elasticsearch.log, gc.log). Ensure these are also collected by the Ops Agent. The agent typically has built-in support for common application logs, but you might need to customize its configuration to point to your Elasticsearch log directory.

logging:
  receivers:
    elasticsearch_logs:
      type: files
      include_paths:
        - /var/log/elasticsearch/*.log
  forwarders:
    default_forwarder:
      destination:
        cloud_logging:
          use_grpc: true
  logs:
    - name: elasticsearch_log_collection
      receivers:
        - elasticsearch_logs
      forwarder:
        destination:
          cloud_logging:
            log_name: elasticsearch-logs

GCP Monitoring for Infrastructure and Alerting

Leverage GCP’s Monitoring service for infrastructure-level metrics (CPU, memory, network) and to set up alerts. You can create custom metrics based on your Prometheus data or use predefined metrics.

Creating Alerting Policies

Use Cloud Monitoring to create alerting policies. For example, you can set an alert when the elasticsearch_cluster_health_status metric reported by Prometheus exceeds a certain threshold (e.g., 1 for yellow or 2 for red).

# Example using gcloud CLI to create an alerting policy
gcloud alpha monitoring policies create \
    --display-name="Elasticsearch Cluster Unhealthy" \
    --condition-above \
    --metric="prometheus.googleapis.com/elasticsearch_cluster_health_status/gauge" \
    --threshold="1" \
    --duration="60s" \
    --comparison=">" \
    --trigger-count=1 \
    --notification-channels="projects/YOUR_PROJECT_ID/notificationChannels/YOUR_CHANNEL_ID" \
    --filter='metric.labels.job="elasticsearch"'

This command assumes you have configured Prometheus to export metrics to GCP Monitoring (e.g., using the Prometheus GCP Exporter or a similar mechanism). Alternatively, you can create custom metrics directly within GCP Monitoring if you have agents pushing data.

Correlating Logs and Metrics

The true power comes from correlating your application logs with performance metrics. When an alert fires (e.g., high latency from Prometheus), you can quickly jump to Cloud Logging to examine the relevant logs from that time period. This drastically reduces Mean Time To Resolution (MTTR).

Conclusion: A Layered Monitoring Strategy

A robust monitoring strategy for PHP applications and Elasticsearch on GCP involves multiple layers: application-level health checks, detailed performance metrics via Prometheus, infrastructure monitoring with GCP’s native tools, and comprehensive log aggregation. By combining these approaches, you gain deep visibility into your system’s health, enabling proactive issue detection and rapid incident response.