Server Monitoring Best Practices: Keeping Your PHP App and Redis Clusters Alive on Google Cloud
Proactive PHP Application Health Checks with Cloud Monitoring
Effective monitoring of PHP applications on Google Cloud Platform (GCP) goes beyond basic uptime checks. We need to instrument our applications to expose internal health metrics that Cloud Monitoring can ingest and alert on. This involves creating custom metrics and leveraging the Cloud Monitoring API or the Ops Agent.
For a typical PHP application, critical health indicators include:
- Request latency (average, p95, p99)
- Error rates (HTTP 5xx, application-level exceptions)
- Database connection pool status
- Cache hit/miss ratios
- Background job queue lengths
Implementing Custom Metrics in PHP
The most robust way to expose custom metrics is by using the Cloud Monitoring client libraries. However, for simpler scenarios or when direct API access is preferred, we can expose metrics via a dedicated health check endpoint that returns a JSON payload. This payload can then be scraped by the Ops Agent or a custom collector.
Let’s consider a scenario where we want to track the number of active database connections and the count of pending background jobs. We’ll create a simple PHP script that exposes this data.
Health Check Endpoint Example
Assume you have a mechanism to track active database connections (e.g., through your PDO or mysqli connection pool) and a queue for background jobs (e.g., Redis or a dedicated queue system).
/healthz.php
<?php
// Assume these functions are implemented elsewhere in your application
// to retrieve actual metrics.
function get_active_db_connections() {
// Replace with actual logic to get connection count
// e.g., querying connection pool status or a dummy value
return rand(5, 20);
}
function get_pending_background_jobs() {
// Replace with actual logic to get pending job count from Redis/queue
// e.g., using Redis `LLEN` command
// For demonstration, returning a random value
return rand(0, 50);
}
// Set content type to JSON
header('Content-Type: application/json');
// Prepare the metrics payload
$metrics = [
'active_db_connections' => get_active_db_connections(),
'pending_background_jobs' => get_pending_background_jobs(),
'timestamp' => (new DateTime('now', new DateTimeZone('UTC')))->format(DateTime::ISO8601),
'status' => 'ok' // Basic status indicator
];
// Output the JSON
echo json_encode($metrics);
exit;
?>
Configuring the Ops Agent for Metric Collection
The Ops Agent is the recommended way to collect logs and metrics from your Compute Engine instances and GKE nodes. We’ll configure it to scrape our custom health endpoint.
Ops Agent Configuration (/etc/google-cloud-ops-agent/config.yaml)
We’ll use the metrics receiver to scrape the JSON endpoint and the google_cloud_monitoring transmitter to send data to Cloud Monitoring. Ensure your agent is installed and running.
metrics:
# Define receivers for scraping metrics.
receivers:
# Receiver for our custom PHP health check endpoint.
php_health_check:
type: http
endpoint: "http://localhost:80/healthz.php" # Adjust port if your web server is different
interval: "60s" # Scrape every 60 seconds
# Define how to parse the JSON response into metrics.
# The 'metrics' field in the JSON will be treated as a map of metrics.
# The 'timestamp' field will be used for the metric timestamp.
# The 'status' field can be used for a health status metric.
parse_json:
metrics_path: "$.*" # Capture all key-value pairs in the root JSON object
timestamp_path: "$.timestamp"
status_path: "$.status" # Optional: can be used to derive a health metric
# Define transmitters to send collected data to.
transmitters:
# Transmit metrics to Google Cloud Monitoring.
google_cloud_monitoring:
type: google_cloud_monitoring
# Optional: specify a project if not using the agent's default project.
# project_id: "your-gcp-project-id"
# Define logging pipelines.
logs:
# ... (your log collection configuration)
# Define metrics pipelines.
# This pipeline connects the receiver to the transmitter.
metrics_pipelines:
# Pipeline for our custom PHP metrics.
php_metrics_pipeline:
receivers:
- php_health_check
transmitters:
- google_cloud_monitoring
After updating the configuration, restart the Ops Agent:
sudo systemctl restart google-cloud-ops-agent
Monitoring Redis Clusters with Cloud Monitoring
Redis clusters, whether managed (Memorystore for Redis) or self-hosted on Compute Engine/GKE, require specific monitoring. Key metrics include:
- Memory usage (used_memory, used_memory_rss)
- CPU utilization
- Network traffic (bytes_in, bytes_out)
- Cache hit/miss ratio (requires custom instrumentation or specific Redis commands)
- Latency of Redis commands
- Number of connected clients
- Replication status (for master/replica setups)
Memorystore for Redis Metrics
Memorystore for Redis automatically exposes a comprehensive set of metrics to Cloud Monitoring. You can view these directly in the Cloud Console under “Monitoring” > “Metrics Explorer”. Common metrics include:
redis.googleapis.com/network/received_bytes_countredis.googleapis.com/network/transmitted_bytes_countredis.googleapis.com/memory/used_memoryredis.googleapis.com/cpu/usageredis.googleapis.com/clients/count
For Memorystore, the primary focus is on setting up appropriate alerting policies based on these built-in metrics.
Monitoring Self-Hosted Redis Clusters
For self-hosted Redis, we can leverage the Ops Agent again, this time using the built-in Redis exporter or by querying Redis directly via `redis-cli` and exposing those metrics.
Option 1: Using the Ops Agent’s Redis Receiver
The Ops Agent has a built-in receiver for Redis that can scrape metrics directly from a running Redis instance. This is often the simplest approach for self-hosted Redis.
Ops Agent Configuration Snippet (/etc/google-cloud-ops-agent/config.yaml)
metrics:
receivers:
redis_metrics:
type: redis
# Specify the endpoint for your Redis instance.
# For a single instance:
# endpoint: "localhost:6379"
# For a Redis cluster (requires multiple endpoints or a proxy):
endpoints:
- "redis-node-1:6379"
- "redis-node-2:6379"
- "redis-node-3:6379"
interval: "30s" # Scrape every 30 seconds
# Optional: authentication
# password: "your_redis_password"
transmitters:
google_cloud_monitoring:
type: google_cloud_monitoring
metrics_pipelines:
redis_pipeline:
receivers:
- redis_metrics
transmitters:
- google_cloud_monitoring
Remember to restart the Ops Agent after applying this configuration.
Option 2: Custom Script with `redis-cli` and Ops Agent
If the built-in receiver doesn’t cover specific metrics you need, or for more granular control, you can write a custom script that queries Redis and exposes metrics in a format the Ops Agent can scrape (e.g., JSON endpoint as shown for PHP).
Example Custom Script (/opt/redis_metrics_exporter.py)
import redis
import json
from datetime import datetime, timezone
# Configuration
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
# REDIS_PASSWORD = 'your_redis_password' # Uncomment if authentication is needed
def get_redis_metrics():
try:
# Connect to Redis
r = redis.StrictRedis(host=REDIS_HOST, port=REDIS_PORT, decode_responses=True) #, password=REDIS_PASSWORD)
r.ping() # Check connection
metrics = {}
# Basic metrics
metrics['used_memory_bytes'] = int(r.info('memory')['used_memory'])
metrics['used_memory_rss_bytes'] = int(r.info('memory')['used_memory_rss'])
metrics['connected_clients'] = int(r.info('clients')['connected_clients'])
metrics['instantaneous_ops_per_sec'] = int(r.info('stats')['instantaneous_ops_per_sec'])
metrics['keyspace_hits'] = int(r.info('keyspace')['db0']['keyspace_hits']) # Assuming db0
metrics['keyspace_misses'] = int(r.info('keyspace')['db0']['keyspace_misses']) # Assuming db0
# Calculate hit ratio
hits = metrics['keyspace_hits']
misses = metrics['keyspace_misses']
if (hits + misses) > 0:
metrics['keyspace_hit_ratio'] = (hits / (hits + misses)) * 100
else:
metrics['keyspace_hit_ratio'] = 0.0
# Add timestamp
metrics['timestamp'] = datetime.now(timezone.utc).isoformat()
metrics['status'] = 'ok'
return metrics
except redis.exceptions.ConnectionError as e:
return {'status': 'error', 'message': str(e), 'timestamp': datetime.now(timezone.utc).isoformat()}
except Exception as e:
return {'status': 'error', 'message': str(e), 'timestamp': datetime.now(timezone.utc).isoformat()}
if __name__ == "__main__":
# This script would typically be run by a web server (e.g., PHP's built-in server for testing,
# or integrated into a Python web framework) to serve the JSON endpoint.
# For simplicity, we'll just print the JSON here.
# In a production setup, you'd integrate this into a Flask/Django app or use a dedicated exporter.
# Example of serving via Flask (requires Flask installed: pip install Flask)
from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/redis_healthz')
def healthz():
metrics = get_redis_metrics()
return jsonify(metrics)
# To run this script directly for testing:
# python /opt/redis_metrics_exporter.py
# Then access http://localhost:5000/redis_healthz
# For production, use a proper WSGI server like Gunicorn.
app.run(host='0.0.0.0', port=5000)
You would then configure the Ops Agent’s http receiver to scrape this Python script’s endpoint (e.g., http://localhost:5000/redis_healthz) and use the same parse_json configuration as shown for the PHP health check.
Alerting Strategies in Cloud Monitoring
Once metrics are flowing into Cloud Monitoring, the next critical step is setting up alerts. Alerts should be actionable and tuned to prevent false positives while catching genuine issues early.
PHP Application Alerting Examples
Based on the custom metrics we’ve set up:
- High Error Rate: Alert if the rate of HTTP 5xx errors (collected automatically by Cloud Monitoring for App Engine, GKE, or Compute Engine) exceeds a threshold (e.g., > 5% over 5 minutes).
- High Latency: Alert if p95 or p99 request latency exceeds a threshold (e.g., > 2 seconds for 10 minutes).
- Database Connection Saturation: Alert if
active_db_connections(from our custom metric) is consistently high (e.g., > 90% of max pool size for 15 minutes). - Background Job Backlog: Alert if
pending_background_jobsexceeds a critical threshold (e.g., > 1000 jobs for 30 minutes), indicating a potential processing bottleneck.
Redis Cluster Alerting Examples
For both Memorystore and self-hosted Redis:
- High Memory Usage: Alert if
redis.googleapis.com/memory/used_memory(or custom equivalent) exceeds 85% of the allocated memory for 20 minutes. - High CPU Usage: Alert if
redis.googleapis.com/cpu/usage(or custom equivalent) exceeds 75% for 15 minutes. - Low Cache Hit Ratio: Alert if the calculated
keyspace_hit_ratio(from custom script) drops below 70% for 10 minutes. - Excessive Clients: Alert if
redis.googleapis.com/clients/count(or custom equivalent) exceeds a predefined limit (e.g., > 10000 clients). - Replication Lag: For self-hosted master/replica setups, monitor replication lag and alert if it exceeds a few seconds.
Advanced Considerations: Distributed Tracing and SLOs
For complex microservice architectures, relying solely on host-level metrics might not be sufficient. Consider:
- Distributed Tracing: Implement OpenTelemetry or use GCP’s operations suite (Cloud Trace) to trace requests across multiple services. This helps pinpoint latency bottlenecks in distributed systems.
- Service Level Objectives (SLOs): Define SLOs for critical user journeys (e.g., “99.9% of login requests complete within 500ms”). Cloud Monitoring can help track Service Level Indicators (SLIs) that feed into SLO compliance.
By combining granular application-level metrics, robust infrastructure monitoring for Redis, and well-defined alerting policies, you can build a resilient and observable PHP application environment on Google Cloud.