Server Monitoring Best Practices: Keeping Your PHP App and DynamoDB Clusters Alive on Google Cloud

Proactive PHP Application Health Checks with Cloud Monitoring

Effective server monitoring for PHP applications on Google Cloud Platform (GCP) hinges on a multi-layered approach. Beyond basic CPU and memory utilization, we need to inspect the application’s internal state, its dependencies, and its ability to serve requests. This involves instrumenting the PHP application itself and leveraging GCP’s native monitoring tools.

Implementing Custom PHP Health Check Endpoints

A robust health check endpoint within your PHP application is paramount. This endpoint should not only verify the web server is responding but also check critical dependencies like database connections, cache services, and external API reachability. We’ll create a simple but effective health check script.

Consider a file named healthcheck.php placed in your web root:

<?php
header('Content-Type: application/json');

$response = [
    'status' => 'unhealthy',
    'checks' => [],
    'timestamp' => date('c'),
];

// 1. Basic Web Server Check (implicit if script is reached)
$response['checks']['web_server'] = ['status' => 'ok'];

// 2. Database Connection Check (Example for MySQL/MariaDB)
$dbHost = getenv('DB_HOST') ?: 'localhost';
$dbUser = getenv('DB_USER') ?: 'user';
$dbPass = getenv('DB_PASS') ?: 'password';
$dbName = getenv('DB_NAME') ?: 'app_db';
$dbPort = getenv('DB_PORT') ?: '3306';

try {
    $dsn = "mysql:host={$dbHost};port={$dbPort};dbname={$dbName};charset=utf8mb4";
    $options = [
        PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
        PDO::ATTR_TIMEOUT => 5, // 5-second timeout
    ];
    $pdo = new PDO($dsn, $dbUser, $dbPass, $options);
    $response['checks']['database'] = ['status' => 'ok', 'host' => $dbHost];
} catch (PDOException $e) {
    $response['checks']['database'] = ['status' => 'error', 'message' => 'Failed to connect to database', 'host' => $dbHost];
    // Log the error for deeper investigation
    error_log("Database health check failed: " . $e->getMessage());
}

// 3. Cache Service Check (Example for Redis)
$redisHost = getenv('REDIS_HOST') ?: 'localhost';
$redisPort = getenv('REDIS_PORT') ?: '6379';
$redisTimeout = 2; // 2-second timeout

try {
    $redis = new Redis();
    if ($redis->connect($redisHost, $redisPort, $redisTimeout)) {
        if ($redis->ping() === '+PONG') {
            $response['checks']['cache'] = ['status' => 'ok', 'host' => $redisHost];
        } else {
            $response['checks']['cache'] = ['status' => 'error', 'message' => 'Redis PING failed', 'host' => $redisHost];
            error_log("Redis health check failed: PING command did not return PONG.");
        }
    } else {
        $response['checks']['cache'] = ['status' => 'error', 'message' => 'Failed to connect to Redis', 'host' => $redisHost];
        error_log("Redis health check failed: Connection refused or timed out.");
    }
} catch (RedisException $e) {
    $response['checks']['cache'] = ['status' => 'error', 'message' => 'Redis connection error', 'host' => $redisHost];
    error_log("Redis health check failed: " . $e->getMessage());
}

// 4. External API Check (Example: Google API)
$externalApiUrl = 'https://www.googleapis.com/discovery/v1/apis';
$apiTimeout = 5; // 5-second timeout

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $externalApiUrl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, $apiTimeout);
curl_setopt($ch, CURLOPT_FAILONERROR, true); // Fail on HTTP codes >= 400

$apiResponse = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$curlError = curl_error($ch);
curl_close($ch);

if ($apiResponse !== false && $httpCode < 400) {
    $response['checks']['external_api'] = ['status' => 'ok', 'url' => $externalApiUrl];
} else {
    $response['checks']['external_api'] = ['status' => 'error', 'message' => 'External API check failed', 'url' => $externalApiUrl, 'http_code' => $httpCode, 'curl_error' => $curlError];
    error_log("External API health check failed for {$externalApiUrl}: HTTP Code {$httpCode}, cURL Error: {$curlError}");
}

// Determine overall status
$isHealthy = true;
foreach ($response['checks'] as $check) {
    if ($check['status'] === 'error') {
        $isHealthy = false;
        break;
    }
}

if ($isHealthy) {
    $response['status'] = 'ok';
    http_response_code(200);
} else {
    $response['status'] = 'unhealthy';
    http_response_code(503); // Service Unavailable
}

echo json_encode($response, JSON_PRETTY_PRINT);
exit;
?>

Key Considerations:

Environment Variables: Use environment variables (e.g., via Kubernetes secrets or Compute Engine metadata) for sensitive credentials and connection details. This script assumes they are available.
Timeouts: Set aggressive but realistic timeouts for all external dependencies (DB, cache, APIs). A slow dependency shouldn’t block the health check indefinitely.
Error Logging: Crucially, log any errors encountered during health checks to GCP Cloud Logging. This provides the necessary context for debugging when an unhealthy state is detected.
HTTP Status Codes: Return 200 OK for healthy and 503 Service Unavailable for unhealthy. This is standard practice and understood by load balancers and monitoring systems.
JSON Output: A structured JSON response is easily parseable by automated systems.

Configuring GCP Load Balancer Health Checks

Google Cloud Load Balancers (HTTP(S) Load Balancer, Network Load Balancer) can be configured to periodically probe your application instances. This is the first line of defense, preventing traffic from being sent to unhealthy instances.

For an HTTP(S) Load Balancer, you’ll configure a Health Check resource:

gcloud compute health-checks create http php-app-health-check \
    --request-path="/healthcheck.php" \
    --port=80 \
    --check-interval=30s \
    --timeout=5s \
    --unhealthy-threshold=3 \
    --healthy-threshold=2 \
    --global

Explanation:

--request-path: Points to your custom health check endpoint.
--port: The port your application listens on (typically 80 for HTTP).
--check-interval: How often the health check runs (e.g., every 30 seconds).
--timeout: How long to wait for a response (should align with your PHP script’s timeouts).
--unhealthy-threshold: Number of consecutive failures before marking an instance unhealthy.
--healthy-threshold: Number of consecutive successes before marking an instance healthy.
--global: For global HTTP(S) Load Balancers. Use --region for regional ones.

After creating the health check, associate it with your backend service:

gcloud compute backend-services update php-backend-service \
    --health-checks php-app-health-check \
    --global

Leveraging Cloud Monitoring for Deeper Insights

GCP Cloud Monitoring (formerly Stackdriver) is essential for aggregating metrics, setting up alerts, and visualizing the health of your infrastructure and applications.

Monitoring PHP-Specific Metrics

While GCP provides OS-level metrics, we need to push application-level metrics. The Ops Agent is the recommended way to collect logs and metrics from your Compute Engine instances and GKE nodes.

Install and configure the Ops Agent. For PHP, you’ll want to monitor:

Request Latency: Track how long requests take to process.
Error Rates: Monitor HTTP 5xx errors.
Throughput: Requests per second.
PHP-FPM Pool Status: If using PHP-FPM, monitor active processes, idle processes, and queue lengths.
Memory Usage: Track PHP’s memory consumption.

You can use libraries like prometheus_client (if using Prometheus) or custom scripts to expose metrics that the Ops Agent can scrape. Alternatively, use the Cloud Monitoring API to push custom metrics directly.

// Example using Cloud Monitoring API (requires google/cloud-monitoring client library)
use Google\Cloud\Monitoring\V3\MetricServiceClient;
use Google\Cloud\Monitoring\V3\MetricDescriptor\MetricKind;
use Google\Cloud\Monitoring\V3\MetricDescriptor\ValueType;
use Google\Cloud\Monitoring\V3\Point;
use Google\Cloud\Monitoring\V3\TimeSeries;
use Google\Cloud\Monitoring\V3\TimeInterval;
use Google\Protobuf\Timestamp;

$projectId = 'your-gcp-project-id';
$metricClient = new MetricServiceClient(['credentials' => '/path/to/your/keyfile.json']);

$metricType = 'custom.googleapis.com/php_app/request_count';
$resourceType = 'gce_instance'; // Or 'k8s_container' for GKE
$instanceId = 'your-instance-id'; // Or container name/id
$zone = 'your-instance-zone';

// Define the metric descriptor if it doesn't exist (run this once)
// $descriptor = $metricClient->createMetricDescriptor($projectId, [
//     'type' => $metricType,
//     'metricKind' => MetricKind::COUNTER,
//     'valueType' => ValueType::INT64,
//     'description' => 'Total number of requests processed by the PHP application.',
// ]);

$timestamp = new Timestamp();
$timestamp->setSeconds(time());
$timestamp->setNanos(0);

$interval = new TimeInterval();
$interval->setEndTime($timestamp);

$point = new Point();
$point->setInterval($interval);
$point->setValue((new \Google\Api\MonitoredResource\TypedValue())->setInt64Value(1)); // Increment count

$timeSeries = new TimeSeries();
$timeSeries->setMetric(
    (new \Google\Cloud\Core\GrpcRequestWrapper\Metric())->setType($metricType)
);
$timeSeries->setResource(
    (new \Google\Cloud\Core\GrpcRequestWrapper\MonitoredResource())->setType($resourceType)
        ->setLabels([
            'instance_id' => $instanceId,
            'zone' => $zone,
        ])
);
$timeSeries->addPoints($point);

try {
    $metricClient->createTimeSeries($projectId, [$timeSeries]);
    echo "Successfully wrote metric.\n";
} catch (\Exception $e) {
    echo "Error writing metric: " . $e->getMessage() . "\n";
}

Setting Up Cloud Monitoring Alerts

Alerting is crucial for proactive issue resolution. Configure alerts for:

High Error Rate: Trigger an alert if the HTTP 5xx error rate exceeds a threshold (e.g., 1% of total requests) for more than 5 minutes.
High Latency: Alert if the 95th percentile request latency is consistently above a certain limit (e.g., 2 seconds).
Unhealthy Instances: While the load balancer handles traffic, Cloud Monitoring can alert on instances consistently failing health checks.
Resource Exhaustion: High CPU, low memory, or disk space issues.
Custom Metric Thresholds: e.g., PHP-FPM queue length exceeding a limit.

Navigate to Cloud Monitoring > Alerting > Create Policy in the GCP Console. Select your desired metric, configure the condition (threshold, duration), and set up notification channels (Email, PagerDuty, Slack via Pub/Sub).

Monitoring DynamoDB Performance and Health

DynamoDB, being a managed NoSQL database, abstracts away much of the infrastructure management. However, monitoring its performance and cost is vital for application stability and efficiency.

Key DynamoDB Metrics to Monitor

CloudWatch (which integrates with GCP Cloud Monitoring via connectors or direct API calls if using DynamoDB Global Tables across clouds) provides essential metrics. Focus on:

Consumed Read/Write Capacity Units: The most critical metric. Indicates how much of your provisioned throughput is being used.
Throttled Read/Write Requests: High throttling indicates your provisioned capacity is insufficient.
System Errors: Server-side errors within DynamoDB.
Latency: Average, minimum, and maximum latency for read and write operations.
Item Count: Useful for understanding table size and growth.
Table Size: Total storage used by the table.

Configuring DynamoDB Alerts

Set up CloudWatch Alarms (or equivalent in GCP Monitoring) for:

High Throttling: If ReadThrottleEvents or WriteThrottleEvents are consistently above zero for a sustained period (e.g., 5 minutes).
Approaching Provisioned Capacity: If ConsumedReadCapacityUnits or ConsumedWriteCapacityUnits consistently exceed 80-90% of provisioned capacity. This is a precursor to throttling.
High Latency: If average read or write latency exceeds acceptable thresholds (e.g., > 100ms).
System Errors: If SystemErrors count increases.

Example CloudWatch Alarm Configuration (Conceptual):

# Using AWS CLI for CloudWatch Alarms (adapt for GCP Monitoring if needed)
aws cloudwatch put-metric-alarm \
    --alarm-name "DynamoDB-High-Write-Throttle-MyTable" \
    --alarm-description "High write throttling on DynamoDB table MyTable" \
    --metric-name WriteThrottleEvents \
    --namespace "AWS/DynamoDB" \
    --statistic Sum \
    --period 300 \
    --threshold 0 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data notBreaching \
    --dimensions Name=TableName,Value=MyTable \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MySNSTopic

In a multi-cloud or hybrid setup, you might use tools like Prometheus with the DynamoDB Exporter or custom scripts pushing metrics via the Cloud Monitoring API to achieve unified visibility.

Automated Scaling and Performance Tuning

Monitoring data should feed directly into your scaling and tuning strategies.

PHP Application Scaling

Use GCP’s Managed Instance Groups (MIGs) with Autoscaling. Configure autoscaling based on metrics like:

CPU Utilization: A common and effective metric.
Load Balancer Serving Capacity: Scales based on the load balancer’s perceived load.
Custom Metrics: e.g., queue depth if using a background job system.

gcloud compute instance-groups managed set-autoscaling php-mig \
    --zone=us-central1-a \
    --min-num-replicas=2 \
    --max-num-replicas=10 \
    --target-cpu-utilization=0.7

DynamoDB Autoscaling

DynamoDB supports Application Auto Scaling. Configure it to adjust provisioned throughput based on actual consumption patterns. This is far more efficient than manual adjustments.

# Example using AWS CLI for DynamoDB Auto Scaling
aws application-autoscaling register-scalable-target \
    --service-namespace dynamodb \
    --scalable-dimension dynamodb:table:WriteCapacityUnits \
    --resource-id MyTable \
    --min-capacity 10 \
    --max-capacity 100

aws application-autoscaling put-scaling-policy \
    --policy-name MyTableWriteScalingPolicy \
    --service-namespace dynamodb \
    --scalable-dimension dynamodb:table:WriteCapacityUnits \
    --resource-id MyTable \
    --policy-type TargetTrackingScaling \
    --target-tracking-scaling-policy-configuration '{
        "TargetValue": 70.0,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "DynamoDBWriteCapacityUtilization"
        }
    }'

Ensure your PHP application’s data access layer is designed to handle potential latency spikes during scaling events and that your DynamoDB table design (partition keys, sort keys) is optimized to distribute load evenly.