Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on AWS

Proactive PostgreSQL Cluster Health Checks with pg_stat_statements and Custom Metrics

Maintaining the health of PostgreSQL clusters, especially in a distributed AWS environment, requires more than just basic CPU and memory monitoring. We need to delve into query performance, connection pooling, and replication lag. A cornerstone for this is the `pg_stat_statements` extension. Ensure it’s enabled in your `postgresql.conf`:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000
pg_stat_statements.save = off

After restarting PostgreSQL, you can query `pg_stat_statements` to identify resource-intensive queries. However, simply looking at raw counts or total execution time isn’t enough. We need to contextualize this data. A common pattern is to extract normalized query metrics and push them to a time-series database like Prometheus or CloudWatch Metrics.

Here’s a Python script that can be run periodically (e.g., via cron or a Lambda function) to collect and report key `pg_stat_statements` metrics. This script assumes you have a PostgreSQL user with `SELECT` privileges on `pg_stat_statements` and a way to expose these metrics (e.g., an HTTP endpoint for Prometheus scraping or direct API calls to CloudWatch).

import psycopg2
import time
import os
from prometheus_client import start_http_server, Gauge

# --- Configuration ---
DB_HOST = os.environ.get("DB_HOST", "localhost")
DB_PORT = os.environ.get("DB_PORT", "5432")
DB_NAME = os.environ.get("DB_NAME", "mydatabase")
DB_USER = os.environ.get("DB_USER", "monitor_user")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "secret")

# Prometheus Metrics
QUERY_EXECUTION_TIME = Gauge('pg_stat_statements_execution_time_total', 'Total execution time of queries', ['query', 'user', 'db'])
QUERY_CALL_COUNT = Gauge('pg_stat_statements_calls_total', 'Total number of calls to queries', ['query', 'user', 'db'])
QUERY_ROWS_RETURNED = Gauge('pg_stat_statements_rows_returned_total', 'Total number of rows returned by queries', ['query', 'user', 'db'])
QUERY_BLOCK_TIME = Gauge('pg_stat_statements_block_time_total', 'Total time spent waiting for locks', ['query', 'user', 'db'])
QUERY_WAL_RECORDS = Gauge('pg_stat_statements_wal_records_total', 'Total WAL records generated', ['query', 'user', 'db'])
QUERY_WAL_BYTES = Gauge('pg_stat_statements_wal_bytes_total', 'Total WAL bytes generated', ['query', 'user', 'db'])

# --- Database Connection ---
def get_db_connection():
    try:
        conn = psycopg2.connect(
            host=DB_HOST,
            port=DB_PORT,
            dbname=DB_NAME,
            user=DB_USER,
            password=DB_PASSWORD
        )
        return conn
    except psycopg2.OperationalError as e:
        print(f"Error connecting to database: {e}")
        return None

# --- Metric Collection ---
def collect_pg_stat_statements():
    conn = get_db_connection()
    if not conn:
        return

    cur = conn.cursor()
    try:
        # Resetting statistics is crucial for delta calculations if not using Prometheus's counter behavior
        # For Prometheus, we typically want cumulative counters, so we don't reset here.
        # If you were calculating deltas manually, you'd run `SELECT pg_stat_statements_reset();`
        # and then query again after a short interval.

        cur.execute("""
            SELECT
                query,
                calls,
                total_exec_time,
                rows,
                mean_time,
                stddev_time,
                rows, -- duplicate for clarity, will use for rows_returned
                blk_read_time,
                blk_write_time,
                temp_blk_read_time,
                temp_blk_write_time,
                '{}' AS db_user, -- Placeholder for user if available in pg_stat_statements
                '{}' AS db_name  -- Placeholder for dbname
            FROM pg_stat_statements
            ORDER BY total_exec_time DESC
            LIMIT 50; -- Limit to top N queries for performance and clarity
        """.format(DB_USER, DB_NAME)) # Injecting user and db name for labels

        rows = cur.fetchall()

        for row in rows:
            query, calls, total_exec_time, total_rows, mean_time, stddev_time, _, blk_read_time, blk_write_time, temp_blk_read_time, temp_blk_write_time, db_user, db_name = row

            # Normalize query for better label readability (e.g., remove literals)
            # This is a simplified normalization. Real-world scenarios might need more robust parsing.
            normalized_query = query.split('(')[0].strip() if '(' in query else query
            normalized_query = normalized_query[:100] # Truncate long queries

            # Update Prometheus gauges
            QUERY_EXECUTION_TIME.labels(query=normalized_query, user=db_user, db=db_name).set(total_exec_time)
            QUERY_CALL_COUNT.labels(query=normalized_query, user=db_user, db=db_name).set(calls)
            QUERY_ROWS_RETURNED.labels(query=normalized_query, user=db_user, db=db_name).set(total_rows)
            QUERY_BLOCK_TIME.labels(query=normalized_query, user=db_user, db=db_name).set(blk_read_time + blk_write_time + temp_blk_read_time + temp_blk_write_time)
            # WAL metrics are not directly in pg_stat_statements, would need to be collected differently if needed.
            # For this example, we'll omit them.

    except Exception as e:
        print(f"Error collecting pg_stat_statements: {e}")
    finally:
        cur.close()
        conn.close()

# --- Main Execution ---
if __name__ == '__main__':
    # Start up the server to expose the metrics.
    # In a real deployment, this would likely be a dedicated exporter process.
    # For simplicity, we'll start it here.
    print("Starting Prometheus HTTP server on port 8000...")
    start_http_server(8000)
    print("Server started. Collecting metrics...")

    while True:
        collect_pg_stat_statements()
        print("Metrics collected. Sleeping for 60 seconds...")
        time.sleep(60) # Collect every minute

This script exposes metrics on port 8000. You would configure Prometheus to scrape http://your-exporter-ip:8000/metrics. Key metrics to alert on include:

High pg_stat_statements_execution_time_total for specific queries, indicating slow queries.
A sudden spike in pg_stat_statements_calls_total without a proportional increase in execution time (potential denial-of-service or inefficient query pattern).
High pg_stat_statements_block_time_total, pointing to lock contention issues.

For more advanced analysis, consider integrating with AWS CloudWatch Logs and creating custom metrics from log entries, or using RDS Performance Insights for a managed solution.

PHP Application Performance Monitoring (APM) with OpenTelemetry and AWS X-Ray

Monitoring your PHP application’s performance is critical for identifying bottlenecks that might be impacting your PostgreSQL cluster. Traditional logging can be noisy; distributed tracing offers a clearer picture of request flows and their associated latency.

OpenTelemetry provides a vendor-neutral standard for instrumentation. We can leverage the OpenTelemetry PHP SDK and integrate it with AWS X-Ray for backend tracing. First, install the necessary components:

composer require open-telemetry/sdk
composer require open-telemetry/exporter-aws
composer require open-telemetry/auto-instrumentation-http
composer require open-telemetry/auto-instrumentation-psr18
composer require open-telemetry/auto-instrumentation-pdo

Next, configure the OpenTelemetry SDK. This typically involves setting up a tracer provider and an exporter. For AWS X-Ray, you’ll need the AWS SDK for PHP and appropriate IAM permissions for your EC2 instance or ECS task to send traces to X-Ray.

<?php
require __DIR__ . '/vendor/autoload.php';

use OpenTelemetry\API\Trace\TracerProviderInterface;
use OpenTelemetry\SDK\Trace\TracerProvider;
use OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor;
use OpenTelemetry\SDK\Trace\Sampler\AlwaysOnSampler;
use OpenTelemetry\SDK\Common\Export\Http\Psr18ClientFactory;
use OpenTelemetry\SDK\Trace\SpanExporter\ConsoleExporter; // For local debugging
use OpenTelemetry\Extension\Xray\XrayExporter; // AWS X-Ray Exporter

// --- Configuration ---
$serviceName = getenv('OTEL_SERVICE_NAME') ?: 'my-php-app';
$awsRegion = getenv('AWS_REGION') ?: 'us-east-1';
$xrayEndpoint = getenv('XRAY_ENDPOINT') ?: "http://localhost:2000/v1/traces"; // Default X-Ray daemon endpoint

// --- Tracer Provider Setup ---
$tracerProvider = new TracerProvider(
    new AlwaysOnSampler(),
    new BatchSpanProcessor(
        // Use ConsoleExporter for local debugging
        // new ConsoleExporter(),

        // Use XRayExporter for AWS deployment
        new XrayExporter(
            $awsRegion,
            $serviceName,
            Psr18ClientFactory::default()->create(),
            $xrayEndpoint
        )
    )
);

// Register the tracer provider globally
OpenTelemetry\API\GlobalTracerProvider::set($tracerProvider);

// --- Auto-instrumentation ---
// HTTP instrumentation (for incoming requests and outgoing HTTP clients)
OpenTelemetry\AutoInstrumentation\Http\HttpInstrumentation::register();

// PSR-18 client instrumentation (if you use libraries like GuzzleHttp)
OpenTelemetry\AutoInstrumentation\Psr18\Psr18Instrumentation::register();

// PDO instrumentation (crucial for database calls)
OpenTelemetry\AutoInstrumentation\Pdo\PdoInstrumentation::register();

// --- Example Usage (within your application logic) ---
$tracer = OpenTelemetry\API\GlobalTracerProvider::getTracerProvider()->getTracer('my-app-tracer');

// Example: Tracing a specific business logic operation
$span = $tracer->spanBuilder('process_user_request')->startSpan();
try {
    // Your application logic here...
    // Example: Making a database call (will be auto-instrumented by PDOInstrumentation)
    $db = new PDO('pgsql:host=your_db_host;dbname=your_db_name', 'db_user', 'db_password');
    $stmt = $db->prepare("SELECT * FROM users WHERE id = :id");
    $stmt->execute([':id' => 1]);
    $user = $stmt->fetch();

    // Example: Making an external API call
    $client = new \GuzzleHttp\Client();
    $response = $client->request('GET', 'https://api.example.com/data');

    $span->setAttribute('user.id', 1);
    $span->setAttribute('http.status_code', 200);

    $span->end();
} catch (\Throwable $e) {
    $span->recordException($e);
    $span->setStatus(\OpenTelemetry\API\Trace\StatusCode::STATUS_ERROR, $e->getMessage());
    $span->end();
    throw $e; // Re-throw the exception
}

// Ensure spans are flushed before application exit
register_shutdown_function(function () use ($tracerProvider) {
    $tracerProvider->shutdown();
});

?>

In a production AWS environment, you would typically run the X-Ray daemon as a separate container or service on your EC2 instances or within your ECS/EKS cluster. The `XRAY_ENDPOINT` environment variable in the script should point to the X-Ray daemon’s UDP endpoint (default is `localhost:2000`). Ensure your IAM role has permissions like xray:PutTraceSegments and xray:PutTelemetryRecords.

With this setup, you can visualize request flows in the AWS X-Ray console, identify slow database queries (which will appear as spans related to PDO operations), and pinpoint latency in external API calls.

AWS CloudWatch Alarms for PostgreSQL Replication Lag and RDS Instance Metrics

For managed PostgreSQL on AWS RDS, CloudWatch provides built-in metrics that are essential for monitoring. Replication lag is a critical indicator of potential data loss or read-only replicas falling behind. You can set up CloudWatch alarms directly through the AWS console or via Infrastructure as Code (IaC) tools like CloudFormation or Terraform.

Key RDS metrics to monitor and alarm on:

`ReplicationLag` (for Read Replicas): This metric shows the number of seconds behind the primary instance the replica is. Set an alarm when this exceeds a threshold (e.g., 60 seconds) for a sustained period.
`CPUUtilization`: Standard CPU usage. High utilization on the primary can indicate query performance issues or insufficient instance sizing.
`DatabaseConnections`: Monitor the number of active connections. A sudden spike or consistently high number might indicate connection leaks or insufficient connection pooling.
`FreeableMemory`: Low freeable memory can lead to increased swapping and degraded performance.
`DiskQueueDepth`: High disk queue depth indicates I/O bottlenecks.

Here’s an example of how you might define a CloudWatch alarm for replication lag using the AWS CLI:

aws cloudwatch put-metric-alarm \
    --alarm-name "RDS-ReplicationLag-High" \
    --alarm-description "Alarm when PostgreSQL replication lag exceeds 60 seconds" \
    --metric-name "ReplicationLag" \
    --namespace "AWS/RDS" \
    --statistic Average \
    --period 300 \
    --threshold 60 \
    --comparison-operator GreaterThanThreshold \
    --dimensions "Name=DBInstanceIdentifier,Value=your-read-replica-db-instance-id" \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:your-sns-topic-for-alarms

Remember to replace your-read-replica-db-instance-id and the SNS topic ARN with your actual values. For primary instances, you’ll monitor metrics like CPUUtilization and DatabaseConnections, adjusting the --dimensions accordingly.

Nginx and PHP-FPM Health Checks and Load Balancing Strategies

Your web server (Nginx) and PHP process manager (PHP-FPM) are the front lines of your application. Ensuring their health and efficient operation is paramount.

Nginx Health Checks:

Nginx can perform active health checks on upstream servers (e.g., PHP-FPM pools or other Nginx instances in a load-balanced setup). This is configured within the upstream block.

http {
    upstream php_backend {
        server 127.0.0.1:9000 fail_timeout=5s max_fails=3; # Example for PHP-FPM
        # server backend1.example.com:9000 fail_timeout=5s max_fails=3;
        # server backend2.example.com:9000 fail_timeout=5s max_fails=3;

        # For Nginx load balancing other Nginx instances:
        # server frontend1.internal:80 weight=1 max_fails=3 fail_timeout=30s;
        # server frontend2.internal:80 weight=1 max_fails=3 fail_timeout=30s;

        # Health check endpoint (if you have one)
        # check interval=3000 rise=2 fall=3 timeout=1000 type=http;
        # check_http uri=/healthz method=GET http_expected_status=200;
    }

    server {
        listen 80;
        server_name yourdomain.com;

        location ~ \.php$ {
            include fastcgi_params;
            fastcgi_pass php_backend;
            fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        }

        # Example health check endpoint for Nginx itself
        location /nginx_healthz {
            access_log off;
            return 200 'OK';
            add_header Content-Type text/plain;
        }
    }
}

The `fail_timeout` and `max_fails` parameters are crucial. If an upstream server fails `max_fails` times within `fail_timeout` seconds, Nginx will consider it down and stop sending requests to it for a period defined by `fail_timeout`. For PHP-FPM, direct `server` entries are common. For load-balanced Nginx instances, you might use a dedicated health check endpoint.

PHP-FPM Monitoring:

PHP-FPM exposes its status through a status page. You need to configure Nginx to proxy requests to this status page and enable the status page in PHP-FPM’s configuration.

1. **Enable PHP-FPM Status Page:**

; In your php-fpm.conf or pool configuration (e.g., www.conf)
[www]
pm.status_path = /fpm_status
ping.path = /fpm_ping
ping.response = pong
request_terminate_timeout = 0 ; Or a reasonable value
catch_workers_output = yes
; ... other settings

2. **Configure Nginx to Proxy to Status Page:**

server {
    listen 80;
    server_name yourdomain.com;

    location /fpm_status {
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_pass unix:/var/run/php/php7.4-fpm.sock; # Adjust path to your FPM socket
        # Or if using TCP: fastcgi_pass 127.0.0.1:9000;
    }

    location /fpm_ping {
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_pass unix:/var/run/php/php7.4-fpm.sock; # Adjust path
        # Or if using TCP: fastcgi_pass 127.0.0.1:9000;
    }

    # ... other locations
}

Accessing http://yourdomain.com/fpm_status will show output like:

pool: www
process manager: dynamic
start for 15000 requests
process.max: 100
process.min: 1
process.num: 5
process.active: 2
process.idle: 3
process.maxLength: 1000
process.maxActive: 5
requests.total: 15000
requests.slow: 0
slowlog: /var/log/php-fpm/slow.log
getpid: 12345
getppid: 12344
fcgi.start: 1678886400
fcgi.restarted: 0

Key metrics to monitor from this output:

`process.active`: Number of actively processing requests. High values indicate a busy FPM pool.
`process.idle`: Number of idle workers. If this is consistently low and `process.active` is high, you might need to increase `pm.max_children`.
`requests.slow`: Number of slow requests. This should ideally be 0.

You can use tools like pmacct, custom scripts, or Prometheus exporters (like php-fpm_exporter) to scrape this status page and feed data into your monitoring system. Alarms should be set for high active processes, low idle processes (indicating potential under-provisioning), and any slow requests.

Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on AWS

Proactive PostgreSQL Cluster Health Checks with pg_stat_statements and Custom Metrics

PHP Application Performance Monitoring (APM) with OpenTelemetry and AWS X-Ray

AWS CloudWatch Alarms for PostgreSQL Replication Lag and RDS Instance Metrics

Nginx and PHP-FPM Health Checks and Load Balancing Strategies

Recent Posts

Top Categories

Our Products

Our Services