Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on AWS
Proactive PostgreSQL Cluster Health Checks with pg_stat_statements and Custom Metrics
Maintaining the health of PostgreSQL clusters, especially in a distributed AWS environment, requires more than just basic CPU and memory monitoring. We need to delve into query performance, connection pooling, and replication lag. A cornerstone for this is the `pg_stat_statements` extension. Ensure it’s enabled in your `postgresql.conf`:
shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all pg_stat_statements.max = 10000 pg_stat_statements.save = off
After restarting PostgreSQL, you can query `pg_stat_statements` to identify resource-intensive queries. However, simply looking at raw counts or total execution time isn’t enough. We need to contextualize this data. A common pattern is to extract normalized query metrics and push them to a time-series database like Prometheus or CloudWatch Metrics.
Here’s a Python script that can be run periodically (e.g., via cron or a Lambda function) to collect and report key `pg_stat_statements` metrics. This script assumes you have a PostgreSQL user with `SELECT` privileges on `pg_stat_statements` and a way to expose these metrics (e.g., an HTTP endpoint for Prometheus scraping or direct API calls to CloudWatch).
import psycopg2
import time
import os
from prometheus_client import start_http_server, Gauge
# --- Configuration ---
DB_HOST = os.environ.get("DB_HOST", "localhost")
DB_PORT = os.environ.get("DB_PORT", "5432")
DB_NAME = os.environ.get("DB_NAME", "mydatabase")
DB_USER = os.environ.get("DB_USER", "monitor_user")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "secret")
# Prometheus Metrics
QUERY_EXECUTION_TIME = Gauge('pg_stat_statements_execution_time_total', 'Total execution time of queries', ['query', 'user', 'db'])
QUERY_CALL_COUNT = Gauge('pg_stat_statements_calls_total', 'Total number of calls to queries', ['query', 'user', 'db'])
QUERY_ROWS_RETURNED = Gauge('pg_stat_statements_rows_returned_total', 'Total number of rows returned by queries', ['query', 'user', 'db'])
QUERY_BLOCK_TIME = Gauge('pg_stat_statements_block_time_total', 'Total time spent waiting for locks', ['query', 'user', 'db'])
QUERY_WAL_RECORDS = Gauge('pg_stat_statements_wal_records_total', 'Total WAL records generated', ['query', 'user', 'db'])
QUERY_WAL_BYTES = Gauge('pg_stat_statements_wal_bytes_total', 'Total WAL bytes generated', ['query', 'user', 'db'])
# --- Database Connection ---
def get_db_connection():
try:
conn = psycopg2.connect(
host=DB_HOST,
port=DB_PORT,
dbname=DB_NAME,
user=DB_USER,
password=DB_PASSWORD
)
return conn
except psycopg2.OperationalError as e:
print(f"Error connecting to database: {e}")
return None
# --- Metric Collection ---
def collect_pg_stat_statements():
conn = get_db_connection()
if not conn:
return
cur = conn.cursor()
try:
# Resetting statistics is crucial for delta calculations if not using Prometheus's counter behavior
# For Prometheus, we typically want cumulative counters, so we don't reset here.
# If you were calculating deltas manually, you'd run `SELECT pg_stat_statements_reset();`
# and then query again after a short interval.
cur.execute("""
SELECT
query,
calls,
total_exec_time,
rows,
mean_time,
stddev_time,
rows, -- duplicate for clarity, will use for rows_returned
blk_read_time,
blk_write_time,
temp_blk_read_time,
temp_blk_write_time,
'{}' AS db_user, -- Placeholder for user if available in pg_stat_statements
'{}' AS db_name -- Placeholder for dbname
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 50; -- Limit to top N queries for performance and clarity
""".format(DB_USER, DB_NAME)) # Injecting user and db name for labels
rows = cur.fetchall()
for row in rows:
query, calls, total_exec_time, total_rows, mean_time, stddev_time, _, blk_read_time, blk_write_time, temp_blk_read_time, temp_blk_write_time, db_user, db_name = row
# Normalize query for better label readability (e.g., remove literals)
# This is a simplified normalization. Real-world scenarios might need more robust parsing.
normalized_query = query.split('(')[0].strip() if '(' in query else query
normalized_query = normalized_query[:100] # Truncate long queries
# Update Prometheus gauges
QUERY_EXECUTION_TIME.labels(query=normalized_query, user=db_user, db=db_name).set(total_exec_time)
QUERY_CALL_COUNT.labels(query=normalized_query, user=db_user, db=db_name).set(calls)
QUERY_ROWS_RETURNED.labels(query=normalized_query, user=db_user, db=db_name).set(total_rows)
QUERY_BLOCK_TIME.labels(query=normalized_query, user=db_user, db=db_name).set(blk_read_time + blk_write_time + temp_blk_read_time + temp_blk_write_time)
# WAL metrics are not directly in pg_stat_statements, would need to be collected differently if needed.
# For this example, we'll omit them.
except Exception as e:
print(f"Error collecting pg_stat_statements: {e}")
finally:
cur.close()
conn.close()
# --- Main Execution ---
if __name__ == '__main__':
# Start up the server to expose the metrics.
# In a real deployment, this would likely be a dedicated exporter process.
# For simplicity, we'll start it here.
print("Starting Prometheus HTTP server on port 8000...")
start_http_server(8000)
print("Server started. Collecting metrics...")
while True:
collect_pg_stat_statements()
print("Metrics collected. Sleeping for 60 seconds...")
time.sleep(60) # Collect every minute
This script exposes metrics on port 8000. You would configure Prometheus to scrape http://your-exporter-ip:8000/metrics. Key metrics to alert on include:
- High
pg_stat_statements_execution_time_totalfor specific queries, indicating slow queries. - A sudden spike in
pg_stat_statements_calls_totalwithout a proportional increase in execution time (potential denial-of-service or inefficient query pattern). - High
pg_stat_statements_block_time_total, pointing to lock contention issues.
For more advanced analysis, consider integrating with AWS CloudWatch Logs and creating custom metrics from log entries, or using RDS Performance Insights for a managed solution.
PHP Application Performance Monitoring (APM) with OpenTelemetry and AWS X-Ray
Monitoring your PHP application’s performance is critical for identifying bottlenecks that might be impacting your PostgreSQL cluster. Traditional logging can be noisy; distributed tracing offers a clearer picture of request flows and their associated latency.
OpenTelemetry provides a vendor-neutral standard for instrumentation. We can leverage the OpenTelemetry PHP SDK and integrate it with AWS X-Ray for backend tracing. First, install the necessary components:
composer require open-telemetry/sdk composer require open-telemetry/exporter-aws composer require open-telemetry/auto-instrumentation-http composer require open-telemetry/auto-instrumentation-psr18 composer require open-telemetry/auto-instrumentation-pdo
Next, configure the OpenTelemetry SDK. This typically involves setting up a tracer provider and an exporter. For AWS X-Ray, you’ll need the AWS SDK for PHP and appropriate IAM permissions for your EC2 instance or ECS task to send traces to X-Ray.
<?php
require __DIR__ . '/vendor/autoload.php';
use OpenTelemetry\API\Trace\TracerProviderInterface;
use OpenTelemetry\SDK\Trace\TracerProvider;
use OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor;
use OpenTelemetry\SDK\Trace\Sampler\AlwaysOnSampler;
use OpenTelemetry\SDK\Common\Export\Http\Psr18ClientFactory;
use OpenTelemetry\SDK\Trace\SpanExporter\ConsoleExporter; // For local debugging
use OpenTelemetry\Extension\Xray\XrayExporter; // AWS X-Ray Exporter
// --- Configuration ---
$serviceName = getenv('OTEL_SERVICE_NAME') ?: 'my-php-app';
$awsRegion = getenv('AWS_REGION') ?: 'us-east-1';
$xrayEndpoint = getenv('XRAY_ENDPOINT') ?: "http://localhost:2000/v1/traces"; // Default X-Ray daemon endpoint
// --- Tracer Provider Setup ---
$tracerProvider = new TracerProvider(
new AlwaysOnSampler(),
new BatchSpanProcessor(
// Use ConsoleExporter for local debugging
// new ConsoleExporter(),
// Use XRayExporter for AWS deployment
new XrayExporter(
$awsRegion,
$serviceName,
Psr18ClientFactory::default()->create(),
$xrayEndpoint
)
)
);
// Register the tracer provider globally
OpenTelemetry\API\GlobalTracerProvider::set($tracerProvider);
// --- Auto-instrumentation ---
// HTTP instrumentation (for incoming requests and outgoing HTTP clients)
OpenTelemetry\AutoInstrumentation\Http\HttpInstrumentation::register();
// PSR-18 client instrumentation (if you use libraries like GuzzleHttp)
OpenTelemetry\AutoInstrumentation\Psr18\Psr18Instrumentation::register();
// PDO instrumentation (crucial for database calls)
OpenTelemetry\AutoInstrumentation\Pdo\PdoInstrumentation::register();
// --- Example Usage (within your application logic) ---
$tracer = OpenTelemetry\API\GlobalTracerProvider::getTracerProvider()->getTracer('my-app-tracer');
// Example: Tracing a specific business logic operation
$span = $tracer->spanBuilder('process_user_request')->startSpan();
try {
// Your application logic here...
// Example: Making a database call (will be auto-instrumented by PDOInstrumentation)
$db = new PDO('pgsql:host=your_db_host;dbname=your_db_name', 'db_user', 'db_password');
$stmt = $db->prepare("SELECT * FROM users WHERE id = :id");
$stmt->execute([':id' => 1]);
$user = $stmt->fetch();
// Example: Making an external API call
$client = new \GuzzleHttp\Client();
$response = $client->request('GET', 'https://api.example.com/data');
$span->setAttribute('user.id', 1);
$span->setAttribute('http.status_code', 200);
$span->end();
} catch (\Throwable $e) {
$span->recordException($e);
$span->setStatus(\OpenTelemetry\API\Trace\StatusCode::STATUS_ERROR, $e->getMessage());
$span->end();
throw $e; // Re-throw the exception
}
// Ensure spans are flushed before application exit
register_shutdown_function(function () use ($tracerProvider) {
$tracerProvider->shutdown();
});
?>
In a production AWS environment, you would typically run the X-Ray daemon as a separate container or service on your EC2 instances or within your ECS/EKS cluster. The `XRAY_ENDPOINT` environment variable in the script should point to the X-Ray daemon’s UDP endpoint (default is `localhost:2000`). Ensure your IAM role has permissions like xray:PutTraceSegments and xray:PutTelemetryRecords.
With this setup, you can visualize request flows in the AWS X-Ray console, identify slow database queries (which will appear as spans related to PDO operations), and pinpoint latency in external API calls.
AWS CloudWatch Alarms for PostgreSQL Replication Lag and RDS Instance Metrics
For managed PostgreSQL on AWS RDS, CloudWatch provides built-in metrics that are essential for monitoring. Replication lag is a critical indicator of potential data loss or read-only replicas falling behind. You can set up CloudWatch alarms directly through the AWS console or via Infrastructure as Code (IaC) tools like CloudFormation or Terraform.
Key RDS metrics to monitor and alarm on:
- `ReplicationLag` (for Read Replicas): This metric shows the number of seconds behind the primary instance the replica is. Set an alarm when this exceeds a threshold (e.g., 60 seconds) for a sustained period.
- `CPUUtilization`: Standard CPU usage. High utilization on the primary can indicate query performance issues or insufficient instance sizing.
- `DatabaseConnections`: Monitor the number of active connections. A sudden spike or consistently high number might indicate connection leaks or insufficient connection pooling.
- `FreeableMemory`: Low freeable memory can lead to increased swapping and degraded performance.
- `DiskQueueDepth`: High disk queue depth indicates I/O bottlenecks.
Here’s an example of how you might define a CloudWatch alarm for replication lag using the AWS CLI:
aws cloudwatch put-metric-alarm \
--alarm-name "RDS-ReplicationLag-High" \
--alarm-description "Alarm when PostgreSQL replication lag exceeds 60 seconds" \
--metric-name "ReplicationLag" \
--namespace "AWS/RDS" \
--statistic Average \
--period 300 \
--threshold 60 \
--comparison-operator GreaterThanThreshold \
--dimensions "Name=DBInstanceIdentifier,Value=your-read-replica-db-instance-id" \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:your-sns-topic-for-alarms
Remember to replace your-read-replica-db-instance-id and the SNS topic ARN with your actual values. For primary instances, you’ll monitor metrics like CPUUtilization and DatabaseConnections, adjusting the --dimensions accordingly.
Nginx and PHP-FPM Health Checks and Load Balancing Strategies
Your web server (Nginx) and PHP process manager (PHP-FPM) are the front lines of your application. Ensuring their health and efficient operation is paramount.
Nginx Health Checks:
Nginx can perform active health checks on upstream servers (e.g., PHP-FPM pools or other Nginx instances in a load-balanced setup). This is configured within the upstream block.
http {
upstream php_backend {
server 127.0.0.1:9000 fail_timeout=5s max_fails=3; # Example for PHP-FPM
# server backend1.example.com:9000 fail_timeout=5s max_fails=3;
# server backend2.example.com:9000 fail_timeout=5s max_fails=3;
# For Nginx load balancing other Nginx instances:
# server frontend1.internal:80 weight=1 max_fails=3 fail_timeout=30s;
# server frontend2.internal:80 weight=1 max_fails=3 fail_timeout=30s;
# Health check endpoint (if you have one)
# check interval=3000 rise=2 fall=3 timeout=1000 type=http;
# check_http uri=/healthz method=GET http_expected_status=200;
}
server {
listen 80;
server_name yourdomain.com;
location ~ \.php$ {
include fastcgi_params;
fastcgi_pass php_backend;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}
# Example health check endpoint for Nginx itself
location /nginx_healthz {
access_log off;
return 200 'OK';
add_header Content-Type text/plain;
}
}
}
The `fail_timeout` and `max_fails` parameters are crucial. If an upstream server fails `max_fails` times within `fail_timeout` seconds, Nginx will consider it down and stop sending requests to it for a period defined by `fail_timeout`. For PHP-FPM, direct `server` entries are common. For load-balanced Nginx instances, you might use a dedicated health check endpoint.
PHP-FPM Monitoring:
PHP-FPM exposes its status through a status page. You need to configure Nginx to proxy requests to this status page and enable the status page in PHP-FPM’s configuration.
1. **Enable PHP-FPM Status Page:**
; In your php-fpm.conf or pool configuration (e.g., www.conf) [www] pm.status_path = /fpm_status ping.path = /fpm_ping ping.response = pong request_terminate_timeout = 0 ; Or a reasonable value catch_workers_output = yes ; ... other settings
2. **Configure Nginx to Proxy to Status Page:**
server {
listen 80;
server_name yourdomain.com;
location /fpm_status {
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_pass unix:/var/run/php/php7.4-fpm.sock; # Adjust path to your FPM socket
# Or if using TCP: fastcgi_pass 127.0.0.1:9000;
}
location /fpm_ping {
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_pass unix:/var/run/php/php7.4-fpm.sock; # Adjust path
# Or if using TCP: fastcgi_pass 127.0.0.1:9000;
}
# ... other locations
}
Accessing http://yourdomain.com/fpm_status will show output like:
pool: www process manager: dynamic start for 15000 requests process.max: 100 process.min: 1 process.num: 5 process.active: 2 process.idle: 3 process.maxLength: 1000 process.maxActive: 5 requests.total: 15000 requests.slow: 0 slowlog: /var/log/php-fpm/slow.log getpid: 12345 getppid: 12344 fcgi.start: 1678886400 fcgi.restarted: 0
Key metrics to monitor from this output:
- `process.active`: Number of actively processing requests. High values indicate a busy FPM pool.
- `process.idle`: Number of idle workers. If this is consistently low and `process.active` is high, you might need to increase `pm.max_children`.
- `requests.slow`: Number of slow requests. This should ideally be 0.
You can use tools like pmacct, custom scripts, or Prometheus exporters (like php-fpm_exporter) to scrape this status page and feed data into your monitoring system. Alarms should be set for high active processes, low idle processes (indicating potential under-provisioning), and any slow requests.