Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on OVH

Proactive PostgreSQL Cluster Health Checks

Maintaining the health of a PostgreSQL cluster, especially in a distributed environment like OVH, requires more than just basic uptime checks. We need to monitor internal cluster states, replication lag, connection pools, and resource utilization at a granular level. This section details essential checks and how to implement them.

Replication Lag Monitoring

Replication lag is a critical indicator of potential data loss or inconsistencies. We’ll use a combination of PostgreSQL’s built-in functions and external scripting to alert on excessive lag.

First, let’s define a SQL query to fetch replication status. This query should be run on the primary and then on each replica. On the replica, it will show the lag behind the primary.

SQL Query for Replication Status

SELECT
    pg_current_wal_lsn() AS primary_lsn,
    pg_last_wal_receive_lsn() AS replica_receive_lsn,
    pg_last_wal_replay_lsn() AS replica_replay_lsn,
    pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS replication_lag_bytes
FROM pg_control_system();

To automate this, we can create a simple Python script that connects to each PostgreSQL instance, executes the query, and checks if the `replication_lag_bytes` exceeds a predefined threshold (e.g., 1GB or 100MB, depending on your write load and tolerance). This script can then send alerts via email, Slack, or your preferred monitoring system (e.g., Prometheus Alertmanager).

Python Script for Replication Lag Alerting

import psycopg2
import smtplib
from email.mime.text import MIMEText
import os

# --- Configuration ---
DB_HOST = os.environ.get('PG_HOST', 'localhost')
DB_PORT = os.environ.get('PG_PORT', '5432')
DB_USER = os.environ.get('PG_USER', 'monitor_user')
DB_PASSWORD = os.environ.get('PG_PASSWORD', 'your_password')
REPLICATION_LAG_THRESHOLD_BYTES = int(os.environ.get('REPLICATION_LAG_THRESHOLD_BYTES', 1073741824)) # 1GB

SMTP_SERVER = os.environ.get('SMTP_SERVER', 'smtp.example.com')
SMTP_PORT = int(os.environ.get('SMTP_PORT', 587))
SMTP_USER = os.environ.get('SMTP_USER', '[email protected]')
SMTP_PASSWORD = os.environ.get('SMTP_PASSWORD', 'smtp_password')
ALERT_RECIPIENTS = os.environ.get('ALERT_RECIPIENTS', '[email protected]').split(',')

# --- SQL Query ---
REPLICATION_LAG_QUERY = """
SELECT
    pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS replication_lag_bytes
FROM pg_control_system();
"""

def send_alert(subject, body):
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = SMTP_USER
    msg['To'] = ", ".join(ALERT_RECIPIENTS)

    try:
        with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as server:
            server.starttls()
            server.login(SMTP_USER, SMTP_PASSWORD)
            server.sendmail(SMTP_USER, ALERT_RECIPIENTS, msg.as_string())
        print(f"Alert sent: {subject}")
    except Exception as e:
        print(f"Failed to send alert: {e}")

def check_replication_lag():
    conn = None
    try:
        conn = psycopg2.connect(host=DB_HOST, port=DB_PORT, user=DB_USER, password=DB_PASSWORD)
        cur = conn.cursor()
        cur.execute(REPLICATION_LAG_QUERY)
        lag_bytes = cur.fetchone()[0]
        cur.close()

        print(f"Current replication lag: {lag_bytes} bytes")

        if lag_bytes is not None and lag_bytes > REPLICATION_LAG_THRESHOLD_BYTES:
            subject = f"ALERT: PostgreSQL Replication Lag Exceeded Threshold on {DB_HOST}"
            body = f"Replication lag on {DB_HOST}:{DB_PORT} is {lag_bytes} bytes, which exceeds the threshold of {REPLICATION_LAG_THRESHOLD_BYTES} bytes."
            send_alert(subject, body)
        else:
            print("Replication lag is within acceptable limits.")

    except psycopg2.OperationalError as e:
        subject = f"CRITICAL: Cannot connect to PostgreSQL on {DB_HOST}"
        body = f"Failed to connect to PostgreSQL at {DB_HOST}:{DB_PORT}. Error: {e}"
        send_alert(subject, body)
    except Exception as e:
        subject = f"ERROR: Unexpected error checking PostgreSQL on {DB_HOST}"
        body = f"An unexpected error occurred while checking PostgreSQL on {DB_HOST}:{DB_PORT}. Error: {e}"
        send_alert(subject, body)
    finally:
        if conn:
            conn.close()

if __name__ == "__main__":
    check_replication_lag()

This script should be scheduled to run periodically (e.g., every 5 minutes) using cron or a similar scheduler on a dedicated monitoring host or one of the PostgreSQL nodes (if resource permits and isolation is not a concern).

Connection Pool Monitoring

An exhausted connection pool can cripple your PHP application. We need to monitor the number of active connections, idle connections, and the maximum allowed connections. Tools like PgBouncer or built-in connection pooling in application frameworks can be monitored.

Monitoring PgBouncer Statistics

If you’re using PgBouncer, you can connect to its monitoring database (usually `pgbouncer`) and query its statistics table.

-- Connect to the pgbouncer database
\c pgbouncer

-- Show database-level statistics
SELECT
    dbname,
    SUM(total_requests) AS total_requests,
    SUM(total_received) AS total_received,
    SUM(total_sent) AS total_sent,
    SUM(max_client_conn) AS max_client_conn,
    SUM(active_connections) AS active_connections,
    SUM(idle_connections) AS idle_connections,
    SUM(used_connections) AS used_connections,
    SUM(pool_size) AS pool_size,
    SUM(avg_conn_duration) AS avg_conn_duration
FROM pg_stat_database
GROUP BY dbname;

-- Show server-level statistics (connections to PostgreSQL)
SELECT
    pool_mode,
    SUM(total_query_time) AS total_query_time,
    SUM(total_returned) AS total_returned,
    SUM(total_written) AS total_written,
    SUM(max_client_conn) AS max_client_conn,
    SUM(active_connections) AS active_connections,
    SUM(idle_connections) AS idle_connections,
    SUM(used_connections) AS used_connections,
    SUM(pool_size) AS pool_size,
    SUM(avg_conn_duration) AS avg_conn_duration
FROM pg_stat_servers
GROUP BY pool_mode;

A Python script can query these statistics and trigger alerts if `used_connections` approaches `max_client_conn` or if `idle_connections` drops too low, indicating potential connection churn.

Monitoring Application-Level Connections (PHP)

For PHP applications, you can instrument your code to track connection usage. Using PDO, you can get the number of active connections, though this is often managed by the framework or connection pooler.

<?php
// Assuming you have a PDO connection object $pdo

try {
    // This is a simplified approach; actual connection count might be managed by pooling.
    // For PDO, there isn't a direct method to get 'active' connections in the pool sense.
    // You'd typically monitor the pooler (like PgBouncer) or framework's connection manager.

    // Example: If you want to track connections opened by your script instance
    // This is NOT a pool monitor, but a count of PDO objects created.
    // A better approach is to use a dedicated monitoring tool or framework hooks.

    // Let's simulate tracking connections opened by this script instance
    static $connectionCount = 0;
    if (!isset($pdo) || $pdo === null) {
        // Establish connection (replace with your actual connection details)
        $dsn = "pgsql:host=your_db_host;port=5432;dbname=your_db";
        $user = 'your_db_user';
        $password = 'your_db_password';
        $options = [
            PDO::ATTR_ERRMODE            => PDO::ERRMODE_EXCEPTION,
            PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
            PDO::ATTR_EMULATE_PREPARES   => false,
            // Consider setting connection timeout
            PDO::ATTR_TIMEOUT => 5, // seconds
        ];
        $pdo = new PDO($dsn, $user, $password, $options);
        $connectionCount++;
        // Log this event or send a metric
        error_log("New PDO connection established. Total for this script instance: " . $connectionCount);
    }

    // To monitor pool usage, you'd typically query PgBouncer or your framework's metrics.
    // For example, if using a framework like Laravel, you might access its service container
    // and check its database connection manager's state, if exposed.

    // If you need to alert on *total* connections to the DB server, you'd use SQL:
    // SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
    // This would require a separate monitoring script/process.

} catch (\PDOException $e) {
    // Log error and potentially trigger an alert
    error_log("Database connection error: " . $e->getMessage());
    // Trigger alert mechanism here
    // e.g., send_alert_to_monitoring_system("DB Connection Failed", $e->getMessage());
    throw $e; // Re-throw to halt execution if necessary
}

// ... rest of your application logic
?>

For robust application-level connection monitoring, integrate with Prometheus client libraries for PHP or use APM tools that can expose connection pool metrics.

Resource Utilization Monitoring

Beyond basic CPU/RAM, PostgreSQL performance is heavily influenced by I/O, disk space, and specific PostgreSQL parameters like `shared_buffers` and `work_mem`. OVH’s control panel provides some metrics, but we need deeper insights.

Disk I/O and Space

Use `iostat` and `df` on the PostgreSQL servers. Monitor I/O wait times, read/write operations per second, and especially free disk space. Alerts should be triggered well before disks are full (e.g., at 80% and 90% utilization).

# Check disk space
df -hP | grep -vE '^Filesystem|tmpfs|cdrom'

# Check I/O statistics (run for a few seconds)
iostat -xz 5

A script can parse the output of `df -hP` and `iostat` to generate alerts. For example, a Python script using `subprocess` to run these commands and then analyze the output.

PostgreSQL Specific Metrics

PostgreSQL exposes a wealth of performance metrics via `pg_stat_activity`, `pg_stat_statements`, and `pg_stat_database`. Tools like `pg_monitor` (part of `pgBadger`) or direct Prometheus exporters (`postgres_exporter`) are invaluable here.

Using `pg_stat_statements`

Ensure `pg_stat_statements` is enabled in your `postgresql.conf` and loaded via `shared_preload_libraries`. This extension tracks execution statistics for all SQL statements executed by the server.

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.save = on

Then, query it to find the most expensive or frequently run queries:

SELECT
    calls,
    total_time,
    rows,
    mean_time,
    stddev_time,
    (total_time / calls) AS avg_time,
    substring(query, 1, 60) AS query
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

SELECT
    calls,
    total_time,
    rows,
    mean_time,
    stddev_time,
    (total_time / calls) AS avg_time,
    substring(query, 1, 60) AS query
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;

Alerting on queries that consistently have high `mean_time` or `calls` can help identify performance bottlenecks in your PHP application’s ORM or direct SQL usage.

PHP Application Health Checks

The PostgreSQL cluster is only one part of the equation. The PHP application itself needs robust health checks. This includes application-level errors, response times, and resource usage.

Error and Exception Monitoring

Implement comprehensive error logging. PHP’s `error_log()` function is basic; for production, use a dedicated logging library like Monolog. Configure Monolog to send logs to a centralized logging system (e.g., ELK stack, Graylog, Datadog).

Monolog Configuration Example

<?php
require 'vendor/autoload.php'; // Assuming Monolog is installed via Composer

use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Handler\SlackWebhookHandler; // Example for Slack

// Create a logger
$logger = new Logger('app_logger');

// Log to a file
$logger->pushHandler(new StreamHandler('/var/log/php_app/app.log', Logger::DEBUG));

// Log to Slack (replace with your webhook URL)
$slackWebhookUrl = 'YOUR_SLACK_WEBHOOK_URL';
$slackHandler = new SlackWebhookHandler(
    $slackWebhookUrl,
    '#your-channel', // Optional: specify channel
    '@your-bot-name', // Optional: specify bot name
    false, // Optional: use attachment
    null, // Optional: icon emoji
    true, // Optional: use short attachment
    null, // Optional: icon url
    Logger::ERROR // Minimum level for Slack
);
$logger->pushHandler($slackHandler);

// --- Usage ---
try {
    // Your application logic...
    if (some_condition_fails()) {
        throw new \RuntimeException("A critical operation failed.");
    }
} catch (\Throwable $e) {
    // Log the exception with context
    $logger->error('An unexpected error occurred', [
        'message' => $e->getMessage(),
        'code' => $e->getCode(),
        'file' => $e->getFile(),
        'line' => $e->getLine(),
        'trace' => $e->getTraceAsString(),
        // Add any relevant application context (user ID, request ID, etc.)
        'user_id' => $_SESSION['user_id'] ?? 'anonymous',
        'request_id' => $_SERVER['HTTP_X_REQUEST_ID'] ?? 'N/A',
    ]);
}

// Log an informational message
$logger->info('User logged in successfully', ['user_id' => $user_id]);
?>

Configure PHP’s `error_reporting` and `display_errors` appropriately for your environment. In production, `display_errors` should be `Off`, and errors should be logged.

; php.ini
error_reporting = E_ALL & ~E_DEPRECATED & ~E_STRICT
display_errors = Off
log_errors = On
error_log = /var/log/php_errors/php_error.log
memory_limit = 256M
max_execution_time = 60

Application Performance Monitoring (APM)

For deeper insights into application performance, including response times, database query times, and external service calls, APM tools are essential. Options include New Relic, Datadog APM, or open-source solutions like Jaeger/Zipkin integrated with OpenTelemetry.

Basic Response Time Monitoring (PHP)

You can implement basic response time tracking by measuring the time from script start to output. This can be logged or sent as a metric.

<?php
// Start timer at the very beginning of your script (e.g., index.php)
define('APP_START_TIME', microtime(true));

// ... your application logic ...

// At the end of the script, before outputting headers/content
$responseTime = microtime(true) - APP_START_TIME;

// Log this response time or send it as a metric
// Example: Log to a file
file_put_contents('/var/log/php_app/response_times.log', date('Y-m-d H:i:s') . " - " . $responseTime . "s\n", FILE_APPEND);

// Or send to a metrics endpoint (e.g., Prometheus Pushgateway)
// This would require a client library for your metrics system.
// e.g., $metricsClient->sendGauge('http_response_time_seconds', $responseTime);

// You can also set an alert threshold
$response_time_threshold = 2.0; // seconds
if ($responseTime > $response_time_threshold) {
    // Trigger an alert
    // e.g., $logger->warning('High response time detected', ['response_time' => $responseTime]);
}
?>

OVH Specific Considerations

OVH’s infrastructure offers specific tools and considerations for monitoring. Leveraging these can simplify your setup.

OVH Control Panel Metrics

Familiarize yourself with the metrics available in the OVH control panel for your Public Cloud instances (CPU, RAM, Network I/O, Disk I/O). These are good for high-level overviews and initial troubleshooting but are often not granular enough for proactive alerting.

OVH Managed Databases (PostgreSQL)

If you are using OVH’s Managed Databases service, they provide built-in monitoring dashboards and alerting capabilities. Ensure you configure these to your needs. Key metrics to watch include:

CPU Usage
Memory Usage
Disk Usage
Network Traffic
Replication Lag (if applicable)
Connection Count

For custom monitoring beyond what OVH provides, you’ll typically need to install agents or run external checks against the database endpoints. OVH often provides access to the underlying host or allows SSH access for custom configurations.

Network Monitoring

Monitor network latency and packet loss between your application servers and the PostgreSQL cluster. Tools like `ping`, `mtr`, and `tcpdump` can be useful for diagnostics. For continuous monitoring, consider using external network monitoring services or agents that periodically probe connectivity and performance.

Centralized Monitoring Stack

To effectively manage monitoring across your PHP applications and PostgreSQL clusters, a centralized stack is crucial. A common and powerful stack includes:

Prometheus: For time-series metrics collection. Use exporters like `postgres_exporter` for PostgreSQL and application-specific exporters or client libraries for PHP.
Grafana: For visualization and dashboarding. Create dashboards to display key metrics from Prometheus.
Alertmanager: For handling alerts generated by Prometheus, deduplicating, grouping, and routing them to appropriate notification channels (email, Slack, PagerDuty).
ELK Stack (Elasticsearch, Logstash, Kibana) or Graylog: For centralized log management and analysis.

Setting up this stack on OVH Public Cloud instances (e.g., using Docker containers) provides a robust and scalable monitoring solution. Ensure your Prometheus server can reach your PostgreSQL instances (and vice-versa for exporters) and that your application servers can send logs to your log aggregation system.