Server Monitoring Best Practices: Keeping Your Laravel App and PostgreSQL Clusters Alive on OVH

Proactive PostgreSQL Cluster Health Checks

Maintaining the health of a PostgreSQL cluster, especially in a distributed environment like OVH, requires more than just basic uptime checks. We need to delve into resource utilization, replication lag, query performance, and potential deadlock scenarios. This section outlines essential checks and the tools to implement them.

1. Replication Lag Monitoring

Replication lag is a critical indicator of data consistency and availability. High lag can lead to stale reads and potential data loss during failovers. We’ll use a combination of PostgreSQL’s built-in functions and a simple Bash script for periodic checks.

First, let’s define a SQL query to fetch the replication lag. This query should be run on the replica(s).

SELECT
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes,
    CASE
        WHEN pg_is_in_recovery() THEN
            EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))
        ELSE
            0
    END AS lag_seconds
FROM pg_stat_wal_receiver
WHERE status = 'streaming';

Now, let’s create a Bash script that executes this query and alerts if the lag exceeds a defined threshold. This script can be scheduled via cron.

#!/bin/bash

# Configuration
PG_HOST="your_replica_host"
PG_PORT="5432"
PG_USER="monitor_user"
PG_DB="postgres"
LAG_THRESHOLD_SECONDS=60 # Alert if lag is more than 60 seconds
ALERT_EMAIL="[email protected]"

# SQL query to get replication lag
SQL_QUERY="SELECT
    CASE
        WHEN pg_is_in_recovery() THEN
            EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))
        ELSE
            0
    END AS lag_seconds
FROM pg_stat_wal_receiver
WHERE status = 'streaming';"

# Execute the query
LAG_SECONDS=$(psql -h $PG_HOST -p $PG_PORT -U $PG_USER -d $PG_DB -t -c "$SQL_QUERY" 2>&1)

# Check for errors during query execution
if [[ $? -ne 0 ]]; then
    echo "Error executing PostgreSQL query on $PG_HOST: $LAG_SECONDS" | mail -s "PostgreSQL Replication Alert: Query Error on $PG_HOST" $ALERT_EMAIL
    exit 1
fi

# Check if LAG_SECONDS is a valid number
if ! [[ "$LAG_SECONDS" =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
    echo "Invalid lag value received from $PG_HOST: '$LAG_SECONDS'" | mail -s "PostgreSQL Replication Alert: Invalid Lag on $PG_HOST" $ALERT_EMAIL
    exit 1
fi

# Compare lag with threshold
if (( $(echo "$LAG_SECONDS > $LAG_THRESHOLD_SECONDS" | bc -l) )); then
    echo "PostgreSQL replication lag on $PG_HOST is ${LAG_SECONDS} seconds, exceeding threshold of ${LAG_THRESHOLD_SECONDS} seconds." | mail -s "PostgreSQL Replication Alert: High Lag on $PG_HOST" $ALERT_EMAIL
    exit 1
else
    echo "PostgreSQL replication lag on $PG_HOST is within acceptable limits (${LAG_SECONDS} seconds)."
    exit 0
fi

Ensure the monitor_user has the necessary permissions (e.g., `SELECT` on `pg_stat_wal_receiver`) and that the script has execute permissions. Schedule this script using cron, for example, to run every minute:

* * * * * /path/to/your/replication_check.sh

2. Connection and Resource Utilization

Monitoring active connections, CPU, memory, and disk I/O is crucial. We can leverage tools like pg_stat_activity, vmstat, and iostat. For a more integrated approach, consider Prometheus with the PostgreSQL exporter.

Here’s how to check active connections and long-running queries:

-- Total active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Long-running queries (e.g., longer than 5 minutes)
SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE state = 'active' AND query_start IS NOT NULL AND age(clock_timestamp(), query_start) > interval '5 minutes'
ORDER BY query_start;

For system-level metrics, we can use standard Linux tools. A script to check for excessive CPU or memory usage by PostgreSQL processes:

#!/bin/bash

# Configuration
CPU_THRESHOLD=80 # Percentage
MEM_THRESHOLD=80 # Percentage
ALERT_EMAIL="[email protected]"

# Get PostgreSQL PIDs
PG_PIDS=$(pgrep -f "postgres: ")

if [ -z "$PG_PIDS" ]; then
    echo "No PostgreSQL processes found."
    exit 0
fi

# Check CPU and Memory for each PostgreSQL process
for PID in $PG_PIDS; do
    CPU_USAGE=$(ps -p $PID -o %cpu --no-headers | awk '{print int($1)}')
    MEM_USAGE=$(ps -p $PID -o %mem --no-headers | awk '{print int($1)}')
    PG_CMD=$(ps -p $PID -o comm=)

    if [ "$CPU_USAGE" -gt "$CPU_THRESHOLD" ]; then
        echo "High CPU usage for PostgreSQL process (PID: $PID, Command: $PG_CMD): ${CPU_USAGE}% (Threshold: ${CPU_THRESHOLD}%)" | mail -s "PostgreSQL Resource Alert: High CPU on $(hostname)" $ALERT_EMAIL
    fi

    if [ "$MEM_USAGE" -gt "$MEM_THRESHOLD" ]; then
        echo "High Memory usage for PostgreSQL process (PID: $PID, Command: $PG_CMD): ${MEM_USAGE}% (Threshold: ${MEM_THRESHOLD}%)" | mail -s "PostgreSQL Resource Alert: High Memory on $(hostname)" $ALERT_EMAIL
    fi
done

exit 0

This script can also be cron-scheduled. For disk I/O, iostat -xd 5 will provide detailed statistics every 5 seconds. Look for high `%util` and `await` values.

3. Deadlock Detection

Deadlocks can halt application processes. PostgreSQL logs deadlocks by default if log_lock_waits is enabled and deadlock_timeout is set appropriately (e.g., 1s to 5s). We can monitor the PostgreSQL logs for deadlock messages.

Ensure your postgresql.conf has these settings:

log_lock_waits = on
deadlock_timeout = 1s
log_statement = 'none' # Or 'ddl', 'mod', 'all' depending on your needs, but avoid logging all queries in production for performance reasons.
log_destination = 'stderr' # Or 'syslog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d.log'

Then, use a log monitoring tool like logwatch, goaccess, or a custom script with grep to scan the PostgreSQL logs for “deadlock detected” messages and trigger alerts.

#!/bin/bash

LOG_FILE="/var/log/postgresql/postgresql-*.log" # Adjust path as needed
ALERT_EMAIL="[email protected]"
LAST_CHECK_FILE="/tmp/last_deadlock_check.timestamp"

# Get current timestamp
CURRENT_TIMESTAMP=$(date +%s)

# Get timestamp of the last check, default to 0 if file doesn't exist
LAST_CHECK_TIMESTAMP=$(cat $LAST_CHECK_FILE 2>/dev/null || echo 0)

# Find log entries since the last check
# Using find with -mmin to check files modified within the last minute, then grep
# This is a simplified approach; for robust log parsing, consider tools like ELK stack or Graylog.
find $(dirname $LOG_FILE) -name $(basename $LOG_FILE) -type f -mmin -1 -print0 | xargs -0 grep "deadlock detected" --with-filename --line-number | while IFS= read -r line; do
    # Extract timestamp from log line if available and parse it
    # This part is highly dependent on your log format. Assuming ISO 8601 format for simplicity.
    LOG_TIMESTAMP_STR=$(echo "$line" | grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' | head -n 1)
    if [ -n "$LOG_TIMESTAMP_STR" ]; then
        LOG_TIMESTAMP=$(date -d "$LOG_TIMESTAMP_STR" +%s)
        if [ "$LOG_TIMESTAMP" -gt "$LAST_CHECK_TIMESTAMP" ]; then
            echo "Deadlock detected: $line" | mail -s "PostgreSQL Alert: Deadlock Detected on $(hostname)" $ALERT_EMAIL
        fi
    else
        # If timestamp parsing fails, alert on any new deadlock message
        echo "Deadlock detected (timestamp parsing failed): $line" | mail -s "PostgreSQL Alert: Deadlock Detected on $(hostname)" $ALERT_EMAIL
    fi
done

# Update the last check timestamp
echo $CURRENT_TIMESTAMP > $LAST_CHECK_FILE

exit 0

This script needs careful tuning based on your log format and rotation strategy. For production, a dedicated log aggregation and alerting system is highly recommended.

Laravel Application Monitoring

Monitoring the Laravel application itself involves tracking request performance, errors, queue performance, and database query efficiency. We’ll focus on practical implementation using built-in Laravel features and external tools.

1. Error and Exception Tracking

Laravel’s robust error handling is a great starting point. For production, you need a centralized error tracking service. Services like Sentry, Bugsnag, or Flare (a Laravel-specific option) are invaluable.

Installation typically involves:

composer require sentry/sentry-laravel
php artisan sentry:install YOUR_DSN

Once configured, exceptions thrown in your Laravel application will be automatically reported to your chosen service. You can also manually report errors:

use Sentry\Laravel\Facade\Sentry;

try {
    // Some operation that might fail
    throw new \Exception("Something went wrong!");
} catch (\Throwable $e) {
    Sentry::captureException($e);
    // Optionally re-throw or handle the exception
    // throw $e;
}

2. Performance Monitoring (APM)

Application Performance Monitoring (APM) tools provide deep insights into request latency, database query times, external HTTP calls, and more. New Relic, Datadog, and Dynatrace are popular choices.

For example, integrating the Datadog APM agent:

# Install the Datadog agent (example for Ubuntu/Debian)
DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=YOUR_DATADOG_API_KEY DD_SITE="datadoghq.eu" bash -c "$(curl -L https://sentry.io/get-started/agent/)"

# Configure the PHP agent
# Edit php.ini or a conf.d file to include the Datadog extension
zend_extension=/path/to/datadog-php-tracer.so
datadog.enabled=true
datadog.service=your-laravel-app-name
datadog.agent_host=127.0.0.1
datadog.agent_port=8126

After restarting your web server (e.g., Nginx/PHP-FPM), requests handled by your Laravel application will be traced. You can then view performance metrics in the Datadog dashboard.

3. Queue Monitoring

Laravel queues are critical for background processing. Monitoring queue length, failed jobs, and processing times is essential. Laravel Horizon provides an excellent dashboard and monitoring capabilities for queues.

Installation:

composer require laravel/horizon
php artisan horizon:install
npm install && npm run dev
php artisan migrate
php artisan horizon

Horizon provides a web UI (accessible at `/horizon`) that shows:

Current number of jobs in each queue.
Number of failed jobs.
Runtime statistics for workers.
Ability to pause/resume queues and retry failed jobs.

For external monitoring, you can use the queue:monitor command provided by Horizon or custom scripts to check queue lengths. For example, to check the number of jobs in the default queue:

php artisan queue:size default

You can integrate this into your cron jobs and alerting system. Set thresholds for queue size to trigger alerts when jobs are backing up.

4. Database Query Optimization within Laravel

Slow database queries are a common bottleneck. Laravel’s query builder and Eloquent ORM can sometimes generate inefficient queries. Monitoring these requires profiling.

Enable Query Logging (Development/Staging):

// In AppServiceProvider or a dedicated debug service provider
public function boot()
{
    if (config('app.debug')) {
        \DB::listen(function($query) {
            \Log::info(
                $query->sql,
                $query->bindings,
                $query->time
            );
        });
    }
}

Production Profiling: For production, use APM tools (like Datadog, New Relic) which automatically capture slow queries. Alternatively, tools like Laravel Debugbar (use with caution in production, ideally only enabled for specific admin routes or via a toggle) can help identify N+1 query problems and slow queries.

Specific Checks:

N+1 Query Detection: Use Eloquent’s with() eager loading to avoid fetching related models in a loop. Debugbar or APM tools will highlight these.
Index Usage: Ensure your database queries are using appropriate indexes. You can analyze query plans using EXPLAIN ANALYZE in PostgreSQL.
Large Result Sets: Avoid fetching thousands of records directly into PHP arrays if not necessary. Use pagination or stream results.

OVH Specific Considerations

When operating on OVH, several factors specific to the cloud provider need attention:

1. Network Latency and Bandwidth

OVH’s network performance can vary. Monitor inter-instance latency, especially between your Laravel application servers and PostgreSQL cluster nodes. Tools like ping, mtr, and iperf3 can help diagnose network issues. High latency can directly impact application response times and replication lag.

2. Disk Performance

The type of storage you choose on OVH (e.g., standard HDD, SSD, NVMe) significantly impacts database performance. Monitor disk I/O metrics (IOPS, throughput, latency) using iostat or cloud provider metrics. Ensure your PostgreSQL data directories are on performant storage, especially for write-heavy workloads.

3. OVH API and Service Status

While not directly application monitoring, staying aware of OVH’s service status and API health is important. Subscribe to OVH’s status page notifications. If you are automating infrastructure tasks via the OVH API, implement retry logic and monitor API response times and error rates.

4. Firewall and Security Groups

Ensure your firewall rules (e.g., UFW on Linux, OVH’s network firewall) are correctly configured to allow necessary traffic (PostgreSQL port 5432, application ports) while blocking unnecessary access. Monitor firewall logs for suspicious activity or denied connections.

Conclusion

A robust monitoring strategy for a Laravel application and its PostgreSQL cluster on OVH involves a multi-layered approach. Combine application-level insights (errors, performance, queues) with database-specific metrics (replication, resource usage, deadlocks) and infrastructure awareness (network, disk). Automate checks, set meaningful alerts, and leverage specialized tools for comprehensive visibility. This proactive approach is key to maintaining high availability and performance.