Server Monitoring Best Practices: Keeping Your Laravel App and PostgreSQL Clusters Alive on OVH
Proactive PostgreSQL Cluster Health Checks
Maintaining the health of a PostgreSQL cluster, especially in a distributed environment like OVH, requires more than just basic uptime checks. We need to delve into resource utilization, replication lag, query performance, and potential deadlock scenarios. This section outlines essential checks and the tools to implement them.
1. Replication Lag Monitoring
Replication lag is a critical indicator of data consistency and availability. High lag can lead to stale reads and potential data loss during failovers. We’ll use a combination of PostgreSQL’s built-in functions and a simple Bash script for periodic checks.
First, let’s define a SQL query to fetch the replication lag. This query should be run on the replica(s).
SELECT
pg_last_wal_receive_lsn() AS receive_lsn,
pg_last_wal_replay_lsn() AS replay_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes,
CASE
WHEN pg_is_in_recovery() THEN
EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))
ELSE
0
END AS lag_seconds
FROM pg_stat_wal_receiver
WHERE status = 'streaming';
Now, let’s create a Bash script that executes this query and alerts if the lag exceeds a defined threshold. This script can be scheduled via cron.
#!/bin/bash # Configuration PG_HOST="your_replica_host" PG_PORT="5432" PG_USER="monitor_user" PG_DB="postgres" LAG_THRESHOLD_SECONDS=60 # Alert if lag is more than 60 seconds ALERT_EMAIL="[email protected]" # SQL query to get replication lag SQL_QUERY="SELECT CASE WHEN pg_is_in_recovery() THEN EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) ELSE 0 END AS lag_seconds FROM pg_stat_wal_receiver WHERE status = 'streaming';" # Execute the query LAG_SECONDS=$(psql -h $PG_HOST -p $PG_PORT -U $PG_USER -d $PG_DB -t -c "$SQL_QUERY" 2>&1) # Check for errors during query execution if [[ $? -ne 0 ]]; then echo "Error executing PostgreSQL query on $PG_HOST: $LAG_SECONDS" | mail -s "PostgreSQL Replication Alert: Query Error on $PG_HOST" $ALERT_EMAIL exit 1 fi # Check if LAG_SECONDS is a valid number if ! [[ "$LAG_SECONDS" =~ ^[0-9]+(\.[0-9]+)?$ ]]; then echo "Invalid lag value received from $PG_HOST: '$LAG_SECONDS'" | mail -s "PostgreSQL Replication Alert: Invalid Lag on $PG_HOST" $ALERT_EMAIL exit 1 fi # Compare lag with threshold if (( $(echo "$LAG_SECONDS > $LAG_THRESHOLD_SECONDS" | bc -l) )); then echo "PostgreSQL replication lag on $PG_HOST is ${LAG_SECONDS} seconds, exceeding threshold of ${LAG_THRESHOLD_SECONDS} seconds." | mail -s "PostgreSQL Replication Alert: High Lag on $PG_HOST" $ALERT_EMAIL exit 1 else echo "PostgreSQL replication lag on $PG_HOST is within acceptable limits (${LAG_SECONDS} seconds)." exit 0 fi
Ensure the monitor_user has the necessary permissions (e.g., `SELECT` on `pg_stat_wal_receiver`) and that the script has execute permissions. Schedule this script using cron, for example, to run every minute:
* * * * * /path/to/your/replication_check.sh
2. Connection and Resource Utilization
Monitoring active connections, CPU, memory, and disk I/O is crucial. We can leverage tools like pg_stat_activity, vmstat, and iostat. For a more integrated approach, consider Prometheus with the PostgreSQL exporter.
Here’s how to check active connections and long-running queries:
-- Total active connections SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; -- Long-running queries (e.g., longer than 5 minutes) SELECT pid, age(clock_timestamp(), query_start), usename, query FROM pg_stat_activity WHERE state = 'active' AND query_start IS NOT NULL AND age(clock_timestamp(), query_start) > interval '5 minutes' ORDER BY query_start;
For system-level metrics, we can use standard Linux tools. A script to check for excessive CPU or memory usage by PostgreSQL processes:
#!/bin/bash # Configuration CPU_THRESHOLD=80 # Percentage MEM_THRESHOLD=80 # Percentage ALERT_EMAIL="[email protected]" # Get PostgreSQL PIDs PG_PIDS=$(pgrep -f "postgres: ") if [ -z "$PG_PIDS" ]; then echo "No PostgreSQL processes found." exit 0 fi # Check CPU and Memory for each PostgreSQL process for PID in $PG_PIDS; do CPU_USAGE=$(ps -p $PID -o %cpu --no-headers | awk '{print int($1)}') MEM_USAGE=$(ps -p $PID -o %mem --no-headers | awk '{print int($1)}') PG_CMD=$(ps -p $PID -o comm=) if [ "$CPU_USAGE" -gt "$CPU_THRESHOLD" ]; then echo "High CPU usage for PostgreSQL process (PID: $PID, Command: $PG_CMD): ${CPU_USAGE}% (Threshold: ${CPU_THRESHOLD}%)" | mail -s "PostgreSQL Resource Alert: High CPU on $(hostname)" $ALERT_EMAIL fi if [ "$MEM_USAGE" -gt "$MEM_THRESHOLD" ]; then echo "High Memory usage for PostgreSQL process (PID: $PID, Command: $PG_CMD): ${MEM_USAGE}% (Threshold: ${MEM_THRESHOLD}%)" | mail -s "PostgreSQL Resource Alert: High Memory on $(hostname)" $ALERT_EMAIL fi done exit 0
This script can also be cron-scheduled. For disk I/O, iostat -xd 5 will provide detailed statistics every 5 seconds. Look for high `%util` and `await` values.
3. Deadlock Detection
Deadlocks can halt application processes. PostgreSQL logs deadlocks by default if log_lock_waits is enabled and deadlock_timeout is set appropriately (e.g., 1s to 5s). We can monitor the PostgreSQL logs for deadlock messages.
Ensure your postgresql.conf has these settings:
log_lock_waits = on deadlock_timeout = 1s log_statement = 'none' # Or 'ddl', 'mod', 'all' depending on your needs, but avoid logging all queries in production for performance reasons. log_destination = 'stderr' # Or 'syslog' logging_collector = on log_directory = 'log' log_filename = 'postgresql-%Y-%m-%d.log'
Then, use a log monitoring tool like logwatch, goaccess, or a custom script with grep to scan the PostgreSQL logs for “deadlock detected” messages and trigger alerts.
#!/bin/bash LOG_FILE="/var/log/postgresql/postgresql-*.log" # Adjust path as needed ALERT_EMAIL="[email protected]" LAST_CHECK_FILE="/tmp/last_deadlock_check.timestamp" # Get current timestamp CURRENT_TIMESTAMP=$(date +%s) # Get timestamp of the last check, default to 0 if file doesn't exist LAST_CHECK_TIMESTAMP=$(cat $LAST_CHECK_FILE 2>/dev/null || echo 0) # Find log entries since the last check # Using find with -mmin to check files modified within the last minute, then grep # This is a simplified approach; for robust log parsing, consider tools like ELK stack or Graylog. find $(dirname $LOG_FILE) -name $(basename $LOG_FILE) -type f -mmin -1 -print0 | xargs -0 grep "deadlock detected" --with-filename --line-number | while IFS= read -r line; do # Extract timestamp from log line if available and parse it # This part is highly dependent on your log format. Assuming ISO 8601 format for simplicity. LOG_TIMESTAMP_STR=$(echo "$line" | grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' | head -n 1) if [ -n "$LOG_TIMESTAMP_STR" ]; then LOG_TIMESTAMP=$(date -d "$LOG_TIMESTAMP_STR" +%s) if [ "$LOG_TIMESTAMP" -gt "$LAST_CHECK_TIMESTAMP" ]; then echo "Deadlock detected: $line" | mail -s "PostgreSQL Alert: Deadlock Detected on $(hostname)" $ALERT_EMAIL fi else # If timestamp parsing fails, alert on any new deadlock message echo "Deadlock detected (timestamp parsing failed): $line" | mail -s "PostgreSQL Alert: Deadlock Detected on $(hostname)" $ALERT_EMAIL fi done # Update the last check timestamp echo $CURRENT_TIMESTAMP > $LAST_CHECK_FILE exit 0
This script needs careful tuning based on your log format and rotation strategy. For production, a dedicated log aggregation and alerting system is highly recommended.
Laravel Application Monitoring
Monitoring the Laravel application itself involves tracking request performance, errors, queue performance, and database query efficiency. We’ll focus on practical implementation using built-in Laravel features and external tools.
1. Error and Exception Tracking
Laravel’s robust error handling is a great starting point. For production, you need a centralized error tracking service. Services like Sentry, Bugsnag, or Flare (a Laravel-specific option) are invaluable.
Installation typically involves:
composer require sentry/sentry-laravel php artisan sentry:install YOUR_DSN
Once configured, exceptions thrown in your Laravel application will be automatically reported to your chosen service. You can also manually report errors:
use Sentry\Laravel\Facade\Sentry;
try {
// Some operation that might fail
throw new \Exception("Something went wrong!");
} catch (\Throwable $e) {
Sentry::captureException($e);
// Optionally re-throw or handle the exception
// throw $e;
}
2. Performance Monitoring (APM)
Application Performance Monitoring (APM) tools provide deep insights into request latency, database query times, external HTTP calls, and more. New Relic, Datadog, and Dynatrace are popular choices.
For example, integrating the Datadog APM agent:
# Install the Datadog agent (example for Ubuntu/Debian) DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=YOUR_DATADOG_API_KEY DD_SITE="datadoghq.eu" bash -c "$(curl -L https://sentry.io/get-started/agent/)" # Configure the PHP agent # Edit php.ini or a conf.d file to include the Datadog extension zend_extension=/path/to/datadog-php-tracer.so datadog.enabled=true datadog.service=your-laravel-app-name datadog.agent_host=127.0.0.1 datadog.agent_port=8126
After restarting your web server (e.g., Nginx/PHP-FPM), requests handled by your Laravel application will be traced. You can then view performance metrics in the Datadog dashboard.
3. Queue Monitoring
Laravel queues are critical for background processing. Monitoring queue length, failed jobs, and processing times is essential. Laravel Horizon provides an excellent dashboard and monitoring capabilities for queues.
Installation:
composer require laravel/horizon php artisan horizon:install npm install && npm run dev php artisan migrate php artisan horizon
Horizon provides a web UI (accessible at `/horizon`) that shows:
- Current number of jobs in each queue.
- Number of failed jobs.
- Runtime statistics for workers.
- Ability to pause/resume queues and retry failed jobs.
For external monitoring, you can use the queue:monitor command provided by Horizon or custom scripts to check queue lengths. For example, to check the number of jobs in the default queue:
php artisan queue:size default
You can integrate this into your cron jobs and alerting system. Set thresholds for queue size to trigger alerts when jobs are backing up.
4. Database Query Optimization within Laravel
Slow database queries are a common bottleneck. Laravel’s query builder and Eloquent ORM can sometimes generate inefficient queries. Monitoring these requires profiling.
Enable Query Logging (Development/Staging):
// In AppServiceProvider or a dedicated debug service provider
public function boot()
{
if (config('app.debug')) {
\DB::listen(function($query) {
\Log::info(
$query->sql,
$query->bindings,
$query->time
);
});
}
}
Production Profiling: For production, use APM tools (like Datadog, New Relic) which automatically capture slow queries. Alternatively, tools like Laravel Debugbar (use with caution in production, ideally only enabled for specific admin routes or via a toggle) can help identify N+1 query problems and slow queries.
Specific Checks:
- N+1 Query Detection: Use Eloquent’s
with()eager loading to avoid fetching related models in a loop. Debugbar or APM tools will highlight these. - Index Usage: Ensure your database queries are using appropriate indexes. You can analyze query plans using
EXPLAIN ANALYZEin PostgreSQL. - Large Result Sets: Avoid fetching thousands of records directly into PHP arrays if not necessary. Use pagination or stream results.
OVH Specific Considerations
When operating on OVH, several factors specific to the cloud provider need attention:
1. Network Latency and Bandwidth
OVH’s network performance can vary. Monitor inter-instance latency, especially between your Laravel application servers and PostgreSQL cluster nodes. Tools like ping, mtr, and iperf3 can help diagnose network issues. High latency can directly impact application response times and replication lag.
2. Disk Performance
The type of storage you choose on OVH (e.g., standard HDD, SSD, NVMe) significantly impacts database performance. Monitor disk I/O metrics (IOPS, throughput, latency) using iostat or cloud provider metrics. Ensure your PostgreSQL data directories are on performant storage, especially for write-heavy workloads.
3. OVH API and Service Status
While not directly application monitoring, staying aware of OVH’s service status and API health is important. Subscribe to OVH’s status page notifications. If you are automating infrastructure tasks via the OVH API, implement retry logic and monitor API response times and error rates.
4. Firewall and Security Groups
Ensure your firewall rules (e.g., UFW on Linux, OVH’s network firewall) are correctly configured to allow necessary traffic (PostgreSQL port 5432, application ports) while blocking unnecessary access. Monitor firewall logs for suspicious activity or denied connections.
Conclusion
A robust monitoring strategy for a Laravel application and its PostgreSQL cluster on OVH involves a multi-layered approach. Combine application-level insights (errors, performance, queues) with database-specific metrics (replication, resource usage, deadlocks) and infrastructure awareness (network, disk). Automate checks, set meaningful alerts, and leverage specialized tools for comprehensive visibility. This proactive approach is key to maintaining high availability and performance.