Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on DigitalOcean
Proactive PostgreSQL Cluster Health Checks
Maintaining the health of a PostgreSQL cluster, especially in a distributed environment like DigitalOcean, requires more than just basic uptime checks. We need to monitor internal cluster metrics that indicate potential performance degradation or impending failures. This involves querying PostgreSQL itself for vital statistics.
A crucial metric is the replication lag. High replication lag on a standby server can lead to data inconsistencies and extended downtime during failover. We can query the `pg_stat_replication` view on the primary to get this information.
Monitoring Replication Lag
The following SQL query, executed on the primary PostgreSQL instance, will provide the replication lag in seconds for each connected standby. We’ll focus on the `write_lag` and `flush_lag` as key indicators.
SELECT
application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS write_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS write_lag_mb,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) / 1024 / 1024 AS flush_lag_mb,
state
FROM
pg_stat_replication
WHERE
state = 'streaming';
To automate this, we can create a simple script that runs this query and triggers an alert if the lag exceeds a predefined threshold (e.g., 60 seconds). This script can be executed via cron.
Automating Replication Lag Checks with Bash and `psql`
Here’s a bash script that connects to the PostgreSQL primary, executes the query, and checks the `write_lag_bytes`. We’ll assume you have `psql` installed and configured to connect without a password using a `.pgpass` file or similar mechanism.
#!/bin/bash
PG_HOST="your_pg_primary_host"
PG_PORT="5432"
PG_USER="monitor_user"
ALERT_THRESHOLD_BYTES=629145600 # 600MB as an example threshold
# SQL query to get replication lag in bytes
SQL_QUERY="SELECT
application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS write_lag_bytes
FROM
pg_stat_replication
WHERE
state = 'streaming';"
# Execute the query and capture output
LAG_DATA=$(PGPASSWORD=$PG_PASSWORD psql -h $PG_HOST -p $PG_PORT -U $PG_USER -d postgres -t -c "$SQL_QUERY")
# Check if the query returned any data
if [ -z "$LAG_DATA" ]; then
echo "No streaming replication found or error executing query."
# Consider sending an alert here for unexpected lack of replication
exit 1
fi
# Process each line of the output
echo "$LAG_DATA" | while IFS='|' read -r app_name write_lag_bytes; do
# Trim whitespace
app_name=$(echo "$app_name" | xargs)
write_lag_bytes=$(echo "$write_lag_bytes" | xargs)
if [ -n "$write_lag_bytes" ] && [ "$write_lag_bytes" -gt "$ALERT_THRESHOLD_BYTES" ]; then
echo "ALERT: Replication lag for '$app_name' is ${write_lag_bytes} bytes, exceeding threshold of ${ALERT_THRESHOLD_BYTES} bytes."
# Trigger your alerting mechanism here (e.g., send to Slack, PagerDuty)
# curl -X POST -H 'Content-type: application/json' --data '{"text":"ALERT: Replication lag for '"$app_name"' is '"$write_lag_bytes"' bytes, exceeding threshold of '"$ALERT_THRESHOLD_BYTES"' bytes."}' YOUR_SLACK_WEBHOOK_URL
else
echo "OK: Replication lag for '$app_name' is ${write_lag_bytes} bytes."
fi
done
exit 0
Remember to set up a dedicated monitoring user in PostgreSQL with appropriate permissions to view `pg_stat_replication`. Also, ensure the `PGPASSWORD` environment variable is handled securely, perhaps by using a secrets management system or a properly configured `.pgpass` file.
PHP Application Performance Monitoring (APM) on DigitalOcean
For PHP applications, monitoring goes beyond just server resource utilization. We need to understand the performance of the application itself: request latency, error rates, slow database queries, and external API call durations. This is where Application Performance Monitoring (APM) tools become indispensable.
Integrating OpenTelemetry with PHP
OpenTelemetry is becoming the de facto standard for observability. It provides a vendor-neutral way to instrument your code and export telemetry data (traces, metrics, logs) to various backends. For PHP, the OpenTelemetry SDK is available.
First, ensure you have the OpenTelemetry PHP extension installed. This typically involves compiling it or using a pre-built package if available for your PHP version and OS.
# Example using PECL (if available and compiled) pecl install opentelemetry
Next, configure your PHP application to use the OpenTelemetry SDK. This involves setting up an exporter to send data to a collector or directly to a backend like Jaeger, Prometheus, or a commercial APM service.
<?php
require 'vendor/autoload.php'; // Assuming you use Composer
use OpenTelemetry\API\Trace\TracerProvider;
use OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor;
use OpenTelemetry\SDK\Trace\TracerProvider as SDKTracerProvider;
use OpenTelemetry\SDK\Trace\Exporter\ConsoleExporter; // For debugging, replace with actual exporter
use OpenTelemetry\SDK\Trace\Sampler\AlwaysOnSampler;
use OpenTelemetry\Context\Propagation\TextMapPropagator;
use OpenTelemetry\API\Globals;
// Initialize TracerProvider
$tracerProvider = new SDKTracerProvider(
new AlwaysOnSampler(),
new BatchSpanProcessor(new ConsoleExporter()) // Replace ConsoleExporter with your actual exporter (e.g., JaegerExporter, OTLPExporter)
);
// Set the global TracerProvider
Globals::setTracerProvider($tracerProvider);
// Get a tracer instance
$tracer = $tracerProvider->getTracer('my-php-app');
// Start a span for an incoming request
$span = $tracer->spanBuilder('incoming_request')
->setSpanKind(\OpenTelemetry\API\Trace\SpanKind::KIND_SERVER)
->start();
try {
// Your application logic here
// Example: Simulate a database query
$dbSpan = $tracer->spanBuilder('database_query')
->setSpanKind(\OpenTelemetry\API\Trace\SpanKind::KIND_CLIENT)
->start();
usleep(rand(50000, 200000)); // Simulate query time
$dbSpan->end();
// Example: Simulate an external API call
$apiSpan = $tracer->spanBuilder('external_api_call')
->setSpanKind(\OpenTelemetry\API\Trace\SpanKind::KIND_CLIENT)
->start();
usleep(rand(100000, 500000)); // Simulate API call time
$apiSpan->end();
// End the main request span
$span->end();
} catch (\Throwable $e) {
// Record exception and end span
$span->recordException($e);
$span->setStatus(\OpenTelemetry\API\Trace\StatusCode::STATUS_ERROR, $e->getMessage());
$span->end();
throw $e; // Re-throw the exception
}
// Ensure all spans are flushed before script exit (important for CLI scripts)
$tracerProvider->shutdown();
?>
For web requests, you’ll typically integrate this instrumentation within your framework’s request lifecycle (e.g., in a middleware or event listener). For background jobs or CLI scripts, ensure `shutdown()` is called to flush any buffered spans.
Choosing an APM Backend
For DigitalOcean deployments, consider these options:
- Jaeger: An open-source, end-to-end distributed tracing system. You can deploy Jaeger on a DigitalOcean Droplet or use a managed service.
- Prometheus + Grafana: While primarily for metrics, you can export traces to Tempo or use Prometheus’s tracing capabilities. Grafana provides excellent visualization.
- Commercial APM Services: Datadog, New Relic, Dynatrace offer managed solutions that integrate well with OpenTelemetry. They often provide more advanced features and support.
The choice depends on your budget, operational overhead tolerance, and feature requirements. For a self-hosted solution, deploying a Jaeger agent on your PHP application servers and a collector/query service elsewhere is a common pattern.
System-Level Monitoring with Prometheus and Node Exporter
Beyond application-specific metrics, robust server monitoring requires understanding the underlying system resources: CPU, memory, disk I/O, network traffic. Prometheus, coupled with `node_exporter`, is a powerful combination for this.
Deploying Node Exporter on DigitalOcean Droplets
`node_exporter` is a simple binary that exposes a wide range of system metrics. You can download the latest release from the official Prometheus GitHub repository.
# Download the latest release (adjust version and architecture as needed)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
# Run node_exporter (for testing)
./node_exporter
# For production, run as a systemd service
sudo cp node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter
# Create systemd service file
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.disable-defaults \
--collector.cpu \
--collector.diskstats \
--collector.filesystem \
--collector.loadavg \
--collector.meminfo \
--collector.netdev \
--collector.stat \
--collector.time \
--web.listen-address=":9100"
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd, enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter
This configuration enables a curated set of collectors. You can customize this further based on your needs. Ensure port 9100 is accessible from your Prometheus server (or the collector if you’re using one).
Configuring Prometheus to Scrape Node Exporter
On your Prometheus server (which could also be a Droplet), you need to configure `prometheus.yml` to scrape the `node_exporter` instances. If you have multiple Droplets, you’ll use service discovery or static configuration.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape PostgreSQL exporter (if you're using one, e.g., pg_exporter)
# - job_name: 'postgres'
# static_configs:
# - targets: ['your_pg_exporter_host:9187'] # Default port for pg_exporter
# Scrape Node Exporter for your PHP app servers
- job_name: 'php_app_nodes'
static_configs:
- targets:
- 'droplet_ip_1:9100'
- 'droplet_ip_2:9100'
- 'droplet_ip_3:9100'
# If using service discovery (e.g., DigitalOcean integration with Prometheus Operator)
# you would configure that here instead of static_configs.
# Scrape Node Exporter for your PostgreSQL cluster nodes
- job_name: 'postgres_nodes'
static_configs:
- targets:
- 'pg_primary_ip:9100'
- 'pg_standby_1_ip:9100'
- 'pg_standby_2_ip:9100'
After updating `prometheus.yml`, reload the Prometheus configuration: `curl -X POST http://localhost:9090/-/reload`.
Alerting with Alertmanager
Collecting metrics is only half the battle; you need to be notified when things go wrong. Prometheus integrates with Alertmanager for sophisticated alerting.
Configuring Alerting Rules
Alerting rules are defined in separate YAML files and loaded by Prometheus. Here are examples for PostgreSQL replication lag and high CPU usage.
# rules/postgres_alerts.yml
groups:
- name: postgresql.rules
rules:
- alert: HighReplicationLag
expr: |
avg by (application_name) (
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 > 1024 # 1GB lag threshold
)
for: 5m
labels:
severity: critical
annotations:
summary: "High replication lag detected for {{ $labels.application_name }}"
description: "PostgreSQL standby {{ $labels.application_name }} has a replication lag of {{ $value | humanize1024 }} MB, exceeding the 1GB threshold for 5 minutes."
# rules/node_alerts.yml
groups:
- name: node.rules
rules:
- alert: HighCpuUsage
expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} has been experiencing CPU usage above 90% for the last 10 minutes."
Ensure these rule files are referenced in your `prometheus.yml` under the `rule_files` directive.
Configuring Alertmanager
Alertmanager handles deduplication, grouping, and routing of alerts to various receivers like email, Slack, PagerDuty, etc. A basic `alertmanager.yml` might look like this:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver' # Default receiver if no specific match
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if .CommonLabels.severity }} - {{ .CommonLabels.severity }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}'
text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'
inhibit_rules:
- target_match:
severity: 'critical'
source_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Prometheus needs to be configured to send alerts to Alertmanager. Add the following to your `prometheus.yml`:
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093' # Assuming Alertmanager is running on the same server
Reload Prometheus after updating its configuration.
Log Aggregation and Analysis
Centralized logging is crucial for debugging and auditing. For PHP applications and PostgreSQL, collecting logs from various sources and making them searchable is key.
Using Fluentd/Fluent Bit with DigitalOcean
Fluentd or its lighter sibling, Fluent Bit, are excellent choices for log collection. They can tail log files, parse them, and forward them to various destinations.
On each Droplet running your PHP app or PostgreSQL, you can deploy Fluent Bit as a daemon. It can collect logs from:
- PHP error logs (e.g., `error_log` in PHP-FPM configuration)
- Nginx/Apache access and error logs
- PostgreSQL logs (configured via `postgresql.conf`)
[SERVICE]
Flush 5
Daemon On
Log_Level info
Parsers_File /etc/fluent-bit/parsers.conf
[INPUT]
Name tail
Path /var/log/php-fpm/error.log
Tag php-fpm.error
Parser php_fpm_error
[INPUT]
Name tail
Path /var/log/nginx/access.log
Tag nginx.access
Parser nginx
[INPUT]
Name tail
Path /var/log/postgresql/postgresql-*.log
Tag postgres.log
Parser postgres_log
[OUTPUT]
Name forward
Host your_log_aggregator_host
Port 24224
Retry_Limit False
You’ll need a corresponding `parsers.conf` file to correctly parse these logs. For example:
[PARSER]
Name php_fpm_error
Format regex
Regex ^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+\[(?P<level>\w+)\]\s+(?P<message>.*)$
Time_Key time
Time_Format %Y-%m-%d %H:%M:%S
[PARSER]
Name nginx
Format nginx
[PARSER]
Name postgres_log
Format regex
Regex ^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+\[(?P<level>\w+)\]\s+\[(?P<pid>\d+)\]\s+(?P<message>.*)$
Time_Key time
Time_Format %Y-%m-%d %H:%M:%S
The logs are then forwarded to a central log aggregator. Options include:
- ELK Stack (Elasticsearch, Logstash, Kibana): A powerful, albeit resource-intensive, solution.
- Loki + Grafana: A more lightweight, cost-effective alternative that integrates seamlessly with Prometheus and Grafana.
- Commercial Log Management Services: Splunk, Datadog Logs, Loggly.
For a DigitalOcean environment, Loki is often a compelling choice due to its integration with Grafana and Prometheus, simplifying the observability stack.
Conclusion: A Layered Approach to Observability
Effective server monitoring for a PHP application and PostgreSQL cluster on DigitalOcean is a multi-faceted endeavor. It requires a layered approach, combining:
- Application-level monitoring (OpenTelemetry, APM) to understand request latency, errors, and code performance.
- Database-specific monitoring (SQL queries, `pg_stat_replication`) to ensure PostgreSQL health and replication integrity.
- System-level metrics (Prometheus, Node Exporter) for CPU, memory, disk, and network utilization.
- Centralized logging (Fluent Bit, Loki/ELK) for debugging and auditing.
- Robust alerting (Alertmanager) to proactively notify teams of issues.
By implementing these practices, you move from reactive firefighting to proactive system management, ensuring the stability, performance, and availability of your critical applications and data clusters.