Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on DigitalOcean

Proactive PostgreSQL Cluster Health Checks

Maintaining the health of a PostgreSQL cluster, especially in a distributed environment like DigitalOcean, requires more than just basic uptime checks. We need to monitor internal cluster metrics that indicate potential performance degradation or impending failures. This involves querying PostgreSQL itself for vital statistics.

A crucial metric is the replication lag. High replication lag on a standby server can lead to data inconsistencies and extended downtime during failover. We can query the `pg_stat_replication` view on the primary to get this information.

Monitoring Replication Lag

The following SQL query, executed on the primary PostgreSQL instance, will provide the replication lag in seconds for each connected standby. We’ll focus on the `write_lag` and `flush_lag` as key indicators.

SELECT
    application_name,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS write_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS flush_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS write_lag_mb,
    pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) / 1024 / 1024 AS flush_lag_mb,
    state
FROM
    pg_stat_replication
WHERE
    state = 'streaming';

To automate this, we can create a simple script that runs this query and triggers an alert if the lag exceeds a predefined threshold (e.g., 60 seconds). This script can be executed via cron.

Automating Replication Lag Checks with Bash and `psql`

Here’s a bash script that connects to the PostgreSQL primary, executes the query, and checks the `write_lag_bytes`. We’ll assume you have `psql` installed and configured to connect without a password using a `.pgpass` file or similar mechanism.

#!/bin/bash

PG_HOST="your_pg_primary_host"
PG_PORT="5432"
PG_USER="monitor_user"
ALERT_THRESHOLD_BYTES=629145600 # 600MB as an example threshold

# SQL query to get replication lag in bytes
SQL_QUERY="SELECT
    application_name,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS write_lag_bytes
FROM
    pg_stat_replication
WHERE
    state = 'streaming';"

# Execute the query and capture output
LAG_DATA=$(PGPASSWORD=$PG_PASSWORD psql -h $PG_HOST -p $PG_PORT -U $PG_USER -d postgres -t -c "$SQL_QUERY")

# Check if the query returned any data
if [ -z "$LAG_DATA" ]; then
    echo "No streaming replication found or error executing query."
    # Consider sending an alert here for unexpected lack of replication
    exit 1
fi

# Process each line of the output
echo "$LAG_DATA" | while IFS='|' read -r app_name write_lag_bytes; do
    # Trim whitespace
    app_name=$(echo "$app_name" | xargs)
    write_lag_bytes=$(echo "$write_lag_bytes" | xargs)

    if [ -n "$write_lag_bytes" ] && [ "$write_lag_bytes" -gt "$ALERT_THRESHOLD_BYTES" ]; then
        echo "ALERT: Replication lag for '$app_name' is ${write_lag_bytes} bytes, exceeding threshold of ${ALERT_THRESHOLD_BYTES} bytes."
        # Trigger your alerting mechanism here (e.g., send to Slack, PagerDuty)
        # curl -X POST -H 'Content-type: application/json' --data '{"text":"ALERT: Replication lag for '"$app_name"' is '"$write_lag_bytes"' bytes, exceeding threshold of '"$ALERT_THRESHOLD_BYTES"' bytes."}' YOUR_SLACK_WEBHOOK_URL
    else
        echo "OK: Replication lag for '$app_name' is ${write_lag_bytes} bytes."
    fi
done

exit 0

Remember to set up a dedicated monitoring user in PostgreSQL with appropriate permissions to view `pg_stat_replication`. Also, ensure the `PGPASSWORD` environment variable is handled securely, perhaps by using a secrets management system or a properly configured `.pgpass` file.

PHP Application Performance Monitoring (APM) on DigitalOcean

For PHP applications, monitoring goes beyond just server resource utilization. We need to understand the performance of the application itself: request latency, error rates, slow database queries, and external API call durations. This is where Application Performance Monitoring (APM) tools become indispensable.

Integrating OpenTelemetry with PHP

OpenTelemetry is becoming the de facto standard for observability. It provides a vendor-neutral way to instrument your code and export telemetry data (traces, metrics, logs) to various backends. For PHP, the OpenTelemetry SDK is available.

First, ensure you have the OpenTelemetry PHP extension installed. This typically involves compiling it or using a pre-built package if available for your PHP version and OS.

# Example using PECL (if available and compiled)
pecl install opentelemetry

Next, configure your PHP application to use the OpenTelemetry SDK. This involves setting up an exporter to send data to a collector or directly to a backend like Jaeger, Prometheus, or a commercial APM service.

<?php
require 'vendor/autoload.php'; // Assuming you use Composer

use OpenTelemetry\API\Trace\TracerProvider;
use OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor;
use OpenTelemetry\SDK\Trace\TracerProvider as SDKTracerProvider;
use OpenTelemetry\SDK\Trace\Exporter\ConsoleExporter; // For debugging, replace with actual exporter
use OpenTelemetry\SDK\Trace\Sampler\AlwaysOnSampler;
use OpenTelemetry\Context\Propagation\TextMapPropagator;
use OpenTelemetry\API\Globals;

// Initialize TracerProvider
$tracerProvider = new SDKTracerProvider(
    new AlwaysOnSampler(),
    new BatchSpanProcessor(new ConsoleExporter()) // Replace ConsoleExporter with your actual exporter (e.g., JaegerExporter, OTLPExporter)
);

// Set the global TracerProvider
Globals::setTracerProvider($tracerProvider);

// Get a tracer instance
$tracer = $tracerProvider->getTracer('my-php-app');

// Start a span for an incoming request
$span = $tracer->spanBuilder('incoming_request')
    ->setSpanKind(\OpenTelemetry\API\Trace\SpanKind::KIND_SERVER)
    ->start();

try {
    // Your application logic here
    // Example: Simulate a database query
    $dbSpan = $tracer->spanBuilder('database_query')
        ->setSpanKind(\OpenTelemetry\API\Trace\SpanKind::KIND_CLIENT)
        ->start();
    usleep(rand(50000, 200000)); // Simulate query time
    $dbSpan->end();

    // Example: Simulate an external API call
    $apiSpan = $tracer->spanBuilder('external_api_call')
        ->setSpanKind(\OpenTelemetry\API\Trace\SpanKind::KIND_CLIENT)
        ->start();
    usleep(rand(100000, 500000)); // Simulate API call time
    $apiSpan->end();

    // End the main request span
    $span->end();

} catch (\Throwable $e) {
    // Record exception and end span
    $span->recordException($e);
    $span->setStatus(\OpenTelemetry\API\Trace\StatusCode::STATUS_ERROR, $e->getMessage());
    $span->end();
    throw $e; // Re-throw the exception
}

// Ensure all spans are flushed before script exit (important for CLI scripts)
$tracerProvider->shutdown();
?>

For web requests, you’ll typically integrate this instrumentation within your framework’s request lifecycle (e.g., in a middleware or event listener). For background jobs or CLI scripts, ensure `shutdown()` is called to flush any buffered spans.

Choosing an APM Backend

For DigitalOcean deployments, consider these options:

Jaeger: An open-source, end-to-end distributed tracing system. You can deploy Jaeger on a DigitalOcean Droplet or use a managed service.
Prometheus + Grafana: While primarily for metrics, you can export traces to Tempo or use Prometheus’s tracing capabilities. Grafana provides excellent visualization.
Commercial APM Services: Datadog, New Relic, Dynatrace offer managed solutions that integrate well with OpenTelemetry. They often provide more advanced features and support.

The choice depends on your budget, operational overhead tolerance, and feature requirements. For a self-hosted solution, deploying a Jaeger agent on your PHP application servers and a collector/query service elsewhere is a common pattern.

System-Level Monitoring with Prometheus and Node Exporter

Beyond application-specific metrics, robust server monitoring requires understanding the underlying system resources: CPU, memory, disk I/O, network traffic. Prometheus, coupled with `node_exporter`, is a powerful combination for this.

Deploying Node Exporter on DigitalOcean Droplets

`node_exporter` is a simple binary that exposes a wide range of system metrics. You can download the latest release from the official Prometheus GitHub repository.

# Download the latest release (adjust version and architecture as needed)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Run node_exporter (for testing)
./node_exporter

# For production, run as a systemd service
sudo cp node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter

# Create systemd service file
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.disable-defaults \
    --collector.cpu \
    --collector.diskstats \
    --collector.filesystem \
    --collector.loadavg \
    --collector.meminfo \
    --collector.netdev \
    --collector.stat \
    --collector.time \
    --web.listen-address=":9100"

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd, enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

This configuration enables a curated set of collectors. You can customize this further based on your needs. Ensure port 9100 is accessible from your Prometheus server (or the collector if you’re using one).

Configuring Prometheus to Scrape Node Exporter

On your Prometheus server (which could also be a Droplet), you need to configure `prometheus.yml` to scrape the `node_exporter` instances. If you have multiple Droplets, you’ll use service discovery or static configuration.

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape PostgreSQL exporter (if you're using one, e.g., pg_exporter)
  # - job_name: 'postgres'
  #   static_configs:
  #     - targets: ['your_pg_exporter_host:9187'] # Default port for pg_exporter

  # Scrape Node Exporter for your PHP app servers
  - job_name: 'php_app_nodes'
    static_configs:
      - targets:
          - 'droplet_ip_1:9100'
          - 'droplet_ip_2:9100'
          - 'droplet_ip_3:9100'
    # If using service discovery (e.g., DigitalOcean integration with Prometheus Operator)
    # you would configure that here instead of static_configs.

  # Scrape Node Exporter for your PostgreSQL cluster nodes
  - job_name: 'postgres_nodes'
    static_configs:
      - targets:
          - 'pg_primary_ip:9100'
          - 'pg_standby_1_ip:9100'
          - 'pg_standby_2_ip:9100'

After updating `prometheus.yml`, reload the Prometheus configuration: `curl -X POST http://localhost:9090/-/reload`.

Alerting with Alertmanager

Collecting metrics is only half the battle; you need to be notified when things go wrong. Prometheus integrates with Alertmanager for sophisticated alerting.

Configuring Alerting Rules

Alerting rules are defined in separate YAML files and loaded by Prometheus. Here are examples for PostgreSQL replication lag and high CPU usage.

# rules/postgres_alerts.yml
groups:
- name: postgresql.rules
  rules:
  - alert: HighReplicationLag
    expr: |
      avg by (application_name) (
        pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 > 1024 # 1GB lag threshold
      )
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High replication lag detected for {{ $labels.application_name }}"
      description: "PostgreSQL standby {{ $labels.application_name }} has a replication lag of {{ $value | humanize1024 }} MB, exceeding the 1GB threshold for 5 minutes."

# rules/node_alerts.yml
groups:
- name: node.rules
  rules:
  - alert: HighCpuUsage
    expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} has been experiencing CPU usage above 90% for the last 10 minutes."

Ensure these rule files are referenced in your `prometheus.yml` under the `rule_files` directive.

Configuring Alertmanager

Alertmanager handles deduplication, grouping, and routing of alerts to various receivers like email, Slack, PagerDuty, etc. A basic `alertmanager.yml` might look like this:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver if no specific match

receivers:
- name: 'default-receiver'
  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL'
    channel: '#alerts'
    send_resolved: true
    title: '[{{ .Status | toUpper }}{{ if .CommonLabels.severity }} - {{ .CommonLabels.severity }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}'
    text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'

inhibit_rules:
  - target_match:
      severity: 'critical'
    source_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Prometheus needs to be configured to send alerts to Alertmanager. Add the following to your `prometheus.yml`:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 'localhost:9093' # Assuming Alertmanager is running on the same server

Reload Prometheus after updating its configuration.

Log Aggregation and Analysis

Centralized logging is crucial for debugging and auditing. For PHP applications and PostgreSQL, collecting logs from various sources and making them searchable is key.

Using Fluentd/Fluent Bit with DigitalOcean

Fluentd or its lighter sibling, Fluent Bit, are excellent choices for log collection. They can tail log files, parse them, and forward them to various destinations.

On each Droplet running your PHP app or PostgreSQL, you can deploy Fluent Bit as a daemon. It can collect logs from:

PHP error logs (e.g., `error_log` in PHP-FPM configuration)
Nginx/Apache access and error logs
PostgreSQL logs (configured via `postgresql.conf`)

[SERVICE]
    Flush        5
    Daemon       On
    Log_Level    info
    Parsers_File /etc/fluent-bit/parsers.conf

[INPUT]
    Name              tail
    Path              /var/log/php-fpm/error.log
    Tag               php-fpm.error
    Parser            php_fpm_error

[INPUT]
    Name              tail
    Path              /var/log/nginx/access.log
    Tag               nginx.access
    Parser            nginx

[INPUT]
    Name              tail
    Path              /var/log/postgresql/postgresql-*.log
    Tag               postgres.log
    Parser            postgres_log

[OUTPUT]
    Name            forward
    Host            your_log_aggregator_host
    Port            24224
    Retry_Limit     False

You’ll need a corresponding `parsers.conf` file to correctly parse these logs. For example:

[PARSER]
    Name        php_fpm_error
    Format      regex
    Regex       ^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+\[(?P<level>\w+)\]\s+(?P<message>.*)$
    Time_Key    time
    Time_Format %Y-%m-%d %H:%M:%S

[PARSER]
    Name        nginx
    Format      nginx

[PARSER]
    Name        postgres_log
    Format      regex
    Regex       ^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+\[(?P<level>\w+)\]\s+\[(?P<pid>\d+)\]\s+(?P<message>.*)$
    Time_Key    time
    Time_Format %Y-%m-%d %H:%M:%S

The logs are then forwarded to a central log aggregator. Options include:

ELK Stack (Elasticsearch, Logstash, Kibana): A powerful, albeit resource-intensive, solution.
Loki + Grafana: A more lightweight, cost-effective alternative that integrates seamlessly with Prometheus and Grafana.
Commercial Log Management Services: Splunk, Datadog Logs, Loggly.

For a DigitalOcean environment, Loki is often a compelling choice due to its integration with Grafana and Prometheus, simplifying the observability stack.

Conclusion: A Layered Approach to Observability

Effective server monitoring for a PHP application and PostgreSQL cluster on DigitalOcean is a multi-faceted endeavor. It requires a layered approach, combining:

Application-level monitoring (OpenTelemetry, APM) to understand request latency, errors, and code performance.
Database-specific monitoring (SQL queries, `pg_stat_replication`) to ensure PostgreSQL health and replication integrity.
System-level metrics (Prometheus, Node Exporter) for CPU, memory, disk, and network utilization.
Centralized logging (Fluent Bit, Loki/ELK) for debugging and auditing.
Robust alerting (Alertmanager) to proactively notify teams of issues.

By implementing these practices, you move from reactive firefighting to proactive system management, ensuring the stability, performance, and availability of your critical applications and data clusters.