Server Monitoring Best Practices: Keeping Your PHP App and PostgreSQL Clusters Alive on Linode

Core Metrics for PHP Applications

Effective monitoring of PHP applications hinges on tracking key performance indicators (KPIs) that directly impact user experience and resource utilization. For a typical Linode-hosted PHP application, this includes request latency, error rates, and resource consumption at the process level.

Request Latency and Throughput

Monitoring the average and percentile response times for your HTTP requests is paramount. Tools like New Relic, Datadog, or even Prometheus with the `php-fpm_exporter` can provide this visibility. For a self-hosted Prometheus setup, you’ll want to ensure your PHP-FPM configuration exposes metrics.

In your PHP-FPM pool configuration (e.g., /etc/php/8.1/fpm/pool.d/www.conf), enable the status page:

pm.status_path = /status
ping.response = pong

Then, configure the `php-fpm_exporter` to scrape this endpoint. A typical Prometheus configuration snippet for scraping PHP-FPM status might look like this:

scrape_configs:
  - job_name: 'php-fpm'
    static_configs:
      - targets: ['your_app_server_ip:9000'] # Or wherever your PHP-FPM is accessible
    metrics_path: /status
    params:
      full: ['true']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        regex: '(.*):9000'
        target_label: instance
      - target_label: __address__
        replacement: 'php-fpm_exporter_host:9101' # Address of your php-fpm_exporter

Error Rates

PHP application errors, both fatal and recoverable, must be logged and alerted upon. Centralized logging solutions like ELK Stack (Elasticsearch, Logstash, Kibana) or Grafana Loki are essential. Configure PHP’s error logging to a file or syslog, and ensure your logging agent (e.g., Filebeat, Promtail) is shipping these logs.

In your php.ini:

error_reporting = E_ALL & ~E_DEPRECATED & ~E_STRICT
display_errors = Off
log_errors = On
error_log = /var/log/php/php-error.log

For critical errors, consider implementing a mechanism to send immediate alerts. A simple approach is to hook into PHP’s `error_log` function or use a dedicated error reporting library like Sentry or Bugsnag.

Resource Utilization (CPU, Memory, I/O)

Monitor CPU and memory usage per PHP-FPM worker process and overall system load. Node Exporter is the standard for collecting host-level metrics. You can then use PromQL queries to identify high-consuming processes.

Example PromQL query to find PHP-FPM processes consuming high CPU:

sum by (process_name) (rate(node_cpu_seconds_total{mode="user", process_name=~"php-fpm.*"}[5m])) > 0.5

Similarly, for memory:

sum by (process_name) (node_memory_bytes_used{process_name=~"php-fpm.*"})

PostgreSQL Cluster Monitoring on Linode

Maintaining the health and performance of a PostgreSQL cluster, especially in a distributed setup, requires a robust monitoring strategy. Key areas include replication lag, query performance, connection pooling, and disk I/O.

Replication Lag

For streaming replication, monitoring the lag between the primary and replica nodes is critical to ensure data consistency and availability. PostgreSQL provides built-in views for this.

On a replica node, query pg_stat_replication on the primary (or a dedicated monitoring user):

SELECT
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replication_lag_bytes
FROM
    pg_stat_replication
WHERE
    application_name = 'your_app_name'; -- If you've set an application_name

Alternatively, use the pg_stat_wal_receiver view on the replica:

SELECT
    received_lsn,
    latest_end_lsn,
    pg_wal_lsn_diff(latest_end_lsn, received_lsn) AS replication_lag_bytes
FROM
    pg_stat_wal_receiver;

These metrics can be scraped by Prometheus using the postgres_exporter. Configure the exporter to connect to your PostgreSQL instances and expose these metrics.

Query Performance and Slow Queries

Identifying slow-running queries is crucial for optimizing database performance. PostgreSQL’s pg_stat_statements extension is invaluable here. Ensure it’s enabled and configured.

Enable the extension in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000
pg_stat_statements.save = on

Then, create the extension in your database:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

You can then query pg_stat_statements to find the top queries by total execution time, calls, etc. The postgres_exporter can scrape these statistics.

Connection Pooling and Active Connections

High numbers of active connections or connection attempts can indicate issues with your application’s connection management or insufficient connection pool resources. Monitor pg_stat_activity.

SELECT
    datname,
    usename,
    client_addr,
    state,
    query_start,
    now() - query_start AS duration
FROM
    pg_stat_activity
WHERE
    state = 'active' AND pid <> pg_backend_pid()
ORDER BY
    duration DESC;

If you’re using a connection pooler like PgBouncer, monitor its specific metrics as well. PgBouncer exposes its own statistics interface.

Disk I/O and Space

Database performance is heavily reliant on disk I/O. Monitor disk read/write operations, latency, and queue depth using Node Exporter. Also, ensure sufficient free disk space to prevent outages.

PromQL for disk I/O (per device):

rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
node_disk_io_time_seconds_total

PromQL for disk space:

node_filesystem_avail_bytes / node_filesystem_size_bytes * 100

Alerting Strategies and Tools

A robust alerting strategy is the backbone of proactive system management. It ensures that potential issues are identified and addressed before they impact users or cause downtime.

Prometheus Alertmanager

Prometheus, coupled with Alertmanager, is a powerful combination for defining and routing alerts. Alerts are defined in Prometheus using alerting rules, and Alertmanager handles deduplication, grouping, silencing, and routing to various receivers (email, Slack, PagerDuty, etc.).

Example Prometheus alerting rule for high PHP-FPM error rate:

groups:
- name: php_alerts
  rules:
  - alert: HighPhpFpmErrorRate
    expr: sum(rate(php_fpm_request_errors_total[5m])) by (instance) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High PHP-FPM error rate on {{ $labels.instance }}"
      description: "PHP-FPM on {{ $labels.instance }} is reporting more than 10 errors per second over the last 5 minutes."

Example Alertmanager configuration (alertmanager.yml) for routing to Slack:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL'
    channel: '#alerts'
    send_resolved: true

Log-Based Alerting

For issues not easily captured by metrics (e.g., specific application exceptions, security events), log-based alerting is essential. If using Grafana Loki, you can define alerts directly within Grafana based on log queries.

Example Grafana alert rule for detecting a specific PHP fatal error in logs:

{
  "condition": {
    "evaluator": {
      "params": [0, 0],
      "type": "gt"
    },
    "query": {
      "datasourceUid": "loki_datasource_uid",
      "model": {
        "datasource": {"uid": "loki_datasource_uid", "type": "loki"},
        "editorMode": "builder",
        "expr": "{job=\"php-fpm\", level=\"error\"} |= \"Uncaught Error: Call to a member function\"",
        "hide": false,
        "instant": true,
        "legendFormat": "",
        "queryType": "range",
        "refId": "A"
      }
    },
    "reducer": {
      "params": [],
      "type": "last"
    },
    "type": "query"
  },
  "data": [
    {
      "expression": "A",
      "refId": "A"
    }
  ],
  "evaluateFor": "5m",
  "evaluateEvery": "1m",
  "for": "5m",
  "labels": {
    "severity": "critical"
  },
  "notifications": [
    {
      "uid": "slack_notification_uid"
    }
  ],
  "ruleName": "PHP Fatal Error Detected",
  "templateVars": [],
  "title": "PHP Fatal Error Detected",
  "type": "logging"
}

System Hardening and Security Monitoring

Beyond performance, keeping your systems secure and resilient is paramount. This involves both preventative measures and active monitoring for suspicious activity.

Firewall Configuration (UFW)

Ensure your Linode instances have a properly configured firewall. UFW (Uncomplicated Firewall) is a user-friendly front-end for iptables.

Example UFW rules:

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh # Port 22
sudo ufw allow http # Port 80
sudo ufw allow https # Port 443
sudo ufw allow 5432/tcp # PostgreSQL, if exposed externally (consider restricting by IP)
sudo ufw enable

Intrusion Detection Systems (IDS)

Tools like Fail2ban can help mitigate brute-force attacks by monitoring log files for repeated failed login attempts and automatically updating firewall rules to block offending IP addresses.

Install and configure Fail2ban:

sudo apt update && sudo apt install fail2ban
sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local

Edit /etc/fail2ban/jail.local to enable specific jails, such as SSH and PostgreSQL:

[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 1h

[postgresql]
enabled = true
port = 5432
filter = postgresql
logpath = /var/log/postgresql/postgresql-*.log
maxretry = 5
bantime = 1h

Security Auditing and Log Analysis

Regularly audit system logs for security-related events. Tools like Auditd can provide detailed system call auditing. Centralizing logs with a SIEM (Security Information and Event Management) solution or a robust log aggregation platform (ELK, Loki) is crucial for effective analysis and threat detection.

Example Auditd rule to log all attempts to modify critical configuration files:

sudo auditctl -w /etc/nginx/nginx.conf -p wa -k nginx_config_changes
sudo auditctl -w /etc/php/ -p wa -k php_config_changes

Ensure these rules are persisted across reboots by adding them to /etc/audit/rules.d/.