Server Monitoring Best Practices: Keeping Your Python App and PostgreSQL Clusters Alive on OVH

Proactive PostgreSQL Cluster Health Checks with `pg_isready` and Custom Scripts

Maintaining the health of a PostgreSQL cluster, especially in a distributed setup on a cloud provider like OVH, requires more than just basic CPU/memory monitoring. We need to ensure the database instances are not only reachable but also responsive and capable of serving read/write operations. The built-in `pg_isready` utility is a cornerstone for this, but its output needs to be parsed and acted upon intelligently.

On each PostgreSQL node, we’ll set up a cron job that periodically checks the cluster status. This script will leverage `pg_isready` and, for more advanced checks, potentially execute a simple query. The output will be logged and, crucially, sent to a central monitoring system like Prometheus via an exporter or a custom push mechanism.

Basic Reachability and Status Check

The `pg_isready` command provides a quick way to determine if a PostgreSQL server is accepting connections and its current state (e.g., `ALIVE`, `DEAD`, `READY`, `NOT RUNNING`). We’ll wrap this in a shell script that checks the exit code and standard output.

`check_pg_status.sh` (Bash Script)

#!/bin/bash

# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}" # User for connection check, can be a read-only user
PG_DB="${4:-postgres}"   # Database to connect to for status check

# Log file
LOG_FILE="/var/log/postgres/pg_status_check.log"
mkdir -p "$(dirname "$LOG_FILE")"

# Timestamp
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")

# Check if pg_isready is available
if ! command -v pg_isready &> /dev/null
then
    echo "$TIMESTAMP - ERROR: pg_isready command not found. Please install PostgreSQL client tools." | tee -a "$LOG_FILE"
    exit 1
fi

# Execute pg_isready
# -h: host, -p: port, -U: user, -d: database, -q: quiet (only output status)
# We use -q for cleaner parsing, but might remove it for debugging.
# The exit code of pg_isready is 0 if the server is accepting connections.
pg_isready -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -q

PG_STATUS_EXIT_CODE=$?

if [ $PG_STATUS_EXIT_CODE -eq 0 ]; then
    echo "$TIMESTAMP - INFO: PostgreSQL on $PG_HOST:$PG_PORT is ALIVE and READY." | tee -a "$LOG_FILE"
    # Optionally, send a success metric to your monitoring system here
    # Example: echo "pg_status_alive{host=\"$PG_HOST\"} 1" | curl --data-binary @- http://your-monitoring-push-endpoint/metrics
    exit 0
else
    # pg_isready returns non-zero for various states:
    # 1: server is not accepting connections (e.g., shutting down, not started)
    # 2: connection failed (e.g., network issue, authentication error)
    # 3: invalid arguments
    echo "$TIMESTAMP - CRITICAL: PostgreSQL on $PG_HOST:$PG_PORT is NOT READY. pg_isready exited with code $PG_STATUS_EXIT_CODE." | tee -a "$LOG_FILE"
    # Optionally, send a failure metric
    # Example: echo "pg_status_alive{host=\"$PG_HOST\"} 0" | curl --data-binary @- http://your-monitoring-push-endpoint/metrics
    exit 1
fi

To make this script executable:

chmod +x check_pg_status.sh

Scheduling the Health Check

We’ll use cron to run this script at regular intervals. For a cluster, you’d typically run this on each node, targeting itself as the host. For a highly available setup, you might also run checks from a separate monitoring server targeting the primary and standby instances.

Cron Job Entry (e.g., every 5 minutes)

*/5 * * * * /path/to/your/scripts/check_pg_status.sh >> /var/log/postgres/pg_status_check.log 2>&1

This cron job executes the script every 5 minutes, appending its output to the log file. The `>>` ensures logs are appended, and `2>&1` redirects standard error to standard output, so both are logged.

Advanced Checks: Querying for Replication Status

For PostgreSQL replication, `pg_isready` only tells us if a server is running. To verify replication health (e.g., lag, sync status), we need to query PostgreSQL’s system views. This is particularly important for standby servers.

We’ll augment our script to perform these checks. This requires a user with sufficient privileges to query `pg_stat_replication` (on the primary) and `pg_stat_wal_receiver` (on the standby).

`check_pg_replication.sh` (Bash Script for Standby)

#!/bin/bash

# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}"
PG_DB="${4:-postgres}"
REPLICATION_USER="${5:-repl_user}" # User for replication checks

# Log file
LOG_FILE="/var/log/postgres/pg_replication_check.log"
mkdir -p "$(dirname "$LOG_FILE")"

# Timestamp
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")

# Check if psql is available
if ! command -v psql &> /dev/null
then
    echo "$TIMESTAMP - ERROR: psql command not found. Please install PostgreSQL client tools." | tee -a "$LOG_FILE"
    exit 1
fi

# Check if this is a standby server
IS_STANDBY=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -tAc "SELECT pg_is_in_recovery();")

if [ "$IS_STANDBY" = "t" ]; then
    # This is a standby server, check replication receiver status
    RECV_STATUS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT status FROM pg_stat_wal_receiver;")
    RECV_PID=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT pid FROM pg_stat_wal_receiver;")
    RECV_LAG_QUERY="SELECT pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag, pg_wal_lsn_diff(sent_lsn, flush_lsn) AS flush_lag, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag FROM pg_stat_wal_receiver;"
    RECV_LAG=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "$RECV_LAG_QUERY")

    if [ -z "$RECV_PID" ]; then
        echo "$TIMESTAMP - CRITICAL: Standby $PG_HOST:$PG_PORT replication receiver is not running." | tee -a "$LOG_FILE"
        exit 1
    elif [ "$RECV_STATUS" != "streaming" ] && [ "$RECV_STATUS" != "catchup" ]; then
        echo "$TIMESTAMP - CRITICAL: Standby $PG_HOST:$PG_PORT replication receiver status is '$RECV_STATUS'." | tee -a "$LOG_FILE"
        exit 1
    else
        # Parse lag values (in bytes)
        WRITE_LAG=$(echo "$RECV_LAG" | awk '{print $1}')
        FLUSH_LAG=$(echo "$RECV_LAG" | awk '{print $2}')
        REPLAY_LAG=$(echo "$RECV_LAG" | awk '{print $3}')

        # Define a threshold for acceptable lag (e.g., 1GB = 1073741824 bytes)
        LAG_THRESHOLD=1073741824

        if [ -n "$REPLAY_LAG" ] && [ "$REPLAY_LAG" -gt "$LAG_THRESHOLD" ]; then
            echo "$TIMESTAMP - WARNING: Standby $PG_HOST:$PG_PORT replication lag (replay) is high: $REPLAY_LAG bytes." | tee -a "$LOG_FILE"
            # Send warning metric
            exit 0 # Not a critical failure, but a warning
        else
            echo "$TIMESTAMP - INFO: Standby $PG_HOST:$PG_PORT replication is healthy. Status: $RECV_STATUS, Replay Lag: ${REPLAY_LAG:-N/A} bytes." | tee -a "$LOG_FILE"
            # Send success metric
            exit 0
        fi
    fi
else
    # This is a primary server, check replication sender status
    # We can query pg_stat_replication to see connected standbys
    NUM_STANDBYS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT count(*) FROM pg_stat_replication;")
    echo "$TIMESTAMP - INFO: Primary $PG_HOST:$PG_PORT has $NUM_STANDBYS connected standbys." | tee -a "$LOG_FILE"
    # Send metric for number of standbys
    exit 0
fi

This script needs to be run with appropriate credentials. For the `REPLICATION_USER`, ensure it has `REPLICATION` privileges and can connect from the host running the script. The `PG_USER` is for general connection checks and can be a less privileged user.

Integrating with Prometheus

The most robust way to handle these checks is by integrating them into a Prometheus monitoring stack. You have two primary options:

Node Exporter with Textfile Collector: Modify the scripts to write metrics in Prometheus text format to a designated directory (e.g., `/var/lib/prometheus/node-exporter/textfile_collector/`). The Node Exporter will then scrape these files.
Custom Exporter: Write a dedicated exporter (e.g., in Python using `prometheus_client`) that runs the checks and exposes metrics via an HTTP endpoint. This is more flexible but requires more development.

Example: Using Node Exporter Textfile Collector

Let’s adapt `check_pg_status.sh` to output Prometheus metrics.

#!/bin/bash

# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}"
PG_DB="${4:-postgres}"

# Prometheus metrics output directory
METRICS_DIR="/var/lib/prometheus/node-exporter/textfile_collector"
METRIC_FILE="${METRICS_DIR}/pg_status_${PG_HOST//./_}.prom" # Sanitize hostname for filename

mkdir -p "$METRICS_DIR"

# Check if pg_isready is available
if ! command -v pg_isready &> /dev/null
then
    echo "# HELP pg_status_alive PostgreSQL server is alive and ready (1=yes, 0=no)." > "$METRIC_FILE"
    echo "# TYPE pg_status_alive gauge" >> "$METRIC_FILE"
    echo "pg_status_alive{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
    exit 1
fi

# Execute pg_isready
pg_isready -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -q
PG_STATUS_EXIT_CODE=$?

echo "# HELP pg_status_alive PostgreSQL server is alive and ready (1=yes, 0=no)." > "$METRIC_FILE"
echo "# TYPE pg_status_alive gauge" >> "$METRIC_FILE"

if [ $PG_STATUS_EXIT_CODE -eq 0 ]; then
    echo "pg_status_alive{host=\"$PG_HOST\",port=\"$PG_PORT\"} 1" >> "$METRIC_FILE"
    exit 0
else
    echo "pg_status_alive{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
    exit 1
fi

Similarly, for replication lag:

#!/bin/bash

# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}"
PG_DB="${4:-postgres}"
REPLICATION_USER="${5:-repl_user}"

METRICS_DIR="/var/lib/prometheus/node-exporter/textfile_collector"
METRIC_FILE="${METRICS_DIR}/pg_replication_${PG_HOST//./_}.prom"

mkdir -p "$METRICS_DIR"

# Check if psql is available
if ! command -v psql &> /dev/null
then
    echo "# HELP pg_replication_status Replication status (1=OK, 0=Error)." > "$METRIC_FILE"
    echo "# TYPE pg_replication_status gauge" >> "$METRIC_FILE"
    echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
    exit 1
fi

IS_STANDBY=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -tAc "SELECT pg_is_in_recovery();")

if [ "$IS_STANDBY" = "t" ]; then
    RECV_STATUS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT status FROM pg_stat_wal_receiver;")
    RECV_PID=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT pid FROM pg_stat_wal_receiver;")
    RECV_LAG_QUERY="SELECT pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag, pg_wal_lsn_diff(sent_lsn, flush_lsn) AS flush_lag, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag FROM pg_stat_wal_receiver;"
    RECV_LAG=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "$RECV_LAG_QUERY")

    echo "# HELP pg_replication_status Replication status (1=OK, 0=Error)." > "$METRIC_FILE"
    echo "# TYPE pg_replication_status gauge" >> "$METRIC_FILE"
    echo "# HELP pg_replication_lag_bytes Replication lag in bytes (write, flush, replay)." >&2 # Log to stderr for debugging
    echo "# TYPE pg_replication_lag_bytes gauge" >&2

    if [ -z "$RECV_PID" ]; then
        echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
        exit 1
    elif [ "$RECV_STATUS" != "streaming" ] && [ "$RECV_STATUS" != "catchup" ]; then
        echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
        exit 1
    else
        WRITE_LAG=$(echo "$RECV_LAG" | awk '{print $1}')
        FLUSH_LAG=$(echo "$RECV_LAG" | awk '{print $2}')
        REPLAY_LAG=$(echo "$RECV_LAG" | awk '{print $3}')

        echo "pg_replication_lag_bytes{host=\"$PG_HOST\",port=\"$PG_PORT\",lag_type=\"write\"} ${WRITE_LAG:-0}" >> "$METRIC_FILE"
        echo "pg_replication_lag_bytes{host=\"$PG_HOST\",port=\"$PG_PORT\",lag_type=\"flush\"} ${FLUSH_LAG:-0}" >> "$METRIC_FILE"
        echo "pg_replication_lag_bytes{host=\"$PG_HOST\",port=\"$PG_PORT\",lag_type=\"replay\"} ${REPLAY_LAG:-0}" >> "$METRIC_FILE"

        LAG_THRESHOLD=1073741824 # 1GB
        if [ -n "$REPLAY_LAG" ] && [ "$REPLAY_LAG" -gt "$LAG_THRESHOLD" ]; then
            echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE" # Treat high lag as an error for alerting
            exit 1
        else
            echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 1" >> "$METRIC_FILE"
            exit 0
        fi
    fi
else
    # Primary server: check number of connected standbys
    NUM_STANDBYS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT count(*) FROM pg_stat_replication;")
    echo "# HELP pg_replication_status Replication status (1=OK, 0=Error)." > "$METRIC_FILE"
    echo "# TYPE pg_replication_status gauge" >> "$METRIC_FILE"
    echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 1" >> "$METRIC_FILE" # Primary is considered OK if running

    echo "# HELP pg_connected_standbys Number of connected standbys." >&2
    echo "# TYPE pg_connected_standbys gauge" >&2
    echo "pg_connected_standbys{host=\"$PG_HOST\",port=\"$PG_PORT\"} $NUM_STANDBYS" >> "$METRIC_FILE"
    exit 0
fi

Ensure the cron job for these scripts is configured to run periodically (e.g., every minute) and that the Node Exporter is configured to scan the `textfile_collector` directory. You’ll then need to set up Prometheus alerts based on these metrics (e.g., `pg_status_alive == 0`, `pg_replication_status == 0`, `pg_replication_lag_bytes{lag_type=”replay”} > 1073741824`).

Monitoring Python Application Performance with `psutil` and Prometheus

Your Python application, running on OVH infrastructure, is the other critical component. Monitoring its resource consumption (CPU, memory, network, disk I/O) and internal performance metrics is vital for stability and scalability. The `psutil` library is an excellent cross-platform tool for gathering this information directly from the process.

Gathering Process Metrics with `psutil`

We can write a Python script that uses `psutil` to collect metrics for the main Python application process. This script will then expose these metrics via an HTTP endpoint, making them scrapeable by Prometheus.

`app_metrics_exporter.py` (Python Script)

import psutil
import time
import os
from prometheus_client import start_http_server, Gauge, Counter, Summary

# Configuration
APP_PROCESS_NAME = "your_app.py" # Or the name of your main Python script/executable
METRICS_PORT = 9101 # Port for the Prometheus exporter
COLLECT_INTERVAL = 15 # Seconds

# Prometheus Metrics
# Gauge: Current value
app_cpu_percent = Gauge('app_process_cpu_percent', 'CPU usage percentage of the application process', ['pid', 'name'])
app_memory_percent = Gauge('app_process_memory_percent', 'Memory usage percentage of the application process', ['pid', 'name'])
app_memory_rss_bytes = Gauge('app_process_memory_rss_bytes', 'Resident Set Size (RSS) memory usage of the application process', ['pid', 'name'])
app_memory_vms_bytes = Gauge('app_process_memory_vms_bytes', 'Virtual Memory Size (VMS) usage of the application process', ['pid', 'name'])
app_network_sent_bytes_total = Counter('app_process_network_sent_bytes_total', 'Total network bytes sent by the application process', ['pid', 'name', 'interface'])
app_network_recv_bytes_total = Counter('app_process_network_recv_bytes_total', 'Total network bytes received by the application process', ['pid', 'name', 'interface'])
app_disk_read_bytes_total = Counter('app_process_disk_read_bytes_total', 'Total disk bytes read by the application process', ['pid', 'name', 'path'])
app_disk_write_bytes_total = Counter('app_process_disk_write_bytes_total', 'Total disk bytes written by the application process', ['pid', 'name', 'path'])
app_threads_count = Gauge('app_process_threads_count', 'Number of threads in the application process', ['pid', 'name'])
app_open_files_count = Gauge('app_process_open_files_count', 'Number of open files by the application process', ['pid', 'name'])

# Find the application process
def find_app_process():
    for proc in psutil.process_iter(['pid', 'name', 'username']):
        try:
            if APP_PROCESS_NAME in proc.info['name'] and proc.info['username'] == os.getenv('APP_USER', os.getlogin()): # Optional: Filter by user
                return proc
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
            pass
    return None

# Collect and expose metrics
def collect_metrics(process):
    if not process:
        print("Application process not found. Skipping metrics collection.")
        return

    pid = str(process.pid)
    name = process.info['name']

    try:
        # CPU and Memory
        cpu_percent = process.cpu_percent(interval=0.1) # Small interval for better accuracy
        memory_info = process.memory_info()
        memory_percent = process.memory_percent()

        app_cpu_percent.labels(pid=pid, name=name).set(cpu_percent)
        app_memory_percent.labels(pid=pid, name=name).set(memory_percent)
        app_memory_rss_bytes.labels(pid=pid, name=name).set(memory_info.rss)
        app_memory_vms_bytes.labels(pid=pid, name=name).set(memory_info.vms)

        # Network
        net_io = process.io_counters() # This might be None on some systems or for certain processes
        if net_io:
            # psutil.net_io_counters() returns bytes sent/received since boot.
            # For process-specific network I/O, we need to iterate through interfaces.
            # This can be complex and might require root privileges.
            # For simplicity, we'll use process.connections() to infer network activity if needed,
            # but direct counters are preferred if available and accurate.
            # Note: psutil's process.io_counters() might not be detailed enough for network per interface.
            # A more robust approach might involve parsing /proc/[pid]/net/dev or similar.
            # For now, we'll assume process.io_counters() gives *some* disk I/O, not network.
            # Let's refine this to use process.connections() for network, though it's not a direct counter.
            # A better approach for network is often to monitor the host's network interfaces.
            # If you need per-process network traffic, consider tools like `nethogs` or custom eBPF.
            pass # Placeholder for network metrics if a reliable method is found.

        # Disk I/O
        # process.io_counters() returns bytes read/written by the process.
        disk_io = process.io_counters()
        if disk_io:
            # psutil.io_counters() returns a named tuple with read_count, write_count, read_bytes, write_bytes.
            # It doesn't specify the path. For path-specific metrics, you'd need to track file descriptors.
            # We'll report total read/write bytes for the process.
            app_disk_read_bytes_total.labels(pid=pid, name=name, path='total').inc(disk_io.read_bytes)
            app_disk_write_bytes_total.labels(pid=pid, name=name, path='total').inc(disk_io.write_bytes)

        # Threads and Open Files
        app_threads_count.labels(pid=pid, name=name).set(process.num_threads())
        app_open_files_count.labels(pid=pid, name=name).set(process.num_fds())

    except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
        print(f"Process {pid} ({name}) disappeared or access denied. Clearing metrics.")
        # Ideally, we'd remove the labels, but Prometheus client doesn't easily support this.
        # The next scrape will simply not find the process.
        return

if __name__ == '__main__':
    print(f"Starting application metrics exporter on port {METRICS_PORT}")
    start_http_server(METRICS_PORT)
    print(f"Monitoring process: {APP_PROCESS_NAME}")

    while True:
        app_proc = find_app_process()
        if app_proc:
            collect_metrics(app_proc)
        else:
            print(f"Application process '{APP_PROCESS_NAME}' not found. Retrying in {COLLECT_INTERVAL}s...")
            # Optionally, set metrics to 0 or a specific 'not_found' value
            app_cpu_percent.clear()
            app_memory_percent.clear()
            app_memory_rss_bytes.clear()
            app_memory_vms_bytes.clear()
            app_threads_count.clear()
            app_open_files_count.clear()
            # Network and Disk counters are cumulative, so clearing them might be misleading.
            # They will naturally stop incrementing if the process is gone.

        time.sleep(COLLECT_INTERVAL)

To run this exporter:

pip install psutil prometheus_client
python app_metrics_exporter.py &

This script will start an HTTP server on port 9101, exposing metrics that Prometheus can scrape. You’ll need to configure your Prometheus instance to scrape `http://your-app-server-ip:9101/metrics`.

Application-Specific Metrics

Beyond system-level metrics, your Python application should expose its own business-logic metrics. This could include:

Request latency (using `prometheus_client.Summary` or `Histogram`).
Number of requests processed (using `prometheus_client.Counter`).
Queue sizes.
Cache hit/miss ratios.
Error counts for specific operations.

Integrate these directly into your application code. For example, to track request duration:

from prometheus_client import Summary, Counter, Histogram
import time

# Define metrics at module level or within a class
REQUEST_LATENCY = Summary('http_request_duration_seconds', 'HTTP request duration in seconds', ['endpoint', 'method'])
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests processed', ['endpoint', 'method', 'status_code'])

# Example usage within a web framework (e.g., Flask)
# @app.route('/api/v1/users')
# def get_users():
#     start_time = time.time()
#     endpoint = '/api/v1/users'
#     method = 'GET'
#     status_code = 200
#     try:
#         # ... your application logic ...
#         time.sleep(0.5) # Simulate work
#         # ...
#     except Exception as e:
#         status_code = 500
#         # Log error
#     finally:
#         duration = time.time() - start_time
#         REQUEST_LATENCY.labels(endpoint=endpoint, method=method).observe(duration)
#         REQUEST_COUNT.labels(endpoint=endpoint, method=method, status_code=status_code).inc()
#     return "Users data", status_code

Log Aggregation and Analysis on OVH

Centralized logging is non-negotiable for debugging and auditing. On OVH, you can set up a robust log aggregation pipeline. A common pattern involves using Fluentd or Filebeat to collect logs from your application servers and PostgreSQL instances, forwarding them to a central store like Elasticsearch or Loki.

Filebeat Configuration for PostgreSQL and Application Logs

We’ll configure Filebeat to tail log files and send them to a Logstash instance or directly to Elasticsearch/Loki.

`filebeat.yml` (Filebeat Configuration)

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/postgres/*.log # PostgreSQL logs
    - /var/log/your_app/*.log # Application logs
  fields_under_root: true
  fields:
    service: postgresql # Tag PostgreSQL logs
    environment: production
  json.keys_under_root: true # If your app logs in JSON format
  json.keys_under_root: true
  json.message_key: log # Specify if your JSON message is under a 'log' key

- type: log
  enabled: true
  paths:
    - /var/log/your_app/app.log # Specific application log file
  fields_under_root: true
  fields:
    service: my_python_app
    environment: production
  json.keys_under_root: true
  json.message_key: message # Assuming your app logs JSON with a 'message' field

# Example for PostgreSQL logs if they are not JSON
- type: log
  enabled: true
  paths:
    - /var/log/postgres/pg_status_check.log
    - /var/log/postgres/pg_replication_check.log
  fields_under_root: true
  fields:
    service: postgresql_healthcheck
    environment: production
  # No JSON parsing needed for these simple log files

output.elasticsearch:
  hosts: ["your-elasticsearch-host:9200"]
  # username: "elastic"
  # password: "changeme"

# Or for Loki:
# output.logstash:
#   hosts: ["your-logstash-host:5044"]

# If using Loki directly:
# output.loki:
#   hosts: ["your-loki-host:3100"]
#   tenant_id: "your-tenant-id" # If applicable

# If using Kafka:
# output.kafka:
#   hosts: ["your-kafka-broker:9092"]
#   topic: 'logs'
#   partition.round_robin:
#     reachable_only: false
#   required_acks: 1
#   compression: gzip
#   max_message_bytes: 1000000

# For local testing with file output:
# output.