Server Monitoring Best Practices: Keeping Your Python App and PostgreSQL Clusters Alive on OVH
Proactive PostgreSQL Cluster Health Checks with `pg_isready` and Custom Scripts
Maintaining the health of a PostgreSQL cluster, especially in a distributed setup on a cloud provider like OVH, requires more than just basic CPU/memory monitoring. We need to ensure the database instances are not only reachable but also responsive and capable of serving read/write operations. The built-in `pg_isready` utility is a cornerstone for this, but its output needs to be parsed and acted upon intelligently.
On each PostgreSQL node, we’ll set up a cron job that periodically checks the cluster status. This script will leverage `pg_isready` and, for more advanced checks, potentially execute a simple query. The output will be logged and, crucially, sent to a central monitoring system like Prometheus via an exporter or a custom push mechanism.
Basic Reachability and Status Check
The `pg_isready` command provides a quick way to determine if a PostgreSQL server is accepting connections and its current state (e.g., `ALIVE`, `DEAD`, `READY`, `NOT RUNNING`). We’ll wrap this in a shell script that checks the exit code and standard output.
`check_pg_status.sh` (Bash Script)
#!/bin/bash
# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}" # User for connection check, can be a read-only user
PG_DB="${4:-postgres}" # Database to connect to for status check
# Log file
LOG_FILE="/var/log/postgres/pg_status_check.log"
mkdir -p "$(dirname "$LOG_FILE")"
# Timestamp
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
# Check if pg_isready is available
if ! command -v pg_isready &> /dev/null
then
echo "$TIMESTAMP - ERROR: pg_isready command not found. Please install PostgreSQL client tools." | tee -a "$LOG_FILE"
exit 1
fi
# Execute pg_isready
# -h: host, -p: port, -U: user, -d: database, -q: quiet (only output status)
# We use -q for cleaner parsing, but might remove it for debugging.
# The exit code of pg_isready is 0 if the server is accepting connections.
pg_isready -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -q
PG_STATUS_EXIT_CODE=$?
if [ $PG_STATUS_EXIT_CODE -eq 0 ]; then
echo "$TIMESTAMP - INFO: PostgreSQL on $PG_HOST:$PG_PORT is ALIVE and READY." | tee -a "$LOG_FILE"
# Optionally, send a success metric to your monitoring system here
# Example: echo "pg_status_alive{host=\"$PG_HOST\"} 1" | curl --data-binary @- http://your-monitoring-push-endpoint/metrics
exit 0
else
# pg_isready returns non-zero for various states:
# 1: server is not accepting connections (e.g., shutting down, not started)
# 2: connection failed (e.g., network issue, authentication error)
# 3: invalid arguments
echo "$TIMESTAMP - CRITICAL: PostgreSQL on $PG_HOST:$PG_PORT is NOT READY. pg_isready exited with code $PG_STATUS_EXIT_CODE." | tee -a "$LOG_FILE"
# Optionally, send a failure metric
# Example: echo "pg_status_alive{host=\"$PG_HOST\"} 0" | curl --data-binary @- http://your-monitoring-push-endpoint/metrics
exit 1
fi
To make this script executable:
chmod +x check_pg_status.sh
Scheduling the Health Check
We’ll use cron to run this script at regular intervals. For a cluster, you’d typically run this on each node, targeting itself as the host. For a highly available setup, you might also run checks from a separate monitoring server targeting the primary and standby instances.
Cron Job Entry (e.g., every 5 minutes)
*/5 * * * * /path/to/your/scripts/check_pg_status.sh >> /var/log/postgres/pg_status_check.log 2>&1
This cron job executes the script every 5 minutes, appending its output to the log file. The `>>` ensures logs are appended, and `2>&1` redirects standard error to standard output, so both are logged.
Advanced Checks: Querying for Replication Status
For PostgreSQL replication, `pg_isready` only tells us if a server is running. To verify replication health (e.g., lag, sync status), we need to query PostgreSQL’s system views. This is particularly important for standby servers.
We’ll augment our script to perform these checks. This requires a user with sufficient privileges to query `pg_stat_replication` (on the primary) and `pg_stat_wal_receiver` (on the standby).
`check_pg_replication.sh` (Bash Script for Standby)
#!/bin/bash
# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}"
PG_DB="${4:-postgres}"
REPLICATION_USER="${5:-repl_user}" # User for replication checks
# Log file
LOG_FILE="/var/log/postgres/pg_replication_check.log"
mkdir -p "$(dirname "$LOG_FILE")"
# Timestamp
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
# Check if psql is available
if ! command -v psql &> /dev/null
then
echo "$TIMESTAMP - ERROR: psql command not found. Please install PostgreSQL client tools." | tee -a "$LOG_FILE"
exit 1
fi
# Check if this is a standby server
IS_STANDBY=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -tAc "SELECT pg_is_in_recovery();")
if [ "$IS_STANDBY" = "t" ]; then
# This is a standby server, check replication receiver status
RECV_STATUS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT status FROM pg_stat_wal_receiver;")
RECV_PID=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT pid FROM pg_stat_wal_receiver;")
RECV_LAG_QUERY="SELECT pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag, pg_wal_lsn_diff(sent_lsn, flush_lsn) AS flush_lag, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag FROM pg_stat_wal_receiver;"
RECV_LAG=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "$RECV_LAG_QUERY")
if [ -z "$RECV_PID" ]; then
echo "$TIMESTAMP - CRITICAL: Standby $PG_HOST:$PG_PORT replication receiver is not running." | tee -a "$LOG_FILE"
exit 1
elif [ "$RECV_STATUS" != "streaming" ] && [ "$RECV_STATUS" != "catchup" ]; then
echo "$TIMESTAMP - CRITICAL: Standby $PG_HOST:$PG_PORT replication receiver status is '$RECV_STATUS'." | tee -a "$LOG_FILE"
exit 1
else
# Parse lag values (in bytes)
WRITE_LAG=$(echo "$RECV_LAG" | awk '{print $1}')
FLUSH_LAG=$(echo "$RECV_LAG" | awk '{print $2}')
REPLAY_LAG=$(echo "$RECV_LAG" | awk '{print $3}')
# Define a threshold for acceptable lag (e.g., 1GB = 1073741824 bytes)
LAG_THRESHOLD=1073741824
if [ -n "$REPLAY_LAG" ] && [ "$REPLAY_LAG" -gt "$LAG_THRESHOLD" ]; then
echo "$TIMESTAMP - WARNING: Standby $PG_HOST:$PG_PORT replication lag (replay) is high: $REPLAY_LAG bytes." | tee -a "$LOG_FILE"
# Send warning metric
exit 0 # Not a critical failure, but a warning
else
echo "$TIMESTAMP - INFO: Standby $PG_HOST:$PG_PORT replication is healthy. Status: $RECV_STATUS, Replay Lag: ${REPLAY_LAG:-N/A} bytes." | tee -a "$LOG_FILE"
# Send success metric
exit 0
fi
fi
else
# This is a primary server, check replication sender status
# We can query pg_stat_replication to see connected standbys
NUM_STANDBYS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT count(*) FROM pg_stat_replication;")
echo "$TIMESTAMP - INFO: Primary $PG_HOST:$PG_PORT has $NUM_STANDBYS connected standbys." | tee -a "$LOG_FILE"
# Send metric for number of standbys
exit 0
fi
This script needs to be run with appropriate credentials. For the `REPLICATION_USER`, ensure it has `REPLICATION` privileges and can connect from the host running the script. The `PG_USER` is for general connection checks and can be a less privileged user.
Integrating with Prometheus
The most robust way to handle these checks is by integrating them into a Prometheus monitoring stack. You have two primary options:
- Node Exporter with Textfile Collector: Modify the scripts to write metrics in Prometheus text format to a designated directory (e.g., `/var/lib/prometheus/node-exporter/textfile_collector/`). The Node Exporter will then scrape these files.
- Custom Exporter: Write a dedicated exporter (e.g., in Python using `prometheus_client`) that runs the checks and exposes metrics via an HTTP endpoint. This is more flexible but requires more development.
Example: Using Node Exporter Textfile Collector
Let’s adapt `check_pg_status.sh` to output Prometheus metrics.
#!/bin/bash
# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}"
PG_DB="${4:-postgres}"
# Prometheus metrics output directory
METRICS_DIR="/var/lib/prometheus/node-exporter/textfile_collector"
METRIC_FILE="${METRICS_DIR}/pg_status_${PG_HOST//./_}.prom" # Sanitize hostname for filename
mkdir -p "$METRICS_DIR"
# Check if pg_isready is available
if ! command -v pg_isready &> /dev/null
then
echo "# HELP pg_status_alive PostgreSQL server is alive and ready (1=yes, 0=no)." > "$METRIC_FILE"
echo "# TYPE pg_status_alive gauge" >> "$METRIC_FILE"
echo "pg_status_alive{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
exit 1
fi
# Execute pg_isready
pg_isready -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -q
PG_STATUS_EXIT_CODE=$?
echo "# HELP pg_status_alive PostgreSQL server is alive and ready (1=yes, 0=no)." > "$METRIC_FILE"
echo "# TYPE pg_status_alive gauge" >> "$METRIC_FILE"
if [ $PG_STATUS_EXIT_CODE -eq 0 ]; then
echo "pg_status_alive{host=\"$PG_HOST\",port=\"$PG_PORT\"} 1" >> "$METRIC_FILE"
exit 0
else
echo "pg_status_alive{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
exit 1
fi
Similarly, for replication lag:
#!/bin/bash
# Configuration
PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"
PG_USER="${3:-postgres}"
PG_DB="${4:-postgres}"
REPLICATION_USER="${5:-repl_user}"
METRICS_DIR="/var/lib/prometheus/node-exporter/textfile_collector"
METRIC_FILE="${METRICS_DIR}/pg_replication_${PG_HOST//./_}.prom"
mkdir -p "$METRICS_DIR"
# Check if psql is available
if ! command -v psql &> /dev/null
then
echo "# HELP pg_replication_status Replication status (1=OK, 0=Error)." > "$METRIC_FILE"
echo "# TYPE pg_replication_status gauge" >> "$METRIC_FILE"
echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
exit 1
fi
IS_STANDBY=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$PG_USER" -d "$PG_DB" -tAc "SELECT pg_is_in_recovery();")
if [ "$IS_STANDBY" = "t" ]; then
RECV_STATUS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT status FROM pg_stat_wal_receiver;")
RECV_PID=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT pid FROM pg_stat_wal_receiver;")
RECV_LAG_QUERY="SELECT pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag, pg_wal_lsn_diff(sent_lsn, flush_lsn) AS flush_lag, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag FROM pg_stat_wal_receiver;"
RECV_LAG=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "$RECV_LAG_QUERY")
echo "# HELP pg_replication_status Replication status (1=OK, 0=Error)." > "$METRIC_FILE"
echo "# TYPE pg_replication_status gauge" >> "$METRIC_FILE"
echo "# HELP pg_replication_lag_bytes Replication lag in bytes (write, flush, replay)." >&2 # Log to stderr for debugging
echo "# TYPE pg_replication_lag_bytes gauge" >&2
if [ -z "$RECV_PID" ]; then
echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
exit 1
elif [ "$RECV_STATUS" != "streaming" ] && [ "$RECV_STATUS" != "catchup" ]; then
echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE"
exit 1
else
WRITE_LAG=$(echo "$RECV_LAG" | awk '{print $1}')
FLUSH_LAG=$(echo "$RECV_LAG" | awk '{print $2}')
REPLAY_LAG=$(echo "$RECV_LAG" | awk '{print $3}')
echo "pg_replication_lag_bytes{host=\"$PG_HOST\",port=\"$PG_PORT\",lag_type=\"write\"} ${WRITE_LAG:-0}" >> "$METRIC_FILE"
echo "pg_replication_lag_bytes{host=\"$PG_HOST\",port=\"$PG_PORT\",lag_type=\"flush\"} ${FLUSH_LAG:-0}" >> "$METRIC_FILE"
echo "pg_replication_lag_bytes{host=\"$PG_HOST\",port=\"$PG_PORT\",lag_type=\"replay\"} ${REPLAY_LAG:-0}" >> "$METRIC_FILE"
LAG_THRESHOLD=1073741824 # 1GB
if [ -n "$REPLAY_LAG" ] && [ "$REPLAY_LAG" -gt "$LAG_THRESHOLD" ]; then
echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 0" >> "$METRIC_FILE" # Treat high lag as an error for alerting
exit 1
else
echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 1" >> "$METRIC_FILE"
exit 0
fi
fi
else
# Primary server: check number of connected standbys
NUM_STANDBYS=$(psql -h "$PG_HOST" -p "$PG_PORT" -U "$REPLICATION_USER" -d "$PG_DB" -tAc "SELECT count(*) FROM pg_stat_replication;")
echo "# HELP pg_replication_status Replication status (1=OK, 0=Error)." > "$METRIC_FILE"
echo "# TYPE pg_replication_status gauge" >> "$METRIC_FILE"
echo "pg_replication_status{host=\"$PG_HOST\",port=\"$PG_PORT\"} 1" >> "$METRIC_FILE" # Primary is considered OK if running
echo "# HELP pg_connected_standbys Number of connected standbys." >&2
echo "# TYPE pg_connected_standbys gauge" >&2
echo "pg_connected_standbys{host=\"$PG_HOST\",port=\"$PG_PORT\"} $NUM_STANDBYS" >> "$METRIC_FILE"
exit 0
fi
Ensure the cron job for these scripts is configured to run periodically (e.g., every minute) and that the Node Exporter is configured to scan the `textfile_collector` directory. You’ll then need to set up Prometheus alerts based on these metrics (e.g., `pg_status_alive == 0`, `pg_replication_status == 0`, `pg_replication_lag_bytes{lag_type=”replay”} > 1073741824`).
Monitoring Python Application Performance with `psutil` and Prometheus
Your Python application, running on OVH infrastructure, is the other critical component. Monitoring its resource consumption (CPU, memory, network, disk I/O) and internal performance metrics is vital for stability and scalability. The `psutil` library is an excellent cross-platform tool for gathering this information directly from the process.
Gathering Process Metrics with `psutil`
We can write a Python script that uses `psutil` to collect metrics for the main Python application process. This script will then expose these metrics via an HTTP endpoint, making them scrapeable by Prometheus.
`app_metrics_exporter.py` (Python Script)
import psutil
import time
import os
from prometheus_client import start_http_server, Gauge, Counter, Summary
# Configuration
APP_PROCESS_NAME = "your_app.py" # Or the name of your main Python script/executable
METRICS_PORT = 9101 # Port for the Prometheus exporter
COLLECT_INTERVAL = 15 # Seconds
# Prometheus Metrics
# Gauge: Current value
app_cpu_percent = Gauge('app_process_cpu_percent', 'CPU usage percentage of the application process', ['pid', 'name'])
app_memory_percent = Gauge('app_process_memory_percent', 'Memory usage percentage of the application process', ['pid', 'name'])
app_memory_rss_bytes = Gauge('app_process_memory_rss_bytes', 'Resident Set Size (RSS) memory usage of the application process', ['pid', 'name'])
app_memory_vms_bytes = Gauge('app_process_memory_vms_bytes', 'Virtual Memory Size (VMS) usage of the application process', ['pid', 'name'])
app_network_sent_bytes_total = Counter('app_process_network_sent_bytes_total', 'Total network bytes sent by the application process', ['pid', 'name', 'interface'])
app_network_recv_bytes_total = Counter('app_process_network_recv_bytes_total', 'Total network bytes received by the application process', ['pid', 'name', 'interface'])
app_disk_read_bytes_total = Counter('app_process_disk_read_bytes_total', 'Total disk bytes read by the application process', ['pid', 'name', 'path'])
app_disk_write_bytes_total = Counter('app_process_disk_write_bytes_total', 'Total disk bytes written by the application process', ['pid', 'name', 'path'])
app_threads_count = Gauge('app_process_threads_count', 'Number of threads in the application process', ['pid', 'name'])
app_open_files_count = Gauge('app_process_open_files_count', 'Number of open files by the application process', ['pid', 'name'])
# Find the application process
def find_app_process():
for proc in psutil.process_iter(['pid', 'name', 'username']):
try:
if APP_PROCESS_NAME in proc.info['name'] and proc.info['username'] == os.getenv('APP_USER', os.getlogin()): # Optional: Filter by user
return proc
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
pass
return None
# Collect and expose metrics
def collect_metrics(process):
if not process:
print("Application process not found. Skipping metrics collection.")
return
pid = str(process.pid)
name = process.info['name']
try:
# CPU and Memory
cpu_percent = process.cpu_percent(interval=0.1) # Small interval for better accuracy
memory_info = process.memory_info()
memory_percent = process.memory_percent()
app_cpu_percent.labels(pid=pid, name=name).set(cpu_percent)
app_memory_percent.labels(pid=pid, name=name).set(memory_percent)
app_memory_rss_bytes.labels(pid=pid, name=name).set(memory_info.rss)
app_memory_vms_bytes.labels(pid=pid, name=name).set(memory_info.vms)
# Network
net_io = process.io_counters() # This might be None on some systems or for certain processes
if net_io:
# psutil.net_io_counters() returns bytes sent/received since boot.
# For process-specific network I/O, we need to iterate through interfaces.
# This can be complex and might require root privileges.
# For simplicity, we'll use process.connections() to infer network activity if needed,
# but direct counters are preferred if available and accurate.
# Note: psutil's process.io_counters() might not be detailed enough for network per interface.
# A more robust approach might involve parsing /proc/[pid]/net/dev or similar.
# For now, we'll assume process.io_counters() gives *some* disk I/O, not network.
# Let's refine this to use process.connections() for network, though it's not a direct counter.
# A better approach for network is often to monitor the host's network interfaces.
# If you need per-process network traffic, consider tools like `nethogs` or custom eBPF.
pass # Placeholder for network metrics if a reliable method is found.
# Disk I/O
# process.io_counters() returns bytes read/written by the process.
disk_io = process.io_counters()
if disk_io:
# psutil.io_counters() returns a named tuple with read_count, write_count, read_bytes, write_bytes.
# It doesn't specify the path. For path-specific metrics, you'd need to track file descriptors.
# We'll report total read/write bytes for the process.
app_disk_read_bytes_total.labels(pid=pid, name=name, path='total').inc(disk_io.read_bytes)
app_disk_write_bytes_total.labels(pid=pid, name=name, path='total').inc(disk_io.write_bytes)
# Threads and Open Files
app_threads_count.labels(pid=pid, name=name).set(process.num_threads())
app_open_files_count.labels(pid=pid, name=name).set(process.num_fds())
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
print(f"Process {pid} ({name}) disappeared or access denied. Clearing metrics.")
# Ideally, we'd remove the labels, but Prometheus client doesn't easily support this.
# The next scrape will simply not find the process.
return
if __name__ == '__main__':
print(f"Starting application metrics exporter on port {METRICS_PORT}")
start_http_server(METRICS_PORT)
print(f"Monitoring process: {APP_PROCESS_NAME}")
while True:
app_proc = find_app_process()
if app_proc:
collect_metrics(app_proc)
else:
print(f"Application process '{APP_PROCESS_NAME}' not found. Retrying in {COLLECT_INTERVAL}s...")
# Optionally, set metrics to 0 or a specific 'not_found' value
app_cpu_percent.clear()
app_memory_percent.clear()
app_memory_rss_bytes.clear()
app_memory_vms_bytes.clear()
app_threads_count.clear()
app_open_files_count.clear()
# Network and Disk counters are cumulative, so clearing them might be misleading.
# They will naturally stop incrementing if the process is gone.
time.sleep(COLLECT_INTERVAL)
To run this exporter:
pip install psutil prometheus_client python app_metrics_exporter.py &
This script will start an HTTP server on port 9101, exposing metrics that Prometheus can scrape. You’ll need to configure your Prometheus instance to scrape `http://your-app-server-ip:9101/metrics`.
Application-Specific Metrics
Beyond system-level metrics, your Python application should expose its own business-logic metrics. This could include:
- Request latency (using `prometheus_client.Summary` or `Histogram`).
- Number of requests processed (using `prometheus_client.Counter`).
- Queue sizes.
- Cache hit/miss ratios.
- Error counts for specific operations.
Integrate these directly into your application code. For example, to track request duration:
from prometheus_client import Summary, Counter, Histogram
import time
# Define metrics at module level or within a class
REQUEST_LATENCY = Summary('http_request_duration_seconds', 'HTTP request duration in seconds', ['endpoint', 'method'])
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests processed', ['endpoint', 'method', 'status_code'])
# Example usage within a web framework (e.g., Flask)
# @app.route('/api/v1/users')
# def get_users():
# start_time = time.time()
# endpoint = '/api/v1/users'
# method = 'GET'
# status_code = 200
# try:
# # ... your application logic ...
# time.sleep(0.5) # Simulate work
# # ...
# except Exception as e:
# status_code = 500
# # Log error
# finally:
# duration = time.time() - start_time
# REQUEST_LATENCY.labels(endpoint=endpoint, method=method).observe(duration)
# REQUEST_COUNT.labels(endpoint=endpoint, method=method, status_code=status_code).inc()
# return "Users data", status_code
Log Aggregation and Analysis on OVH
Centralized logging is non-negotiable for debugging and auditing. On OVH, you can set up a robust log aggregation pipeline. A common pattern involves using Fluentd or Filebeat to collect logs from your application servers and PostgreSQL instances, forwarding them to a central store like Elasticsearch or Loki.
Filebeat Configuration for PostgreSQL and Application Logs
We’ll configure Filebeat to tail log files and send them to a Logstash instance or directly to Elasticsearch/Loki.
`filebeat.yml` (Filebeat Configuration)
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/postgres/*.log # PostgreSQL logs
- /var/log/your_app/*.log # Application logs
fields_under_root: true
fields:
service: postgresql # Tag PostgreSQL logs
environment: production
json.keys_under_root: true # If your app logs in JSON format
json.keys_under_root: true
json.message_key: log # Specify if your JSON message is under a 'log' key
- type: log
enabled: true
paths:
- /var/log/your_app/app.log # Specific application log file
fields_under_root: true
fields:
service: my_python_app
environment: production
json.keys_under_root: true
json.message_key: message # Assuming your app logs JSON with a 'message' field
# Example for PostgreSQL logs if they are not JSON
- type: log
enabled: true
paths:
- /var/log/postgres/pg_status_check.log
- /var/log/postgres/pg_replication_check.log
fields_under_root: true
fields:
service: postgresql_healthcheck
environment: production
# No JSON parsing needed for these simple log files
output.elasticsearch:
hosts: ["your-elasticsearch-host:9200"]
# username: "elastic"
# password: "changeme"
# Or for Loki:
# output.logstash:
# hosts: ["your-logstash-host:5044"]
# If using Loki directly:
# output.loki:
# hosts: ["your-loki-host:3100"]
# tenant_id: "your-tenant-id" # If applicable
# If using Kafka:
# output.kafka:
# hosts: ["your-kafka-broker:9092"]
# topic: 'logs'
# partition.round_robin:
# reachable_only: false
# required_acks: 1
# compression: gzip
# max_message_bytes: 1000000
# For local testing with file output:
# output.