Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on OVH
Proactive PostgreSQL Cluster Health Checks with `pg_stat_activity` and `pg_locks`
Maintaining the health of a PostgreSQL cluster, especially in a distributed or high-availability setup on OVH, requires more than just basic CPU/memory monitoring. Understanding the internal state of your database is paramount. We’ll focus on two critical system views: `pg_stat_activity` and `pg_locks`. These views provide real-time insights into what your database is actually doing, allowing for early detection of performance bottlenecks and deadlocks.
A common pitfall is to only look at external metrics. While essential, they don’t tell the whole story. A server might have ample CPU and RAM, but if a single long-running query is blocking all other operations, your application will grind to a halt. This is where querying `pg_stat_activity` becomes indispensable.
Monitoring `pg_stat_activity` for Long-Running Queries and Idle Connections
The `pg_stat_activity` view shows one row per server process, detailing its current state. We’re particularly interested in identifying queries that have been running for an unusually long time and connections that are idle but holding resources (e.g., locks).
Here’s a robust query to identify such activities. We’ll set thresholds for “long-running” (e.g., > 5 minutes) and “idle in transaction” states. These thresholds should be tuned based on your application’s typical workload and transaction patterns.
SELECT
pid,
datname,
usename,
client_addr,
backend_start,
query_start,
state_change,
wait_event_type,
wait_event,
state,
query
FROM
pg_stat_activity
WHERE
state = 'active' AND
now() - query_start > interval '5 minutes' -- Adjust threshold as needed
ORDER BY
query_start ASC;
SELECT
pid,
datname,
usename,
client_addr,
backend_start,
state_change,
wait_event_type,
wait_event,
state,
query
FROM
pg_stat_activity
WHERE
state = 'idle in transaction' AND
now() - state_change > interval '1 minute' -- Adjust threshold as needed
ORDER BY
state_change ASC;
To automate this, we can create a simple script. This script will connect to each PostgreSQL instance in our cluster (assuming a primary/replica setup or a sharded environment) and execute these checks. Alerts can be triggered if any results are returned.
Automating PostgreSQL Health Checks with a Python Script
We’ll use Python with the `psycopg2` library for database connectivity. This script can be scheduled via cron on a dedicated monitoring server or one of the application servers. For OVH, ensure your monitoring server has network access to your PostgreSQL instances (e.g., via private network or secured public IP).
The script will iterate through a list of database connection details, execute the monitoring queries, and log any issues. For a production environment, integrate this with your alerting system (e.g., PagerDuty, Slack, Opsgenie).
import psycopg2
import psycopg2.extras
import logging
import os
from datetime import datetime, timedelta
# --- Configuration ---
LOG_FILE = '/var/log/pg_monitor.log'
ALERT_THRESHOLD_ACTIVE_MIN = 5 # Minutes for long-running active queries
ALERT_THRESHOLD_IDLE_TX_MIN = 1 # Minutes for idle in transaction
# Database connection details (replace with your actual credentials and hosts)
# For OVH, consider using private network IPs if available for better security and performance.
DB_CONNECTIONS = [
{
'host': 'pg-primary.yourdomain.com',
'port': 5432,
'database': 'your_db',
'user': 'monitor_user',
'password': 'your_monitor_password'
},
# Add other PostgreSQL instances (replicas, other clusters)
# {
# 'host': 'pg-replica-1.yourdomain.com',
# 'port': 5432,
# 'database': 'your_db',
# 'user': 'monitor_user',
# 'password': 'your_monitor_password'
# },
]
# --- Logging Setup ---
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(LOG_FILE),
logging.StreamHandler() # Also log to console
]
)
# --- SQL Queries ---
QUERY_ACTIVE_LONG_RUNNING = f"""
SELECT
pid,
datname,
usename,
client_addr,
backend_start,
query_start,
state_change,
wait_event_type,
wait_event,
state,
query
FROM
pg_stat_activity
WHERE
state = 'active' AND
now() - query_start > interval '{ALERT_THRESHOLD_ACTIVE_MIN} minutes'
ORDER BY
query_start ASC;
"""
QUERY_IDLE_IN_TRANSACTION = f"""
SELECT
pid,
datname,
usename,
client_addr,
backend_start,
state_change,
wait_event_type,
wait_event,
state,
query
FROM
pg_stat_activity
WHERE
state = 'idle in transaction' AND
now() - state_change > interval '{ALERT_THRESHOLD_IDLE_TX_MIN} minutes'
ORDER BY
state_change ASC;
"""
def check_postgres_health(db_config):
conn = None
try:
logging.info(f"Connecting to {db_config['host']}:{db_config['port']} database '{db_config['database']}'...")
conn = psycopg2.connect(
host=db_config['host'],
port=db_config['port'],
database=db_config['database'],
user=db_config['user'],
password=db_config['password'],
connect_timeout=5 # Short timeout for connection attempts
)
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
# Check for long-running active queries
cur.execute(QUERY_ACTIVE_LONG_RUNNING)
active_queries = cur.fetchall()
if active_queries:
logging.warning(f"ALERT: {db_config['host']} - Found {len(active_queries)} long-running active queries (>{ALERT_THRESHOLD_ACTIVE_MIN} min):")
for row in active_queries:
logging.warning(f" PID: {row['pid']}, User: {row['usename']}, Client: {row['client_addr']}, Query Start: {row['query_start']}, Wait Event: {row['wait_event_type']} - {row['wait_event']}")
logging.warning(f" Query: {row['query'][:200]}...") # Truncate for log readability
# TODO: Integrate with alerting system here (e.g., send_slack_alert(...))
# Check for idle in transaction connections
cur.execute(QUERY_IDLE_IN_TRANSACTION)
idle_tx_connections = cur.fetchall()
if idle_tx_connections:
logging.warning(f"ALERT: {db_config['host']} - Found {len(idle_tx_connections)} idle in transaction connections (>{ALERT_THRESHOLD_IDLE_TX_MIN} min):")
for row in idle_tx_connections:
logging.warning(f" PID: {row['pid']}, User: {row['usename']}, Client: {row['client_addr']}, State Change: {row['state_change']}")
logging.warning(f" Query: {row['query'][:200]}...") # Truncate for log readability
# TODO: Integrate with alerting system here
cur.close()
logging.info(f"Successfully checked {db_config['host']}:{db_config['port']}.")
except psycopg2.OperationalError as e:
logging.error(f"Could not connect to {db_config['host']}:{db_config['port']} - {e}")
# TODO: Integrate with alerting system for connection failures
except Exception as e:
logging.error(f"An unexpected error occurred for {db_config['host']}:{db_config['port']} - {e}")
# TODO: Integrate with alerting system for unexpected errors
finally:
if conn:
conn.close()
if __name__ == "__main__":
logging.info("Starting PostgreSQL health check...")
for db in DB_CONNECTIONS:
check_postgres_health(db)
logging.info("PostgreSQL health check finished.")
Important Security Note: The `monitor_user` should have minimal privileges. It only needs `SELECT` access to `pg_stat_activity`. Avoid using superuser credentials for monitoring. On OVH, ensure your firewall rules (OVH Security Firewall or OS-level iptables/ufw) allow the monitoring server to connect to the PostgreSQL ports.
Detecting and Diagnosing Deadlocks with `pg_locks`
Deadlocks are a more severe issue where two or more processes are waiting for each other to release locks, creating a standstill. `pg_stat_activity` can show processes waiting for locks, but `pg_locks` provides the detailed mapping of which process holds which lock and which process is waiting for it.
A deadlock scenario typically involves a cycle in the lock dependencies. We can query `pg_locks` to identify processes that are blocked and the locks they are waiting for. PostgreSQL’s built-in deadlock detector will usually resolve these by terminating one of the processes, but identifying the cause is crucial for application logic fixes.
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocked_activity.client_addr AS blocked_client,
blocked_activity.query AS blocked_query,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocking_activity.client_addr AS blocking_client,
blocking_activity.query AS blocking_query,
blocking_locks.locktype AS lock_type,
blocking_locks.mode AS lock_mode,
blocking_locks.granted AS lock_granted
FROM
pg_catalog.pg_locks blocked_locks
JOIN
pg_catalog.pg_stat_activity blocked_activity
ON
blocked_activity.pid = blocked_locks.pid
JOIN
pg_catalog.pg_locks blocking_locks
ON
blocking_locks.lock = blocked_locks.lock AND blocking_locks.pid != blocked_locks.pid
JOIN
pg_catalog.pg_stat_activity blocking_activity
ON
blocking_activity.pid = blocking_locks.pid
WHERE
NOT blocked_locks.granted
AND
blocking_locks.pid IS NOT NULL
AND
blocked_activity.pid != pg_backend_pid() -- Exclude self
ORDER BY
blocking_locks.pid, blocked_locks.pid;
This query identifies processes that are waiting (`NOT blocked_locks.granted`) and are being blocked by another process (`blocking_locks.pid IS NOT NULL`). It then joins back to `pg_stat_activity` to get the query details for both the blocked and blocking processes.
Integrating this into our Python script is straightforward. We’ll add a new query and conditional logging/alerting.
# ... (previous imports and configuration) ...
QUERY_DEADLOCKS = """
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocked_activity.client_addr AS blocked_client,
blocked_activity.query AS blocked_query,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocking_activity.client_addr AS blocking_client,
blocking_activity.query AS blocking_query,
blocking_locks.locktype AS lock_type,
blocking_locks.mode AS lock_mode,
blocking_locks.granted AS lock_granted
FROM
pg_catalog.pg_locks blocked_locks
JOIN
pg_catalog.pg_stat_activity blocked_activity
ON
blocked_activity.pid = blocked_locks.pid
JOIN
pg_catalog.pg_locks blocking_locks
ON
blocking_locks.lock = blocked_locks.lock AND blocking_locks.pid != blocked_locks.pid
JOIN
pg_catalog.pg_stat_activity blocking_activity
ON
blocking_activity.pid = blocking_locks.pid
WHERE
NOT blocked_locks.granted
AND
blocking_locks.pid IS NOT NULL
AND
blocked_activity.pid != pg_backend_pid() -- Exclude self
ORDER BY
blocking_locks.pid, blocked_locks.pid;
"""
def check_postgres_health(db_config):
conn = None
try:
# ... (connection and previous checks) ...
# Check for deadlocks
cur.execute(QUERY_DEADLOCKS)
deadlocks = cur.fetchall()
if deadlocks:
logging.error(f"ALERT: {db_config['host']} - Potential deadlock detected!")
for row in deadlocks:
logging.error(f" Blocked PID: {row['blocked_pid']} (User: {row['blocked_user']}, Client: {row['blocked_client']})")
logging.error(f" Blocked Query: {row['blocked_query'][:200]}...")
logging.error(f" Blocking PID: {row['blocking_pid']} (User: {row['blocking_user']}, Client: {row['blocking_client']})")
logging.error(f" Blocking Query: {row['blocking_query'][:200]}...")
logging.error(f" Lock Type: {row['lock_type']}, Mode: {row['lock_mode']}")
# TODO: Integrate with alerting system here for critical deadlock alerts
cur.close()
logging.info(f"Successfully checked {db_config['host']}:{db_config['port']}.")
# ... (exception handling and finally block) ...
# ... (if __name__ == "__main__": block) ...
Monitoring Ruby Application Performance with APM Tools
While database monitoring is critical, understanding the performance of your Ruby application itself is equally important. Application Performance Monitoring (APM) tools provide deep insights into request latency, error rates, transaction traces, and external service calls. For a Ruby on Rails application running on OVH, integrating an APM solution is a must.
Popular choices include:
- New Relic: A mature and feature-rich APM with excellent Ruby support.
- Datadog: Offers comprehensive monitoring across infrastructure and applications, with strong APM capabilities.
- AppSignal: A Ruby-focused APM known for its ease of use and detailed insights.
- Scout APM: Another Ruby-centric APM that excels at identifying performance bottlenecks in code.
The integration typically involves adding a gem to your application’s Gemfile and configuring an agent. For example, with the `newrelic_rpm` gem:
# Gemfile gem 'newrelic_rpm'
After running `bundle install`, you’ll need to create a `newrelic.yml` configuration file in your application’s root directory. This file contains your New Relic license key and application name.
# newrelic.yml common: &common license_key: YOUR_NEW_RELIC_LICENSE_KEY app_name: YourRubyApp - Production development: <<: *common monitor_mode: true test: <<: *common monitor_mode: true production: <<: *common monitor_mode: true
Ensure your Ruby application server (e.g., Puma, Unicorn) is configured to load this agent. For Puma, this often means starting it with `bundle exec puma -C config/puma.rb` where `config/puma.rb` might have:
# config/puma.rb (example snippet) require 'newrelic_rpm' # Ensure this is required early # ... other Puma configurations ...
On OVH, your application servers will need outbound network access to the APM provider's collection endpoints. For New Relic, this is typically `collector.newrelic.com` on port 443.
Infrastructure Monitoring on OVH: Beyond Basic Metrics
OVH provides basic infrastructure metrics through its control panel and API. However, for robust monitoring, you'll want to deploy agents that collect more granular data and can be integrated with a centralized monitoring system like Prometheus, Zabbix, or Datadog.
Key infrastructure metrics to monitor for your Ruby app and PostgreSQL cluster:
- CPU Usage: Per-core and overall system load. High CPU on PostgreSQL can indicate inefficient queries or insufficient resources. High CPU on app servers might point to application bottlenecks or traffic spikes.
- Memory Usage: Total RAM, swap usage, and buffer/cache. Excessive swapping on PostgreSQL is a major performance killer.
- Disk I/O: Read/write operations per second, latency, and throughput. Slow disk I/O on PostgreSQL is often a primary cause of performance issues.
- Network Traffic: Inbound/outbound bandwidth, packet loss, and latency. High network latency between app servers and PostgreSQL can degrade performance.
- Disk Space: Free space percentage. Running out of disk space can cause application and database failures.
- Process Count: Number of running processes. An unusually high number might indicate runaway processes or resource exhaustion.
For PostgreSQL, specific metrics like WAL (Write-Ahead Log) generation rate, cache hit ratio, and replication lag are also critical. These are often exposed by PostgreSQL itself or can be collected by specialized exporters (e.g., `postgres_exporter` for Prometheus).
Configuring Prometheus and `postgres_exporter`
Prometheus is a powerful open-source monitoring and alerting system. `postgres_exporter` is a popular tool that exposes PostgreSQL metrics in a format Prometheus can scrape.
1. Install `postgres_exporter`
On your monitoring server or a dedicated host that can reach your PostgreSQL instances:
# Download the latest release (adjust version as needed) wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.12.0/postgres_exporter_v0.12.0_linux_amd64.tar.gz tar xvfz postgres_exporter_v0.12.0_linux_amd64.tar.gz sudo mv postgres_exporter_v0.12.0_linux_amd64/postgres_exporter /usr/local/bin/ # Create a user for the exporter (optional but recommended) sudo useradd -rs /bin/false postgres_exporter # Create a systemd service file sudo nano /etc/systemd/system/postgres_exporter.service
[Unit] Description=PostgreSQL Exporter Wants=network-online.target After=network-online.target [Service] User=postgres_exporter Group=postgres_exporter Type=simple ExecStart=/usr/local/bin/postgres_exporter \ --web.listen-address=":9187" \ --extend.labels="environment=production" \ --extend.queries="queries.yaml" \ --disable-default-metrics \ --metrics.wal \ --metrics.replication \ --metrics.statements \ --metrics.locks \ --metrics.bgwriter \ --metrics.autovacuum \ --metrics.database \ --metrics.connection \ --metrics.host \ --metrics.index \ --metrics.table \ --metrics.tablespace \ --metrics.tables \ --metrics.user \ --metrics.version \ --metrics.query_plan \ --metrics.query_stats \ --metrics.pg_stat_statements \ --log.level=info [Install] # If you want to start it automatically on boot WantedBy=multi-user.target
You'll also need a `queries.yaml` file for custom metrics. A basic one might look like this:
# queries.yaml
# Example custom query for replication lag
- name: replication_lag
query: |
SELECT
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication
WHERE application_name = 'your_app_name_for_replication'; # Adjust if you use a specific app_name
metrics:
- lag_bytes:
usage: GAUGE
description: "Replication lag in bytes"
labels:
- application_name # If you have multiple replication slots
Note: The `postgres_exporter` needs credentials to connect to your PostgreSQL instances. You can provide these via a `.pgpass` file for the `postgres_exporter` user, or via environment variables. For security, using a dedicated read-only user with a `.pgpass` file is recommended.
# Create .pgpass for the postgres_exporter user sudo su - postgres_exporter echo "host:port:database:user:password" > ~/.pgpass chmod 0600 ~/.pgpass exit # Then, in the systemd service file, you might need to add: # Environment="PGHOST=your_pg_host" # Environment="PGPORT=5432" # Environment="PGUSER=your_monitor_user" # Environment="PGPASSWORD=your_monitor_password" # Or rely on the .pgpass file if the exporter user is running the process. # The --web.listen-address=":9187" should be accessible by your Prometheus server.
2. Configure Prometheus to Scrape `postgres_exporter`
Edit your Prometheus configuration file (`prometheus.yml`):
scrape_configs:
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter-host:9187'] # Replace with the actual host and port
labels:
instance: 'pg-cluster-1' # Or a more descriptive label
env: 'production'
# If you have multiple PostgreSQL instances, add them here or use service discovery
# - targets: ['postgres-exporter-host-2:9187']
# labels:
# instance: 'pg-cluster-2'
# env: 'production'
Reload Prometheus configuration. You should now see your PostgreSQL metrics appearing in Prometheus. You can then build dashboards in Grafana using these metrics, and set up alerts based on critical thresholds (e.g., replication lag exceeding a certain value, high I/O wait times).
Conclusion: A Multi-Layered Approach to Reliability
Keeping your Ruby application and PostgreSQL clusters healthy on OVH requires a proactive, multi-layered monitoring strategy. This involves deep dives into database internals using `pg_stat_activity` and `pg_locks`, comprehensive application performance monitoring with APM tools, and robust infrastructure monitoring with systems like Prometheus and specialized exporters. By implementing these practices, you can move from reactive firefighting to proactive system management, ensuring the stability and performance of your critical services.