Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on OVH

Proactive PostgreSQL Cluster Health Checks with `pg_stat_activity` and `pg_locks`

Maintaining the health of a PostgreSQL cluster, especially in a distributed or high-availability setup on OVH, requires more than just basic CPU/memory monitoring. Understanding the internal state of your database is paramount. We’ll focus on two critical system views: `pg_stat_activity` and `pg_locks`. These views provide real-time insights into what your database is actually doing, allowing for early detection of performance bottlenecks and deadlocks.

A common pitfall is to only look at external metrics. While essential, they don’t tell the whole story. A server might have ample CPU and RAM, but if a single long-running query is blocking all other operations, your application will grind to a halt. This is where querying `pg_stat_activity` becomes indispensable.

Monitoring `pg_stat_activity` for Long-Running Queries and Idle Connections

The `pg_stat_activity` view shows one row per server process, detailing its current state. We’re particularly interested in identifying queries that have been running for an unusually long time and connections that are idle but holding resources (e.g., locks).

Here’s a robust query to identify such activities. We’ll set thresholds for “long-running” (e.g., > 5 minutes) and “idle in transaction” states. These thresholds should be tuned based on your application’s typical workload and transaction patterns.

SELECT
    pid,
    datname,
    usename,
    client_addr,
    backend_start,
    query_start,
    state_change,
    wait_event_type,
    wait_event,
    state,
    query
FROM
    pg_stat_activity
WHERE
    state = 'active' AND
    now() - query_start > interval '5 minutes' -- Adjust threshold as needed
ORDER BY
    query_start ASC;

SELECT
    pid,
    datname,
    usename,
    client_addr,
    backend_start,
    state_change,
    wait_event_type,
    wait_event,
    state,
    query
FROM
    pg_stat_activity
WHERE
    state = 'idle in transaction' AND
    now() - state_change > interval '1 minute' -- Adjust threshold as needed
ORDER BY
    state_change ASC;

To automate this, we can create a simple script. This script will connect to each PostgreSQL instance in our cluster (assuming a primary/replica setup or a sharded environment) and execute these checks. Alerts can be triggered if any results are returned.

Automating PostgreSQL Health Checks with a Python Script

We’ll use Python with the `psycopg2` library for database connectivity. This script can be scheduled via cron on a dedicated monitoring server or one of the application servers. For OVH, ensure your monitoring server has network access to your PostgreSQL instances (e.g., via private network or secured public IP).

The script will iterate through a list of database connection details, execute the monitoring queries, and log any issues. For a production environment, integrate this with your alerting system (e.g., PagerDuty, Slack, Opsgenie).

import psycopg2
import psycopg2.extras
import logging
import os
from datetime import datetime, timedelta

# --- Configuration ---
LOG_FILE = '/var/log/pg_monitor.log'
ALERT_THRESHOLD_ACTIVE_MIN = 5  # Minutes for long-running active queries
ALERT_THRESHOLD_IDLE_TX_MIN = 1  # Minutes for idle in transaction

# Database connection details (replace with your actual credentials and hosts)
# For OVH, consider using private network IPs if available for better security and performance.
DB_CONNECTIONS = [
    {
        'host': 'pg-primary.yourdomain.com',
        'port': 5432,
        'database': 'your_db',
        'user': 'monitor_user',
        'password': 'your_monitor_password'
    },
    # Add other PostgreSQL instances (replicas, other clusters)
    # {
    #     'host': 'pg-replica-1.yourdomain.com',
    #     'port': 5432,
    #     'database': 'your_db',
    #     'user': 'monitor_user',
    #     'password': 'your_monitor_password'
    # },
]

# --- Logging Setup ---
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(LOG_FILE),
        logging.StreamHandler() # Also log to console
    ]
)

# --- SQL Queries ---
QUERY_ACTIVE_LONG_RUNNING = f"""
SELECT
    pid,
    datname,
    usename,
    client_addr,
    backend_start,
    query_start,
    state_change,
    wait_event_type,
    wait_event,
    state,
    query
FROM
    pg_stat_activity
WHERE
    state = 'active' AND
    now() - query_start > interval '{ALERT_THRESHOLD_ACTIVE_MIN} minutes'
ORDER BY
    query_start ASC;
"""

QUERY_IDLE_IN_TRANSACTION = f"""
SELECT
    pid,
    datname,
    usename,
    client_addr,
    backend_start,
    state_change,
    wait_event_type,
    wait_event,
    state,
    query
FROM
    pg_stat_activity
WHERE
    state = 'idle in transaction' AND
    now() - state_change > interval '{ALERT_THRESHOLD_IDLE_TX_MIN} minutes'
ORDER BY
    state_change ASC;
"""

def check_postgres_health(db_config):
    conn = None
    try:
        logging.info(f"Connecting to {db_config['host']}:{db_config['port']} database '{db_config['database']}'...")
        conn = psycopg2.connect(
            host=db_config['host'],
            port=db_config['port'],
            database=db_config['database'],
            user=db_config['user'],
            password=db_config['password'],
            connect_timeout=5 # Short timeout for connection attempts
        )
        cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)

        # Check for long-running active queries
        cur.execute(QUERY_ACTIVE_LONG_RUNNING)
        active_queries = cur.fetchall()
        if active_queries:
            logging.warning(f"ALERT: {db_config['host']} - Found {len(active_queries)} long-running active queries (>{ALERT_THRESHOLD_ACTIVE_MIN} min):")
            for row in active_queries:
                logging.warning(f"  PID: {row['pid']}, User: {row['usename']}, Client: {row['client_addr']}, Query Start: {row['query_start']}, Wait Event: {row['wait_event_type']} - {row['wait_event']}")
                logging.warning(f"  Query: {row['query'][:200]}...") # Truncate for log readability
                # TODO: Integrate with alerting system here (e.g., send_slack_alert(...))

        # Check for idle in transaction connections
        cur.execute(QUERY_IDLE_IN_TRANSACTION)
        idle_tx_connections = cur.fetchall()
        if idle_tx_connections:
            logging.warning(f"ALERT: {db_config['host']} - Found {len(idle_tx_connections)} idle in transaction connections (>{ALERT_THRESHOLD_IDLE_TX_MIN} min):")
            for row in idle_tx_connections:
                logging.warning(f"  PID: {row['pid']}, User: {row['usename']}, Client: {row['client_addr']}, State Change: {row['state_change']}")
                logging.warning(f"  Query: {row['query'][:200]}...") # Truncate for log readability
                # TODO: Integrate with alerting system here

        cur.close()
        logging.info(f"Successfully checked {db_config['host']}:{db_config['port']}.")

    except psycopg2.OperationalError as e:
        logging.error(f"Could not connect to {db_config['host']}:{db_config['port']} - {e}")
        # TODO: Integrate with alerting system for connection failures
    except Exception as e:
        logging.error(f"An unexpected error occurred for {db_config['host']}:{db_config['port']} - {e}")
        # TODO: Integrate with alerting system for unexpected errors
    finally:
        if conn:
            conn.close()

if __name__ == "__main__":
    logging.info("Starting PostgreSQL health check...")
    for db in DB_CONNECTIONS:
        check_postgres_health(db)
    logging.info("PostgreSQL health check finished.")

Important Security Note: The `monitor_user` should have minimal privileges. It only needs `SELECT` access to `pg_stat_activity`. Avoid using superuser credentials for monitoring. On OVH, ensure your firewall rules (OVH Security Firewall or OS-level iptables/ufw) allow the monitoring server to connect to the PostgreSQL ports.

Detecting and Diagnosing Deadlocks with `pg_locks`

Deadlocks are a more severe issue where two or more processes are waiting for each other to release locks, creating a standstill. `pg_stat_activity` can show processes waiting for locks, but `pg_locks` provides the detailed mapping of which process holds which lock and which process is waiting for it.

A deadlock scenario typically involves a cycle in the lock dependencies. We can query `pg_locks` to identify processes that are blocked and the locks they are waiting for. PostgreSQL’s built-in deadlock detector will usually resolve these by terminating one of the processes, but identifying the cause is crucial for application logic fixes.

SELECT
    blocked_locks.pid     AS blocked_pid,
    blocked_activity.usename  AS blocked_user,
    blocked_activity.client_addr AS blocked_client,
    blocked_activity.query    AS blocked_query,
    blocking_locks.pid     AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocking_activity.client_addr AS blocking_client,
    blocking_activity.query   AS blocking_query,
    blocking_locks.locktype   AS lock_type,
    blocking_locks.mode       AS lock_mode,
    blocking_locks.granted    AS lock_granted
FROM
    pg_catalog.pg_locks blocked_locks
JOIN
    pg_catalog.pg_stat_activity blocked_activity
ON
    blocked_activity.pid = blocked_locks.pid
JOIN
    pg_catalog.pg_locks blocking_locks
ON
    blocking_locks.lock = blocked_locks.lock AND blocking_locks.pid != blocked_locks.pid
JOIN
    pg_catalog.pg_stat_activity blocking_activity
ON
    blocking_activity.pid = blocking_locks.pid
WHERE
    NOT blocked_locks.granted
AND
    blocking_locks.pid IS NOT NULL
AND
    blocked_activity.pid != pg_backend_pid() -- Exclude self
ORDER BY
    blocking_locks.pid, blocked_locks.pid;

This query identifies processes that are waiting (`NOT blocked_locks.granted`) and are being blocked by another process (`blocking_locks.pid IS NOT NULL`). It then joins back to `pg_stat_activity` to get the query details for both the blocked and blocking processes.

Integrating this into our Python script is straightforward. We’ll add a new query and conditional logging/alerting.

# ... (previous imports and configuration) ...

QUERY_DEADLOCKS = """
SELECT
    blocked_locks.pid     AS blocked_pid,
    blocked_activity.usename  AS blocked_user,
    blocked_activity.client_addr AS blocked_client,
    blocked_activity.query    AS blocked_query,
    blocking_locks.pid     AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocking_activity.client_addr AS blocking_client,
    blocking_activity.query   AS blocking_query,
    blocking_locks.locktype   AS lock_type,
    blocking_locks.mode       AS lock_mode,
    blocking_locks.granted    AS lock_granted
FROM
    pg_catalog.pg_locks blocked_locks
JOIN
    pg_catalog.pg_stat_activity blocked_activity
ON
    blocked_activity.pid = blocked_locks.pid
JOIN
    pg_catalog.pg_locks blocking_locks
ON
    blocking_locks.lock = blocked_locks.lock AND blocking_locks.pid != blocked_locks.pid
JOIN
    pg_catalog.pg_stat_activity blocking_activity
ON
    blocking_activity.pid = blocking_locks.pid
WHERE
    NOT blocked_locks.granted
AND
    blocking_locks.pid IS NOT NULL
AND
    blocked_activity.pid != pg_backend_pid() -- Exclude self
ORDER BY
    blocking_locks.pid, blocked_locks.pid;
"""

def check_postgres_health(db_config):
    conn = None
    try:
        # ... (connection and previous checks) ...

        # Check for deadlocks
        cur.execute(QUERY_DEADLOCKS)
        deadlocks = cur.fetchall()
        if deadlocks:
            logging.error(f"ALERT: {db_config['host']} - Potential deadlock detected!")
            for row in deadlocks:
                logging.error(f"  Blocked PID: {row['blocked_pid']} (User: {row['blocked_user']}, Client: {row['blocked_client']})")
                logging.error(f"    Blocked Query: {row['blocked_query'][:200]}...")
                logging.error(f"  Blocking PID: {row['blocking_pid']} (User: {row['blocking_user']}, Client: {row['blocking_client']})")
                logging.error(f"    Blocking Query: {row['blocking_query'][:200]}...")
                logging.error(f"  Lock Type: {row['lock_type']}, Mode: {row['lock_mode']}")
                # TODO: Integrate with alerting system here for critical deadlock alerts

        cur.close()
        logging.info(f"Successfully checked {db_config['host']}:{db_config['port']}.")

    # ... (exception handling and finally block) ...

# ... (if __name__ == "__main__": block) ...

Monitoring Ruby Application Performance with APM Tools

While database monitoring is critical, understanding the performance of your Ruby application itself is equally important. Application Performance Monitoring (APM) tools provide deep insights into request latency, error rates, transaction traces, and external service calls. For a Ruby on Rails application running on OVH, integrating an APM solution is a must.

Popular choices include:

New Relic: A mature and feature-rich APM with excellent Ruby support.
Datadog: Offers comprehensive monitoring across infrastructure and applications, with strong APM capabilities.
AppSignal: A Ruby-focused APM known for its ease of use and detailed insights.
Scout APM: Another Ruby-centric APM that excels at identifying performance bottlenecks in code.

The integration typically involves adding a gem to your application’s Gemfile and configuring an agent. For example, with the `newrelic_rpm` gem:

# Gemfile
gem 'newrelic_rpm'

After running `bundle install`, you’ll need to create a `newrelic.yml` configuration file in your application’s root directory. This file contains your New Relic license key and application name.

# newrelic.yml
common: &common
  license_key: YOUR_NEW_RELIC_LICENSE_KEY
  app_name: YourRubyApp - Production

development:
  <<: *common
  monitor_mode: true

test:
  <<: *common
  monitor_mode: true

production:
  <<: *common
  monitor_mode: true

Ensure your Ruby application server (e.g., Puma, Unicorn) is configured to load this agent. For Puma, this often means starting it with `bundle exec puma -C config/puma.rb` where `config/puma.rb` might have:

# config/puma.rb (example snippet)
require 'newrelic_rpm' # Ensure this is required early

# ... other Puma configurations ...

On OVH, your application servers will need outbound network access to the APM provider's collection endpoints. For New Relic, this is typically `collector.newrelic.com` on port 443.

Infrastructure Monitoring on OVH: Beyond Basic Metrics

OVH provides basic infrastructure metrics through its control panel and API. However, for robust monitoring, you'll want to deploy agents that collect more granular data and can be integrated with a centralized monitoring system like Prometheus, Zabbix, or Datadog.

Key infrastructure metrics to monitor for your Ruby app and PostgreSQL cluster:

CPU Usage: Per-core and overall system load. High CPU on PostgreSQL can indicate inefficient queries or insufficient resources. High CPU on app servers might point to application bottlenecks or traffic spikes.
Memory Usage: Total RAM, swap usage, and buffer/cache. Excessive swapping on PostgreSQL is a major performance killer.
Disk I/O: Read/write operations per second, latency, and throughput. Slow disk I/O on PostgreSQL is often a primary cause of performance issues.
Network Traffic: Inbound/outbound bandwidth, packet loss, and latency. High network latency between app servers and PostgreSQL can degrade performance.
Disk Space: Free space percentage. Running out of disk space can cause application and database failures.
Process Count: Number of running processes. An unusually high number might indicate runaway processes or resource exhaustion.

For PostgreSQL, specific metrics like WAL (Write-Ahead Log) generation rate, cache hit ratio, and replication lag are also critical. These are often exposed by PostgreSQL itself or can be collected by specialized exporters (e.g., `postgres_exporter` for Prometheus).

Configuring Prometheus and `postgres_exporter`

Prometheus is a powerful open-source monitoring and alerting system. `postgres_exporter` is a popular tool that exposes PostgreSQL metrics in a format Prometheus can scrape.

1. Install `postgres_exporter`

On your monitoring server or a dedicated host that can reach your PostgreSQL instances:

# Download the latest release (adjust version as needed)
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.12.0/postgres_exporter_v0.12.0_linux_amd64.tar.gz
tar xvfz postgres_exporter_v0.12.0_linux_amd64.tar.gz
sudo mv postgres_exporter_v0.12.0_linux_amd64/postgres_exporter /usr/local/bin/

# Create a user for the exporter (optional but recommended)
sudo useradd -rs /bin/false postgres_exporter

# Create a systemd service file
sudo nano /etc/systemd/system/postgres_exporter.service

[Unit]
Description=PostgreSQL Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=postgres_exporter
Group=postgres_exporter
Type=simple
ExecStart=/usr/local/bin/postgres_exporter \
  --web.listen-address=":9187" \
  --extend.labels="environment=production" \
  --extend.queries="queries.yaml" \
  --disable-default-metrics \
  --metrics.wal \
  --metrics.replication \
  --metrics.statements \
  --metrics.locks \
  --metrics.bgwriter \
  --metrics.autovacuum \
  --metrics.database \
  --metrics.connection \
  --metrics.host \
  --metrics.index \
  --metrics.table \
  --metrics.tablespace \
  --metrics.tables \
  --metrics.user \
  --metrics.version \
  --metrics.query_plan \
  --metrics.query_stats \
  --metrics.pg_stat_statements \
  --log.level=info

[Install]
# If you want to start it automatically on boot
WantedBy=multi-user.target

You'll also need a `queries.yaml` file for custom metrics. A basic one might look like this:

# queries.yaml
# Example custom query for replication lag
- name: replication_lag
  query: |
    SELECT
      pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
    FROM pg_stat_replication
    WHERE application_name = 'your_app_name_for_replication'; # Adjust if you use a specific app_name
  metrics:
    - lag_bytes:
        usage: GAUGE
        description: "Replication lag in bytes"
        labels:
          - application_name # If you have multiple replication slots

Note: The `postgres_exporter` needs credentials to connect to your PostgreSQL instances. You can provide these via a `.pgpass` file for the `postgres_exporter` user, or via environment variables. For security, using a dedicated read-only user with a `.pgpass` file is recommended.

# Create .pgpass for the postgres_exporter user
sudo su - postgres_exporter
echo "host:port:database:user:password" > ~/.pgpass
chmod 0600 ~/.pgpass
exit

# Then, in the systemd service file, you might need to add:
# Environment="PGHOST=your_pg_host"
# Environment="PGPORT=5432"
# Environment="PGUSER=your_monitor_user"
# Environment="PGPASSWORD=your_monitor_password"
# Or rely on the .pgpass file if the exporter user is running the process.
# The --web.listen-address=":9187" should be accessible by your Prometheus server.

2. Configure Prometheus to Scrape `postgres_exporter`

Edit your Prometheus configuration file (`prometheus.yml`):

scrape_configs:
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter-host:9187'] # Replace with the actual host and port
        labels:
          instance: 'pg-cluster-1' # Or a more descriptive label
          env: 'production'
    # If you have multiple PostgreSQL instances, add them here or use service discovery
    # - targets: ['postgres-exporter-host-2:9187']
    #   labels:
    #     instance: 'pg-cluster-2'
    #     env: 'production'

Reload Prometheus configuration. You should now see your PostgreSQL metrics appearing in Prometheus. You can then build dashboards in Grafana using these metrics, and set up alerts based on critical thresholds (e.g., replication lag exceeding a certain value, high I/O wait times).

Conclusion: A Multi-Layered Approach to Reliability

Keeping your Ruby application and PostgreSQL clusters healthy on OVH requires a proactive, multi-layered monitoring strategy. This involves deep dives into database internals using `pg_stat_activity` and `pg_locks`, comprehensive application performance monitoring with APM tools, and robust infrastructure monitoring with systems like Prometheus and specialized exporters. By implementing these practices, you can move from reactive firefighting to proactive system management, ensuring the stability and performance of your critical services.

Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on OVH

Proactive PostgreSQL Cluster Health Checks with `pg_stat_activity` and `pg_locks`

Monitoring `pg_stat_activity` for Long-Running Queries and Idle Connections

Automating PostgreSQL Health Checks with a Python Script

Detecting and Diagnosing Deadlocks with `pg_locks`

Monitoring Ruby Application Performance with APM Tools

Infrastructure Monitoring on OVH: Beyond Basic Metrics

Configuring Prometheus and `postgres_exporter`

Conclusion: A Multi-Layered Approach to Reliability

Recent Posts

Top Categories

Our Products

Our Services