Server Monitoring Best Practices: Keeping Your Perl App and PostgreSQL Clusters Alive on OVH

Proactive PostgreSQL Cluster Health Checks with Perl Scripts

Maintaining the health of a PostgreSQL cluster, especially in a production environment hosted on OVH, demands more than just reactive alerts. We need proactive, granular checks that can identify potential issues before they impact application availability. For our Perl-driven applications, this often means integrating monitoring directly into our deployment and operational workflows. This section details a robust Perl script designed to perform deep health checks on PostgreSQL instances, focusing on critical metrics that indicate performance degradation or impending failure.

The script leverages the DBI and DBD::Pg modules to connect to PostgreSQL and execute a series of diagnostic queries. Key areas of focus include connection pooling, replication lag, disk space utilization, query performance, and deadlocks.

Core Health Check Script (`pg_health_check.pl`)

This script should be deployed on a dedicated monitoring server or a bastion host with network access to your PostgreSQL cluster. It’s designed to be run via cron at regular intervals (e.g., every 5 minutes).

#!/usr/bin/perl

use strict;
use warnings;
use DBI;
use Time::HiRes qw(time);

# --- Configuration ---
my @db_configs = (
    {
        name     => 'pg_primary_cluster',
        host     => 'your_primary_pg_host.ovh.com',
        port     => 5432,
        database => 'postgres',
        user     => 'monitor_user',
        password => 'monitor_password',
        is_replica => 0,
    },
    {
        name     => 'pg_replica_1',
        host     => 'your_replica_1_pg_host.ovh.com',
        port     => 5432,
        database => 'postgres',
        user     => 'monitor_user',
        password => 'monitor_password',
        is_replica => 1,
        master_host => 'your_primary_pg_host.ovh.com', # For replication lag check
    },
    # Add more replica configurations as needed
);

my $max_replication_lag_seconds = 60; # Alert if replication lag exceeds 60 seconds
my $max_idle_in_transaction_seconds = 300; # Alert if idle in transaction for > 5 minutes
my $max_long_running_query_seconds = 600; # Alert if query runs for > 10 minutes
my $disk_space_threshold_percent = 85; # Alert if disk usage > 85%
my $max_connections_threshold_percent = 90; # Alert if connections > 90% of max_connections

# --- Logging ---
sub log_message {
    my ($level, $message) = @_;
    my $timestamp = localtime();
    print "[$timestamp] [$level] $message\n";
}

# --- Database Connection ---
sub connect_db {
    my ($config) = @_;
    my $dsn = "DBI:Pg:host=$config->{host};port=$config->{port};dbname=$config->{database}";
    my $dbh;
    eval {
        $dbh = DBI->connect($dsn, $config->{user}, $config->{password}, {
            RaiseError => 1,
            PrintError => 0,
            AutoCommit => 1,
        });
    };
    if ($@) {
        log_message('ERROR', "Failed to connect to database '$config->{name}' on $config->{host}:$config->{port} - $@");
        return undef;
    }
    log_message('INFO', "Successfully connected to database '$config->{name}' on $config->{host}:$config->{port}");
    return $dbh;
}

# --- Health Checks ---

# Check 1: Basic Connectivity and Version
sub check_connectivity {
    my ($dbh, $config) = @_;
    my $start_time = time();
    my $version = $dbh->selectrow_array('SELECT version()');
    my $duration = time() - $start_time;
    log_message('INFO', "Connectivity check for '$config->{name}' successful. Version: $version. Query duration: ${duration}s");
    return 1;
}

# Check 2: Replication Lag (for replicas)
sub check_replication_lag {
    my ($dbh, $config) = @_;
    unless ($config->{is_replica}) {
        log_message('DEBUG', "Skipping replication lag check for non-replica '$config->{name}'.");
        return 1;
    }

    my $start_time = time();
    my $lag_sql = q{
        SELECT
            pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
        FROM
            pg_stat_replication
        WHERE
            application_name = (SELECT application_name FROM pg_stat_replication LIMIT 1); -- Assuming one primary app name
    };
    # Alternative for older versions or different setups:
    # SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication WHERE pid = (SELECT pid FROM pg_stat_replication LIMIT 1);

    my $lag_bytes = $dbh->selectrow_array($lag_sql);
    my $duration = time() - $start_time;

    if (defined $lag_bytes) {
        my $lag_seconds = $lag_bytes / (1024 * 1024); # Approximate conversion, actual WAL rate varies
        log_message('INFO', "Replication lag for '$config->{name}': ${lag_seconds}s. Query duration: ${duration}s");
        if ($lag_seconds > $max_replication_lag_seconds) {
            log_message('WARN', "Replication lag EXCEEDED threshold for '$config->{name}'. Lag: ${lag_seconds}s (Threshold: ${max_replication_lag_seconds}s)");
            return 0; # Indicate failure
        }
        return 1;
    } else {
        log_message('WARN', "Could not determine replication lag for '$config->{name}'. Is replication active?");
        return 0; # Indicate potential issue
    }
}

# Check 3: Disk Space Usage
sub check_disk_space {
    my ($dbh, $config) = @_;
    my $start_time = time();
    # This query is OS-dependent and might require specific permissions or extensions.
    # A more portable approach is to run 'df -h' on the server itself via SSH.
    # For simplicity here, we'll assume a common query that works on many Linux systems.
    my $disk_sql = q{
        SELECT
            (pg_catalog.pg_disk_usage($$||'/'||$$)::numeric * 100) AS percent_used
        FROM
            pg_catalog.pg_settings
        WHERE
            name = 'data_directory';
    };
    # A more robust method would be to query pg_tablespace and check each tablespace's disk usage.
    # For a quick check, we'll use the data directory.

    my $percent_used = $dbh->selectrow_array($disk_sql);
    my $duration = time() - $start_time;

    if (defined $percent_used) {
        log_message('INFO', "Disk usage for '$config->{name}' data directory: ${percent_used}%. Query duration: ${duration}s");
        if ($percent_used > $disk_space_threshold_percent) {
            log_message('WARN', "Disk space usage EXCEEDED threshold for '$config->{name}'. Usage: ${percent_used}% (Threshold: ${disk_space_threshold_percent}%)");
            return 0;
        }
        return 1;
    } else {
        log_message('WARN', "Could not determine disk space usage for '$config->{name}'.");
        return 0;
    }
}

# Check 4: Connection Usage
sub check_connections {
    my ($dbh, $config) = @_;
    my $start_time = time();
    my $max_connections_sql = "SHOW max_connections;";
    my $current_connections_sql = "SELECT count(*) FROM pg_stat_activity;";

    my $max_connections = $dbh->selectrow_array($max_connections_sql);
    my $current_connections = $dbh->selectrow_array($current_connections_sql);
    my $duration = time() - $start_time;

    if (defined $max_connections && defined $current_connections) {
        my $connection_percent = ($current_connections / $max_connections) * 100;
        log_message('INFO', "Connection usage for '$config->{name}': ${current_connections}/${max_connections} (${connection_percent}%). Query duration: ${duration}s");
        if ($connection_percent > $max_connections_threshold_percent) {
            log_message('WARN', "Connection usage EXCEEDED threshold for '$config->{name}'. Usage: ${connection_percent}% (Threshold: ${max_connections_threshold_percent}%)");
            return 0;
        }
        return 1;
    } else {
        log_message('WARN', "Could not determine connection usage for '$config->{name}'.");
        return 0;
    }
}

# Check 5: Long Running Transactions (Idle in Transaction)
sub check_idle_in_transaction {
    my ($dbh, $config) = @_;
    my $start_time = time();
    my $idle_sql = q{
        SELECT
            pid,
            datname,
            usename,
            state,
            now() - xact_start AS duration
        FROM
            pg_stat_activity
        WHERE
            state = 'idle in transaction'
            AND now() - xact_start > interval '$max_idle_in_transaction_seconds seconds';
    };

    my $sth = $dbh->prepare($idle_sql);
    $sth->execute();
    my @idle_transactions;
    while (my $row = $sth->fetchrow_hashref) {
        push @idle_transactions, $row;
    }
    my $duration = time() - $start_time;

    if (@idle_transactions) {
        log_message('WARN', "Found " . scalar(@idle_transactions) . " idle transactions exceeding ${max_idle_in_transaction_seconds}s for '$config->{name}'. PIDs: " . join(', ', map { $_->{pid} } @idle_transactions) . ". Query duration: ${duration}s");
        return 0;
    } else {
        log_message('INFO', "No long idle transactions found for '$config->{name}'. Query duration: ${duration}s");
        return 1;
    }
}

# Check 6: Long Running Queries
sub check_long_running_queries {
    my ($dbh, $config) = @_;
    my $start_time = time();
    my $query_sql = q{
        SELECT
            pid,
            datname,
            usename,
            state,
            now() - query_start AS duration,
            query
        FROM
            pg_stat_activity
        WHERE
            state = 'active'
            AND now() - query_start > interval '$max_long_running_query_seconds seconds';
    };

    my $sth = $dbh->prepare($query_sql);
    $sth->execute();
    my @long_queries;
    while (my $row = $sth->fetchrow_hashref) {
        push @long_queries, $row;
    }
    my $duration = time() - $start_time;

    if (@long_queries) {
        log_message('WARN', "Found " . scalar(@long_queries) . " long running queries exceeding ${max_long_running_query_seconds}s for '$config->{name}'. PIDs: " . join(', ', map { $_->{pid} } @long_queries) . ". Query duration: ${duration}s");
        # Optionally log the queries themselves for deeper inspection
        # foreach my $q (@long_queries) {
        #     log_message('DEBUG', "Long Query PID: $q->{pid}, Duration: $q->{duration}, Query: $q->{query}");
        # }
        return 0;
    } else {
        log_message('INFO', "No long running queries found for '$config->{name}'. Query duration: ${duration}s");
        return 1;
    }
}

# --- Main Execution ---
my $overall_status = 1;

foreach my $db_config (@db_configs) {
    log_message('INFO', "Starting checks for database: '$db_config->{name}' on $db_config->{host}:$db_config->{port}");
    my $dbh = connect_db($db_config);

    if ($dbh) {
        my $status = 1;
        $status &= check_connectivity($dbh, $db_config);
        $status &= check_replication_lag($dbh, $db_config);
        $status &= check_disk_space($dbh, $db_config);
        $status &= check_connections($dbh, $db_config);
        $status &= check_idle_in_transaction($dbh, $db_config);
        $status &= check_long_running_queries($dbh, $db_config);

        $dbh->disconnect();

        if ($status) {
            log_message('INFO', "All checks passed for '$db_config->{name}'.");
        } else {
            log_message('ERROR', "One or more checks FAILED for '$db_config->{name}'.");
            $overall_status = 0;
        }
    } else {
        log_message('ERROR', "Failed to connect to '$db_config->{name}'. Skipping checks.");
        $overall_status = 0;
    }
    log_message('INFO', "Finished checks for database: '$db_config->{name}'.\n");
}

exit($overall_status);

Configuration and Deployment

Before running, ensure the following:

Install necessary Perl modules: cpan DBI DBD::Pg Time::HiRes.
Create a dedicated PostgreSQL user (e.g., monitor_user) with minimal privileges. It needs CONNECT and SELECT on pg_stat_activity, pg_settings, and potentially pg_stat_replication. For disk space checks, it might need more depending on the exact query used.
Update the @db_configs array with your actual PostgreSQL hostnames, ports, database names, usernames, and passwords.
Securely store credentials. Avoid hardcoding passwords in production. Consider using environment variables or a secrets management system.
Schedule the script using cron. For example, to run every 5 minutes:

*/5 * * * * /usr/bin/perl /path/to/your/scripts/pg_health_check.pl >> /var/log/pg_health_check.log 2>&1

The output is logged to /var/log/pg_health_check.log. This log should then be ingested by your central logging system (e.g., ELK stack, Splunk, Graylog) for aggregation and alerting.

Integrating with Alerting Systems

The script exits with a status code of 0 on success and 1 on failure. This allows standard monitoring tools (like Nagios, Zabbix, Prometheus Alertmanager via an exporter) to interpret the script’s outcome. For instance, if using Prometheus, you could use the node_exporter‘s textfile_collector to expose the script’s exit code as a metric.

Alternatively, you can modify the script to send alerts directly to a webhook endpoint (e.g., Slack, PagerDuty) using modules like LWP::UserAgent or HTTP::Request when a check fails. This requires careful management to avoid alert storms.

OVH Specific Considerations for PostgreSQL Monitoring

When operating PostgreSQL clusters on OVH, several platform-specific factors influence your monitoring strategy:

Network Latency and Bandwidth

OVH’s network infrastructure, while generally robust, can experience fluctuations. High latency between your application servers and PostgreSQL instances, or between replicas and the primary, can manifest as increased replication lag or slower query responses. Your monitoring should account for this:

Monitor Network Metrics: Use tools like ping, mtr, or dedicated network monitoring agents to track latency and packet loss between critical components. Integrate these metrics into your central monitoring dashboard.
Replication Lag Thresholds: Adjust the $max_replication_lag_seconds in the Perl script based on observed network conditions. If your application can tolerate a few seconds of lag, set the threshold accordingly. For synchronous replication, even minor network hiccups can be critical.
OVH Network Performance Tools: Familiarize yourself with any network performance monitoring tools provided by OVH within your control panel.

Disk I/O and Storage Performance

OVH offers various storage options (e.g., SSD, NVMe). The performance characteristics of these disks directly impact PostgreSQL. Monitoring disk I/O is crucial:

I/O Monitoring Tools: On your PostgreSQL servers, use tools like iostat (part of sysstat) to monitor IOPS, throughput, and wait times.

# Example iostat command to monitor disk activity every 5 seconds
iostat -dx 5

Look for consistently high %util, high await times, and low throughput relative to the disk’s advertised capabilities. These indicate I/O bottlenecks.

Disk Space Check Granularity: The provided Perl script checks the data directory. For more advanced monitoring, consider:

Checking space for each tablespace individually.
Monitoring WAL (Write-Ahead Log) directory size, as excessive WAL buildup can indicate replication issues or slow checkpoints.

OVH Managed Databases vs. Self-Managed Instances

If you are using OVH’s Managed Databases for PostgreSQL, the level of access you have for direct monitoring might be restricted. OVH typically provides its own monitoring dashboards and alerting mechanisms for managed services. In such cases:

Leverage OVH Dashboards: Thoroughly explore the OVH control panel for PostgreSQL metrics. Understand what they offer regarding CPU, RAM, disk, connections, and replication status.
API Integration: Check if OVH provides an API to access monitoring data for managed services. This allows you to pull metrics into your central monitoring system.
Focus on Application-Level Metrics: If direct database access is limited, focus on monitoring the performance of your Perl application. Slowdowns in the application can often be traced back to database issues, even if you can’t directly query the database for the root cause.
Self-Managed: If you are running PostgreSQL on OVH Public Cloud instances (e.g., VMs), you have full control and the Perl script approach described earlier is highly applicable.

Security and Access Control

Accessing your PostgreSQL instances from external monitoring systems requires careful security considerations:

Firewall Rules: Ensure that your OVH firewall rules (Security Groups, Network ACLs) permit traffic from your monitoring server’s IP address to the PostgreSQL port (default 5432) on your database servers.
Dedicated Monitoring User: As mentioned, use a dedicated PostgreSQL user with the principle of least privilege. This user should only have the necessary permissions for monitoring queries.
SSL/TLS Encryption: Configure your PostgreSQL instances to require SSL connections and ensure your monitoring script uses SSL when connecting. This encrypts the credentials and query traffic over the network.

[postgresql]
# In postgresql.conf
ssl = on
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem' # Replace with your actual certificate path
ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key' # Replace with your actual key path
# In pg_hba.conf
hostssl    all             monitor_user      your_monitoring_server_ip/32   md5

Update your Perl script’s connection string to enforce SSL:

# In connect_db subroutine
my $dsn = "DBI:Pg:host=$config->{host};port=$config->{port};dbname=$config->{database};sslmode=require";

Advanced Monitoring: pgBadger and Query Analysis

While the Perl script provides essential real-time health checks, a comprehensive monitoring strategy also includes analyzing historical performance data. pgBadger is an invaluable tool for this purpose.

Setting up pgBadger

pgBadger analyzes PostgreSQL log files to generate detailed HTML reports on performance, errors, and query patterns. To use it effectively:

Configure Logging: Ensure your PostgreSQL instances are logging queries, especially slow ones. Set log_statement = 'all' or log_min_duration_statement to a reasonable value (e.g., 500ms or 1s) in postgresql.conf. Also, enable log_line_prefix for better context.

# In postgresql.conf
log_destination = 'stderr' # Or 'csvlog'
logging_collector = on
log_directory = 'log' # Relative to data directory
log_filename = 'postgresql-%Y-%m-%d_%H-%M-%S.log'
log_min_duration_statement = 500 # Log statements longer than 500ms
log_statement = 'none' # Or 'ddl', 'mod', 'all' - use 'all' cautiously in production
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ' # Example prefix

Remember to reload PostgreSQL configuration after changes: SELECT pg_reload_conf(); or restart the service.

Install pgBadger:

# On Debian/Ubuntu
sudo apt update && sudo apt install pgbadger

# Or from source: https://github.com/dalibo/pgbadger

Run pgBadger:

# Assuming logs are in /var/lib/postgresql/14/main/log/
# Rotate logs first if not using logrotate
# Example: Concatenate logs for a specific day
cat /var/lib/postgresql/14/main/log/postgresql-*.log > /tmp/postgres_all.log

# Run pgbadger
pgbadger /tmp/postgres_all.log -o /var/www/html/pg_report.html --format html --UTF-8

Schedule this command using cron to run daily or weekly. The generated HTML report provides insights into:

Top N slowest queries
Most frequent queries
Queries with the highest total execution time
Connection/disconnection patterns
Temporary file usage
Index usage and bloat
Error analysis

Integrating pgBadger Reports with Monitoring

While pgBadger provides static reports, you can extract key metrics for real-time dashboards:

Custom Scripts: Parse the pgBadger output (or its JSON output if configured) to extract specific metrics (e.g., number of slow queries, top query execution time) and expose them via a custom exporter for Prometheus or push them to your logging system.
Alerting on Trends: Set up alerts based on trends identified by pgBadger, such as a sudden increase in slow queries or index bloat. This often involves comparing current report metrics against historical baselines.

Application-Level Monitoring for Perl Apps

Database monitoring is only half the battle. Understanding how your Perl application interacts with PostgreSQL and its own performance is equally critical. This involves instrumenting your application code.

Using `DBI` Profiling

The DBI module in Perl offers built-in profiling capabilities that can help identify slow database interactions within your application code.

Enable Profiling: You can enable profiling globally or per-connection. For specific scripts, you can set the PERLDB environment variable or use DBI->trace_callbacks.

# Example: Enable tracing for a specific connection
use DBI;
my $dbh = DBI->connect("dbi:Pg:dbname=mydb", "user", "pass", {
    RaiseError => 1,
    PrintError => 0,
    AutoCommit => 1,
    TraceError => 1, # Trace errors
    Trace      => 9, # Trace level (0-9)
});

# Or using callbacks for more granular control
$dbh->trace_callbacks(1);
$dbh->trace(9, sub {
    my ($pkg, $sub, $msg) = @_;
    # Log or process the trace message
    print STDERR "DBI Trace: $pkg->$sub: $msg\n";
});

# Execute queries...
my $sth = $dbh->prepare("SELECT * FROM some_table WHERE id = ?");
$sth->execute(123);
while (my $row = $sth->fetchrow_hashref) {
    # Process row
}

$dbh->trace_callbacks(0); # Disable callbacks

The output of DBI->trace can be verbose. Directing it to a file and then analyzing it (perhaps using custom scripts or log aggregation tools) is recommended. This helps pinpoint which parts of your Perl application are making slow or frequent database calls.

Application Performance Monitoring (APM) Tools

For more sophisticated application-level monitoring, consider integrating with APM solutions. While dedicated Perl APM agents might be less common than for other languages, you can often achieve similar results by:

Custom Metrics: Instrument your Perl code to emit custom metrics (e.g., number of queries per request, average query time, connection pool usage) using libraries like Prometheus::Client or by sending data to a StatsD endpoint.
Log Correlation: Ensure your application logs include unique request IDs that are also passed to PostgreSQL (via application_name in the connection string). This allows you to correlate application behavior with specific database queries identified by pgBadger or the Perl health check script.

# Example: Setting application_name in connection string
my $app_name = "my_perl_app_v1.2.3_" . get_request_id(); # Assuming get_request_id() provides a unique ID
my $dsn = "DBI:Pg:host=$config->{host};port=$config->{port};dbname=$config->{database};application_name=$app_name";
my $dbh = DBI->connect($dsn, $config->{user}, $config->{password}, { ... });

By combining proactive PostgreSQL health checks, historical log analysis with pgBadger, and application-level instrumentation, you can build a robust monitoring system tailored for your Perl applications and PostgreSQL clusters on OVH, ensuring high availability and performance.