Server Monitoring Best Practices: Keeping Your Perl App and MongoDB Clusters Alive on OVH

Proactive MongoDB Cluster Health Checks with Perl Scripts

Maintaining the health of a MongoDB cluster, especially when supporting critical Perl applications, requires more than just basic uptime checks. We need to delve into performance metrics, replication lag, and resource utilization. A robust monitoring strategy involves custom scripting to gather granular data that standard tools might miss. Here, we’ll outline a Perl script designed to query MongoDB for key health indicators and how to integrate it into a broader monitoring system.

This script connects to a MongoDB instance, checks the status of replica sets, and reports on oplog window, network statistics, and basic server information. It’s designed to be run periodically by a scheduler like cron.

Perl Script for MongoDB Health Monitoring

#!/usr/bin/perl

use strict;
use warnings;
use MongoDB;
use MongoDB::Connection;
use JSON;

# --- Configuration ---
my $mongo_host = '127.0.0.1'; # Or your MongoDB primary host
my $mongo_port = 27017;
my $mongo_db   = 'admin';     # Database to connect to for admin commands
my $mongo_user = '';          # Uncomment and set if authentication is enabled
my $mongo_pass = '';          # Uncomment and set if authentication is enabled

# --- Script Logic ---
my $conn_params = {
    host => "$mongo_host:$mongo_port",
    db_name => $mongo_db,
};
if ($mongo_user && $mongo_pass) {
    $conn_params->{username} = $mongo_user;
    $conn_params->{password} = $mongo_pass;
}

my $client;
eval {
    $client = MongoDB::Connection->new($conn_params);
};
if ($@) {
    print STDERR "ERROR: Could not connect to MongoDB: $@\n";
    exit 1;
}

my $db = $client->get_database($mongo_db);

# 1. Replica Set Status
my $repl_status;
eval {
    $repl_status = $db->run_command({ replSetGetStatus => 1 });
};
if ($@) {
    print STDERR "WARNING: Could not get replica set status: $@\n";
    $repl_status = { ok => 0, errmsg => "Failed to get replSetGetStatus: $@" };
}

my $output = {
    timestamp => scalar(localtime),
    replica_set_status => $repl_status,
};

# 2. Server Status (for basic metrics)
my $server_status;
eval {
    $server_status = $db->run_command({ serverStatus => 1 });
};
if ($@) {
    print STDERR "WARNING: Could not get server status: $@\n";
    $server_status = { ok => 0, errmsg => "Failed to get serverStatus: $@" };
}

$output->{server_status} = $server_status;

# 3. Network Statistics
my $network_status;
if ($server_status && $server_status->{ok}) {
    $network_status = $server_status->{network};
}
$output->{network_status} = $network_status || { ok => 0, errmsg => "Server status not available" };

# 4. Oplog Window (if replica set is configured)
my $oplog_window_seconds = 'N/A';
if ($repl_status && $repl_status->{ok} && exists $repl_status->{members}) {
    my $oplog_coll = $db->get_collection('oplog.rs');
    my $latest_oplog = $oplog_coll->find_one({}, { sort => [ '-ts' ] });
    my $earliest_oplog = $oplog_coll->find_one({}, { sort => [ 'ts' ] });

    if ($latest_oplog && $earliest_oplog) {
        # MongoDB timestamps are BSON Date objects, which Perl's DateTime can handle
        # For simplicity, we'll assume a direct comparison works or use a helper if needed.
        # A more robust approach would involve converting to epoch seconds.
        # For this example, we'll use a simplified approach assuming the driver handles it.
        # In a real-world scenario, you might need to extract seconds from the timestamp.
        # Example: $latest_oplog->{ts}->epoch() - $earliest_oplog->{ts}->epoch()
        # For now, let's just report the presence of oplog data.
        $oplog_window_seconds = "Oplog data found. Calculate window manually or enhance script.";
    }
}
$output->{oplog_window_seconds} = $oplog_window_seconds;

# --- Output ---
# For Nagios/Icinga compatibility, we can format output for specific checks.
# For general logging, JSON is useful.

# Example: Basic JSON output for a log file or API
print to_json($output, { pretty => 1 });

# Example: Nagios-style output for specific checks
# Check 1: Replica Set Health
if ($repl_status && $repl_status->{ok}) {
    my $primary_count = 0;
    my $secondary_count = 0;
    my $arbiter_count = 0;
    my $down_count = 0;
    my $max_lag = 0;

    foreach my $member (@{$repl_status->{members}}) {
        if ($member->{stateStr} eq 'PRIMARY') {
            $primary_count++;
        } elsif ($member->{stateStr} eq 'SECONDARY') {
            $secondary_count++;
            if (exists $member->{optimeTime} && exists $repl_status->{members}[0]{optimeTime}) { # Assuming member 0 is primary
                # This is a simplified lag calculation. A proper one involves comparing timestamps.
                # For robust lag, you'd need to fetch the primary's optime and compare.
                # Example: $lag = $primary_optime->epoch() - $member->{optimeTime}->epoch();
                # For now, we'll just count members.
            }
        } elsif ($member->{stateStr} eq 'ARBITER') {
            $arbiter_count++;
        } else {
            $down_count++;
        }
    }

    if ($primary_count == 1 && $down_count == 0) {
        print "OK: Replica set healthy. Primary: $primary_count, Secondary: $secondary_count, Arbiter: $arbiter_count.\n";
    } else {
        print "CRITICAL: Replica set unhealthy. Primary: $primary_count, Secondary: $secondary_count, Arbiter: $arbiter_count, Down: $down_count.\n";
        exit 2; # CRITICAL
    }
} else {
    print "CRITICAL: Failed to get replica set status.\n";
    exit 2; # CRITICAL
}

# Check 2: Oplog Window (simplified)
# A real check would calculate the actual window duration.
if ($repl_status && $repl_status->{ok} && exists $repl_status->{members}) {
    # Placeholder for actual oplog window calculation
    # If oplog is filling up, this is a critical indicator of replication issues.
    # For now, we assume it's OK if we can access it.
    print "OK: Oplog accessible.\n";
} else {
    # This case is already covered by the replica set check, but good for clarity.
    print "WARNING: Oplog status unknown due to replica set issues.\n";
}

# Check 3: Network Traffic (example: high incoming connections)
if ($network_status && $network_status->{ok}) {
    my $num_connections = $network_status->{numClients};
    if ($num_connections > 500) { # Threshold, adjust as needed
        print "WARNING: High number of client connections: $num_connections.\n";
    } else {
        print "OK: Client connections ($num_connections) within limits.\n";
    }
} else {
    print "WARNING: Could not retrieve network status.\n";
}

exit 0; # OK

Explanation:

Configuration: Sets connection parameters for MongoDB. Ensure these are correct for your environment, especially if using authentication.
Connection: Establishes a connection to the MongoDB instance using the MongoDB Perl driver. Error handling is crucial here.
replSetGetStatus: This command is vital for replica set health. It provides information about each member’s state (PRIMARY, SECONDARY, ARBITER), health, and replication lag.
serverStatus: Offers a broad range of server metrics, including network traffic, memory usage, and operation counts.
Oplog Window: The script attempts to find the latest and earliest entries in the oplog. A large oplog window indicates that secondaries are falling behind the primary, which can lead to data inconsistency or performance degradation. A more sophisticated script would calculate the time difference between the earliest and latest oplog entries.
Output Formatting: The script can output JSON for general logging or API consumption, and also provides Nagios/Icinga-compatible status messages for direct integration with monitoring systems.

Integrating Perl Scripts with Cron and Monitoring Systems

Once the Perl script is functional, the next step is to automate its execution and integrate its output into a centralized monitoring platform. Cron is the de facto standard for scheduling tasks on Linux systems.

Cron Job Setup

To run the script every 5 minutes, you would add an entry to your crontab. First, ensure the script is executable and has the correct shebang line (#!/usr/bin/perl).

# Make the script executable
chmod +x /path/to/your/mongo_health_check.pl

# Edit crontab
crontab -e

# Add the following line to run every 5 minutes:
*/5 * * * * /usr/bin/perl /path/to/your/mongo_health_check.pl >> /var/log/mongo_health_check.log 2>&1

This cron job will execute the Perl script every five minutes, appending its standard output and standard error to a log file. This log file can then be monitored for specific patterns or errors.

Forwarding to a Monitoring System (e.g., Zabbix, Prometheus)

For more advanced alerting and dashboarding, you’ll want to forward the script’s output to a dedicated monitoring system. Here are common approaches:

Option 1: Using a Custom Exporter for Prometheus

You can create a small web server (e.g., using Python’s Flask or Node.js) that runs your Perl script periodically, parses its output, and exposes metrics in Prometheus format. Alternatively, you can adapt the Perl script to output Prometheus-compatible metrics directly.

# Example of modifying the Perl script to output Prometheus metrics
# ... (previous Perl script code) ...

# --- Prometheus Output Section ---
# Assuming $repl_status and $server_status are populated
my $prom_output = "";

# Replica Set Status Metric
if ($repl_status && $repl_status->{ok}) {
    my $primary_count = 0;
    my $secondary_count = 0;
    my $arbiter_count = 0;
    my $down_count = 0;
    my $max_lag_sec = 0; # Placeholder for actual lag calculation

    foreach my $member (@{$repl_status->{members}}) {
        if ($member->{stateStr} eq 'PRIMARY') {
            $primary_count++;
        } elsif ($member->{stateStr} eq 'SECONDARY') {
            $secondary_count++;
            # Add actual lag calculation here if needed
        } elsif ($member->{stateStr} eq 'ARBITER') {
            $arbiter_count++;
        } else {
            $down_count++;
        }
    }
    $prom_output .= "# HELP mongo_replica_set_members_total Total members in the replica set\n";
    $prom_output .= "# TYPE mongo_replica_set_members_total gauge\n";
    $prom_output .= "mongo_replica_set_members_total{state=\"primary\"} $primary_count\n";
    $prom_output .= "mongo_replica_set_members_total{state=\"secondary\"} $secondary_count\n";
    $prom_output .= "mongo_replica_set_members_total{state=\"arbiter\"} $arbiter_count\n";
    $prom_output .= "mongo_replica_set_members_total{state=\"down\"} $down_count\n";

    if ($primary_count < 1 || $down_count > 0) {
        $prom_output .= "mongo_replica_set_health 0\n"; # 0 for critical
    } elsif ($secondary_count > 0) {
        $prom_output .= "mongo_replica_set_health 1\n"; # 1 for warning (e.g., no secondaries)
    } else {
        $prom_output .= "mongo_replica_set_health 2\n"; # 2 for OK
    }
} else {
    $prom_output .= "mongo_replica_set_health 0\n"; # CRITICAL if status command fails
}

# Network Metrics
if ($server_status && $server_status->{ok} && exists $server_status->{network}) {
    my $net = $server_status->{network};
    $prom_output .= "# HELP mongo_network_connections Number of client connections\n";
    $prom_output .= "# TYPE mongo_network_connections gauge\n";
    $prom_output .= "mongo_network_connections " . ($net->{numClients} || 0) . "\n";
    $prom_output .= "# HELP mongo_network_bytes_in Total bytes received\n";
    $prom_output .= "# TYPE mongo_network_bytes_in counter\n";
    $prom_output .= "mongo_network_bytes_in " . ($net->{bytesIn} || 0) . "\n";
    $prom_output .= "# HELP mongo_network_bytes_out Total bytes sent\n";
    $prom_output .= "# TYPE mongo_network_bytes_out counter\n";
    $prom_output .= "mongo_network_bytes_out " . ($net->{bytesOut} || 0) . "\n";
}

# Oplog Window (simplified - requires actual calculation)
# This would typically be a gauge or histogram metric.
# For now, we'll just indicate if it's accessible.
if ($repl_status && $repl_status->{ok}) {
    # Placeholder for actual oplog window calculation in seconds
    my $oplog_window_sec = 0; # Replace with actual calculation
    $prom_output .= "# HELP mongo_oplog_window_seconds Current oplog window in seconds\n";
    $prom_output .= "# TYPE mongo_oplog_window_seconds gauge\n";
    $prom_output .= "mongo_oplog_window_seconds $oplog_window_sec\n";
}

# Print Prometheus metrics
print $prom_output;

exit 0;

You would then configure Prometheus to scrape this endpoint (e.g., running a small web server that executes this script and serves the output). The cron job would be replaced by Prometheus’s scraping mechanism.

Option 2: Using Zabbix Agent UserParameters

If you’re using Zabbix, you can leverage UserParameter in the Zabbix agent configuration to execute your Perl script and return specific values.

# /etc/zabbix/zabbix_agentd.conf
# Add these lines:

UserParameter=mongo.replset.status[*],/usr/bin/perl /path/to/your/mongo_health_check.pl --key replSetGetStatus --host $1 --port $2
UserParameter=mongo.server.connections[*],/usr/bin/perl /path/to/your/mongo_health_check.pl --key serverStatus --host $1 --port $2 | grep "numClients"
UserParameter=mongo.oplog.window[*],/usr/bin/perl /path/to/your/mongo_health_check.pl --key oplogWindow --host $1 --port $2

You would then need to modify the Perl script to accept specific keys (like replSetGetStatus, serverStatus, oplogWindow) and return only the relevant data for that key, rather than the full JSON or Nagios output. Zabbix server would then poll these keys.

Monitoring Perl Application Performance on OVH

Beyond infrastructure, monitoring the performance of your Perl applications is critical. This involves tracking request latency, error rates, memory usage, and CPU consumption specific to your application processes.

Application-Level Metrics with `Devel::NYTProf` and `Devel::Cover`

For in-depth performance analysis of your Perl code, profiling tools are indispensable. Devel::NYTProf is excellent for identifying performance bottlenecks within your codebase.

# Install Devel::NYTProf
cpanm Devel::NYTProf

# Run your Perl application with profiling enabled
perl -d:NYTProf your_app.pl [args...]

# After execution, generate a report
nytprofhtml

This generates an HTML report detailing execution time per subroutine, line-by-line timings, and call counts. You can integrate this into your development workflow to optimize slow code paths. For code coverage analysis, Devel::Cover is invaluable.

# Install Devel::Cover
cpanm Devel::Cover

# Run your Perl application with coverage enabled
perl -MDevel::Cover your_app.pl [args...]

# After execution, generate a report
cover -report html

These tools are primarily for development and debugging. For production monitoring, you’ll want to collect metrics that reflect real-time application behavior.

System-Level Process Monitoring for Perl Apps

Standard OS-level tools can monitor your Perl application processes. We can use tools like ps, top, or htop, and then parse their output or use more advanced agents.

# Find your Perl application's process ID (PID)
pgrep -f "your_app.pl"

# Get resource usage for a specific PID
ps aux | grep "your_app.pl"

# Example: Monitoring CPU and Memory usage of a Perl process
# This can be scripted to run periodically and send alerts.

APP_NAME="your_app.pl"
PID=$(pgrep -f "$APP_NAME")

if [ -z "$PID" ]; then
    echo "ERROR: $APP_NAME process not found."
    exit 1
fi

CPU_USAGE=$(ps -p $PID -o %cpu | tail -n 1)
MEM_USAGE=$(ps -p $PID -o %mem | tail -n 1)

echo "Application: $APP_NAME"
echo "PID: $PID"
echo "CPU Usage: $CPU_USAGE%"
echo "Memory Usage: $MEM_USAGE%"

# Add logic here to compare against thresholds and send alerts
# e.g., if [ $(echo "$CPU_USAGE > 80" | bc -l) -eq 1 ]; then ...

For more sophisticated monitoring, consider agents like Datadog, New Relic, or even custom solutions using tools like collectd or Telegraf to gather process metrics and send them to a time-series database (e.g., InfluxDB, Graphite).

Logging and Error Aggregation

Centralized logging is paramount. Ensure your Perl application logs errors, warnings, and significant events to a consistent format (e.g., JSON) and location. Tools like rsyslog, Filebeat, or Fluentd can then collect these logs and forward them to a central aggregation system like Elasticsearch, Splunk, or Loki.

# Example rsyslog configuration to forward application logs
# /etc/rsyslog.d/99-app-logs.conf

# Assuming your application logs to /var/log/your_app.log
if $programname == 'your_app.pl' then /var/log/your_app.log
& stop

# Forward to a remote syslog server (e.g., Logstash/Elasticsearch)
*.* @your-log-aggregator.example.com:514

By combining infrastructure monitoring (MongoDB) with application performance monitoring (Perl app), you create a comprehensive observability strategy that ensures the stability and performance of your services on OVH.