Server Monitoring Best Practices: Keeping Your Perl App and Elasticsearch Clusters Alive on OVH

Proactive Perl Application Health Checks

Maintaining the stability of a Perl application, especially one serving critical functions, requires more than just basic process monitoring. We need to ensure the application is not only running but also *functioning correctly*. This involves deep dives into application-specific metrics and error conditions. For a Perl application, this often means checking database connections, internal state, and response times to simulated requests.

A robust approach involves creating a dedicated health check script that the monitoring system can periodically execute. This script should perform a series of checks and exit with a non-zero status code if any check fails. We’ll use a simple Perl script for this, which can be triggered via cron or a dedicated monitoring agent.

Perl Health Check Script Example

#!/usr/bin/perl

use strict;
use warnings;
use DBI;
use LWP::UserAgent;
use Time::HiRes qw(time);

# --- Configuration ---
my $db_dsn      = "dbi:mysql:database=your_app_db;host=127.0.0.1;port=3306";
my $db_user     = "your_db_user";
my $db_password = "your_db_password";
my $app_url     = "http://localhost/your_app/health_check.php"; # Assuming a simple PHP endpoint for HTTP check
my $timeout     = 5; # seconds

# --- Database Connection Check ---
sub check_db_connection {
    my $start_time = time();
    eval {
        my $dbh = DBI->connect($db_dsn, $db_user, $db_password, { RaiseError => 1, AutoCommit => 1 });
        $dbh->ping;
        $dbh->disconnect;
    };
    if ($@) {
        print STDERR "DB Connection Error: $@\n";
        return 0; # Failure
    }
    my $duration = time() - $start_time;
    print "DB Connection OK (took ${duration}s)\n";
    return 1; # Success
}

# --- HTTP Endpoint Check ---
sub check_http_endpoint {
    my $ua = LWP::UserAgent->new;
    $ua->timeout($timeout);
    my $start_time = time();
    my $response = $ua->get($app_url);
    my $duration = time() - $start_time;

    unless ($response->is_success) {
        print STDERR "HTTP Endpoint Error: " . $response->status_line . "\n";
        return 0; # Failure
    }
    print "HTTP Endpoint OK (took ${duration}s)\n";
    return 1; # Success
}

# --- Main Execution ---
my $overall_status = 1;

# Perform checks
$overall_status &= check_db_connection();
$overall_status &= check_http_endpoint();

# Add more checks here: e.g., cache status, queue depth, specific business logic validation

exit( $overall_status ? 0 : 1 );

This script checks both a database connection and an HTTP endpoint. The database check uses DBI, and the HTTP check uses LWP::UserAgent. Crucially, it uses `eval` for error handling and prints informative messages to STDERR on failure. The script exits with 0 on success and 1 on failure, making it easily consumable by standard monitoring tools.

Integrating with Nagios/Icinga

For systems like Nagios or Icinga, you can use the `check_command` directive to execute this Perl script. Ensure the script has execute permissions and is placed in a directory accessible by the monitoring daemon.

# In your Nagios/Icinga host or service definition:
define service {
    use                     generic-service
    host_name               your_perl_app_host
    service_description     Perl App Health Check
    check_command           check_script!/path/to/your/health_check.pl
    check_interval          1
    retry_interval          1
    max_check_attempts      3
}

# In commands.cfg (or equivalent):
define command {
    command_name    check_script
    command_line    $USER1$/check_script -H $HOSTADDRESS$ -c $ARG1$
}

The `check_script` command here is a placeholder. You’d typically use a wrapper script or a plugin like `check_nrpe` if you’re using NRPE. For a direct script execution, you might simplify the `check_command` to just point to the script if the monitoring agent has direct access.

Elasticsearch Cluster Monitoring: Beyond Basic Node Status

Monitoring an Elasticsearch cluster involves more than just ensuring nodes are up. We need to track cluster health (green, yellow, red), shard allocation, indexing performance, search latency, and resource utilization (CPU, memory, disk). Elasticsearch provides a rich set of APIs for this, which we can query using tools like `curl` or dedicated monitoring solutions.

Essential Elasticsearch Cluster Health Checks

The cluster health API (`_cluster/health`) is the first port of call. It provides an overall health status and details about shards.

curl -X GET "localhost:9200/_cluster/health?pretty"

The output will look something like this:

{
  "cluster_name" : "my-es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 10,
  "active_shards" : 30,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue" : 0,
  "active_shards_percent_as_number" : 100.0
}

Key metrics to monitor here are:

status: Should ideally be green. yellow indicates that primary shards are available but some replicas are not. red means some primary shards are unavailable, leading to data loss.
unassigned_shards: Should be 0. Any non-zero value indicates a problem with shard allocation.
number_of_nodes: Ensure this matches your expected cluster size.

Shard Allocation Monitoring

The allocation explain API (`_cluster/allocation/explain`) is invaluable for diagnosing why shards are unassigned. You can run it with specific shard details or let it provide a general explanation.

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

If you have unassigned shards, this API will give you detailed reasons, such as insufficient disk space, shard count limits, or node attribute mismatches.

Node Statistics and Performance

To understand individual node performance and resource usage, the nodes stats API (`_nodes/stats`) is essential. You can filter by specific nodes or metrics.

curl -X GET "localhost:9200/_nodes/stats/os,jvm,indices,breaker?pretty"

This command retrieves statistics for the operating system, JVM, indices, and circuit breakers for all nodes. Pay close attention to:

jvm.mem.heap_used_percent: High heap usage (e.g., > 80-90%) can lead to garbage collection pauses and instability.
os.cpu.percent: High CPU utilization on nodes.
indices.segments.count: A very high number of segments can indicate inefficient indexing or too frequent refreshes.
breaker: Circuit breaker statistics can indicate memory pressure.

Index Performance and Latency

Monitoring indexing and search performance is critical. The indices stats API (`_stats`) provides detailed information per index.

curl -X GET "localhost:9200/_stats/indexing,search?pretty"

Key metrics include:

indices.*.total.indexing.index_total: Total documents indexed.
indices.*.total.indexing.index_time_in_millis: Time spent indexing.
indices.*.total.search.query_total: Total search requests.
indices.*.total.search.query_time_in_millis: Time spent on searches.

By comparing these over time, you can identify performance bottlenecks or regressions. For more granular latency, consider using the Profile API or application-level timing.

Disk Space Monitoring

Elasticsearch is heavily reliant on disk I/O and space. Running out of disk space is a common cause of cluster instability and data loss. The nodes stats API provides disk usage per node and per data path.

curl -X GET "localhost:9200/_nodes/stats/fs?pretty"

Look for:

fs.data.available_in_bytes: Ensure this is well above critical thresholds (e.g., > 20-30% of total disk space).
fs.data.total_in_bytes: Total disk space.
fs.data.used_percent: Percentage of disk used.

Set up alerts when disk usage exceeds predefined thresholds (e.g., 80% for warning, 90% for critical).

Alerting Strategy for OVH Infrastructure

On OVH, you’ll likely be managing your own VMs or dedicated servers. This means you’re responsible for the OS-level monitoring as well as the application and cluster-level monitoring. A common setup involves:

OS-Level Monitoring: Tools like node_exporter (for Prometheus) or collectd to gather CPU, memory, disk I/O, and network metrics.
Application Monitoring: The Perl health check script described earlier, executed by a scheduler or monitoring agent.
Elasticsearch Monitoring: A combination of `curl` commands for critical health checks (cluster status, unassigned shards, disk space) scheduled via cron or a dedicated monitoring agent, and potentially a more comprehensive solution like Prometheus with the Elasticsearch Exporter, or the built-in Elasticsearch monitoring features if using X-Pack.

For alerting, consider using a system like Alertmanager (if using Prometheus) or integrating directly with tools like PagerDuty or Opsgenie. Alerts should be actionable and clearly indicate the problem and its severity.

Example Cron Job for Elasticsearch Health Check

A simple cron job can periodically check Elasticsearch cluster health and disk space.

# In root's crontab or a dedicated monitoring user's crontab
# Check cluster health every 5 minutes
*/5 * * * * /usr/bin/curl -s -f "http://localhost:9200/_cluster/health?pretty" | grep '"status": "green"' > /dev/null || echo "ES Cluster Status is NOT GREEN" | mail -s "ES Cluster Alert" [email protected]

# Check disk space on data nodes (assuming a single data path) every hour
0 * * * * DISK_USAGE=$(/usr/bin/curl -s "http://localhost:9200/_nodes/stats/fs?pretty" | grep '"used_percent" : ' | awk '{print $2}' | sed 's/,//g'); if [ $(echo "$DISK_USAGE > 85" | bc) -eq 1 ]; then echo "ES Disk Usage is HIGH: ${DISK_USAGE}%" | mail -s "ES Disk Alert" [email protected]; fi

This example uses `curl` with `-s` (silent) and `-f` (fail silently on HTTP errors) and pipes the output to `grep` to check for the “green” status. If `grep` doesn’t find it, the pipe fails, and the `||` executes the mail command. The disk space check is more complex, parsing JSON output and using `bc` for floating-point comparison. Adjust thresholds and email addresses as needed.