Server Monitoring Best Practices: Keeping Your Perl App and Elasticsearch Clusters Alive on DigitalOcean

Perl Application Health Checks: Beyond Basic Pings

Monitoring a Perl application on DigitalOcean requires more than just checking if the web server is responding. We need to ensure the application logic itself is sound, its dependencies are met, and it’s not succumbing to common Perl pitfalls like memory leaks or unhandled exceptions. A robust health check should interrogate the application’s core functionalities.

For a typical Perl web application, this might involve a dedicated endpoint that performs a series of internal checks. Let’s consider a simple CGI or PSGI application. We can craft a health check script that verifies database connectivity, checks essential configuration values, and even performs a trivial computation.

Implementing a Perl Health Check Endpoint

Here’s a sample Perl script that can be exposed as a health check endpoint. This script checks for a database connection (assuming a DBI connection) and a critical configuration parameter.

#!/usr/bin/perl

use strict;
use warnings;
use DBI;
use CGI; # Or use Plack::Request for PSGI

# --- Configuration ---
my $db_dsn = "dbi:mysql:database=your_app_db;host=127.0.0.1";
my $db_user = "your_db_user";
my $db_pass = "your_db_password";
my $critical_config_key = "FEATURE_TOGGLE_X";
# ---------------------

my $cgi = CGI->new;

# --- Health Check Logic ---
my @errors;

# 1. Database Connectivity Check
eval {
    my $dbh = DBI->connect($db_dsn, $db_user, $db_pass, { RaiseError => 1, AutoCommit => 1 });
    if ($dbh) {
        # Perform a simple query to ensure it's not just a connection, but functional
        my $sth = $dbh->prepare("SELECT 1");
        $sth->execute() or push @errors, "DB query failed: " . $sth->errstr;
        $dbh->disconnect;
    } else {
        push @errors, "DB connection failed: " . DBI::errstr;
    }
};
if ($@) {
    push @errors, "DB connection exception: $@";
}

# 2. Configuration Check (Example: check if a config value is set)
# This assumes you have a mechanism to load configuration, e.g., a hash reference
my %app_config = (
    "FEATURE_TOGGLE_X" => 1,
    "ANOTHER_SETTING" => "value",
); # In a real app, load this from a file or env vars

unless (exists $app_config{$critical_config_key} && defined $app_config{$critical_config_key}) {
    push @errors, "Critical configuration '$critical_config_key' is missing or undefined.";
}

# --- Response ---
if (@errors) {
    # Return HTTP 500 Internal Server Error
    print $cgi->header(-status => 500, -type => 'text/plain');
    print "Health check failed:\n";
    print join("\n", @errors), "\n";
} else {
    # Return HTTP 200 OK
    print $cgi->header(-status => 200, -type => 'text/plain');
    print "Health check successful.\n";
}

exit 0;

To integrate this with a web server like Nginx, you’d configure a location block to proxy requests to this script. For a CGI setup, ensure the script has execute permissions and is in your CGI directory. For PSGI, you’d run it via a PSGI server (like Starman or Plackup) and configure Nginx to proxy to that server.

Monitoring Perl Processes with `top` and `ps`

While application-level checks are crucial, system-level monitoring of Perl processes is equally important. Tools like `top` and `ps` are invaluable for identifying runaway processes or excessive resource consumption.

Identifying High CPU/Memory Perl Processes

You can use `ps` with specific formatting to quickly find Perl processes consuming significant resources. This command sorts processes by CPU usage in descending order.

ps aux --sort=-%cpu | grep perl | head -n 10

Similarly, to sort by memory usage:

ps aux --sort=-%mem | grep perl | head -n 10

These commands are excellent for interactive debugging. For automated monitoring, you’d integrate these into scripts that periodically check thresholds and trigger alerts.

Elasticsearch Cluster Health: Beyond the Green Status

An Elasticsearch cluster reporting a “green” status is a good start, but it doesn’t guarantee optimal performance or resilience. Advanced monitoring involves looking at metrics like JVM heap usage, indexing/search latency, disk I/O, and shard allocation status.

Leveraging Elasticsearch’s Monitoring APIs

Elasticsearch provides a rich set of APIs to query its internal state. The `_cluster/health` API is fundamental, but we should also utilize `_nodes/stats` and `_cat` APIs.

Deep Dive with `_cluster/health`

The basic health API gives an overview:

GET /_cluster/health

Key fields to watch:

status: Should ideally be green. yellow indicates unassigned primary shards (data loss risk if node fails). red means primary shards are unassigned (data loss imminent).
number_of_nodes: Ensure this matches your expected cluster size.
unassigned_shards: Should be 0.
initializing_shards, relocating_shards, pending_tasks: High numbers here can indicate a struggling cluster.

Node Statistics (`_nodes/stats`)

This API provides detailed metrics for each node. We’re particularly interested in JVM and filesystem stats.

GET /_nodes/stats/jvm,fs

Critical metrics from this API:

jvm.mem.heap_used_percent: Aim to keep this below 75-80%. Sustained high usage (above 90%) can lead to frequent garbage collection pauses and instability.
fs.data.available_in_bytes: Monitor disk space. Running out of disk space is a common cause of cluster failure.
fs.data.total_in_bytes: Understand your total storage capacity.

Shard Allocation and Status (`_cat` APIs)

The `_cat` APIs offer a human-readable, command-line-friendly view of cluster state.

GET /_cat/shards?v
GET /_cat/allocation?v

_cat/shards helps identify shards that are not on their expected node or are in an unusual state (e.g., UNASSIGNED). _cat/allocation shows disk usage per node and shard counts, useful for detecting imbalances.

Setting Up Prometheus for Elasticsearch Monitoring

Prometheus is a de facto standard for metrics collection and alerting. The official Elasticsearch Exporter is the recommended way to expose Elasticsearch metrics in a Prometheus-compatible format.

Deploying the Elasticsearch Exporter

You can run the exporter as a standalone service. A common approach is to use Docker or deploy it directly on a node.

First, download the latest release from the official GitHub repository. For example, on a Debian/Ubuntu system:

wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/vX.Y.Z/elasticsearch_exporter-X.Y.Z.linux-amd64.tar.gz
tar -xzf elasticsearch_exporter-X.Y.Z.linux-amd64.tar.gz
cd elasticsearch_exporter-X.Y.Z.linux-amd64

Then, run the exporter, pointing it to your Elasticsearch cluster. You can configure it to scrape specific metrics.

./elasticsearch_exporter --es.uri="http://localhost:9200" --es.all --web.listen-address=":9114"

--es.uri: The address of your Elasticsearch node. If you have multiple nodes, point it to one, or use a load balancer. For security, use https:// and provide credentials if necessary.

--es.all: Scrapes all available metrics. You can be more selective using flags like --es.indices, --es.cluster-settings, etc.

--web.listen-address: The port the exporter will listen on for Prometheus scrapes.

Configuring Prometheus to Scrape the Exporter

Edit your prometheus.yml configuration file to add a new scrape job:

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['localhost:9114'] # Replace with the actual IP/hostname of your exporter
    metrics_path: /metrics

After reloading Prometheus configuration, you should see your Elasticsearch exporter targets appearing in the Prometheus UI under “Targets”.

Alerting on Elasticsearch Cluster Health

Alerting is crucial for proactive issue resolution. We’ll define alerts in Prometheus’s Alertmanager configuration.

Example Alerting Rules

Create a rule file (e.g., es-alerts.yml) and include it in your Prometheus configuration.

groups:
- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_status == 1 # 1 for red, 2 for yellow, 0 for green
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster is RED"
      description: "At least one node is down or shards are unassigned. Cluster status: {{ $value }}"

  - alert: ElasticsearchClusterYellow
    expr: elasticsearch_cluster_status == 2
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch cluster is YELLOW"
      description: "Some primary shards are unassigned. Cluster status: {{ $value }}"

  - alert: HighElasticsearchHeapUsage
    expr: elasticsearch_jvm_mem_heap_used_percent > 80
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High Elasticsearch JVM Heap Usage"
      description: "Elasticsearch node {{ $labels.instance }} has heap usage above 80% (current: {{ $value }}%)"

  - alert: LowDiskSpaceElasticsearch
    expr: elasticsearch_fs_data_available_bytes / elasticsearch_fs_data_total_bytes * 100 < 15
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Low Disk Space on Elasticsearch Node"
      description: "Elasticsearch node {{ $labels.instance }} has less than 15% disk space available (current: {{ printf \"%.2f\" (elasticsearch_fs_data_available_bytes / elasticsearch_fs_data_total_bytes * 100) }}%)"

  - alert: ElasticsearchNodeNotReady
    expr: up{job="elasticsearch"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch node is down"
      description: "Elasticsearch node {{ $labels.instance }} is unreachable."

Ensure your Prometheus configuration points to this rule file and that Alertmanager is configured to receive alerts from Prometheus. This setup provides a comprehensive monitoring strategy for both your Perl applications and your Elasticsearch clusters on DigitalOcean.