Server Monitoring Best Practices: Keeping Your Perl App and Redis Clusters Alive on Google Cloud
Proactive Redis Cluster Health Checks with Perl Scripts
Maintaining the health and availability of Redis clusters, especially in a distributed environment like Google Cloud, requires more than just relying on GCP’s built-in metrics. We need granular, application-aware checks. For our Perl-based applications that depend on Redis, a custom monitoring script offers the most flexibility. This script will perform essential checks: connectivity, latency, memory usage, and replication status.
We’ll deploy this script on a dedicated monitoring VM or as a cron job on one of the application servers. The script will leverage the Redis Perl module. Ensure it’s installed via CPAN: cpan Redis.
Perl Monitoring Script: `check_redis_cluster.pl`
#!/usr/bin/perl
use strict;
use warnings;
use Redis;
use Sys::Hostname;
use Time::HiRes qw(time);
# --- Configuration ---
my @redis_hosts = (
{ host => 'redis-master-0.redis-headless.default.svc.cluster.local', port => 6379, name => 'redis-master-0' },
{ host => 'redis-replica-1.redis-headless.default.svc.cluster.local', port => 6379, name => 'redis-replica-1' },
{ host => 'redis-replica-2.redis-headless.default.svc.cluster.local', port => 6379, name => 'redis-replica-2' },
# Add more nodes as needed for your cluster
);
my $max_latency_ms = 50; # Maximum acceptable latency in milliseconds
my $max_memory_percent = 85; # Maximum acceptable memory usage percentage
my $replication_lag_threshold = 60; # Maximum acceptable replication lag in seconds
my $check_interval_seconds = 60; # How often to run checks (for continuous monitoring)
# --- Logging ---
sub log_message {
my ($level, $message) = @_;
my $timestamp = localtime();
print "[$timestamp] [$level] $message\n";
}
# --- Main Check Function ---
sub check_redis_node {
my ($node) = @_;
my $host = $node->{host};
my $port = $node->{port};
my $name = $node->{name};
my $redis = Redis->new(server => "$host:$port", socket_timeout => 5, io_timeout => 5);
my $start_time = time();
my $is_ok = 1;
my @errors;
eval {
# 1. Connectivity Check
$redis->ping();
my $end_time = time();
my $latency_ms = ($end_time - $start_time) * 1000;
if ($latency_ms > $max_latency_ms) {
push @errors, "High latency: ${latency_ms}ms (>${max_latency_ms}ms)";
$is_ok = 0;
}
# 2. Memory Usage Check
my $memory_info = $redis->info('memory');
if ($memory_info =~ /used_memory_human:\s*(\d+\.?\d*[MG])/) {
my $used_memory_human = $1;
my $total_memory_human = $redis->config('GET', 'maxmemory');
$total_memory_human =~ s/maxmemory\s*//;
if ($total_memory_human =~ /(\d+\.?\d*[MG])/) {
my $max_mem_val = parse_memory($1);
my $used_mem_val = parse_memory($used_memory_human);
my $memory_percent = ($used_mem_val / $max_mem_val) * 100;
if ($memory_percent > $max_memory_percent) {
push @errors, "High memory usage: ${memory_percent}% (>${max_memory_percent}%)";
$is_ok = 0;
}
} else {
log_message('WARN', "Could not parse maxmemory config for $name: $total_memory_human");
}
} else {
log_message('WARN', "Could not retrieve memory usage for $name.");
}
# 3. Replication Status Check (for replicas)
if ($redis->role() eq 'slave') {
my $replication_info = $redis->info('replication');
if ($replication_info =~ /master_link_status:up/) {
if ($replication_info =~ /master_sync_in_progress:\s*0/) {
if ($replication_info =~ /master_repl_offset:\s*(\d+)/) {
my $master_offset = $1;
if ($replication_info =~ /slave_repl_offset:\s*(\d+)/) {
my $slave_offset = $1;
my $lag = $master_offset - $slave_offset;
if ($lag > $replication_lag_threshold) {
push @errors, "High replication lag: ${lag}s (>${replication_lag_threshold}s)";
$is_ok = 0;
}
} else {
log_message('WARN', "Could not retrieve slave_repl_offset for $name.");
}
} else {
log_message('WARN', "Could not retrieve master_repl_offset for $name.");
}
} else {
push @errors, "Replication sync in progress.";
$is_ok = 0;
}
} else {
push @errors, "Replication master link is down.";
$is_ok = 0;
}
}
};
if ($@) {
log_message('ERROR', "Exception connecting to or checking $name ($host:$port): $@");
$is_ok = 0;
push @errors, "Connection/Exception error: $@";
}
if ($is_ok) {
log_message('INFO', "Node $name ($host:$port) is OK. Latency: " . sprintf("%.2f", ($end_time - $start_time) * 1000) . "ms");
return 1;
} else {
log_message('ERROR', "Node $name ($host:$port) is NOT OK. Errors: " . join('; ', @errors));
return 0;
}
}
# --- Memory Parsing Helper ---
sub parse_memory {
my ($mem_str) = @_;
my $value = $mem_str;
my $unit = 'B'; # Default to Bytes
if ($value =~ s/([MG])//) {
$unit = $1;
}
$value = $value + 0; # Ensure it's a number
if ($unit eq 'G') {
return $value * 1024 * 1024 * 1024;
} elsif ($unit eq 'M') {
return $value * 1024 * 1024;
} elsif ($unit eq 'K') {
return $value * 1024;
} else {
return $value; # Assume bytes if no unit or unknown unit
}
}
# --- Main Execution Loop ---
sub main {
my $all_nodes_ok = 1;
foreach my $node (@redis_hosts) {
if (!check_redis_node($node)) {
$all_nodes_ok = 0;
}
}
if ($all_nodes_ok) {
log_message('INFO', "All Redis nodes are healthy.");
exit 0; # Success
} else {
log_message('ERROR', "One or more Redis nodes are unhealthy.");
exit 1; # Failure
}
}
# --- Run the checks ---
# For continuous monitoring, uncomment the loop below and adjust check_interval_seconds
# while (1) {
# main();
# sleep($check_interval_seconds);
# }
# For single run (e.g., cron job)
main();
Explanation:
- Configuration: Defines the Redis nodes (using Kubernetes service discovery names for simplicity within GCP), thresholds for latency, memory, and replication lag.
- Logging: A basic logging function for clear output.
- `check_redis_node` function:
- Establishes a connection to a Redis node with timeouts.
- Performs a
PINGto check connectivity and measure latency. - Retrieves
INFO memoryto checkused_memoryagainstmaxmemoryconfiguration. - For replica nodes, it checks
INFO replicationformaster_link_statusand calculates replication lag usingmaster_repl_offsetandslave_repl_offset. - Handles exceptions during connection or command execution.
- Returns 1 for success, 0 for failure, logging detailed errors.
- `parse_memory` function: A utility to convert human-readable memory strings (e.g., “128M”, “2G”) into bytes for accurate comparison.
- `main` function: Iterates through all configured Redis hosts, calls `check_redis_node` for each, and exits with status 0 if all are healthy, or 1 if any node fails.
Deployment:
- Save the script as
check_redis_cluster.pl. - Make it executable:
chmod +x check_redis_cluster.pl. - Test it manually:
./check_redis_cluster.pl. - Schedule it using cron: Add a line like
*/1 * * * * /path/to/check_redis_cluster.pl >> /var/log/redis_check.log 2>&1to your crontab. - Integrate with GCP Monitoring: Configure a custom metric or alert based on the script’s exit code. A common approach is to use
gcloud monitoring policies createor a custom agent that scrapes the script’s output.
Integrating with Google Cloud Monitoring
To make these checks actionable within GCP, we need to feed their results into Cloud Monitoring. The simplest method for script-based checks is to use the Cloud Monitoring API or the gcloud command-line tool to write custom metrics.
Option 1: Using `gcloud` for Custom Metrics (Simpler for Cron Jobs)
Modify the Perl script to output metrics in a format that can be easily parsed, or use a wrapper script. For a direct approach, we can have the script write to a file that another process reads.
Let’s adapt the script to output JSON suitable for ingestion.
#!/usr/bin/perl
use strict;
use warnings;
use Redis;
use Sys::Hostname;
use Time::HiRes qw(time);
use JSON;
# --- Configuration (same as before) ---
my @redis_hosts = (
{ host => 'redis-master-0.redis-headless.default.svc.cluster.local', port => 6379, name => 'redis-master-0' },
{ host => 'redis-replica-1.redis-headless.default.svc.cluster.local', port => 6379, name => 'redis-replica-1' },
{ host => 'redis-replica-2.redis-headless.default.svc.cluster.local', port => 6379, name => 'redis-replica-2' },
);
my $max_latency_ms = 50;
my $max_memory_percent = 85;
my $replication_lag_threshold = 60;
# --- Logging (redirected to stderr for clarity) ---
sub log_message {
my ($level, $message) = @_;
my $timestamp = localtime();
print STDERR "[$timestamp] [$level] $message\n";
}
# --- Main Check Function (modified to return data) ---
sub check_redis_node {
my ($node) = @_;
my $host = $node->{host};
my $port = $node->{port};
my $name = $node->{name};
my $redis = Redis->new(server => "$host:$port", socket_timeout => 5, io_timeout => 5);
my $result = {
node_name => $name,
host => $host,
port => $port,
status => 'UNKNOWN',
errors => [],
latency_ms => undef,
memory_percent => undef,
replication_lag_s => undef,
};
my $start_time = time();
eval {
$redis->ping();
my $end_time = time();
my $latency_ms = ($end_time - $start_time) * 1000;
$result->{latency_ms} = sprintf("%.2f", $latency_ms);
if ($latency_ms > $max_latency_ms) {
push @{$result->{errors}}, "High latency: ${latency_ms}ms (>${max_latency_ms}ms)";
}
my $memory_info = $redis->info('memory');
if ($memory_info =~ /used_memory_human:\s*(\d+\.?\d*[MG])/) {
my $used_memory_human = $1;
my $total_memory_human = $redis->config('GET', 'maxmemory');
$total_memory_human =~ s/maxmemory\s*//;
if ($total_memory_human =~ /(\d+\.?\d*[MG])/) {
my $max_mem_val = parse_memory($1);
my $used_mem_val = parse_memory($used_memory_human);
my $memory_percent = ($used_mem_val / $max_mem_val) * 100;
$result->{memory_percent} = sprintf("%.2f", $memory_percent);
if ($memory_percent > $max_memory_percent) {
push @{$result->{errors}}, "High memory usage: ${memory_percent}% (>${max_memory_percent}%)";
}
} else {
log_message('WARN', "Could not parse maxmemory config for $name: $total_memory_human");
}
} else {
log_message('WARN', "Could not retrieve memory usage for $name.");
}
if ($redis->role() eq 'slave') {
my $replication_info = $redis->info('replication');
if ($replication_info =~ /master_link_status:up/) {
if ($replication_info =~ /master_sync_in_progress:\s*0/) {
if ($replication_info =~ /master_repl_offset:\s*(\d+)/) {
my $master_offset = $1;
if ($replication_info =~ /slave_repl_offset:\s*(\d+)/) {
my $slave_offset = $1;
my $lag = $master_offset - $slave_offset;
$result->{replication_lag_s} = $lag;
if ($lag > $replication_lag_threshold) {
push @{$result->{errors}}, "High replication lag: ${lag}s (>${replication_lag_threshold}s)";
}
} else {
log_message('WARN', "Could not retrieve slave_repl_offset for $name.");
}
} else {
log_message('WARN', "Could not retrieve master_repl_offset for $name.");
}
} else {
push @{$result->{errors}}, "Replication sync in progress.";
}
} else {
push @{$result->{errors}}, "Replication master link is down.";
}
}
};
if ($@) {
log_message('ERROR', "Exception connecting to or checking $name ($host:$port): $@");
push @{$result->{errors}}, "Connection/Exception error: $@";
}
if (scalar @{$result->{errors}} == 0) {
$result->{status} = 'OK';
} else {
$result->{status} = 'ERROR';
}
return $result;
}
# --- Memory Parsing Helper (same as before) ---
sub parse_memory {
my ($mem_str) = @_;
my $value = $mem_str;
my $unit = 'B';
if ($value =~ s/([MG])//) {
$unit = $1;
}
$value = $value + 0;
if ($unit eq 'G') {
return $value * 1024 * 1024 * 1024;
} elsif ($unit eq 'M') {
return $value * 1024 * 1024;
} elsif ($unit eq 'K') {
return $value * 1024;
} else {
return $value;
}
}
# --- Main Execution ---
sub main {
my @all_results;
my $overall_status = 'OK';
foreach my $node (@redis_hosts) {
my $node_result = check_redis_node($node);
push @all_results, $node_result;
if ($node_result->{status} eq 'ERROR') {
$overall_status = 'ERROR';
}
}
# Output JSON for ingestion
print encode_json({
timestamp => scalar(localtime()),
overall_status => $overall_status,
nodes => \@all_results,
});
exit ($overall_status eq 'OK' ? 0 : 1);
}
main();
Now, a cron job can execute this script and pipe the JSON output to a custom script that uses gcloud monitoring policies create or the Cloud Monitoring API client libraries to write time series data.
Option 2: Using the Cloud Monitoring Agent (Ops Agent)
For more robust and integrated monitoring, the Ops Agent is the recommended approach. It can collect logs and metrics from applications and the system. We can configure it to:
- Execute the Perl script periodically.
- Parse its JSON output.
- Write the metrics (latency, memory usage, replication lag) as custom metrics to Cloud Monitoring.
First, ensure the Ops Agent is installed and running on your monitoring VM or application server. Then, configure its agent.yaml file (typically located at /etc/google-cloud-ops-agent/config.yaml).
logging:
receivers:
redis_check_log:
type: file
include_paths:
- /var/log/redis_check_output.log # Log file for the Perl script output
processors:
parse_redis_json:
type: json_parser
field: 'text' # Assuming the Perl script prints JSON to stdout
service:
pipelines:
redis_pipeline:
receivers: [redis_check_log]
processors: [parse_redis_json]
exporters: [google_cloud_logging]
metrics:
# This section is for collecting metrics directly.
# For script-based metrics, we'll use a custom receiver.
# Example of a custom receiver for script output:
# Note: This requires a separate script to run the Perl check and write to a file
# that the agent can read, or use a custom receiver type if available/developed.
# A more direct approach is to use the 'exec' receiver if supported or a custom plugin.
# For simplicity, let's assume we'll parse logs.
# If you want to run the script and parse its output as logs, the logging section above is sufficient.
# To get these as metrics, you'd typically use a custom metrics plugin or the exec receiver.
# Example using a hypothetical 'exec' receiver (check Ops Agent documentation for current capabilities)
# receivers:
# redis_script_exec:
# type: exec
# command: '/usr/bin/perl /path/to/check_redis_cluster.pl'
# interval: '60s'
# timeout: '30s'
# data_format: 'json' # Assuming the script outputs JSON
# exporters:
# google_cloud_monitoring:
# # Define metrics to export here
# metrics_from: redis_script_exec
# --- Alternative: Using log parsing for metrics (less direct but often feasible) ---
# We'll rely on the logging pipeline to capture the JSON output.
# Then, we can use Cloud Monitoring's UI or API to create metrics from logs.
# This is often done via Log-based Metrics in GCP Console.
# For direct metric collection via Ops Agent, consider Prometheus receivers if Redis exposes metrics that way,
# or write a custom receiver/plugin.
# Given the Perl script, parsing its JSON output via logs and creating log-based metrics is a common pattern.
# Let's focus on ensuring the script output is captured as logs.
# The logging section above will capture the JSON output to /var/log/redis_check_output.log
# and parse it. This JSON can then be used to create Log-based Metrics in GCP Console.
# --- Example of a Log-based Metric creation in GCP Console ---
# 1. Navigate to Cloud Monitoring -> Logs Explorer.
# 2. Filter logs from your instance/VM where the script runs, and specifically for the log file.
# 3. Use the JSON payload from the script (e.g., {"node_name": "redis-master-0", "status": "OK", "latency_ms": "1.23", ...})
# 4. Click "Create metric" from the log entry.
# 5. Configure the metric name (e.g., `custom.googleapis.com/redis/node_status`), type (gauge/distribution), and filters.
# 6. For numerical values like latency, memory_percent, replication_lag_s, create distribution metrics.
# 7. For status (OK/ERROR), you might create a counter metric (1 for OK, 0 for ERROR) or use log counts.
# --- Example of a simple script to run the Perl check and log to a file ---
# Create a file, e.g., /opt/scripts/run_redis_check.sh
# Make it executable: chmod +x /opt/scripts/run_redis_check.sh
# Add to cron: */1 * * * * /opt/scripts/run_redis_check.sh >> /var/log/redis_check_output.log 2>&1
# Ensure the Perl script is at /usr/local/bin/check_redis_cluster.pl
#!/bin/bash
# /opt/scripts/run_redis_check.sh
PERL_SCRIPT="/usr/local/bin/check_redis_cluster.pl"
LOG_FILE="/var/log/redis_check_output.log"
if [ -x "$PERL_SCRIPT" ]; then
"$PERL_SCRIPT" >> "$LOG_FILE" 2>&1
EXIT_CODE=$?
# You can add logic here to write specific metrics if needed,
# but relying on log-based metrics is often simpler.
# For example, to write a simple status metric:
# if [ $EXIT_CODE -eq 0 ]; then
# echo "REDIS_STATUS: OK" >> /var/log/redis_status.log
# else
# echo "REDIS_STATUS: ERROR" >> /var/log/redis_status.log
# fi
exit $EXIT_CODE
else
echo "Error: Perl script not found or not executable at $PERL_SCRIPT" >&2
exit 1
fi
After configuring the Ops Agent to capture the logs, you will create “Log-based Metrics” within the Google Cloud Console. This involves filtering logs (e.g., by resource type and log name) and defining how to extract numerical values (like latency, memory percentage, replication lag) or categorical values (like status) into Cloud Monitoring metrics.
Perl Application Monitoring: Redis Client-Side Metrics
Beyond infrastructure-level checks, it’s crucial to monitor how your Perl application *perceives* Redis performance. This involves instrumenting your application code to track Redis operation latency and error rates from the client’s perspective.
Instrumenting Perl Application Code
We can use a simple wrapper around the Redis client object to add timing and error tracking. This wrapper can then expose metrics that can be scraped by Prometheus or sent directly to Cloud Monitoring via the OpenTelemetry collector or custom agents.
package MyApp::RedisClientWrapper;
use strict;
use warnings;
use Redis;
use Time::HiRes qw(time);
use Scalar::Util qw(blessed);
# --- Configuration ---
my $redis_host = 'redis-master-0.redis-headless.default.svc.cluster.local';
my $redis_port = 6379;
my $redis_db = 0;
# --- Metrics (Placeholder - replace with your actual metric reporting mechanism) ---
sub report_metric {
my ($metric_name, $value, $labels) = @_;
# In a real application, this would send data to Prometheus, Cloud Monitoring, etc.
# Example: Send to a local Prometheus exporter endpoint or Cloud Monitoring API
my $label_str = join ',', map { "$_=$labels->{$_}" } sort keys %$labels;
print STDERR "[METRIC] $metric_name{$label_str}=$value\n";
}
# --- Constructor ---
sub new {
my ($class, $config) = @_;
my $self = {};
bless $self, $class;
$self->{host} = $config->{host} || $redis_host;
$self->{port} = $config->{port} || $redis_port;
$self->{db} = $config->{db} || $redis_db;
$self->{redis_client} = undef;
$self->{connection_errors} = 0;
$self->{operation_errors} = 0;
$self->{operations_total} = 0;
$self->connect();
return $self;
}
# --- Connection Management ---
sub connect {
my ($self) = @_;
my $redis = Redis->new(
server => "$self->{host}:$self->{port}",
database => $self->{db},
socket_timeout => 5,
io_timeout => 5,
utf8 => 1,
);
eval {
$redis->ping();
$self->{redis_client} = $redis;
$self->{connection_errors} = 0; # Reset on successful connection
report_metric('redis_connection_status', 1, { host => $self->{host}, port => $self->{port} });
};
if ($@) {
$self->{redis_client} = undef;
$self->{connection_errors}++;
report_metric('redis_connection_status', 0, { host => $self->{host}, port => $self->{port} });
report_metric('redis_connection_errors_total', 1, { host => $self->{host}, port => $self->{port} });
print STDERR "Redis connection error: $@\n";
}
}
sub _get_client {
my ($self) = @_;
unless (blessed($self->{redis_client}) && $self->{redis_client}->ping()) {
# Attempt to reconnect if client is gone or ping fails
$self->connect();
}
return $self->{redis_client};
}
# --- Wrapped Redis Methods ---
sub get {
my ($self, $key) = @_;
return $self->execute_command('GET', $key);
}
sub set {
my ($self, $key, $value, $options) = @_;
return $self->execute_command('SET', $key, $value, $options);
}
sub hgetall {
my ($self, $key) = @_;
return $self->execute_command('HGETALL', $key);
}
# Add wrappers for other commonly used Redis commands (SETEX, LPUSH, RPUSH, etc.)
# --- Generic Command Execution ---
sub execute_command {
my ($self, $command, @args) = @_;
$self->{operations_total}++;
my $client = $self->_get_client();
unless ($client) {
$self->{operation_errors}++;
report_metric('redis_operation_errors_total', 1, { command => $command, host => $self->{host} });
return undef; # Cannot execute command
}
my $start_time = time();
my $result;
eval {
# Dynamically call the Redis method
$result = $client->$command(@args);
my $end_time = time();
my $latency_ms = ($end_time - $start_time) * 1000;
report_metric('redis_operation_latency_seconds', $latency_ms / 1000, { command => $command, host => $self->{host} });
};
if ($@) {
$self->{operation_errors}++;
report_metric('redis_operation_errors_total', 1, { command => $command, host => $self->{host} });
print STDERR "Redis command '$command' error: $@\n";
# Attempt to invalidate client on error to force reconnect
$self->{redis_client} = undef;
report_metric('redis_connection_status', 0, { host => $self->{host}, port => $self->{port} });
return undef;
}
return $result;
}
# --- Metric Reporting ---
sub report_application_metrics {
my ($self) = @_;
report_metric('redis_connection_errors_total', $self->{connection_errors}, { host => $self->{host}, port => $self->{port} });
report_metric('redis_operation_errors_total', $self->{operation_errors}, { host => $self->{host}, port => $self->{port} });
report_metric('redis_operations_total', $self->{operations_total}, { host => $self->{host}, port => $self->{port} });
}
# --- Example Usage ---
# In your application's main logic or initialization:
# my $redis_wrapper = MyApp::RedisClientWrapper->new({ host => 'your_redis_host', port => 6379 });
#
# # In request handlers or background jobs:
# my $value = $redis_wrapper->get('my_key');
# if (defined $value) {
# # Process value
# } else {
# # Handle potential Redis error (already logged/counted)
# }
#
# # Periodically (e.g., via a background task or timer):
# $redis_wrapper->report_application_metrics();
1;
Integration:
- Replace the placeholder
report_metricfunction with actual calls to your metrics backend (e.g., Prometheus client library, Cloud Monitoring API client, OpenTelemetry SDK). - Instantiate the wrapper when your application starts:
my $redis = MyApp::RedisClientWrapper->new();. - Use the wrapper object instead of the raw
Redisobject for all Redis interactions. - Period