Server Monitoring Best Practices: Keeping Your Perl App and Redis Clusters Alive on DigitalOcean
Proactive Redis Cluster Health Checks with `redis-cli` and Custom Scripts
Maintaining the health and availability of Redis clusters, especially in a distributed setup on DigitalOcean, requires more than just basic CPU and memory monitoring. For Perl applications relying on Redis for caching or session management, even brief cluster unavailability can cascade into application errors. We’ll focus on implementing granular health checks that go beyond simple ping responses, ensuring data integrity and node responsiveness.
The primary tool for interacting with Redis is `redis-cli`. While it offers basic commands like PING, we need to leverage its capabilities for more advanced diagnostics. For a Redis cluster, we’ll check the status of each node and the cluster’s overall health.
Cluster State Verification
The CLUSTER INFO command provides a wealth of information about the cluster’s state. Key metrics to monitor include cluster_state (should be ok), cluster_slots_assigned, cluster_slots_ok, cluster_slots_pfail, and cluster_slots_fail. A discrepancy between assigned and OK slots, or the presence of pfail (probabilistic failure) or fail states, indicates an issue that needs immediate attention.
We can script this check using `redis-cli` and `jq` for parsing the JSON output. This script can be run periodically via cron or a dedicated monitoring agent.
First, ensure you have `jq` installed on your monitoring server or one of your Redis nodes: sudo apt-get update && sudo apt-get install jq (for Debian/Ubuntu).
Here’s a Bash script to check the cluster state:
#!/bin/bash
REDIS_HOST="your_redis_master_node_ip" # Replace with an IP of any node in your cluster
REDIS_PORT="7000" # Default cluster port
# Fetch cluster info in JSON format
CLUSTER_INFO=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT cluster info 2>&1)
# Check if redis-cli command failed
if [[ $? -ne 0 ]]; then
echo "ERROR: Failed to connect to Redis cluster at $REDIS_HOST:$REDIS_PORT. Output: $CLUSTER_INFO"
exit 1
fi
# Parse JSON output using jq
CLUSTER_STATE=$(echo "$CLUSTER_INFO" | jq -r '.cluster_state')
SLOTS_ASSIGNED=$(echo "$CLUSTER_INFO" | jq -r '.cluster_slots_assigned')
SLOTS_OK=$(echo "$CLUSTER_INFO" | jq -r '.cluster_slots_ok')
SLOTS_PFAIL=$(echo "$CLUSTER_INFO" | jq -r '.cluster_slots_pfail')
SLOTS_FAIL=$(echo "$CLUSTER_INFO" | jq -r '.cluster_slots_fail')
# Define alert thresholds
EXPECTED_SLOTS=16384 # Total slots in a Redis cluster
# Check cluster state
if [[ "$CLUSTER_STATE" != "ok" ]]; then
echo "ALERT: Redis cluster state is '$CLUSTER_STATE'. Expected 'ok'."
exit 1
fi
# Check slot status
if [[ "$SLOTS_ASSIGNED" -ne "$EXPECTED_SLOTS" ]]; then
echo "ALERT: Redis cluster has $SLOTS_ASSIGNED assigned slots, expected $EXPECTED_SLOTS."
exit 1
fi
if [[ "$SLOTS_OK" -ne "$EXPECTED_SLOTS" ]]; then
echo "ALERT: Redis cluster has $SLOTS_OK OK slots, expected $EXPECTED_SLOTS. PFAIL: $SLOTS_PFAIL, FAIL: $SLOTS_FAIL."
exit 1
fi
echo "SUCCESS: Redis cluster is healthy. State: $CLUSTER_STATE, Slots OK: $SLOTS_OK/$EXPECTED_SLOTS."
exit 0
This script can be scheduled via cron to run every minute. If it exits with a non-zero status, your monitoring system (e.g., Prometheus Alertmanager, Nagios) should trigger an alert.
Individual Node Health and Replication Status
Beyond the cluster-wide view, it’s crucial to monitor individual nodes. The CLUSTER NODES command lists all nodes in the cluster, their status, and their role (master/slave). We need to ensure all nodes are connected and that replicas are in sync with their masters.
A key indicator of replication lag is the master_repl_offset and slave_repl_offset. For a healthy replica, slave_repl_offset should be very close to, or equal to, its master’s master_repl_offset. We can also check the connected status of each node.
Here’s a more advanced script that iterates through each node reported by CLUSTER NODES and checks its status and replication lag:
#!/bin/bash
REDIS_HOST="your_redis_master_node_ip" # Replace with an IP of any node in your cluster
REDIS_PORT="7000" # Default cluster port
MAX_REPLICATION_LAG=5000 # Maximum allowed replication lag in bytes (adjust as needed)
# Fetch cluster nodes information
NODES_INFO=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT cluster nodes 2>&1)
if [[ $? -ne 0 ]]; then
echo "ERROR: Failed to connect to Redis cluster at $REDIS_HOST:$REDIS_PORT to get nodes info. Output: $NODES_INFO"
exit 1
fi
# Process each line (node) from the output
echo "$NODES_INFO" | while IFS= read -r line; do
NODE_ID=$(echo "$line" | awk '{print $1}')
NODE_IP_PORT=$(echo "$line" | awk '{print $2}' | cut -d',' -f1)
NODE_FLAGS=$(echo "$line" | awk '{print $3}')
MASTER_ID=$(echo "$line" | awk '{print $4}')
LAST_PING_SENT=$(echo "$line" | awk '{print $5}')
LAST_PING_REPLY=$(echo "$line" | awk '{print $6}')
SLAVE_IP_PORT=$(echo "$line" | awk '{print $7}') # Only present for masters, points to their slaves
NODE_IP=$(echo "$NODE_IP_PORT" | cut -d':' -f1)
NODE_PORT=$(echo "$NODE_IP_PORT" | cut -d':' -f2)
# Check if node is connected
if [[ "$line" =~ "connected" ]]; then
# It's a master node
if [[ "$NODE_FLAGS" == "master" ]]; then
# Get master's replication offset
MASTER_REPL_OFFSET=$(redis-cli -h $NODE_IP -p $NODE_PORT --no-auth CLUSTER REPL_OFFSET 2>&1)
if [[ $? -ne 0 ]]; then
echo "WARNING: Could not get replication offset for master $NODE_IP:$NODE_PORT. Output: $MASTER_REPL_OFFSET"
continue
fi
# Check its slaves
echo "$line" | grep -oP '(?<=slave:)\S+' | while IFS= read -r slave_info; do
SLAVE_IP_PORT_FULL=$(echo "$slave_info" | sed 's/,.*//') # Remove flags like 'slave'
SLAVE_IP=$(echo "$SLAVE_IP_PORT_FULL" | cut -d':' -f1)
SLAVE_PORT=$(echo "$SLAVE_IP_PORT_FULL" | cut -d':' -f2)
# Get slave's replication offset
SLAVE_REPL_OFFSET=$(redis-cli -h $SLAVE_IP -p $SLAVE_PORT --no-auth CLUSTER REPL_OFFSET 2>&1)
if [[ $? -ne 0 ]]; then
echo "WARNING: Could not get replication offset for slave $SLAVE_IP:$SLAVE_PORT. Output: $SLAVE_REPL_OFFSET"
continue
fi
REPLICATION_LAG=$((MASTER_REPL_OFFSET - SLAVE_REPL_OFFSET))
if [[ "$REPLICATION_LAG" -gt "$MAX_REPLICATION_LAG" ]]; then
echo "ALERT: Replication lag for slave $SLAVE_IP:$SLAVE_PORT (connected to master $NODE_IP:$NODE_PORT) is $REPLICATION_LAG bytes, exceeding threshold of $MAX_REPLICATION_LAG."
exit 1 # Exit with error if any slave is lagging significantly
else
echo "INFO: Replication lag for slave $SLAVE_IP:$SLAVE_PORT is $REPLICATION_LAG bytes (OK)."
fi
done
fi
else
echo "ALERT: Node $NODE_IP:$NODE_PORT ($NODE_ID) is disconnected."
exit 1 # Exit with error if any node is disconnected
fi
done
echo "SUCCESS: All Redis cluster nodes are connected and replication lag is within limits."
exit 0
Note: The CLUSTER REPL_OFFSET command is not a standard `redis-cli` command. This script assumes you are running a custom version or have a way to access this information. A more robust approach would be to use the INFO replication command on each master and its corresponding slaves and calculate the difference between master_repl_offset and slave_repl_offset. For example:
# Inside the loop for each master node MASTER_INFO=$(redis-cli -h $NODE_IP -p $NODE_PORT INFO replication 2>&1) MASTER_REPL_OFFSET=$(echo "$MASTER_INFO" | grep "master_repl_offset:" | cut -d':' -f2) # Inside the loop for each slave node SLAVE_INFO=$(redis-cli -h $SLAVE_IP -p $SLAVE_PORT INFO replication 2>&1) SLAVE_REPL_OFFSET=$(echo "$SLAVE_INFO" | grep "slave_repl_offset:" | cut -d':' -f2) REPLICATION_LAG=$((MASTER_REPL_OFFSET - SLAVE_REPL_OFFSET)) # ... rest of the lag check logic
Perl Application Health Checks: Beyond Basic Connectivity
For Perl applications, simply checking if the Redis client library can connect is insufficient. We need to ensure that the application can perform essential operations and that the data it expects is present and valid. This involves simulating typical application workflows against Redis.
Simulating Key Application Operations
A common pattern is using Redis for session management or caching. A health check script in Perl should attempt to:
- Establish a connection to the Redis cluster (using the appropriate client library, e.g.,
Cache::Redis,Redis). - Set a test key with a specific value and an expiration time.
- Retrieve the test key and verify its value.
- Delete the test key.
- If using Redis Cluster, ensure the client library correctly handles key distribution and redirection.
Here’s a sample Perl script that performs these checks. This script should be deployed on your application servers and run periodically.
#!/usr/bin/perl
use strict;
use warnings;
use Redis;
use Redis::Cluster; # Or your preferred Redis client library
use Time::HiRes qw(time);
# --- Configuration ---
my $redis_nodes = "10.132.0.1:7000,10.132.0.2:7000,10.132.0.3:7000"; # Comma-separated list of cluster nodes
my $test_key = "app_health_check_key";
my $test_value = "health_check_value_" . time();
my $expiration_seconds = 60; # Key will expire after 60 seconds
# --- Health Check Logic ---
my $redis = undef;
my $start_time = time();
my $timeout_seconds = 10; # Overall timeout for the health check
eval {
local $SIG{ALRM} = sub { die "Health check timed out after $timeout_seconds seconds\n" };
alarm $timeout_seconds;
# Initialize Redis Cluster client
# Adjust options based on your Redis::Cluster version and needs
$redis = Redis::Cluster->new(
servers => [ split /,/, $redis_nodes ],
# Add any other necessary options like 'timeout', 'reconnect_attempts', etc.
# For example:
# timeout => 2,
# reconnect_attempts => 2,
);
# Ping the cluster to ensure connectivity
my $ping_response = $redis->ping();
if ($ping_response ne 'PONG') {
die "Redis PING failed. Expected PONG, got: $ping_response\n";
}
# Set a test key with expiration
my $set_success = $redis->setex($test_key, $expiration_seconds, $test_value);
if (!$set_success) {
die "Failed to set test key '$test_key' with expiration.\n";
}
# Retrieve the test key
my $retrieved_value = $redis->get($test_key);
if (!defined $retrieved_value) {
die "Failed to retrieve test key '$test_key'. It might have expired prematurely or not been set correctly.\n";
}
# Verify the value
if ($retrieved_value ne $test_value) {
die "Retrieved value for '$test_key' mismatch. Expected '$test_value', got '$retrieved_value'.\n";
}
# Clean up the test key
my $del_count = $redis->del($test_key);
if ($del_count != 1) {
# This might not be a critical failure if the key expired naturally,
# but it's good to log.
warn "Warning: Failed to delete test key '$test_key' or it was already gone (del_count: $del_count).\n";
}
alarm 0; # Disable alarm
};
my $end_time = time();
my $duration = $end_time - $start_time;
if ($@) {
# An error occurred
print "HEALTH_CHECK_FAILED: $@";
exit 1;
} else {
print "HEALTH_CHECK_OK: Redis operations successful. Duration: ${duration}s\n";
exit 0;
}
# --- Helper function for Redis::Cluster (if not using a modern version) ---
# This is a simplified example. A real implementation might need more robust error handling
# and slot management logic if not handled by the library.
package Redis::Cluster;
use strict;
use warnings;
use base 'Redis'; # Assuming Redis::Cluster inherits from Redis
sub _get_node {
my ($self, $key) = @_;
# In a real Redis::Cluster, this would involve hashing the key to find the correct node.
# For simplicity, we'll just pick the first node for this example health check.
# A proper health check should ideally test against multiple nodes or rely on the library's
# internal routing.
return $self->{servers}[0]; # This is a simplification!
}
sub setex {
my ($self, $key, $seconds, $value) = @_;
my $node = $self->_get_node($key);
return $self->do_command('SETEX', $key, $seconds, $value, { node => $node });
}
sub get {
my ($self, $key) = @_;
my $node = $self->_get_node($key);
return $self->do_command('GET', $key, { node => $node });
}
sub del {
my ($self, $key) = @_;
my $node = $self->_get_node($key);
return $self->do_command('DEL', $key, { node => $node });
}
# This is a placeholder. The actual Redis::Cluster library handles command routing.
# The 'node' option would be used by the library to send the command to the correct node.
sub do_command {
my ($self, $command, @args) = @_;
# This is where the actual network call to the chosen node would happen.
# For this example, we'll simulate success/failure.
# In a real scenario, you'd use the underlying Redis object to connect.
print STDERR "Simulating command '$command' for key '$args[0]' on node...\n";
# Simulate success for health check
return 1 if $command eq 'SETEX';
return $args[-1] if $command eq 'GET'; # Return the value we "set"
return 1 if $command eq 'DEL';
return undef; # Simulate failure if needed
}
# Add a basic ping method if not inherited or overridden
sub ping {
my $self = shift;
# In a real scenario, this would connect to a node and send PING.
# For this example, we'll just return PONG.
return "PONG";
}
# --- End Helper functions ---
Important Considerations for the Perl Script:
- Error Handling: The
evalblock withalarmprovides a basic timeout mechanism. More sophisticated error handling should be implemented to distinguish between connection errors, command execution errors, and data integrity issues. - Client Library: Ensure you are using a Redis client library for Perl that properly supports Redis Cluster, including automatic slot redirection and node discovery. The example above uses a placeholder
Redis::Clusterfor illustration; you’ll likely use a well-maintained CPAN module. - Configuration: Externalize the Redis node list and other parameters into a configuration file or environment variables for easier management.
- Deployment: This script should be run by a process manager (like
systemd) or a cron job on each application server. Its output should be directed to a log file and potentially parsed by a centralized monitoring system.
DigitalOcean Specifics: Droplet Monitoring and Networking
Beyond the application and Redis layers, we must consider the underlying infrastructure provided by DigitalOcean. This includes monitoring Droplet resource utilization and network connectivity.
Droplet Resource Monitoring
DigitalOcean provides basic Droplet metrics through its control panel (CPU, Disk I/O, Bandwidth). For more granular and historical data, consider integrating a dedicated monitoring solution:
- Node Exporter (Prometheus): Deploy the Node Exporter on each Droplet running Redis or your application. This exposes detailed system metrics (CPU, memory, disk, network, process stats) that Prometheus can scrape.
- DigitalOcean Monitoring Agent: DigitalOcean offers its own monitoring agent that can be installed on Droplets to collect more detailed metrics and send them to the DigitalOcean control panel.
- Third-Party Solutions: Services like Datadog, New Relic, or SignalFx offer agents that provide comprehensive infrastructure and application performance monitoring.
Key metrics to watch for Redis Droplets:
- CPU Usage: High CPU can indicate heavy load, inefficient queries, or potential issues with the Perl application’s interaction with Redis.
- Memory Usage: Redis is in-memory, so monitoring memory is critical. Watch for high usage, especially if it approaches the configured `maxmemory` limit, which can lead to eviction policies or errors.
- Network Traffic: Spikes in inbound/outbound traffic can indicate increased application load or potential DDoS attacks.
- Disk I/O: While Redis is primarily in-memory, persistence (RDB snapshots, AOF) involves disk I/O. High I/O wait times can impact performance.
Network Connectivity and Firewall Rules
Ensure that your Droplets can communicate with each other on the necessary ports (typically 6379 for Redis clients, and 16379 for cluster bus communication if not using the same port). DigitalOcean’s firewall (UFW on Ubuntu, or DigitalOcean Cloud Firewalls) plays a crucial role.
Example UFW rules on a Redis node:
# Allow connections from application servers on the client port sudo ufw allow from <app_server_ip> to any port 6379 proto tcp # Allow cluster bus communication between Redis nodes (if using separate port) # Assuming cluster bus is on 16379 sudo ufw allow from <redis_node_ip_1> to any port 16379 proto tcp sudo ufw allow from <redis_node_ip_2> to any port 16379 proto tcp # ... and so on for all Redis nodes # If Redis client and cluster bus use the same port (e.g., 7000 for cluster) sudo ufw allow from <app_server_ip> to any port 7000 proto tcp sudo ufw allow from <redis_node_ip_1> to any port 7000 proto tcp # ... and so on for all Redis nodes # Reload UFW to apply changes sudo ufw reload
DigitalOcean Cloud Firewalls: For a more robust solution, use DigitalOcean’s Cloud Firewalls. These are managed at the account level and can be applied to Droplets based on tags. This is generally preferred over host-based firewalls for managing inter-service communication.
A typical Cloud Firewall rule set for a Redis cluster would involve:
- Inbound Rules: Allow TCP traffic on port 6379 (or your configured Redis port) from the IP ranges or tags of your application Droplets.
- Inbound Rules: Allow TCP traffic on port 6379 (or your configured Redis port) from the IP ranges or tags of your other Redis cluster Droplets (for cluster bus communication).
- Outbound Rules: Generally, allow all outbound traffic unless you have specific security requirements.
Regularly review these firewall rules to ensure they align with your network topology and security policies. Network latency between Droplets can also impact Redis performance; ensure your Droplets are in the same datacenter region and preferably the same VPC network.