Server Monitoring Best Practices: Keeping Your C App and Redis Clusters Alive on DigitalOcean
Proactive Redis Cluster Health Checks with Custom Scripts
Maintaining the health of a Redis cluster, especially in a distributed environment like DigitalOcean, requires more than just relying on basic service status. We need to actively probe key metrics and cluster state to catch issues before they impact application performance. This involves crafting custom scripts that interact directly with Redis and its sentinel processes.
A critical aspect of Redis cluster health is ensuring all nodes are communicating correctly and that failover mechanisms are functioning as expected. We’ll focus on a Python script that leverages the `redis-py` library to check cluster status, node health, and sentinel responsiveness.
Python Script for Redis Cluster and Sentinel Monitoring
This script connects to a Redis cluster and its sentinels, then performs several checks:
- Verifies that the cluster is in a stable state (no pending reconfigurations).
- Checks the health of each master and replica node.
- Ensures sentinels are aware of the master and can reach it.
- Reports on the number of sentinels monitoring the master.
Save this script as redis_monitor.py on a dedicated monitoring server or one of your application servers (if resource constraints allow). Ensure you have the redis-py library installed: pip install redis.
redis_monitor.py
import redis
import time
import sys
# --- Configuration ---
REDIS_HOSTS = ['redis-1.example.com:6379', 'redis-2.example.com:6379', 'redis-3.example.com:6379'] # List of initial cluster nodes
SENTINEL_HOSTS = [('sentinel-1.example.com', 26379), ('sentinel-2.example.com', 26379), ('sentinel-3.example.com', 26379)] # List of sentinel hosts and ports
MASTER_NAME = 'mymaster' # The name of your Redis master set by Sentinel
TIMEOUT = 5 # Connection timeout in seconds
MAX_RETRIES = 3 # Max retries for Redis commands
RETRY_DELAY = 1 # Delay between retries in seconds
# --- Helper Functions ---
def get_redis_connection(host, port, db=0, socket_timeout=TIMEOUT, socket_connect_timeout=TIMEOUT, decode_responses=True):
try:
r = redis.Redis(host=host, port=port, db=db, socket_timeout=socket_timeout,
socket_connect_timeout=socket_connect_timeout, decode_responses=decode_responses)
r.ping()
return r
except redis.exceptions.ConnectionError as e:
print(f"ERROR: Could not connect to Redis at {host}:{port} - {e}", file=sys.stderr)
return None
except Exception as e:
print(f"ERROR: Unexpected error connecting to Redis at {host}:{port} - {e}", file=sys.stderr)
return None
def get_sentinel_connection(host, port, socket_timeout=TIMEOUT, socket_connect_timeout=TIMEOUT):
try:
s = redis.Sentinel(sentinels=[(host, port)], socket_timeout=socket_timeout, socket_connect_timeout=socket_connect_timeout)
s.master_for(MASTER_NAME) # Test connection by getting master
return s
except redis.exceptions.ConnectionError as e:
print(f"ERROR: Could not connect to Sentinel at {host}:{port} - {e}", file=sys.stderr)
return None
except Exception as e:
print(f"ERROR: Unexpected error connecting to Sentinel at {host}:{port} - {e}", file=sys.stderr)
return None
# --- Main Monitoring Logic ---
def monitor_redis_cluster():
print(f"--- Starting Redis Cluster Monitor ({time.strftime('%Y-%m-%d %H:%M:%S')}) ---")
# 1. Check Sentinel Connectivity and Master Status
print("\n[1/4] Checking Sentinel connectivity and master status...")
sentinels_connected = 0
master_info = None
sentinel_connections = []
for s_host, s_port in SENTINEL_HOSTS:
sentinel = get_sentinel_connection(s_host, s_port)
if sentinel:
sentinels_connected += 1
sentinel_connections.append(sentinel)
try:
current_master = sentinel.master_for(MASTER_NAME)
if master_info is None:
master_info = current_master
print(f" - Sentinel {s_host}:{s_port} reports master: {master_info.connection_pool.host}:{master_info.connection_pool.port}")
except redis.exceptions.MasterNotFoundError:
print(f" - WARNING: Sentinel {s_host}:{s_port} does not know about master '{MASTER_NAME}'.", file=sys.stderr)
except Exception as e:
print(f" - ERROR: Sentinel {s_host}:{s_port} failed to get master info: {e}", file=sys.stderr)
else:
print(f" - Sentinel {s_host}:{s_port} is unreachable.", file=sys.stderr)
if sentinels_connected == 0:
print("CRITICAL: No Sentinels are reachable. Cannot determine master status.", file=sys.stderr)
return False
elif master_info is None:
print(f"CRITICAL: No Sentinel could provide master information for '{MASTER_NAME}'.", file=sys.stderr)
return False
else:
master_host = master_info.connection_pool.host
master_port = master_info.connection_pool.port
print(f" - Current Master (reported by at least one Sentinel): {master_host}:{master_port}")
# 2. Check Master Node Health
print(f"\n[2/4] Checking master node health ({master_host}:{master_port})...")
master_conn = get_redis_connection(master_host, master_port)
if not master_conn:
print(f"CRITICAL: Master node {master_host}:{master_port} is unreachable.", file=sys.stderr)
return False
else:
try:
master_info_cmd = master_conn.info('replication')
if master_info_cmd.get('role') != 'master':
print(f"CRITICAL: Node {master_host}:{master_port} is reporting role '{master_info_cmd.get('role')}' instead of 'master'.", file=sys.stderr)
return False
print(f" - Master node {master_host}:{master_port} is healthy and reports role: master.")
except Exception as e:
print(f"CRITICAL: Failed to get info from master node {master_host}:{master_port} - {e}", file=sys.stderr)
return False
# 3. Check Replica Nodes Health and Replication Status
print("\n[3/4] Checking replica nodes health and replication status...")
replicas_healthy = 0
total_replicas = 0
for sentinel in sentinel_connections:
try:
replica_hosts = sentinel.slaves(MASTER_NAME)
for replica_host_port in replica_hosts:
r_host, r_port = replica_host_port.split(':')
r_port = int(r_port)
# Avoid double-counting if multiple sentinels report the same replica
if not any(r['host'] == r_host and r['port'] == r_port for r in current_replicas_checked):
print(f" - Checking replica: {r_host}:{r_port}")
replica_conn = get_redis_connection(r_host, r_port)
if replica_conn:
try:
replica_info_cmd = replica_conn.info('replication')
if replica_info_cmd.get('role') == 'slave':
master_host_replica = replica_info_cmd.get('master_host')
master_port_replica = replica_info_cmd.get('master_port')
master_link_status = replica_info_cmd.get('master_link_status')
if master_host_replica == master_host and master_port_replica == str(master_port):
if master_link_status == 'up':
replicas_healthy += 1
print(f" - Replica {r_host}:{r_port} is healthy, connected to master {master_host}:{master_port}, link status: UP.")
else:
print(f" - WARNING: Replica {r_host}:{r_port} is connected to master but link status is '{master_link_status}'.", file=sys.stderr)
else:
print(f" - WARNING: Replica {r_host}:{r_port} is connected to wrong master ({master_host_replica}:{master_port_replica}). Expected {master_host}:{master_port}.", file=sys.stderr)
else:
print(f" - WARNING: Replica node {r_host}:{r_port} is reporting role '{replica_info_cmd.get('role')}' instead of 'slave'.", file=sys.stderr)
except Exception as e:
print(f" - ERROR: Failed to get info from replica node {r_host}:{r_port} - {e}", file=sys.stderr)
else:
print(f" - WARNING: Replica node {r_host}:{r_port} is unreachable.", file=sys.stderr)
current_replicas_checked.append({'host': r_host, 'port': r_port})
except Exception as e:
print(f" - ERROR: Failed to get replica list from Sentinel {sentinel.connection_pool.host}:{sentinel.connection_pool.port} - {e}", file=sys.stderr)
# Get total number of replicas expected from sentinels
all_replica_hosts_from_sentinels = set()
for sentinel in sentinel_connections:
try:
replica_hosts_list = sentinel.slaves(MASTER_NAME)
for replica_host_port in replica_hosts_list:
all_replica_hosts_from_sentinels.add(replica_host_port)
except Exception:
pass # Ignore errors here, we already logged them above
total_replicas = len(all_replica_hosts_from_sentinels)
print(f" - Found {replicas_healthy}/{total_replicas} healthy replicas connected to master {master_host}:{master_port}.")
if replicas_healthy < total_replicas:
print("WARNING: Not all replicas are healthy or connected to the master.", file=sys.stderr)
# Decide if this is a critical failure or just a warning based on your SLOs
# For now, we'll consider it a warning.
# 4. Check Cluster State (for Redis Cluster mode, not Sentinel)
# This part is more relevant for Redis Cluster (sharded) rather than Sentinel managed HA.
# If you are using Redis Cluster, you'd use redis.RedisCluster and check cluster_info.
# For Sentinel, we've covered the essential checks above.
print("\n[4/4] Skipping Redis Cluster state check (using Sentinel for HA).")
# If using Redis Cluster:
# try:
# rc = redis.RedisCluster(startup_nodes=REDIS_HOSTS, decode_responses=True)
# cluster_info = rc.cluster_info()
# if cluster_info.get('cluster_state') == 'ok':
# print(" - Redis Cluster state is OK.")
# else:
# print(f"CRITICAL: Redis Cluster state is '{cluster_info.get('cluster_state')}'.", file=sys.stderr)
# return False
# except Exception as e:
# print(f"CRITICAL: Failed to get Redis Cluster info - {e}", file=sys.stderr)
# return False
print("\n--- Redis Cluster Monitor finished successfully. ---")
return True
if __name__ == "__main__":
current_replicas_checked = [] # Initialize to avoid scope issues
if not monitor_redis_cluster():
sys.exit(1) # Exit with a non-zero status code to indicate failure
else:
sys.exit(0) # Exit with zero status code for success
Integrating with DigitalOcean Monitoring and Alerting
To make this script truly effective, it needs to be integrated into a robust monitoring and alerting system. DigitalOcean's built-in monitoring provides basic CPU, memory, and disk I/O metrics. However, for application-level checks like Redis health, we need external tools.
Here are a few common strategies:
1. Cron Job with Email/Slack Notifications
The simplest approach is to schedule the Python script using cron and pipe its output to a notification mechanism. If the script exits with a non-zero status code (indicating an error), we can trigger an alert.
Cron Job Setup
Edit your crontab:
crontab -e
Add a line like this to run the script every 5 minutes:
*/5 * * * * /usr/bin/python3 /path/to/your/redis_monitor.py >> /var/log/redis_monitor.log 2>&1 || echo "Redis Monitor Failed: $(date)" | mail -s "ALERT: Redis Cluster Down" [email protected]
Explanation:
*/5 * * * *: Runs the command every 5 minutes./usr/bin/python3 /path/to/your/redis_monitor.py: Executes your Python script. Adjust the path as necessary.>> /var/log/redis_monitor.log 2>&1: Appends standard output and standard error to a log file. This is crucial for debugging.|| echo "Redis Monitor Failed: $(date)" | mail -s "ALERT: Redis Cluster Down" [email protected]: This is the alerting part. If the preceding command (the Python script) exits with a non-zero status (failure), the||(OR) condition is met, and the email is sent.
You'll need to configure your server's mail transfer agent (like Postfix or Sendmail) to send emails, or integrate with a service like SendGrid or Mailgun. For Slack notifications, you would replace the mail command with a curl request to a Slack incoming webhook.
2. Prometheus and Alertmanager
For more sophisticated monitoring, Prometheus is the de facto standard. You can adapt the Python script to expose metrics that Prometheus can scrape, or use existing Redis exporters.
Option A: Custom Exporter (Node Exporter with a script)
You can use the Node Exporter's --collector.textfile.directory feature. This allows Node Exporter to read metrics from files in a specified directory. Your Python script would then generate a file with Prometheus-formatted metrics.
Python Script Modification for Prometheus Textfile Collector
# ... (previous imports and helper functions) ...
# --- Prometheus Metrics Generation ---
def generate_prometheus_metrics():
metrics = []
status = monitor_redis_cluster() # Reuse the existing monitoring logic
# Basic status metric: 1 for healthy, 0 for unhealthy
metrics.append(f'redis_cluster_health_status {{master_name="{MASTER_NAME}"}} {int(status)}')
# Add more detailed metrics if monitor_redis_cluster was enhanced to return them
# For example, if monitor_redis_cluster returned replica counts, etc.
return "\n".join(metrics)
if __name__ == "__main__":
# This part needs to be run by Node Exporter's textfile collector
# The script itself should not exit with 0/1 here if it's just generating metrics.
# The actual alerting logic would be in Alertmanager based on Prometheus rules.
# For demonstration, we'll still run the monitor and print metrics.
# In a real setup, Node Exporter would read the output file.
# Temporarily disable exit codes for metric generation
# sys.exit(0) # Comment out or remove for actual metric generation
try:
# Run the monitoring logic to get the status
is_healthy = monitor_redis_cluster()
# Generate Prometheus metrics
metrics_output = f'# HELP redis_cluster_health_status Status of the Redis cluster (1=healthy, 0=unhealthy).\n'
metrics_output += f'# TYPE redis_cluster_health_status gauge\n'
metrics_output += f'redis_cluster_health_status {{master_name="{MASTER_NAME}"}} {int(is_healthy)}\n'
# In a real scenario, write this to a file like:
# with open('/var/lib/node_exporter/textfile_collector/redis_health.prom', 'w') as f:
# f.write(metrics_output)
print(metrics_output) # Print to stdout for testing
if not is_healthy:
sys.exit(1) # Still exit with 1 if the monitor failed, so cron can catch it if not using Prometheus
else:
sys.exit(0)
except Exception as e:
print(f"ERROR: Exception during metric generation: {e}", file=sys.stderr)
sys.exit(1)
Node Exporter Configuration:
# On the server running Node Exporter: # 1. Create a directory for text files: sudo mkdir -p /var/lib/node_exporter/textfile_collector # 2. Create a script that runs your Python monitor and saves output to a .prom file: sudo nano /usr/local/bin/redis_monitor_exporter.sh
#!/bin/bash # Run the Python script and save output to the textfile collector directory /usr/bin/python3 /path/to/your/redis_monitor.py > /var/lib/node_exporter/textfile_collector/redis_health.prom
Make it executable:
sudo chmod +x /usr/local/bin/redis_monitor_exporter.sh
Schedule this script with cron to run frequently (e.g., every minute):
crontab -e
* * * * * /usr/local/bin/redis_monitor_exporter.sh
Prometheus Configuration:
Ensure your prometheus.yml includes a scrape config for your Node Exporter:
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['your_node_exporter_host:9100'] # Replace with your Node Exporter IP/hostname
labels:
instance: 'your_server_name' # e.g., 'redis-monitor-server'
Alertmanager Configuration:
Define a Prometheus alerting rule in your Prometheus rules file (e.g., alerts.yml):
groups:
- name: redis_alerts
rules:
- alert: RedisClusterUnhealthy
expr: redis_cluster_health_status{master_name="mymaster"} == 0
for: 5m # Alert if the status is 0 for 5 minutes
labels:
severity: critical
annotations:
summary: "Redis cluster '{{ $labels.master_name }}' is unhealthy."
description: "The Redis cluster monitor reported an unhealthy status for {{ $labels.master_name }} for more than 5 minutes."
Then, configure Alertmanager to receive these alerts and route them to your desired notification channels (email, Slack, PagerDuty, etc.).
Option B: Dedicated Redis Exporter
For a more comprehensive view of Redis metrics, consider using a dedicated Redis exporter like redis_exporter (available on GitHub). This exporter can be configured to connect to your Redis instances and expose a wide range of metrics (memory usage, connections, latency, etc.) directly to Prometheus.
You would typically run this exporter as a separate service, configure Prometheus to scrape it, and then use Alertmanager for alerting based on its metrics.
Monitoring Your C Application
Monitoring your C application involves a layered approach, from system-level metrics to application-specific performance indicators and error logging.
1. System-Level Metrics (CPU, Memory, Network, Disk)
DigitalOcean's Droplet monitoring provides basic visibility into these. For deeper insights and historical data, integrate with Prometheus and Node Exporter as described above. Ensure Node Exporter is running on your application servers.
Key metrics to watch:
node_cpu_seconds_total: CPU usage per core and overall. Look for sustained high utilization.node_memory_MemAvailable_bytes: Available memory. Low available memory can lead to swapping and performance degradation.node_network_receive_bytes_total,node_network_transmit_bytes_total: Network traffic. High traffic might indicate load issues or potential DDoS attacks.node_disk_io_time_seconds_total: Disk I/O wait times. High values suggest disk bottlenecks.
2. Application-Specific Metrics
This is where custom instrumentation is crucial. Your C application should expose metrics relevant to its function. The most common way to do this for Prometheus is via a custom exporter or by embedding a client library.
Option A: Custom C Exporter (using Prometheus C++ Client)
The Prometheus C++ client library allows you to instrument your C/C++ application directly. You can expose metrics like request counts, latency, error rates, and connection pool usage.
Example (Conceptual C++):
#include <prometheus/registry.h>
#include <prometheus/counter.h>
#include <prometheus/exposer.h>
#include <prometheus/family.h>
#include <chrono>
#include <thread>
#include <iostream>
// Global Prometheus objects
std::shared_ptr<prometheus::Registry> registry;
prometheus::Family<prometheus::Counter>* request_counter_family;
prometheus::Family<prometheus::Counter>* error_counter_family;
void initialize_prometheus(const std::string& application_name, int port) {
registry = std::make_shared<prometheus::Registry>();
// Expose metrics on HTTP port
prometheus::Exposer exposer{"0.0.0.0:" + std::to_string(port)};
exposer.RegisterCollectable(registry);
// Define a counter for requests
request_counter_family = ®istry->AddFamily<prometheus::Counter>(
"app_" + application_name + "_requests_total",
"Total number of requests processed by the application.");
// Define a counter for errors
error_counter_family = ®istry->AddFamily<prometheus::Counter>(
"app_" + application_name + "_errors_total",
"Total number of errors encountered by the application.");
}
void process_request(const std::string& endpoint) {
// Increment request counter
(*request_counter_family)->Add({{"endpoint", endpoint}}).Increment();
// Simulate some work...
std::this_thread::sleep_for(std::chrono::milliseconds(50));
// Simulate an error condition
if (rand() % 100 < 5) { // 5% chance of error
std::cerr << "Simulated error for endpoint: " << endpoint << std::endl;
// Increment error counter
(*error_counter_family)->Add({{"endpoint", endpoint}, {"type", "internal"}}).Increment();
}
}
int main() {
// Initialize Prometheus metrics exporter on port 9091
initialize_prometheus("my_c_app", 9091);
std::cout << "Prometheus exporter started on port 9091" << std::endl;
// Simulate application work
while (true) {
process_request("/api/v1/data");
process_request("/api/v1/status");
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return 0;
}
Build and Run:
You'll need to compile this with a C++ compiler and link against the Prometheus C++ client library. Once running, Prometheus can scrape http://your_app_server:9091/metrics.
Option B: Log Aggregation and Analysis
Even with metrics, detailed logs are indispensable for debugging. Implement structured logging in your C application. Instead of plain text, log in a machine-readable format like JSON.
Example (Conceptual C with JSON logging):
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
// Basic JSON structure for logs
void log_message(const char* level, const char* message, const char* endpoint, const char* error_code) {
time_t now;
struct tm* tm_info;
char timestamp[20];
time(&now);
tm_info = localtime(&now);
strftime(timestamp, sizeof(timestamp), "%Y-%m-%dT%H:%M:%S", tm_info);
printf("{\"timestamp\": \"%s\", \"level\": \"%s\", \"message\": \"%s\"", timestamp, level, message);
if (endpoint) {
printf(", \"endpoint\": \"%s\"", endpoint);
}
if (error_code) {
printf(", \"error_code\": \"%s\"", error_code);
}
printf("}\n");
fflush(stdout); // Ensure log is written immediately
}
int main() {
// Simulate application startup
log_message("INFO", "Application started", NULL, NULL);
// Simulate processing a request
const char* current_endpoint = "/api/v1/process";
log_message("INFO", "Processing request", current_endpoint, NULL);
// Simulate an error
int simulated_error = rand() % 2; // 0 or 1
if (simulated_error) {
log_message("ERROR", "Failed to process data", current_endpoint, "ERR_DATA_PROC");
} else {
log_message("INFO", "Request processed successfully", current_endpoint, NULL);
}
log_message("INFO", "Application shutting down", NULL, NULL);
return 0;
}
Use a log aggregation tool like Fluentd, Logstash, or Vector to collect these JSON logs from your application servers. These tools can parse the JSON, enrich it with metadata (like server hostname, application name), and send it to a centralized logging backend (Elasticsearch, Loki, cloud logging services).
3. Health Check Endpoints
Implement a dedicated HTTP endpoint (e.g., /healthz) in your C application. This endpoint should perform basic checks: is the application running? Can it connect to its dependencies (like Redis)?
Your health check endpoint should return:
200 OKif all checks pass.5xx Server Errorif any check fails.
You can then use external monitoring tools (like Prometheus's blackbox exporter, or even a simple `curl` in a cron job) to periodically poll this endpoint. This provides a high-level "is the service up?" check.
Putting It All Together: A Holistic Strategy
A robust monitoring strategy for your C app and Redis clusters on DigitalOcean involves:
- Redis: Custom Python scripts for deep health checks, integrated with cron/email or Prometheus/Alertmanager for proactive alerting on cluster state, master/replica health, and sentinel responsiveness.
- C Application:
- System metrics via Node Exporter (Prometheus).
- Application-specific metrics exposed via a custom C++ exporter or client library (Prometheus).
- Structured JSON logging aggregated by tools like Fluentd/Vector to a central backend.
- A dedicated
/healthzendpoint for basic service availability checks.
- Alerting: Centralized alerting via Alertmanager (if using Prometheus) or custom scripts for simpler setups, ensuring critical issues are immediately flagged.
- Dashboards: Visualize key metrics and logs in Grafana for at-a-glance understanding of system health and performance trends.
By combining these techniques, you create a multi-layered defense against outages, ensuring both your data store and your application remain available and performant.