Server Monitoring Best Practices: Keeping Your C App and Redis Clusters Alive on OVH
Proactive C Application Health Checks with Systemd and Redis Cluster Monitoring
Maintaining the stability of critical C applications and their underlying Redis clusters on cloud infrastructure, particularly within an OVH environment, demands a robust, multi-layered monitoring strategy. This isn’t about merely reacting to failures; it’s about anticipating them. We’ll focus on deep system-level integration for our C application and specific, actionable metrics for Redis, all designed for production resilience.
Systemd Service Monitoring for C Applications
For C applications managed by systemd, leveraging its built-in capabilities is the first line of defense. We’ll configure systemd to not only restart failed services but also to actively probe their health. This involves defining a custom health check executable that systemd can call periodically.
Creating a C Health Check Executable
This small C program will perform a basic check, such as attempting to bind to a specific port or making a simple internal API call. For demonstration, we’ll simulate a check by returning an exit code of 0 for success and non-zero for failure.
Create a file named app_health_check.c:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
// Define the port your C application listens on
#define APP_PORT 8080
int main() {
int sock_fd;
struct sockaddr_in serv_addr;
// Attempt to create a socket
if ((sock_fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
perror("socket creation failed");
return 1; // Indicate failure
}
// Prepare the sockaddr_in structure
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); // Check localhost
serv_addr.sin_port = htons(APP_PORT);
// Attempt to connect to the application's port
// This is a simplified check. A real-world scenario might involve sending a specific request.
if (connect(sock_fd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
perror("connection to app failed");
close(sock_fd);
return 1; // Indicate failure
}
// If connection is successful, the app is likely responsive
printf("Application health check successful.\n");
close(sock_fd);
return 0; // Indicate success
}
Compile this program and place the executable in a known location, e.g., /usr/local/bin/app_health_check.
gcc app_health_check.c -o /usr/local/bin/app_health_check chmod +x /usr/local/bin/app_health_check
Configuring Systemd Service Unit
Now, modify your C application’s systemd service unit file (e.g., /etc/systemd/system/my-c-app.service) to include health checking directives.
[Unit] Description=My Critical C Application After=network.target [Service] ExecStart=/usr/local/bin/my-c-app Restart=always RestartSec=5s # Health Check Configuration # Type=notify is preferred if your app supports it. # Otherwise, use ExecStartPost with a script that calls the health check. # For simplicity, we'll use a basic ExecStartPost here. # A more robust approach would be a dedicated systemd timer and service. # Option 1: Simple check after start (less frequent) # ExecStartPost=/usr/local/bin/app_health_check # Option 2: Using a dedicated health check service (more robust) # This requires a separate .service file for the health check itself. # For this example, we'll stick to a simpler approach for demonstration. # Systemd's built-in watchdog functionality is excellent. # Ensure your application periodically calls sd_notify() if using Type=notify. # If not, we can simulate checks. # Let's configure systemd to run our check periodically using a separate service. # This is more advanced and reliable than ExecStartPost. # We'll define a separate service for the health check. # This service will run our executable and exit. # Systemd's watchdog will monitor this service. # The main service unit: # ExecStart=/usr/local/bin/my-c-app # WatchdogSec=10s # If your app sends sd_notify() [Install] WantedBy=multi-user.target
To implement periodic health checks reliably, we’ll create a separate systemd service and timer. This allows us to define the frequency of checks independently of the application’s restart policy.
Dedicated Health Check Service and Timer
Create a service file for the health check, e.g., /etc/systemd/system/my-c-app-healthcheck.service:
[Unit] Description=Health Check for My Critical C Application Requires=my-c-app.service After=my-c-app.service [Service] Type=oneshot ExecStart=/usr/local/bin/app_health_check
Create a timer file to trigger this service, e.g., /etc/systemd/system/my-c-app-healthcheck.timer:
[Unit] Description=Timer for My Critical C Application Health Check [Timer] OnBootSec=1min # Start checking 1 minute after boot OnUnitActiveSec=30s # Check every 30 seconds after the app is active [Install] WantedBy=timers.target
Reload systemd, enable and start the timer:
sudo systemctl daemon-reload sudo systemctl enable my-c-app-healthcheck.timer sudo systemctl start my-c-app-healthcheck.timer
If my-c-app-healthcheck.service fails (i.e., app_health_check returns non-zero), systemd will log the failure. You can then configure systemd’s alerting mechanisms or integrate with external monitoring tools like Prometheus Alertmanager.
Redis Cluster Monitoring with Redis-CLI and Prometheus Exporter
Monitoring Redis clusters, especially in a sharded or Sentinel-managed setup, requires looking beyond basic latency. We need to track cluster health, memory usage, network traffic, and replication status.
Essential Redis Metrics via redis-cli
Directly querying Redis instances using redis-cli provides immediate insights. We’ll focus on commands that reveal critical operational data.
# Connect to a Redis master node redis-cli -h-p 6379 # Get general info INFO ALL # Key metrics to watch from INFO ALL: # - connected_clients: Number of connected clients. High numbers might indicate a bottleneck. # - memory_used_peak: Peak memory usage. Crucial for capacity planning. # - memory_rss: Resident Set Size. Actual memory occupied by Redis. # - latest_fork_usec: Time taken for the last fork operation. Long forks can block the server. # - evicted_keys: Number of keys evicted due to memory policy. Indicates memory pressure. # - keyspace_hits, keyspace_misses: Cache hit ratio. # - total_commands_processed: Throughput. # - instantaneous_ops_per_sec: Current throughput. # - rejected_connections: Number of connections rejected due to maxclients limit. # - sync_partial_ok, sync_partial_err: Replication status. # - master_repl_offset, slave_repl_offset: Replication lag. # Check cluster status (if using Redis Cluster) CLUSTER INFO # Key metrics from CLUSTER INFO: # - cluster_state: Should be 'ok'. # - cluster_slots_assigned, cluster_slots_ok, cluster_slots_pfail, cluster_slots_fail: Cluster health. # Check replication status for a specific slave INFO replication # Check for slow commands SLOWLOG GET 10
These commands are invaluable for manual diagnostics. For automated monitoring, we’ll integrate them with Prometheus.
Prometheus Redis Exporter Setup
The redis_exporter is the de facto standard for exposing Redis metrics to Prometheus. We’ll deploy it as a systemd service.
Download the latest release from the official GitHub repository:
# Example for Linux AMD64 wget https://github.com/oliver006/redis_exporter/releases/download/v1.47.0/redis_exporter-v1.47.0.linux-amd64.tar.gz tar xvfz redis_exporter-v1.47.0.linux-amd64.tar.gz sudo mv redis_exporter-v1.47.0.linux-amd64/redis_exporter /usr/local/bin/ rm -rf redis_exporter-v1.47.0.linux-amd64*
Create a systemd service file for the exporter, e.g., /etc/systemd/system/redis_exporter.service. This configuration assumes you have a Redis instance running on localhost:6379. For a cluster, you’ll need to specify multiple targets or use a configuration file.
[Unit] Description=Prometheus Redis Exporter After=network.target redis.service # Adjust if Redis is managed by a different service [Service] User=redis # Or a dedicated user for the exporter ExecStart=/usr/local/bin/redis_exporter \ --redis.addr=redis://localhost:6379 \ --web.listen-address=":9121" \ --check-keyspace=true \ --check-clients=true \ --check-memory=true \ --check-replication=true \ --check-cluster=true \ --namespace=redis # For Redis Cluster, you might use: # --redis.addr=redis://:6379,redis:// :6379,... # Or a configuration file: --redis.config=/etc/redis_exporter/redis.conf Restart=always RestartSec=5s [Install] WantedBy=multi-user.target
If you are monitoring a Redis Cluster, you’ll need to provide all master node addresses to --redis.addr or use a configuration file. The exporter will then query each node and its replicas.
Redis Cluster Configuration File for Exporter
For complex Redis Cluster setups, a configuration file is cleaner. Create /etc/redis_exporter/redis.conf:
# Example redis.conf for redis_exporter # This file lists the Redis instances to monitor. # The exporter will connect to each listed instance. # For Redis Cluster, list all master nodes. The exporter will discover slaves. redis.addr: redis://192.168.1.10:6379 redis.addr: redis://192.168.1.11:6379 redis.addr: redis://192.168.1.12:6379 # Other exporter options can also be specified here check.keyspace: true check.clients: true check.memory: true check.replication: true check.cluster: true namespace: redis_cluster
Update the ExecStart line in the systemd service to use this config file:
ExecStart=/usr/local/bin/redis_exporter --redis.config=/etc/redis_exporter/redis.conf --web.listen-address=":9121"
Reload systemd, enable, and start the exporter:
sudo systemctl daemon-reload sudo systemctl enable redis_exporter.service sudo systemctl start redis_exporter.service
Ensure your Prometheus server is configured to scrape http://.
Alerting Strategies with Prometheus and Alertmanager
Effective alerting is crucial. We’ll define Prometheus rules that trigger alerts based on critical metrics, and Alertmanager will handle deduplication, grouping, and routing.
Prometheus Alerting Rules
Create a Prometheus rules file (e.g., /etc/prometheus/rules/redis_app_alerts.yml):
groups:
- name: c_app_alerts
rules:
- alert: CAppUnhealthy
expr: up{job="my-c-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "C Application {{ $labels.instance }} is down."
description: "The C application service {{ $labels.instance }} has been reported as down by systemd."
- alert: CAppHealthCheckFailed
expr: |
probe_success{job="my-c-app-healthcheck"} == 0
for: 2m
labels:
severity: warning
annotations:
summary: "C Application Health Check Failed on {{ $labels.instance }}"
description: "The periodic health check for C application {{ $labels.instance }} failed. Check logs for details."
- name: redis_alerts
rules:
- alert: RedisClusterDown
expr: redis_cluster_cluster_state{job="redis_exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis Cluster is down on {{ $labels.instance }}"
description: "The Redis cluster managed by {{ $labels.instance }} is in a 'fail' state."
- alert: RedisHighMemoryUsage
expr: |
(redis_memory_used_bytes{job="redis_exporter"} / redis_total_system_memory_bytes{job="redis_exporter"}) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Redis High Memory Usage on {{ $labels.instance }}"
description: "Redis instance {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }}% of its total system memory."
- alert: RedisEvictedKeys
expr: |
increase(redis_evicted_keys_total{job="redis_exporter"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Redis Evicted Keys on {{ $labels.instance }}"
description: "Redis instance {{ $labels.instance }} has evicted keys in the last 5 minutes, indicating memory pressure."
- alert: RedisReplicationLag
expr: |
redis_replication_master_repl_offset{job="redis_exporter"} - redis_replication_slave_repl_offset{job="redis_exporter"} > 100000
for: 5m
labels:
severity: warning
annotations:
summary: "Redis Replication Lag on {{ $labels.instance }}"
description: "Redis slave {{ $labels.instance }} is lagging behind its master by {{ $value }} bytes."
- alert: RedisLongFork
expr: |
increase(redis_latest_fork_usec{job="redis_exporter"}[10m]) > 500000 # 500ms
for: 10m
labels:
severity: warning
annotations:
summary: "Redis Long Fork Operation on {{ $labels.instance }}"
description: "Redis instance {{ $labels.instance }} experienced a fork operation taking longer than 500ms."
Add this rules file to your Prometheus configuration and reload Prometheus.
Alertmanager Configuration
Configure Alertmanager (alertmanager.yml) to route these alerts to your desired channels (e.g., Slack, PagerDuty, email).
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: ''
channel: '#alerts-general'
- name: 'critical-alerts'
slack_configs:
- api_url: ''
channel: '#alerts-critical'
pagerduty_configs:
- service_key: ''
Ensure Alertmanager is running and configured to load this configuration. Prometheus should be configured to send alerts to Alertmanager.
OVH Specific Considerations
While the above practices are general, OVH’s infrastructure might have specific nuances:
- Network Latency: Monitor inter-node latency within your Redis cluster and between your application servers and Redis. OVH’s network can be highly performant, but cross-zone or cross-region communication should be carefully observed.
- Instance Types: Choose appropriate instance types on OVH that provide sufficient CPU, RAM, and network bandwidth for your C application and Redis. Monitor resource utilization closely.
- Security Groups/Firewalls: Ensure your monitoring endpoints (e.g., Prometheus exporter on port 9121) are accessible from your Prometheus server, and that your C app’s port is accessible from where the health check runs.
- OVH Monitoring Tools: Complement your custom monitoring with OVH’s native monitoring dashboards for infrastructure-level metrics (CPU, disk I/O, network traffic at the hypervisor level). This provides a broader view and can help distinguish between application-level issues and infrastructure problems.
By combining systemd’s deep integration, detailed Redis metrics, and a robust Prometheus/Alertmanager stack, you establish a proactive, resilient monitoring system capable of keeping your C applications and Redis clusters healthy and available on OVH.