Server Monitoring Best Practices: Keeping Your C App and Redis Clusters Alive on OVH

Proactive C Application Health Checks with Systemd and Redis Cluster Monitoring

Maintaining the stability of critical C applications and their underlying Redis clusters on cloud infrastructure, particularly within an OVH environment, demands a robust, multi-layered monitoring strategy. This isn’t about merely reacting to failures; it’s about anticipating them. We’ll focus on deep system-level integration for our C application and specific, actionable metrics for Redis, all designed for production resilience.

Systemd Service Monitoring for C Applications

For C applications managed by systemd, leveraging its built-in capabilities is the first line of defense. We’ll configure systemd to not only restart failed services but also to actively probe their health. This involves defining a custom health check executable that systemd can call periodically.

Creating a C Health Check Executable

This small C program will perform a basic check, such as attempting to bind to a specific port or making a simple internal API call. For demonstration, we’ll simulate a check by returning an exit code of 0 for success and non-zero for failure.

Create a file named app_health_check.c:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>

// Define the port your C application listens on
#define APP_PORT 8080

int main() {
    int sock_fd;
    struct sockaddr_in serv_addr;

    // Attempt to create a socket
    if ((sock_fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
        perror("socket creation failed");
        return 1; // Indicate failure
    }

    // Prepare the sockaddr_in structure
    serv_addr.sin_family = AF_INET;
    serv_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); // Check localhost
    serv_addr.sin_port = htons(APP_PORT);

    // Attempt to connect to the application's port
    // This is a simplified check. A real-world scenario might involve sending a specific request.
    if (connect(sock_fd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
        perror("connection to app failed");
        close(sock_fd);
        return 1; // Indicate failure
    }

    // If connection is successful, the app is likely responsive
    printf("Application health check successful.\n");
    close(sock_fd);
    return 0; // Indicate success
}

Compile this program and place the executable in a known location, e.g., /usr/local/bin/app_health_check.

gcc app_health_check.c -o /usr/local/bin/app_health_check
chmod +x /usr/local/bin/app_health_check

Configuring Systemd Service Unit

Now, modify your C application’s systemd service unit file (e.g., /etc/systemd/system/my-c-app.service) to include health checking directives.

[Unit]
Description=My Critical C Application
After=network.target

[Service]
ExecStart=/usr/local/bin/my-c-app
Restart=always
RestartSec=5s

# Health Check Configuration
# Type=notify is preferred if your app supports it.
# Otherwise, use ExecStartPost with a script that calls the health check.
# For simplicity, we'll use a basic ExecStartPost here.
# A more robust approach would be a dedicated systemd timer and service.

# Option 1: Simple check after start (less frequent)
# ExecStartPost=/usr/local/bin/app_health_check

# Option 2: Using a dedicated health check service (more robust)
# This requires a separate .service file for the health check itself.
# For this example, we'll stick to a simpler approach for demonstration.

# Systemd's built-in watchdog functionality is excellent.
# Ensure your application periodically calls sd_notify() if using Type=notify.
# If not, we can simulate checks.

# Let's configure systemd to run our check periodically using a separate service.
# This is more advanced and reliable than ExecStartPost.

# We'll define a separate service for the health check.
# This service will run our executable and exit.
# Systemd's watchdog will monitor this service.

# The main service unit:
# ExecStart=/usr/local/bin/my-c-app
# WatchdogSec=10s  # If your app sends sd_notify()

[Install]
WantedBy=multi-user.target

To implement periodic health checks reliably, we’ll create a separate systemd service and timer. This allows us to define the frequency of checks independently of the application’s restart policy.

Dedicated Health Check Service and Timer

Create a service file for the health check, e.g., /etc/systemd/system/my-c-app-healthcheck.service:

[Unit]
Description=Health Check for My Critical C Application
Requires=my-c-app.service
After=my-c-app.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/app_health_check

Create a timer file to trigger this service, e.g., /etc/systemd/system/my-c-app-healthcheck.timer:

[Unit]
Description=Timer for My Critical C Application Health Check

[Timer]
OnBootSec=1min  # Start checking 1 minute after boot
OnUnitActiveSec=30s # Check every 30 seconds after the app is active

[Install]
WantedBy=timers.target

Reload systemd, enable and start the timer:

sudo systemctl daemon-reload
sudo systemctl enable my-c-app-healthcheck.timer
sudo systemctl start my-c-app-healthcheck.timer

If my-c-app-healthcheck.service fails (i.e., app_health_check returns non-zero), systemd will log the failure. You can then configure systemd’s alerting mechanisms or integrate with external monitoring tools like Prometheus Alertmanager.

Redis Cluster Monitoring with Redis-CLI and Prometheus Exporter

Monitoring Redis clusters, especially in a sharded or Sentinel-managed setup, requires looking beyond basic latency. We need to track cluster health, memory usage, network traffic, and replication status.

Essential Redis Metrics via redis-cli

Directly querying Redis instances using redis-cli provides immediate insights. We’ll focus on commands that reveal critical operational data.

# Connect to a Redis master node
redis-cli -h  -p 6379

# Get general info
INFO ALL

# Key metrics to watch from INFO ALL:
# - connected_clients: Number of connected clients. High numbers might indicate a bottleneck.
# - memory_used_peak: Peak memory usage. Crucial for capacity planning.
# - memory_rss: Resident Set Size. Actual memory occupied by Redis.
# - latest_fork_usec: Time taken for the last fork operation. Long forks can block the server.
# - evicted_keys: Number of keys evicted due to memory policy. Indicates memory pressure.
# - keyspace_hits, keyspace_misses: Cache hit ratio.
# - total_commands_processed: Throughput.
# - instantaneous_ops_per_sec: Current throughput.
# - rejected_connections: Number of connections rejected due to maxclients limit.
# - sync_partial_ok, sync_partial_err: Replication status.
# - master_repl_offset, slave_repl_offset: Replication lag.

# Check cluster status (if using Redis Cluster)
CLUSTER INFO

# Key metrics from CLUSTER INFO:
# - cluster_state: Should be 'ok'.
# - cluster_slots_assigned, cluster_slots_ok, cluster_slots_pfail, cluster_slots_fail: Cluster health.

# Check replication status for a specific slave
INFO replication

# Check for slow commands
SLOWLOG GET 10

These commands are invaluable for manual diagnostics. For automated monitoring, we’ll integrate them with Prometheus.

Prometheus Redis Exporter Setup

The redis_exporter is the de facto standard for exposing Redis metrics to Prometheus. We’ll deploy it as a systemd service.

Download the latest release from the official GitHub repository:

# Example for Linux AMD64
wget https://github.com/oliver006/redis_exporter/releases/download/v1.47.0/redis_exporter-v1.47.0.linux-amd64.tar.gz
tar xvfz redis_exporter-v1.47.0.linux-amd64.tar.gz
sudo mv redis_exporter-v1.47.0.linux-amd64/redis_exporter /usr/local/bin/
rm -rf redis_exporter-v1.47.0.linux-amd64*

Create a systemd service file for the exporter, e.g., /etc/systemd/system/redis_exporter.service. This configuration assumes you have a Redis instance running on localhost:6379. For a cluster, you’ll need to specify multiple targets or use a configuration file.

[Unit]
Description=Prometheus Redis Exporter
After=network.target redis.service # Adjust if Redis is managed by a different service

[Service]
User=redis # Or a dedicated user for the exporter
ExecStart=/usr/local/bin/redis_exporter \
  --redis.addr=redis://localhost:6379 \
  --web.listen-address=":9121" \
  --check-keyspace=true \
  --check-clients=true \
  --check-memory=true \
  --check-replication=true \
  --check-cluster=true \
  --namespace=redis
# For Redis Cluster, you might use:
# --redis.addr=redis://:6379,redis://:6379,...
# Or a configuration file: --redis.config=/etc/redis_exporter/redis.conf

Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

If you are monitoring a Redis Cluster, you’ll need to provide all master node addresses to --redis.addr or use a configuration file. The exporter will then query each node and its replicas.

Redis Cluster Configuration File for Exporter

For complex Redis Cluster setups, a configuration file is cleaner. Create /etc/redis_exporter/redis.conf:

# Example redis.conf for redis_exporter
# This file lists the Redis instances to monitor.
# The exporter will connect to each listed instance.

# For Redis Cluster, list all master nodes. The exporter will discover slaves.
redis.addr: redis://192.168.1.10:6379
redis.addr: redis://192.168.1.11:6379
redis.addr: redis://192.168.1.12:6379

# Other exporter options can also be specified here
check.keyspace: true
check.clients: true
check.memory: true
check.replication: true
check.cluster: true
namespace: redis_cluster

Update the ExecStart line in the systemd service to use this config file:

ExecStart=/usr/local/bin/redis_exporter --redis.config=/etc/redis_exporter/redis.conf --web.listen-address=":9121"

Reload systemd, enable, and start the exporter:

sudo systemctl daemon-reload
sudo systemctl enable redis_exporter.service
sudo systemctl start redis_exporter.service

Ensure your Prometheus server is configured to scrape http://:9121/metrics.

Alerting Strategies with Prometheus and Alertmanager

Effective alerting is crucial. We’ll define Prometheus rules that trigger alerts based on critical metrics, and Alertmanager will handle deduplication, grouping, and routing.

Prometheus Alerting Rules

Create a Prometheus rules file (e.g., /etc/prometheus/rules/redis_app_alerts.yml):

groups:
- name: c_app_alerts
  rules:
  - alert: CAppUnhealthy
    expr: up{job="my-c-app"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "C Application {{ $labels.instance }} is down."
      description: "The C application service {{ $labels.instance }} has been reported as down by systemd."

  - alert: CAppHealthCheckFailed
    expr: |
      probe_success{job="my-c-app-healthcheck"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "C Application Health Check Failed on {{ $labels.instance }}"
      description: "The periodic health check for C application {{ $labels.instance }} failed. Check logs for details."

- name: redis_alerts
  rules:
  - alert: RedisClusterDown
    expr: redis_cluster_cluster_state{job="redis_exporter"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis Cluster is down on {{ $labels.instance }}"
      description: "The Redis cluster managed by {{ $labels.instance }} is in a 'fail' state."

  - alert: RedisHighMemoryUsage
    expr: |
      (redis_memory_used_bytes{job="redis_exporter"} / redis_total_system_memory_bytes{job="redis_exporter"}) * 100 > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Redis High Memory Usage on {{ $labels.instance }}"
      description: "Redis instance {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }}% of its total system memory."

  - alert: RedisEvictedKeys
    expr: |
      increase(redis_evicted_keys_total{job="redis_exporter"}[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis Evicted Keys on {{ $labels.instance }}"
      description: "Redis instance {{ $labels.instance }} has evicted keys in the last 5 minutes, indicating memory pressure."

  - alert: RedisReplicationLag
    expr: |
      redis_replication_master_repl_offset{job="redis_exporter"} - redis_replication_slave_repl_offset{job="redis_exporter"} > 100000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis Replication Lag on {{ $labels.instance }}"
      description: "Redis slave {{ $labels.instance }} is lagging behind its master by {{ $value }} bytes."

  - alert: RedisLongFork
    expr: |
      increase(redis_latest_fork_usec{job="redis_exporter"}[10m]) > 500000 # 500ms
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Redis Long Fork Operation on {{ $labels.instance }}"
      description: "Redis instance {{ $labels.instance }} experienced a fork operation taking longer than 500ms."

Add this rules file to your Prometheus configuration and reload Prometheus.

Alertmanager Configuration

Configure Alertmanager (alertmanager.yml) to route these alerts to your desired channels (e.g., Slack, PagerDuty, email).

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
    continue: true

receivers:
- name: 'default-receiver'
  slack_configs:
  - api_url: ''
    channel: '#alerts-general'

- name: 'critical-alerts'
  slack_configs:
  - api_url: ''
    channel: '#alerts-critical'
  pagerduty_configs:
  - service_key: ''

Ensure Alertmanager is running and configured to load this configuration. Prometheus should be configured to send alerts to Alertmanager.

OVH Specific Considerations

While the above practices are general, OVH’s infrastructure might have specific nuances:

Network Latency: Monitor inter-node latency within your Redis cluster and between your application servers and Redis. OVH’s network can be highly performant, but cross-zone or cross-region communication should be carefully observed.
Instance Types: Choose appropriate instance types on OVH that provide sufficient CPU, RAM, and network bandwidth for your C application and Redis. Monitor resource utilization closely.
Security Groups/Firewalls: Ensure your monitoring endpoints (e.g., Prometheus exporter on port 9121) are accessible from your Prometheus server, and that your C app’s port is accessible from where the health check runs.
OVH Monitoring Tools: Complement your custom monitoring with OVH’s native monitoring dashboards for infrastructure-level metrics (CPU, disk I/O, network traffic at the hypervisor level). This provides a broader view and can help distinguish between application-level issues and infrastructure problems.

By combining systemd’s deep integration, detailed Redis metrics, and a robust Prometheus/Alertmanager stack, you establish a proactive, resilient monitoring system capable of keeping your C applications and Redis clusters healthy and available on OVH.