Server Monitoring Best Practices: Keeping Your C App and Redis Clusters Alive on AWS

Proactive C Application Health Checks with Prometheus and Node Exporter

Maintaining the health of a C application, especially one deployed on AWS, requires more than just basic process monitoring. We need to expose application-specific metrics and ensure the underlying host is performing optimally. Prometheus, coupled with `node_exporter`, provides a robust solution for this. For C applications, we’ll leverage custom exporters or direct metric exposition.

First, let’s configure `node_exporter` to gather system-level metrics. This is typically run as a systemd service on your EC2 instances.

Systemd Service for Node Exporter

Create a systemd unit file, for example, /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
  --web.listen-address=0.0.0.0:9100

[Install]
WantedBy=multi-user.target

Ensure the user `prometheus` exists and has appropriate permissions for the textfile collector directory. Then, enable and start the service:

sudo useradd -rs /bin/false prometheus
sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo chown prometheus:prometheus /var/lib/node_exporter/textfile_collector
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Exposing C Application Metrics

For C applications, you can either:

Directly expose metrics: Embed a small HTTP server within your C application that exposes metrics in Prometheus text format. This is ideal for high-frequency, application-specific metrics.
Use a custom exporter: Write a separate process (e.g., in Python or Go) that queries your C application (via IPC, shared memory, or a custom protocol) and exposes these metrics to Prometheus.
Leverage textfile collectors: Have your C application write metric files to the /var/lib/node_exporter/textfile_collector directory. `node_exporter` will pick these up. This is simpler for less dynamic metrics.

Let’s illustrate the textfile collector approach. Assume your C application periodically calculates a request latency metric. It can write this to a file like /var/lib/node_exporter/textfile_collector/app_metrics.prom.

Example C code snippet (conceptual, using POSIX file I/O):

#include <stdio.h>
#include <time.h>
#include <string.h>

// Assume 'calculate_request_latency()' returns latency in milliseconds
double calculate_request_latency() {
    // ... your latency calculation logic ...
    return (double)(rand() % 1000); // Placeholder
}

void write_app_metrics(const char* filepath) {
    FILE *f = fopen(filepath, "w");
    if (!f) {
        perror("Failed to open metrics file");
        return;
    }

    time_t now = time(NULL);
    double latency = calculate_request_latency();

    fprintf(f, "# HELP app_request_latency_ms Average request latency in milliseconds.\n");
    fprintf(f, "# TYPE app_request_latency_ms gauge\n");
    fprintf(f, "app_request_latency_ms{instance=\"%s\"} %f %ld000000000\n", gethostname(NULL, 0), latency, (long)now);

    fclose(f);
}

int main() {
    // ... your application logic ...
    const char* metrics_file = "/var/lib/node_exporter/textfile_collector/app_metrics.prom";
    // Call this periodically, e.g., every 30 seconds
    write_app_metrics(metrics_file);
    // ...
    return 0;
}

Ensure your C application has write permissions to /var/lib/node_exporter/textfile_collector/. Prometheus will scrape http://:9100/metrics, and the textfile collector module will automatically expose the contents of the .prom files.

Redis Cluster Monitoring with Prometheus Redis Exporter

Monitoring a Redis cluster involves tracking node health, memory usage, command latency, and replication status. The official Prometheus Redis Exporter is the standard tool for this.

Deploying Redis Exporter

You can deploy the Redis Exporter as a Docker container or a standalone binary. For AWS, running it on an EC2 instance that can reach your Redis nodes (e.g., within the same VPC or via security group rules) is common. We’ll use a systemd service for robustness.

# Download the latest release
wget https://github.com/oliver006/redis_exporter/releases/download/v1.47.0/redis_exporter-v1.47.0.linux-amd64.tar.gz
tar xvfz redis_exporter-v1.47.0.linux-amd64.tar.gz
sudo mv redis_exporter-v1.47.0.linux-amd64/redis_exporter /usr/local/bin/

# Create a systemd service file: /etc/systemd/system/redis_exporter.service
[Unit]
Description=Prometheus Redis Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=redis_exporter
ExecStart=/usr/local/bin/redis_exporter \
  --redis.addr=redis://:6379 \
  --redis.password= \
  --web.listen-address=0.0.0.0:9121 \
  --check-keyspace=true \
  --check-keyspace.interval=5m \
  --check-keyspace.max-keys=1000 \
  --namespace=redis

[Install]
WantedBy=multi-user.target

Replace <redis-master-ip> and <your-redis-password>. If you have a Redis Sentinel setup, you can point it to Sentinel and let the exporter discover the master. For a Redis Cluster, you’ll typically point it to one of the nodes and it will discover the cluster topology.

sudo useradd -rs /bin/false redis_exporter
sudo systemctl daemon-reload
sudo systemctl enable redis_exporter
sudo systemctl start redis_exporter
sudo systemctl status redis_exporter

Redis Exporter Configuration for Clusters

For Redis Cluster, the exporter automatically discovers other nodes. Key metrics to monitor include:

redis_up: Whether the exporter can connect to Redis.
redis_memory_used_bytes: Memory usage of the Redis instance.
redis_connected_clients: Number of connected clients.
redis_commands_processed_total: Total commands processed.
redis_instantaneous_ops_per_sec: Current operations per second.
redis_keyspace_keys: Number of keys in the keyspace.
redis_cluster_slots_assigned: Number of slots assigned to a node (for cluster nodes).
redis_cluster_slots_ok: Number of slots in OK state (for cluster nodes).
redis_cluster_slots_pfail: Number of slots in PFAIL state (for cluster nodes).
redis_cluster_slots_fail: Number of slots in FAIL state (for cluster nodes).
redis_replication_connected_slaves: Number of connected replicas for a master.
redis_replication_master_link_status: Status of the replication link.

You can customize which metrics are collected using flags like --check-keyspace, --check-single-keys, and --check-slave-nodes. For a cluster, ensure --check-keyspace is enabled to get cluster-wide key counts.

AWS CloudWatch Alarms for Critical Metrics

While Prometheus provides deep visibility, AWS CloudWatch is essential for infrastructure-level alarms and integration with AWS services like SNS for notifications. We’ll set up alarms for both EC2 instances running our C app and the Redis cluster nodes.

EC2 Instance Alarms (for C App Host)

Key CloudWatch metrics to alarm on for the EC2 instance hosting your C application:

CPUUtilization: High CPU can indicate application issues or resource starvation. Set a threshold like > 90% for 5 minutes.
MemoryUtilization: (Requires the CloudWatch agent). High memory usage can lead to OOM kills. Set a threshold like > 85% for 10 minutes.
DiskReadOps / DiskWriteOps: High I/O can indicate performance bottlenecks.
NetworkIn / NetworkOut: Unexpected spikes or drops can signal issues.
StatusCheckFailed: Any failure here (System or Instance) is critical. Alarm on > 0 for 1 minute.

To get MemoryUtilization, install and configure the CloudWatch agent:

# Example installation (Amazon Linux 2)
sudo yum install amazon-cloudwatch-agent -y

# Create configuration file (e.g., /opt/aws/amazon-cloudwatch-agent/bin/config.json)
{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "CApp/EC2",
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_iowait",
          "cpu_usage_user",
          "cpu_usage_system"
        ],
        "totalcpu_time_metrics": true
      },
      "disk": {
        "measurement": [
          "total_disks",
          "disk_used_percent",
          "disk_ inodes_free"
        ],
        "resources_pattern": [
          "xvda",
          "nvme0n1"
        ],
        "append_dimensions": {
          "InstanceId": "${aws:InstanceId}"
        }
      },
      "mem": {
        "measurement": [
          "mem_used_percent",
          "mem_total",
          "mem_used"
        ]
      },
      "netif": {
        "measurement": [
          "bytes_sent",
          "bytes_recv",
          "packets_sent",
          "packets_recv"
        ]
      }
    }
  }
}

# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
sudo systemctl status amazon-cloudwatch-agent

Then, create CloudWatch Alarms via the AWS Console or CLI, targeting the custom namespace (e.g., CApp/EC2) and metrics like mem_used_percent.

Redis Cluster Alarms (via Prometheus Integration)

Leverage Prometheus alerts defined in Alertmanager, which can then trigger AWS SNS topics or other notification channels. Here’s a sample Prometheus alerting rule for Redis cluster health:

groups:
- name: redis_cluster_alerts
  rules:
  - alert: RedisClusterNodeDown
    expr: redis_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis node {{ $labels.instance }} is down."
      description: "The Redis exporter reports that the Redis node {{ $labels.instance }} is unreachable for more than 5 minutes."

  - alert: RedisHighMemoryUsage
    expr: redis_memory_used_bytes / redis_memory_total_bytes * 100 > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Redis node {{ $labels.instance }} has high memory usage."
      description: "Redis node {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }}% of its memory."

  - alert: RedisClusterSlotsDegraded
    expr: redis_cluster_slots_pfail + redis_cluster_slots_fail > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis cluster has degraded slots on {{ $labels.instance }}."
      description: "Redis node {{ $labels.instance }} reports {{ $value }} slots in PFAIL or FAIL state."

  - alert: RedisReplicationLagging
    expr: redis_replication_master_link_status != \"OK\"
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis replication lag detected on {{ $labels.instance }}."
      description: "The replication link for Redis node {{ $labels.instance }} is not OK."

Configure Alertmanager to route these alerts to an SNS topic. This allows you to integrate with Slack, PagerDuty, or email notifications.

Advanced Diagnostics and Troubleshooting

When issues arise, a systematic approach is key. Here are common diagnostic steps:

C Application Issues

Check Logs: Ensure your C application logs errors and warnings verbosely. Use journalctl -u your_app.service -f or check log files.
Resource Limits: Verify system resource limits (ulimit) for the application user.
Core Dumps: Configure core dumps for crashes. Analyze them with gdb: gdb /path/to/your_app /path/to/core_dump.
Strace: Trace system calls to understand I/O or network issues: strace -p $(pgrep your_app).
Valgrind: Detect memory leaks and errors: valgrind --leak-check=full ./your_app.
Prometheus Metrics: Analyze app_request_latency_ms, error counters, and resource usage exposed by your app. Look for correlations with system metrics.

Redis Cluster Issues

Redis CLI: Connect directly: redis-cli -c -h -p 6379. Use commands like CLUSTER INFO, CLUSTER NODES, INFO memory, INFO replication, SLOWLOG GET 10.
Redis Exporter Metrics: Check redis_up, redis_cluster_slots_fail, redis_replication_master_link_status.
Network Connectivity: Ensure instances can reach each other on port 6379 (and 16379 for cluster bus). Use telnet 6379 or nc -vz 6379.
Security Groups/NACLs: Verify AWS network rules allow traffic between Redis nodes and between Redis nodes and the Redis exporter/application instances.
Persistence: Check RDB/AOF status and disk space if persistence is enabled.
Client Connection Issues: Monitor redis_connected_clients and check application logs for Redis connection errors.

By combining application-level metrics (Prometheus), infrastructure metrics (Node Exporter, CloudWatch Agent), and cloud-native monitoring (CloudWatch Alarms), you build a resilient system capable of both proactive health management and rapid issue resolution for your C application and Redis clusters on AWS.