Server Monitoring Best Practices: Keeping Your C App and Redis Clusters Alive on AWS
Proactive C Application Health Checks with Prometheus and Node Exporter
Maintaining the health of a C application, especially one deployed on AWS, requires more than just basic process monitoring. We need to expose application-specific metrics and ensure the underlying host is performing optimally. Prometheus, coupled with `node_exporter`, provides a robust solution for this. For C applications, we’ll leverage custom exporters or direct metric exposition.
First, let’s configure `node_exporter` to gather system-level metrics. This is typically run as a systemd service on your EC2 instances.
Systemd Service for Node Exporter
Create a systemd unit file, for example, /etc/systemd/system/node_exporter.service:
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus ExecStart=/usr/local/bin/node_exporter \ --collector.textfile.directory=/var/lib/node_exporter/textfile_collector \ --web.listen-address=0.0.0.0:9100 [Install] WantedBy=multi-user.target
Ensure the user `prometheus` exists and has appropriate permissions for the textfile collector directory. Then, enable and start the service:
sudo useradd -rs /bin/false prometheus sudo mkdir -p /var/lib/node_exporter/textfile_collector sudo chown prometheus:prometheus /var/lib/node_exporter/textfile_collector sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
Exposing C Application Metrics
For C applications, you can either:
- Directly expose metrics: Embed a small HTTP server within your C application that exposes metrics in Prometheus text format. This is ideal for high-frequency, application-specific metrics.
- Use a custom exporter: Write a separate process (e.g., in Python or Go) that queries your C application (via IPC, shared memory, or a custom protocol) and exposes these metrics to Prometheus.
- Leverage textfile collectors: Have your C application write metric files to the
/var/lib/node_exporter/textfile_collectordirectory. `node_exporter` will pick these up. This is simpler for less dynamic metrics.
Let’s illustrate the textfile collector approach. Assume your C application periodically calculates a request latency metric. It can write this to a file like /var/lib/node_exporter/textfile_collector/app_metrics.prom.
Example C code snippet (conceptual, using POSIX file I/O):
#include <stdio.h>
#include <time.h>
#include <string.h>
// Assume 'calculate_request_latency()' returns latency in milliseconds
double calculate_request_latency() {
// ... your latency calculation logic ...
return (double)(rand() % 1000); // Placeholder
}
void write_app_metrics(const char* filepath) {
FILE *f = fopen(filepath, "w");
if (!f) {
perror("Failed to open metrics file");
return;
}
time_t now = time(NULL);
double latency = calculate_request_latency();
fprintf(f, "# HELP app_request_latency_ms Average request latency in milliseconds.\n");
fprintf(f, "# TYPE app_request_latency_ms gauge\n");
fprintf(f, "app_request_latency_ms{instance=\"%s\"} %f %ld000000000\n", gethostname(NULL, 0), latency, (long)now);
fclose(f);
}
int main() {
// ... your application logic ...
const char* metrics_file = "/var/lib/node_exporter/textfile_collector/app_metrics.prom";
// Call this periodically, e.g., every 30 seconds
write_app_metrics(metrics_file);
// ...
return 0;
}
Ensure your C application has write permissions to /var/lib/node_exporter/textfile_collector/. Prometheus will scrape http://, and the textfile collector module will automatically expose the contents of the .prom files.
Redis Cluster Monitoring with Prometheus Redis Exporter
Monitoring a Redis cluster involves tracking node health, memory usage, command latency, and replication status. The official Prometheus Redis Exporter is the standard tool for this.
Deploying Redis Exporter
You can deploy the Redis Exporter as a Docker container or a standalone binary. For AWS, running it on an EC2 instance that can reach your Redis nodes (e.g., within the same VPC or via security group rules) is common. We’ll use a systemd service for robustness.
# Download the latest release wget https://github.com/oliver006/redis_exporter/releases/download/v1.47.0/redis_exporter-v1.47.0.linux-amd64.tar.gz tar xvfz redis_exporter-v1.47.0.linux-amd64.tar.gz sudo mv redis_exporter-v1.47.0.linux-amd64/redis_exporter /usr/local/bin/ # Create a systemd service file: /etc/systemd/system/redis_exporter.service [Unit] Description=Prometheus Redis Exporter Wants=network-online.target After=network-online.target [Service] User=redis_exporter ExecStart=/usr/local/bin/redis_exporter \ --redis.addr=redis://:6379 \ --redis.password= \ --web.listen-address=0.0.0.0:9121 \ --check-keyspace=true \ --check-keyspace.interval=5m \ --check-keyspace.max-keys=1000 \ --namespace=redis [Install] WantedBy=multi-user.target
Replace <redis-master-ip> and <your-redis-password>. If you have a Redis Sentinel setup, you can point it to Sentinel and let the exporter discover the master. For a Redis Cluster, you’ll typically point it to one of the nodes and it will discover the cluster topology.
sudo useradd -rs /bin/false redis_exporter sudo systemctl daemon-reload sudo systemctl enable redis_exporter sudo systemctl start redis_exporter sudo systemctl status redis_exporter
Redis Exporter Configuration for Clusters
For Redis Cluster, the exporter automatically discovers other nodes. Key metrics to monitor include:
redis_up: Whether the exporter can connect to Redis.redis_memory_used_bytes: Memory usage of the Redis instance.redis_connected_clients: Number of connected clients.redis_commands_processed_total: Total commands processed.redis_instantaneous_ops_per_sec: Current operations per second.redis_keyspace_keys: Number of keys in the keyspace.redis_cluster_slots_assigned: Number of slots assigned to a node (for cluster nodes).redis_cluster_slots_ok: Number of slots in OK state (for cluster nodes).redis_cluster_slots_pfail: Number of slots in PFAIL state (for cluster nodes).redis_cluster_slots_fail: Number of slots in FAIL state (for cluster nodes).redis_replication_connected_slaves: Number of connected replicas for a master.redis_replication_master_link_status: Status of the replication link.
You can customize which metrics are collected using flags like --check-keyspace, --check-single-keys, and --check-slave-nodes. For a cluster, ensure --check-keyspace is enabled to get cluster-wide key counts.
AWS CloudWatch Alarms for Critical Metrics
While Prometheus provides deep visibility, AWS CloudWatch is essential for infrastructure-level alarms and integration with AWS services like SNS for notifications. We’ll set up alarms for both EC2 instances running our C app and the Redis cluster nodes.
EC2 Instance Alarms (for C App Host)
Key CloudWatch metrics to alarm on for the EC2 instance hosting your C application:
CPUUtilization: High CPU can indicate application issues or resource starvation. Set a threshold like> 90%for 5 minutes.MemoryUtilization: (Requires the CloudWatch agent). High memory usage can lead to OOM kills. Set a threshold like> 85%for 10 minutes.DiskReadOps/DiskWriteOps: High I/O can indicate performance bottlenecks.NetworkIn/NetworkOut: Unexpected spikes or drops can signal issues.StatusCheckFailed: Any failure here (System or Instance) is critical. Alarm on> 0for 1 minute.
To get MemoryUtilization, install and configure the CloudWatch agent:
# Example installation (Amazon Linux 2)
sudo yum install amazon-cloudwatch-agent -y
# Create configuration file (e.g., /opt/aws/amazon-cloudwatch-agent/bin/config.json)
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "CApp/EC2",
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait",
"cpu_usage_user",
"cpu_usage_system"
],
"totalcpu_time_metrics": true
},
"disk": {
"measurement": [
"total_disks",
"disk_used_percent",
"disk_ inodes_free"
],
"resources_pattern": [
"xvda",
"nvme0n1"
],
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
}
},
"mem": {
"measurement": [
"mem_used_percent",
"mem_total",
"mem_used"
]
},
"netif": {
"measurement": [
"bytes_sent",
"bytes_recv",
"packets_sent",
"packets_recv"
]
}
}
}
}
# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
sudo systemctl status amazon-cloudwatch-agent
Then, create CloudWatch Alarms via the AWS Console or CLI, targeting the custom namespace (e.g., CApp/EC2) and metrics like mem_used_percent.
Redis Cluster Alarms (via Prometheus Integration)
Leverage Prometheus alerts defined in Alertmanager, which can then trigger AWS SNS topics or other notification channels. Here’s a sample Prometheus alerting rule for Redis cluster health:
groups:
- name: redis_cluster_alerts
rules:
- alert: RedisClusterNodeDown
expr: redis_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis node {{ $labels.instance }} is down."
description: "The Redis exporter reports that the Redis node {{ $labels.instance }} is unreachable for more than 5 minutes."
- alert: RedisHighMemoryUsage
expr: redis_memory_used_bytes / redis_memory_total_bytes * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Redis node {{ $labels.instance }} has high memory usage."
description: "Redis node {{ $labels.instance }} is using {{ $value | printf \"%.2f\" }}% of its memory."
- alert: RedisClusterSlotsDegraded
expr: redis_cluster_slots_pfail + redis_cluster_slots_fail > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis cluster has degraded slots on {{ $labels.instance }}."
description: "Redis node {{ $labels.instance }} reports {{ $value }} slots in PFAIL or FAIL state."
- alert: RedisReplicationLagging
expr: redis_replication_master_link_status != \"OK\"
for: 2m
labels:
severity: warning
annotations:
summary: "Redis replication lag detected on {{ $labels.instance }}."
description: "The replication link for Redis node {{ $labels.instance }} is not OK."
Configure Alertmanager to route these alerts to an SNS topic. This allows you to integrate with Slack, PagerDuty, or email notifications.
Advanced Diagnostics and Troubleshooting
When issues arise, a systematic approach is key. Here are common diagnostic steps:
C Application Issues
- Check Logs: Ensure your C application logs errors and warnings verbosely. Use
journalctl -u your_app.service -for check log files. - Resource Limits: Verify system resource limits (ulimit) for the application user.
- Core Dumps: Configure core dumps for crashes. Analyze them with
gdb:gdb /path/to/your_app /path/to/core_dump. - Strace: Trace system calls to understand I/O or network issues:
strace -p $(pgrep your_app). - Valgrind: Detect memory leaks and errors:
valgrind --leak-check=full ./your_app. - Prometheus Metrics: Analyze
app_request_latency_ms, error counters, and resource usage exposed by your app. Look for correlations with system metrics.
Redis Cluster Issues
- Redis CLI: Connect directly:
redis-cli -c -h. Use commands like-p 6379 CLUSTER INFO,CLUSTER NODES,INFO memory,INFO replication,SLOWLOG GET 10. - Redis Exporter Metrics: Check
redis_up,redis_cluster_slots_fail,redis_replication_master_link_status. - Network Connectivity: Ensure instances can reach each other on port 6379 (and 16379 for cluster bus). Use
telnetor6379 nc -vz.6379 - Security Groups/NACLs: Verify AWS network rules allow traffic between Redis nodes and between Redis nodes and the Redis exporter/application instances.
- Persistence: Check RDB/AOF status and disk space if persistence is enabled.
- Client Connection Issues: Monitor
redis_connected_clientsand check application logs for Redis connection errors.
By combining application-level metrics (Prometheus), infrastructure metrics (Node Exporter, CloudWatch Agent), and cloud-native monitoring (CloudWatch Alarms), you build a resilient system capable of both proactive health management and rapid issue resolution for your C application and Redis clusters on AWS.