Server Monitoring Best Practices: Keeping Your C++ App and Redis Clusters Alive on AWS

Proactive C++ Application Health Checks on AWS EC2

Maintaining the health of C++ applications deployed on AWS EC2 instances requires a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. This involves leveraging system-level tools and custom application instrumentation.

Implementing a Process Watchdog with `systemd`

For robust process management, `systemd` is the de facto standard on modern Linux distributions. We can configure it to automatically restart our C++ application if it crashes. This is achieved by creating a `.service` unit file.

Example `systemd` Service File for a C++ Application

Let’s assume your C++ application is compiled into an executable named `my_cpp_app` and resides in `/opt/my_cpp_app/bin/my_cpp_app`. It might also require specific environment variables and a working directory.

[Unit]
Description=My C++ Application Service
After=network.target

[Service]
Type=simple
User=appuser
Group=appgroup
WorkingDirectory=/opt/my_cpp_app
Environment="MY_APP_CONFIG=/etc/my_cpp_app/config.yaml"
ExecStart=/opt/my_cpp_app/bin/my_cpp_app --config $MY_APP_CONFIG
Restart=on-failure
RestartSec=5
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=my_cpp_app

[Install]
WantedBy=multi-user.target

To enable and start this service:

sudo cp my_cpp_app.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable my_cpp_app.service
sudo systemctl start my_cpp_app.service
sudo systemctl status my_cpp_app.service

The `Restart=on-failure` directive is crucial here. `RestartSec=5` adds a small delay before attempting a restart, preventing rapid restart loops if the application fails immediately upon startup. `StandardOutput` and `StandardError` directed to `syslog` allow us to capture application logs using tools like `journalctl`.

Application-Level Health Endpoints

Relying solely on process restarts isn’t sufficient. Your C++ application should expose a dedicated health check endpoint. This endpoint can perform deeper checks, such as verifying database connections, external service availability, or internal state consistency.

Implementing a Simple HTTP Health Check in C++ (using `libmicrohttpd`)

Here’s a simplified example demonstrating how to add an HTTP health check endpoint. This requires a web server library like `libmicrohttpd`.

#include <microhttpd.h>
#include <string>
#include <iostream>
#include <vector>

// Assume this function checks your application's critical dependencies
bool check_dependencies() {
    // Example: Check database connection, external API status, etc.
    // For demonstration, we'll just return true.
    return true;
}

static int health_handler(void *cls, struct MHD_Connection *connection,
                          const char *url, const char *method,
                          const char *version, const char *upload_data,
                          size_t *upload_data_size, void **con_cls) {
    if (strcmp(method, "GET") != 0) {
        return MHD_NO; // Only accept GET requests
    }

    if (strcmp(url, "/health") != 0) {
        return MHD_NO; // Only accept /health endpoint
    }

    if (check_dependencies()) {
        const char *response_str = "{\"status\": \"ok\"}";
        struct MHD_Response *response = MHD_create_response_from_buffer(strlen(response_str), (void *)response_str, MHD_RESPMem_MUST_COPY);
        MHD_add_response_header(response, MHD_HTTP_HEADER_CONTENT_TYPE, "application/json");
        MHD_queue_response(connection, MHD_HTTP_STATUS_OK, response);
        MHD_destroy_response(response);
        return MHD_YES;
    } else {
        const char *response_str = "{\"status\": \"degraded\"}";
        struct MHD_Response *response = MHD_create_response_from_buffer(strlen(response_str), (void *)response_str, MHD_RESPMem_MUST_COPY);
        MHD_add_response_header(response, MHD_HTTP_HEADER_CONTENT_TYPE, "application/json");
        MHD_queue_response(connection, MHD_HTTP_STATUS_SERVICE_UNAVAILABLE, response);
        MHD_destroy_response(response);
        return MHD_YES;
    }
}

int main() {
    struct MHD_Daemon *daemon;

    daemon = MHD_start_daemon(MHD_CONFIG_LISTEN_PORT, 8080, NULL, NULL,
                              &health_handler, NULL, MHD_OPTION_END);
    if (daemon == NULL) {
        std::cerr << "Failed to start daemon" << std::endl;
        return 1;
    }

    std::cout << "Health check server started on port 8080" << std::endl;

    // Keep the main thread alive
    std::cin.get();

    MHD_stop_daemon(daemon);
    return 0;
}

Compile this code (e.g., `g++ health_check.cpp -o health_check -lmicrohttpd`) and integrate it into your application’s startup routine or run it as a separate process managed by `systemd`. You would then configure your monitoring system (e.g., CloudWatch, Prometheus) to poll `http://:8080/health`.

Leveraging AWS CloudWatch for Metrics and Alarms

CloudWatch is indispensable for monitoring AWS resources. We’ll focus on collecting custom metrics from our C++ application and setting up alarms.

Sending Custom Metrics with the CloudWatch Agent

The CloudWatch agent can collect system-level metrics and custom application metrics. For application-specific metrics (e.g., request latency, error counts), you can use the agent’s StatsD or collectd input plugins, or directly use the AWS SDK to publish metrics.

Configuring the CloudWatch Agent for Custom Metrics

Create a configuration file (e.g., `/opt/aws/amazon-cloudwatch-agent/bin/config.json`).

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyCppApp",
    "metrics_collected": {
      "statsd": {
        "service_address": "127.0.0.1:8125",
        "metrics_collection_interval": 60
      },
      "collectd": {
        "data_source": "tcp",
        "service_address": "127.0.0.1:25826",
        "typesdb": "/usr/share/collectd/types.db",
        "metrics_collection_interval": 60
      }
    }
  }
}

Your C++ application would then need to send metrics in StatsD or collectd format to `127.0.0.1:8125` or `127.0.0.1:25826` respectively. For example, using StatsD:

# Example: Sending a counter metric for errors
echo "my_cpp_app.errors:1|c" | nc -u 127.0.0.1 8125
# Example: Sending a gauge metric for active connections
echo "my_cpp_app.active_connections:15|g" | nc -u 127.0.0.1 8125

After creating the configuration file, start the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s

Setting Up CloudWatch Alarms

Create alarms based on these custom metrics or standard EC2 metrics (CPUUtilization, NetworkIn/Out, DiskReadOps/WriteOps).

Example Alarm: High CPU Utilization

In the AWS Console, navigate to CloudWatch -> Alarms -> Create alarm. Select the EC2 metric `CPUUtilization` for your instance. Set the condition to “Greater than” 80% for 5 consecutive minutes. Configure an action to notify an SNS topic, which can then trigger automated remediation (e.g., scaling out, restarting an instance).

Example Alarm: Application Health Endpoint Unhealthy

If you’re using CloudWatch Synthetics Canaries to poll your `/health` endpoint, you can create an alarm based on the Canary’s success/failure rate. Alternatively, if you’re pushing custom metrics for health checks (e.g., a metric `health_check_status` with value 0 for unhealthy, 1 for healthy), create an alarm on this metric.

Metric: MyCppApp.health_check_status
Statistic: Minimum
Period: 60 seconds
Threshold type: Static
Condition: Less than
Upper threshold: 1
Datapoints to alarm: 1 out of 1

This alarm triggers if the `health_check_status` metric is ever reported as less than 1 (i.e., 0, indicating unhealthy) within a 1-minute window.

Monitoring Redis Clusters on AWS ElastiCache

ElastiCache for Redis provides managed Redis instances. Monitoring focuses on ElastiCache-specific metrics, Redis commands, and cluster health.

Key ElastiCache Metrics to Monitor

CPUUtilization: High CPU can indicate heavy load or inefficient queries.
MemoryUsagePercentage: Crucial for avoiding eviction and performance degradation.
CacheHits/CacheMisses: Ratio indicates cache efficiency. A high miss rate might mean the cache isn’t sized appropriately or data isn’t being accessed effectively.
CurrConnections: Number of active client connections.
NewConnections: Rate of new connections. Spikes can indicate connection leaks or sudden load.
Evictions: Number of keys evicted due to memory pressure. High evictions degrade performance.
ReplicationLag: For Redis Cluster mode or read replicas, this indicates how far behind replicas are. High lag impacts read consistency.
EngineCPUUtilization: Specific to the Redis engine process.

These metrics are available directly in CloudWatch under the `AWS/ElastiCache` namespace.

Redis Command Monitoring

While ElastiCache abstracts much of the Redis internals, you can still monitor slow commands. Enable the Redis Slow Log if possible (though ElastiCache has limitations here, often requiring custom solutions or specific configurations). A more practical approach is to monitor Redis metrics related to command execution.

Monitoring Slow Commands via `MONITOR` (with caution)

The `MONITOR` command in Redis logs every command received by the server. This is extremely verbose and impacts performance significantly. It’s generally **not recommended for production environments** but can be useful for debugging specific issues in a controlled, non-production setting.

# Connect to your ElastiCache endpoint
redis-cli -h your-redis-endpoint.cache.amazonaws.com -p 6379

# Enable MONITOR (use with extreme caution!)
MONITOR

Instead of `MONITOR`, focus on metrics like `KeyAccessed`, `KeyMissed`, and `Evictions` which indirectly point to command performance and cache effectiveness.

ElastiCache Cluster Health and Failover

ElastiCache automatically handles failover for Multi-AZ deployments. Monitor the `ReplicationLag` metric to ensure replicas are healthy. For Redis Cluster mode, monitor the health of individual shards.

Setting Up ElastiCache Alarms in CloudWatch

Create alarms for critical ElastiCache metrics:

Alarm: High Memory Usage

Metric: MemoryUsagePercentage
Namespace: AWS/ElastiCache
Statistic: Average
Period: 300 seconds (5 minutes)
Threshold type: Static
Condition: Greater than
Upper threshold: 85
Datapoints to alarm: 2 out of 2

Alarm: High Eviction Rate

Metric: Evictions
Namespace: AWS/ElastiCache
Statistic: Sum
Period: 300 seconds (5 minutes)
Threshold type: Static
Condition: Greater than
Upper threshold: 10000 (adjust based on expected load)
Datapoints to alarm: 1 out of 1

Alarm: Significant Replication Lag

Metric: ReplicationLag
Namespace: AWS/ElastiCache
Statistic: Maximum
Period: 60 seconds
Threshold type: Static
Condition: Greater than
Upper threshold: 10 (adjust based on tolerance)
Datapoints to alarm: 1 out of 1

These alarms should trigger notifications to an SNS topic, allowing for investigation and potential intervention, such as scaling up the ElastiCache node type or optimizing data access patterns in your C++ application.

Integrating Application Logs with CloudWatch Logs

Centralizing logs is critical for debugging and auditing. The CloudWatch agent can tail log files and send them to CloudWatch Logs.

Configuring Log File Tailing

Add a `logs` section to your CloudWatch agent configuration file (`/opt/aws/amazon-cloudwatch-agent/bin/config.json`):

{
  "agent": { ... },
  "metrics": { ... },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/my_cpp_app.log",
            "log_group_name": "my_cpp_app_logs",
            "log_stream_name": "{instance_id}",
            "timezone": "UTC"
          },
          {
            "file_path": "/var/log/syslog",
            "log_group_name": "system_logs",
            "log_stream_name": "{instance_id}",
            "timezone": "UTC"
          }
        ]
      }
    }
  }
}

Ensure your C++ application is configured to log to `/var/log/my_cpp_app.log` (or adjust the `file_path` accordingly). After updating the agent configuration, restart the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s

You can then create metric filters in CloudWatch Logs based on patterns in your logs (e.g., “ERROR”, “Exception”) to trigger alarms or simply search and analyze logs directly in the CloudWatch Logs console.