Server Monitoring Best Practices: Keeping Your C++ App and DynamoDB Clusters Alive on AWS

Proactive C++ Application Health Checks on EC2

Maintaining the health of a C++ application deployed on EC2 instances requires more than just basic CPU and memory utilization checks. For production-grade systems, we need to implement application-specific health checks that can be polled by monitoring services like CloudWatch or integrated into load balancer health checks. This involves creating a dedicated endpoint within your C++ application that exposes its current operational status.

A common pattern is to expose an HTTP endpoint (e.g., `/healthz`) that returns a simple status code. For more complex applications, this endpoint can perform internal checks, such as verifying database connections, checking cache availability, or ensuring critical background threads are running. We’ll use a simple HTTP server library like `cpprestsdk` for this example, but the principle applies to other libraries or even custom socket implementations.

Implementing a Health Check Endpoint with cpprestsdk

First, ensure you have `cpprestsdk` installed and configured in your C++ project. The following code snippet demonstrates a basic health check endpoint:

#include <cpprest/http_listener.h>
#include <cpprest/json.h>
#include <iostream>
#include <string>
#include <atomic>

using namespace web;
using namespace web::http;
using namespace web::http::experimental::listener;

// Global flag to simulate application state
std::atomic<bool> is_application_healthy(true);

void handle_get(http_request message) {
    ucout << message.method() << U(" request received from ") << message.remote_address() << std::endl;

    if (is_application_healthy.load()) {
        message.reply(status_codes::OK, U("Application is healthy.\n"));
    } else {
        message.reply(status_codes::InternalError, U("Application is unhealthy.\n"));
    }
}

void setup_health_endpoint(const std::wstring& address) {
    http_listener listener(address);

    listener.support(methods::GET, handle_get);

    try {
        listener
            .open()
            .then(&[&]() { ucout << U("Health check listener started on ") << address << std::endl; })
            .wait();

        // Keep the listener running
        while (true) {
            std::this_thread::sleep_for(std::chrono::seconds(1));
        }
    } catch (const std::exception& e) {
        ucout << U("Error starting health check listener: ") << e.what() << std::endl;
    }
}

int main() {
    // Example: Simulate an unhealthy state after some time
    std::thread health_simulator([]() {
        std::this_thread::sleep_for(std::chrono::minutes(5));
        std::cout << "Simulating application failure..." << std::endl;
        is_application_healthy.store(false);
    });
    health_simulator.detach(); // Detach to let it run independently

    // Start the health check endpoint on port 8080
    setup_health_endpoint(U("http://0.0.0.0:8080/healthz"));

    return 0;
}

To integrate this with AWS, you would typically:

Compile this application and deploy it to your EC2 instances.
Configure your EC2 security groups to allow inbound traffic on port 8080 from your monitoring tools or load balancers.
Set up an AWS CloudWatch alarm that polls the `/healthz` endpoint (e.g., using a custom metric or a Lambda function that periodically checks the endpoint).
Alternatively, configure an Application Load Balancer (ALB) or Network Load Balancer (NLB) to use this `/healthz` endpoint as its health check target.

DynamoDB Cluster Monitoring on AWS: Beyond Basic Metrics

Monitoring DynamoDB clusters involves understanding not just the raw metrics provided by CloudWatch, but also how to interpret them in the context of your application’s performance and cost. Key metrics to track include:

ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits: Essential for understanding throughput usage and cost.
ProvisionedReadCapacityUnits and ProvisionedWriteCapacityUnits: Crucial for capacity planning and avoiding throttling.
ThrottledRequests: A direct indicator of insufficient provisioned capacity.
SuccessfulRequestLatency: Measures the time taken for successful read/write operations. High latency can indicate underlying issues.
SystemErrors: Tracks server-side errors within DynamoDB.
ConditionalCheckFailedRequests: Indicates issues with conditional writes.

Setting Up Advanced DynamoDB Alarms with CloudWatch

We need to go beyond simple threshold alerts. For instance, a sustained high rate of ThrottledRequests is a critical issue, but a single spike might be acceptable. Similarly, monitoring the ratio of consumed to provisioned capacity can provide early warnings of potential future throttling.

Here’s how you can set up more sophisticated alarms using the AWS CLI:

Alarm for Sustained Throttling

aws cloudwatch put-metric-alarm \
    --alarm-name "DynamoDB-High-ThrottledRequests-Table-XYZ" \
    --alarm-description "High number of throttled requests on Table XYZ" \
    --metric-name ThrottledRequests \
    --namespace AWS/DynamoDB \
    --statistic Sum \
    --period 300 \
    --threshold 100 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions "Name=TableName,Value=XYZ" "Name=GlobalSecondaryIndexName,Value=YourGSI" \
    --evaluation-periods 3 \
    --datapoints-to-alarm 3 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic

This alarm triggers if the Sum of ThrottledRequests for Table XYZ (and optionally a specific GSI) exceeds 100 over three consecutive 5-minute periods (total 15 minutes). This indicates a persistent problem, not a transient one.

Alarm for High Capacity Utilization Ratio

This requires a custom metric or a CloudWatch Metric Math expression. Let’s assume we want to alert if consumed read capacity is consistently above 80% of provisioned read capacity for a table.

aws cloudwatch put-metric-alarm \
    --alarm-name "DynamoDB-High-Read-Capacity-Utilization-Table-XYZ" \
    --alarm-description "Read capacity utilization consistently above 80% on Table XYZ" \
    --metric-name ConsumedReadCapacityUnits \
    --namespace AWS/DynamoDB \
    --statistic Average \
    --period 600 \
    --threshold 0.8 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions "Name=TableName,Value=XYZ" \
    --extended-statistic p90 \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic \
    --actions-enabled \
    --metric-math-expression "m1=e2.e1/e2.e0;m1.label='ReadUtilization';m1.id='e1'" \
    --metric-math-expression "e1=m:ConsumedReadCapacityUnits:TableName=XYZ:Period=600:Stat=Average;e1.label='ConsumedRead';e1.id='e0'" \
    --metric-math-expression "e2=m:ProvisionedReadCapacityUnits:TableName=XYZ:Period=600:Stat=Average;e2.label='ProvisionedRead';e2.id='e2'"

This example uses a simplified approach by directly comparing the average consumed units to provisioned units. A more robust solution would involve creating a custom metric that calculates the ratio ConsumedReadCapacityUnits / ProvisionedReadCapacityUnits and then alarming on that custom metric. The extended-statistic p90 is used here to consider the 90th percentile of the average consumed capacity over the period, making it less susceptible to brief dips.

Leveraging CloudWatch Logs for Deeper Insights

While metrics provide a high-level view, CloudWatch Logs can offer granular details about application behavior and potential issues. For your C++ application, ensure that you are logging:

Application startup and shutdown events.
Errors and exceptions with stack traces.
Key business logic execution points.
Performance-critical operations (e.g., database queries, external API calls).
Health check endpoint requests and responses.

For DynamoDB, enabling DynamoDB Debug Logging (if applicable and cost-effective for your use case) can provide extremely detailed information about request processing, but this should be used judiciously due to potential cost and volume implications. More commonly, you’ll rely on DynamoDB’s built-in audit logs or CloudTrail for API call tracking.

Setting up Log Metric Filters and Alarms

You can create CloudWatch Alarms based on patterns found in your application logs. For example, to alert on critical errors logged by your C++ application:

# First, create a metric filter for your C++ app's log group
aws logs put-metric-filter \
    --log-group-name "/aws/ecs/my-cpp-app" \
    --filter-name "CriticalErrors" \
    --filter-pattern "[..., ERROR, ..., \"Critical failure detected\", ...]" \
    --metric-transformations metricName=CriticalErrorCount,metricNamespace=MyCppApp,metricValue=1,defaultValue=0

# Then, create an alarm based on the new metric
aws cloudwatch put-metric-alarm \
    --alarm-name "CppApp-CriticalErrors-Detected" \
    --alarm-description "Critical errors detected in C++ application logs" \
    --metric-name CriticalErrorCount \
    --namespace "MyCppApp" \
    --statistic Sum \
    --period 300 \
    --threshold 0 \
    --comparison-operator LessThanThreshold \
    --evaluation-periods 1 \
    --datapoints-to-alarm 1 \
    --treat-missing-data breaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic

This setup first defines a filter to count occurrences of a specific “Critical failure detected” log message. The alarm then triggers if this count drops to zero within a 5-minute period, indicating that the error condition is no longer present. You would adjust the filter pattern and alarm logic based on your specific error logging strategy.

Integrating C++ Application Metrics with CloudWatch Agent

For more detailed application-level metrics beyond what a simple HTTP health check provides, you can leverage the CloudWatch Agent. This allows you to collect custom metrics from your C++ application and send them directly to CloudWatch. This is particularly useful for tracking internal application states, queue depths, or custom performance counters.

Configuring the CloudWatch Agent for Custom Metrics

You’ll need to install the CloudWatch Agent on your EC2 instances. The agent is configured via a JSON file. Here’s an example configuration snippet that collects custom metrics from a C++ application that might expose metrics via a local file or a specific port:

{
  "agent": {
    "metrics_collection_interval": 60
  },
  "metrics": {
    "namespace": "MyCppAppMetrics",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "statsd": {
        "service_address": "udp:127.0.0.1:8125",
        "metrics_collection_interval": 60
      },
      "prometheus": {
        "log_group_names": [
          "/aws/ecs/my-cpp-app"
        ],
        "prometheus_config_path": "/opt/cloudwatch-agent/prometheus-config.yml",
        "scrape_interval": 60
      }
    }
  }
}

In this configuration:

We’ve set a default metrics_collection_interval of 60 seconds.
A custom namespace “MyCppAppMetrics” is defined.
InstanceId is automatically appended as a dimension.
We’ve enabled collection via StatsD (if your C++ app can emit StatsD metrics) and Prometheus (if your C++ app exposes a Prometheus-compatible metrics endpoint).

Your C++ application would then need to be instrumented to send metrics. For StatsD, this could involve a UDP client sending strings like my_custom_metric:1|c. For Prometheus, it would expose an HTTP endpoint (e.g., `/metrics`) in the Prometheus text format.

Instrumenting C++ for StatsD or Prometheus

Using a library like `prometheus-cpp` for Prometheus metrics:

#include <prometheus/registry.h>
#include <prometheus/counter.h>
#include <prometheus/exposer.h>
#include <prometheus/metric_family.h>
#include <thread>
#include <chrono>

// Global Prometheus registry
auto registry = std::make_shared<prometheus::Registry>();

// Define a counter metric
auto& request_counter = prometheus::BuildCounter()
    .Register(*registry)
    .Add({{"name", "http_requests_total"}});

void start_prometheus_exposer(int port) {
    prometheus::Exposer exposer{"0.0.0.0:" + std::to_string(port)};
    exposer.RegisterCollectable(registry);
    std::cout << "Prometheus metrics exposed on port " << port << std::endl;
    // Keep the exposer running
    while(true) {
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }
}

int main() {
    // Start the Prometheus metrics exporter in a separate thread
    std::thread exposer_thread(start_prometheus_exposer, 9100); // Expose on port 9100

    // Simulate incoming requests
    while (true) {
        // In a real app, this would be incremented on each incoming request
        request_counter.Increment();
        std::cout << "Request processed. Counter incremented." << std::endl;
        std::this_thread::sleep_for(std::chrono::seconds(5));
    }

    exposer_thread.join(); // This will never be reached in this example
    return 0;
}

Once the CloudWatch Agent is configured and running, and your C++ application is emitting metrics (either via StatsD or exposing a Prometheus endpoint), the agent will collect these and send them to CloudWatch under the `MyCppAppMetrics` namespace. You can then create CloudWatch Alarms based on these custom metrics, just as you would with standard AWS metrics.