Server Monitoring Best Practices: Keeping Your C App and DynamoDB Clusters Alive on OVH

Proactive C Application Health Checks with Systemd

For C applications deployed on OVH infrastructure, robust health checking is paramount. We’ll leverage systemd’s capabilities to ensure our C services are not only running but also responsive. This involves defining precise health check endpoints within the C application and configuring systemd to periodically query them.

Consider a C application that exposes a simple HTTP endpoint for health checks. This endpoint should return a 200 OK status code if the application is healthy, and a non-2xx status code otherwise. For simplicity, let’s assume a basic web server implementation (e.g., using `libmicrohttpd` or a custom socket listener) that handles this.

C Application Health Check Endpoint Example

Here’s a conceptual snippet of how such an endpoint might look in C. This is a simplified representation; a production-ready version would include more sophisticated checks (e.g., database connectivity, internal state validation).

#include <stdio.h>
#include <microhttpd.h> // Example library

#define PORT 8080

// Function to handle health check requests
static int health_check_handler(void *cls, struct MHD_Connection *connection,
                                const char *url, const char *method,
                                const char *version, const char *upload_data,
                                size_t *upload_data_size, void *private_data) {
    if (strcmp(url, "/health") == 0 && strcmp(method, "GET") == 0) {
        // Perform internal health checks here
        // For demonstration, assume always healthy
        const char *response = "OK";
        struct MHD_Response *mhd_response;
        int ret;

        mhd_response = MHD_create_response_from_buffer(strlen(response), (void *)response, MHD_RESPMEM_PERSISTENT);
        if (!mhd_response) return MHD_NO;

        ret = MHD_queue_response(connection, MHD_HTTP_OK, mhd_response);
        MHD_destroy_response(mhd_response);
        return ret;
    }
    // Handle other requests or return 404
    return MHD_NO;
}

int main() {
    struct MHD_Daemon *daemon;

    daemon = MHD_start_daemon(MHD_THREADED_IMMEDIATELY, PORT, NULL, NULL,
                              &health_check_handler, NULL, MHD_END_DAEMON);
    if (daemon == NULL) {
        fprintf(stderr, "Failed to start daemon\n");
        return 1;
    }

    printf("Server started on port %d. Health check at /health\n", PORT);
    // Keep the server running...
    getchar(); // Simple way to keep it alive for demo
    MHD_stop_daemon(daemon);
    return 0;
}

Systemd Service Unit Configuration

Next, we configure systemd to manage our C application. This involves a service unit file that defines how to start, stop, and importantly, how to check the health of the application.

Create a file named my-c-app.service in /etc/systemd/system/:

[Unit]
Description=My C Application Service
After=network.target

[Service]
ExecStart=/usr/local/bin/my_c_app
Restart=on-failure
RestartSec=5
User=appuser
Group=appgroup
Environment="PORT=8080"

# Health Check Configuration
ExecStartPost=/bin/sh -c 'sleep 5 && curl --fail http://localhost:8080/health || exit 1'
WatchdogSec=30
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Explanation:

ExecStartPost: This command runs immediately after the service starts. It waits 5 seconds (allowing the app to initialize) and then attempts to `curl` the health check endpoint. The --fail flag ensures that curl exits with a non-zero status if the HTTP status code is not 2xx or 3xx. If curl fails, the service will be marked as failed.
WatchdogSec=30: This tells systemd to expect a “keep-alive” signal from the service within 30 seconds. For applications that don’t natively support systemd’s watchdog protocol, this is often used in conjunction with ExecStartPost or a separate health check script that periodically pings the service. A more advanced approach would involve the C application itself periodically calling sd_notify(0, "READY=1") after it’s ready and then sd_notify(0, "STATUS=Processing...") to keep the watchdog alive.
Restart=on-failure: Ensures the service is restarted if it crashes.
StandardOutput=journal and StandardError=journal: Directs logs to the systemd journal for centralized logging.

After creating the file, reload systemd, enable, and start the service:

sudo systemctl daemon-reload
sudo systemctl enable my-c-app.service
sudo systemctl start my-c-app.service
sudo systemctl status my-c-app.service

Monitoring DynamoDB with CloudWatch Metrics and Alarms

For DynamoDB clusters, AWS CloudWatch is the primary tool for monitoring. We need to focus on key metrics that indicate performance bottlenecks, throttling, and potential availability issues. OVH’s cloud offerings might not directly integrate with AWS services like DynamoDB, implying you’re likely using DynamoDB via AWS or a managed service that abstracts this. Assuming direct AWS DynamoDB usage, here’s how to set up effective monitoring.

Key DynamoDB CloudWatch Metrics to Monitor

Focus on these metrics, especially for tables and global secondary indexes (GSIs):

ConsumedReadCapacityUnits: Tracks the read throughput consumed by your application. Spikes can indicate increased load or inefficient queries.
ConsumedWriteCapacityUnits: Tracks write throughput consumed. Similar to read units, spikes need investigation.
ReadThrottleEvents: Crucial. Indicates requests that were throttled because provisioned or on-demand capacity was exceeded. Persistent throttling degrades application performance.
WriteThrottleEvents: The write equivalent of ReadThrottleEvents.
SystemErrors: Counts internal server errors within DynamoDB. Any non-zero value here is a critical alert.
SuccessfulRequestLatency: Measures the latency of successful requests. High latency, even without throttling, can signal performance issues. Monitor the 95th and 99th percentiles.
ConditionalCheckFailedRequests: Indicates failed conditional writes. While not always an error, a high rate can point to application logic issues or contention.
ItemCount: Useful for understanding table size and growth.
TableSizeBytes: Tracks the total size of the table.

Setting Up CloudWatch Alarms

Create CloudWatch alarms based on these metrics. Alarms should trigger notifications (e.g., via SNS to Slack, PagerDuty, or email) when thresholds are breached. Here are example alarm configurations:

Alarm 1: High Read Throttle Rate

Metric: ReadThrottleEvents
Namespace: AWS/DynamoDB
Statistic: Sum
Period: 1 minute
Threshold type: Static
Condition: > 0 (for any table/index) or a specific threshold if you know your acceptable limits. A value greater than 0 for a sustained period is usually problematic.
Datapoints to alarm: 1 out of 1 (to be sensitive)
Evaluation period: 1 minute
Alarm name: DynamoDB-HighReadThrottleEvents-TableXYZ
Notification: Send to an SNS topic for alerts.

Alarm 2: High Write Throttle Rate

Metric: WriteThrottleEvents
Namespace: AWS/DynamoDB
Statistic: Sum
Period: 1 minute
Threshold type: Static
Condition: > 0 (or a specific threshold)
Datapoints to alarm: 1 out of 1
Evaluation period: 1 minute
Alarm name: DynamoDB-HighWriteThrottleEvents-TableXYZ
Notification: Send to an SNS topic.

Alarm 3: High Latency (95th Percentile)

Metric: SuccessfulRequestLatency
Namespace: AWS/DynamoDB
Statistic: p95
Period: 1 minute
Threshold type: Static
Condition: > 0.5 (e.g., 500ms – adjust based on your SLOs)
Datapoints to alarm: 1 out of 1
Evaluation period: 1 minute
Alarm name: DynamoDB-HighLatency-p95-TableXYZ
Notification: Send to an SNS topic.

Alarm 4: System Errors

Metric: SystemErrors
Namespace: AWS/DynamoDB
Statistic: Sum
Period: 1 minute
Threshold type: Static
Condition: > 0
Datapoints to alarm: 1 out of 1
Evaluation period: 1 minute
Alarm name: DynamoDB-SystemErrors-TableXYZ
Notification: Send to an SNS topic (this is a critical alert).

Automating Monitoring with AWS CLI / SDK

For infrastructure-as-code and automated deployment, use the AWS CLI or SDKs to manage CloudWatch alarms. Here’s an example using AWS CLI to create a throttle alarm:

aws cloudwatch put-metric-alarm \
    --alarm-name "DynamoDB-HighWriteThrottleEvents-TableXYZ" \
    --alarm-description "Alarm when write throttle events exceed 0 in 1 minute" \
    --metric-name "WriteThrottleEvents" \
    --namespace "AWS/DynamoDB" \
    --statistic "Sum" \
    --period 60 \
    --threshold 0 \
    --comparison-operator "GreaterThanThreshold" \
    --evaluation-periods 1 \
    --datapoints-to-alarm 1 \
    --treat-missing-data "notBreaching" \
    --dimensions Name=TableName,Value=TableXYZ \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic

Remember to replace TableXYZ, us-east-1, 123456789012, and MyAlertsTopic with your actual values.

Integrating C App Monitoring with Prometheus and Grafana

While systemd provides basic health checks and restarts, a more comprehensive monitoring solution involves exporting metrics from your C application and visualizing them. Prometheus is an excellent choice for time-series metrics collection, and Grafana for dashboarding.

Exposing Metrics from C Application

You can integrate a Prometheus client library into your C application. A common approach is to expose a /metrics HTTP endpoint that Prometheus can scrape.

Using a library like `prometheus-c-client` (or a similar C/C++ library), you would:

Initialize the Prometheus client.
Define metrics (e.g., counters for requests, gauges for active connections, histograms for latency).
Update these metrics within your application logic.
Start an HTTP server to expose the /metrics endpoint.

Example (conceptual, using a hypothetical Prometheus C client API):

#include <stdio.h>
#include <microhttpd.h>
#include "prometheus/client.h" // Hypothetical library

#define METRICS_PORT 9100 // Separate port for metrics

// Define metrics
prometheus_counter_t requests_total;
prometheus_gauge_t active_connections;
prometheus_histogram_t request_latency_seconds;

// Handler for /metrics endpoint
static int metrics_handler(void *cls, struct MHD_Connection *connection,
                           const char *url, const char *method,
                           const char *version, const char *upload_data,
                           size_t *upload_data_size, void *private_data) {
    if (strcmp(url, "/metrics") == 0 && strcmp(method, "GET") == 0) {
        char *metrics_text = prometheus_collect_metrics(); // Hypothetical function
        struct MHD_Response *mhd_response;
        int ret;

        mhd_response = MHD_create_response_from_buffer(strlen(metrics_text), (void *)metrics_text, MHD_RESPMEM_MUST_FREE);
        if (!mhd_response) return MHD_NO;

        ret = MHD_queue_response(connection, MHD_HTTP_OK, mhd_response);
        MHD_destroy_response(mhd_response);
        return ret;
    }
    return MHD_NO;
}

int main() {
    // Initialize Prometheus client
    prometheus_init();
    requests_total = prometheus_counter_new("my_c_app_requests_total", "Total number of requests");
    active_connections = prometheus_gauge_new("my_c_app_active_connections", "Number of active connections");
    request_latency_seconds = prometheus_histogram_new("my_c_app_request_latency_seconds", "Latency of requests in seconds", 10, (double[]){0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0});

    // Start metrics server
    struct MHD_Daemon *metrics_daemon;
    metrics_daemon = MHD_start_daemon(MHD_THREADED_IMMEDIATELY, METRICS_PORT, NULL, NULL,
                                      &metrics_handler, NULL, MHD_END_DAEMON);
    if (metrics_daemon == NULL) {
        fprintf(stderr, "Failed to start metrics daemon\n");
        return 1;
    }

    // ... your main application logic ...
    // Example: Increment counter and record latency
    // prometheus_counter_inc(requests_total);
    // double start_time = get_current_time();
    // ... process request ...
    // double latency = get_current_time() - start_time;
    // prometheus_histogram_observe(request_latency_seconds, latency);

    return 0;
}

Prometheus Configuration

Configure Prometheus to scrape your C application’s metrics endpoint. Add the following to your prometheus.yml:

scrape_configs:
  - job_name: 'my-c-app'
    static_configs:
      - targets: ['your_server_ip:9100'] # Replace with your server's IP or hostname
        labels:
          instance: 'c-app-instance-1'

Grafana Dashboard Setup

In Grafana, add Prometheus as a data source. Then, create a new dashboard and add panels using PromQL queries to visualize your C application’s metrics. For example:

Panel 1: Request Rate
Query: rate(my_c_app_requests_total[5m])
Panel 2: Active Connections
Query: my_c_app_active_connections
Panel 3: Request Latency (95th Percentile)
Query: histogram_quantile(0.95, sum(rate(my_c_app_request_latency_seconds_bucket[5m])) by (le))

Combine these Prometheus and Grafana setups with CloudWatch alarms for DynamoDB to achieve comprehensive, multi-layered monitoring for your C application and its data store on OVH.