Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on AWS

Proactive C Application Health Checks with Systemd

For C applications deployed on AWS EC2 instances, robust health monitoring is paramount. Relying solely on external checks can lead to prolonged downtime if an application crashes or enters an unresponsive state. Integrating health checks directly into the service management layer, such as systemd, provides a more immediate and granular approach. This allows systemd to not only restart a failed service but also to report its status accurately.

We’ll define a systemd service unit file that includes a ExecStartPre command to verify application readiness and a HealthCheckCommand to periodically assess its operational status. For this example, assume your C application listens on a specific TCP port (e.g., 8080) and has a simple health endpoint that returns “OK” when healthy.

Systemd Service Unit Configuration

Create a file named my-c-app.service in /etc/systemd/system/. This file will define how your C application is managed.

[Unit]
Description=My C Application Service
After=network.target

[Service]
User=appuser
Group=appgroup
WorkingDirectory=/opt/my-c-app
ExecStartPre=/usr/bin/nc -z localhost 8080 || exit 1
ExecStart=/opt/my-c-app/bin/my_c_app --config /etc/my-c-app/config.conf
ExecStop=/bin/kill -s TERM $MAINPID
Restart=on-failure
RestartSec=5s

# Health Check Configuration
# This command will be executed periodically by systemd to check health.
# It attempts to connect to the health endpoint and expects "OK" within 2 seconds.
HealthCheckCommand=/usr/bin/timeout 2s /usr/bin/nc -q 1 localhost 8080 | grep -q "OK"
HealthCheckIntervalSec=10s
HealthCheckTimeoutSec=5s
HealthCheckFailureMode=restart

[Install]
WantedBy=multi-user.target

Explanation:

Description: A human-readable description of the service.
After=network.target: Ensures the network is up before starting the service.
User, Group, WorkingDirectory: Define the execution context for the application.
ExecStartPre: A pre-start command. Here, nc -z localhost 8080 checks if port 8080 is open. If it fails (returns non-zero exit code), the service start is aborted.
ExecStart: The command to start your C application.
ExecStop: How to gracefully stop the application.
Restart=on-failure: Configures systemd to restart the service if it exits with a non-zero status.
RestartSec=5s: The delay before attempting a restart.
HealthCheckCommand: This is the core of our proactive health check. timeout 2s ensures the command doesn’t hang indefinitely. nc -q 1 localhost 8080 attempts to connect and send a newline (implicitly, due to how netcat often works with pipes, or explicitly if you were to pipe data). The grep -q "OK" checks if the output contains “OK”. The -q flag makes grep silent, only returning an exit code. If “OK” is found, grep returns 0 (success); otherwise, it returns non-zero.
HealthCheckIntervalSec=10s: How often to run the HealthCheckCommand.
HealthCheckTimeoutSec=5s: How long to wait for the HealthCheckCommand to complete.
HealthCheckFailureMode=restart: If the health check fails, systemd will restart the service.
WantedBy=multi-user.target: Ensures the service is started when the system reaches the multi-user runlevel.

After creating the file, reload systemd, enable, and start your service:

sudo systemctl daemon-reload
sudo systemctl enable my-c-app.service
sudo systemctl start my-c-app.service

You can check the status and logs with:

sudo systemctl status my-c-app.service
sudo journalctl -u my-c-app.service -f

Elasticsearch Cluster Monitoring with Prometheus and Grafana

Monitoring Elasticsearch clusters, especially on AWS, requires a comprehensive approach that captures cluster health, node performance, JVM metrics, and query performance. Prometheus is an excellent choice for metrics collection, and Grafana for visualization. We’ll focus on setting up the Elasticsearch Exporter for Prometheus and integrating it with a Grafana dashboard.

Deploying the Elasticsearch Exporter

The official Elasticsearch Exporter can be deployed as a Docker container or a standalone binary. For simplicity and ease of management on EC2, we’ll outline the Docker approach. Ensure you have Docker installed on your EC2 instance(s) or use an ECS/EKS setup.

The exporter needs to connect to your Elasticsearch cluster. If your Elasticsearch cluster is within a private VPC, you’ll need to ensure the EC2 instance running the exporter has network access (e.g., via VPC peering, Transit Gateway, or by running the exporter within the same VPC).

docker run -d \
  --name elasticsearch-exporter \
  -p 9114:9114 \
  quay.io/prometheus/elasticsearch-exporter:latest \
  --es.uri=http://your-elasticsearch-endpoint:9200 \
  --es.all_indices \
  --es.indices_include=".*" \
  --es.cluster_name=my-es-cluster

Configuration Notes:

--es.uri: The HTTP endpoint of your Elasticsearch cluster. Replace your-elasticsearch-endpoint with the actual hostname or IP. If using AWS Elasticsearch Service (now OpenSearch Service), this will be its endpoint.
--es.all_indices and --es.indices_include=".*": These flags tell the exporter to collect metrics for all indices. You can refine this with more specific regex patterns if needed.
--es.cluster_name: Explicitly set your cluster name.
-p 9114:9114: Exposes the exporter’s metrics endpoint on port 9114.

If your Elasticsearch cluster requires authentication, you’ll need to add --es.user and --es.password flags. For HTTPS, use --es.tls.

Prometheus Configuration

Add a scrape configuration to your Prometheus prometheus.yml file to collect metrics from the exporter. Assuming your Prometheus server can reach the EC2 instance running the exporter on port 9114:

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: [':9114']
    metrics_path: /metrics
    params:
      # Optional: Filter metrics if needed, e.g., by index name
      # index: ['my-index-*']

Replace <EC2_INSTANCE_IP_OR_DNS> with the actual IP address or DNS name of the EC2 instance running the Elasticsearch exporter. After updating prometheus.yml, reload Prometheus configuration (e.g., by sending a SIGHUP signal or restarting the Prometheus service).

Grafana Dashboard for Elasticsearch

Grafana provides numerous pre-built dashboards for Elasticsearch. You can import one by its ID from Grafana.com or create your own. A popular community dashboard is the “Elasticsearch Cluster Monitoring” dashboard (ID: 1427). Ensure your Grafana instance is configured with Prometheus as a data source.

Key metrics to monitor include:

Cluster Health: elasticsearch_cluster_health_status (0=red, 1=yellow, 2=green)
Node Count: elasticsearch_nodes_count
JVM Heap Usage: elasticsearch_jvm_heap_used_percent
Indexing Rate: elasticsearch_indices_indexing_total (rate over time)
Search Rate: elasticsearch_indices_search_total (rate over time)
Disk Usage: elasticsearch_fs_data_free_bytes, elasticsearch_fs_data_total_bytes
CPU Usage: elasticsearch_process_cpu_percent
Request Latency: elasticsearch_indices_query_cache_hit_rate, elasticsearch_indices_request_cache_hit_rate (indirect indicators)

For latency, you might need to instrument your application or use Elasticsearch’s slow logs and analyze them separately, as direct latency metrics from the exporter can be limited.

AWS CloudWatch Alarms for Critical Metrics

While Prometheus and Grafana provide deep insights, AWS CloudWatch is essential for setting up actionable alarms that can trigger automated responses or notify teams. We’ll focus on alarms for both the C application instances and the Elasticsearch cluster.

CloudWatch Alarms for C Application EC2 Instances

Ensure the CloudWatch agent is installed and configured on your EC2 instances to send custom metrics and logs. For basic health, we can monitor CPU utilization, network traffic, and disk space. More importantly, we can leverage the systemd service status.

Alarm 1: High CPU Utilization

Metric: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: 5 minutes
Threshold: > 80%
Evaluation Periods: 3
Alarm Actions: SNS topic for critical alerts

Alarm 2: Service Unhealthy (using systemd status)

This requires a custom metric. You can write a small script that checks systemctl is-active my-c-app.service and systemctl is-enabled my-c-app.service, and then publishes a custom metric (e.g., ServiceActive = 1 if active, 0 if not) to CloudWatch using the CloudWatch agent or AWS SDK.

# Example script snippet (to be run periodically and pushed to CloudWatch)
SERVICE_NAME="my-c-app.service"
METRIC_NAME="MyCAppServiceActive"
NAMESPACE="Custom/MyApp"

if systemctl is-active --quiet $SERVICE_NAME; then
    VALUE=1
else
    VALUE=0
fi

# Use AWS CLI or CloudWatch Agent to put-metric-data
# Example using AWS CLI (requires IAM permissions)
aws cloudwatch put-metric-data --metric-name $METRIC_NAME --namespace $NAMESPACE --value $VALUE --dimensions ServiceName=$SERVICE_NAME

Once the custom metric is published:

Metric: MyCAppServiceActive
Namespace: Custom/MyApp
Statistic: Minimum
Period: 1 minute
Threshold: < 1
Evaluation Periods: 2
Alarm Actions: SNS topic for critical alerts

This alarm triggers if the service is not active for two consecutive 1-minute periods.

CloudWatch Alarms for Elasticsearch (AWS OpenSearch Service)

If you are using AWS OpenSearch Service (formerly Elasticsearch Service), many key metrics are automatically published to CloudWatch. If you’re self-hosting Elasticsearch on EC2, you’ll need to use the CloudWatch agent to collect relevant metrics (similar to the C app example).

Alarm 1: Cluster Status (for OpenSearch Service)

Metric: ClusterStatus.red
Namespace: AWS/ES
Statistic: Maximum
Period: 5 minutes
Threshold: = 1
Evaluation Periods: 1
Alarm Actions: SNS topic for critical alerts

You can create similar alarms for ClusterStatus.yellow if yellow status is considered critical for your use case.

Alarm 2: JVM Memory Pressure (for OpenSearch Service)

Metric: JVMMemoryPressure
Namespace: AWS/ES
Statistic: Average
Period: 5 minutes
Threshold: > 85%
Evaluation Periods: 3
Alarm Actions: SNS topic for warnings/investigation

Alarm 3: High Disk Usage (for OpenSearch Service)

Metric: DiskQueueDepth
Namespace: AWS/ES
Statistic: Average
Period: 5 minutes
Threshold: > 1000
Evaluation Periods: 3
Alarm Actions: SNS topic for warnings/investigation

Alarm 4: Unhealthy Nodes (for OpenSearch Service)

Metric: UnassignedShards
Namespace: AWS/ES
Statistic: Maximum
Period: 5 minutes
Threshold: > 0
Evaluation Periods: 1
Alarm Actions: SNS topic for critical alerts

By combining systemd health checks for your C application, Prometheus/Grafana for deep Elasticsearch observability, and CloudWatch for critical AWS-native alerting, you establish a robust, multi-layered monitoring strategy to keep your services operational and performant on AWS.