Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on AWS
Proactive C Application Health Checks with Systemd
For C applications deployed on AWS EC2 instances, robust health monitoring is paramount. Relying solely on external checks can lead to prolonged downtime if an application crashes or enters an unresponsive state. Integrating health checks directly into the service management layer, such as systemd, provides a more immediate and granular approach. This allows systemd to not only restart a failed service but also to report its status accurately.
We’ll define a systemd service unit file that includes a ExecStartPre command to verify application readiness and a HealthCheckCommand to periodically assess its operational status. For this example, assume your C application listens on a specific TCP port (e.g., 8080) and has a simple health endpoint that returns “OK” when healthy.
Systemd Service Unit Configuration
Create a file named my-c-app.service in /etc/systemd/system/. This file will define how your C application is managed.
[Unit] Description=My C Application Service After=network.target [Service] User=appuser Group=appgroup WorkingDirectory=/opt/my-c-app ExecStartPre=/usr/bin/nc -z localhost 8080 || exit 1 ExecStart=/opt/my-c-app/bin/my_c_app --config /etc/my-c-app/config.conf ExecStop=/bin/kill -s TERM $MAINPID Restart=on-failure RestartSec=5s # Health Check Configuration # This command will be executed periodically by systemd to check health. # It attempts to connect to the health endpoint and expects "OK" within 2 seconds. HealthCheckCommand=/usr/bin/timeout 2s /usr/bin/nc -q 1 localhost 8080 | grep -q "OK" HealthCheckIntervalSec=10s HealthCheckTimeoutSec=5s HealthCheckFailureMode=restart [Install] WantedBy=multi-user.target
Explanation:
Description: A human-readable description of the service.After=network.target: Ensures the network is up before starting the service.User,Group,WorkingDirectory: Define the execution context for the application.ExecStartPre: A pre-start command. Here,nc -z localhost 8080checks if port 8080 is open. If it fails (returns non-zero exit code), the service start is aborted.ExecStart: The command to start your C application.ExecStop: How to gracefully stop the application.Restart=on-failure: Configures systemd to restart the service if it exits with a non-zero status.RestartSec=5s: The delay before attempting a restart.HealthCheckCommand: This is the core of our proactive health check.timeout 2sensures the command doesn’t hang indefinitely.nc -q 1 localhost 8080attempts to connect and send a newline (implicitly, due to how netcat often works with pipes, or explicitly if you were to pipe data). Thegrep -q "OK"checks if the output contains “OK”. The-qflag makes grep silent, only returning an exit code. If “OK” is found, grep returns 0 (success); otherwise, it returns non-zero.HealthCheckIntervalSec=10s: How often to run theHealthCheckCommand.HealthCheckTimeoutSec=5s: How long to wait for theHealthCheckCommandto complete.HealthCheckFailureMode=restart: If the health check fails, systemd will restart the service.WantedBy=multi-user.target: Ensures the service is started when the system reaches the multi-user runlevel.
After creating the file, reload systemd, enable, and start your service:
sudo systemctl daemon-reload sudo systemctl enable my-c-app.service sudo systemctl start my-c-app.service
You can check the status and logs with:
sudo systemctl status my-c-app.service sudo journalctl -u my-c-app.service -f
Elasticsearch Cluster Monitoring with Prometheus and Grafana
Monitoring Elasticsearch clusters, especially on AWS, requires a comprehensive approach that captures cluster health, node performance, JVM metrics, and query performance. Prometheus is an excellent choice for metrics collection, and Grafana for visualization. We’ll focus on setting up the Elasticsearch Exporter for Prometheus and integrating it with a Grafana dashboard.
Deploying the Elasticsearch Exporter
The official Elasticsearch Exporter can be deployed as a Docker container or a standalone binary. For simplicity and ease of management on EC2, we’ll outline the Docker approach. Ensure you have Docker installed on your EC2 instance(s) or use an ECS/EKS setup.
The exporter needs to connect to your Elasticsearch cluster. If your Elasticsearch cluster is within a private VPC, you’ll need to ensure the EC2 instance running the exporter has network access (e.g., via VPC peering, Transit Gateway, or by running the exporter within the same VPC).
docker run -d \ --name elasticsearch-exporter \ -p 9114:9114 \ quay.io/prometheus/elasticsearch-exporter:latest \ --es.uri=http://your-elasticsearch-endpoint:9200 \ --es.all_indices \ --es.indices_include=".*" \ --es.cluster_name=my-es-cluster
Configuration Notes:
--es.uri: The HTTP endpoint of your Elasticsearch cluster. Replaceyour-elasticsearch-endpointwith the actual hostname or IP. If using AWS Elasticsearch Service (now OpenSearch Service), this will be its endpoint.--es.all_indicesand--es.indices_include=".*": These flags tell the exporter to collect metrics for all indices. You can refine this with more specific regex patterns if needed.--es.cluster_name: Explicitly set your cluster name.-p 9114:9114: Exposes the exporter’s metrics endpoint on port 9114.
If your Elasticsearch cluster requires authentication, you’ll need to add --es.user and --es.password flags. For HTTPS, use --es.tls.
Prometheus Configuration
Add a scrape configuration to your Prometheus prometheus.yml file to collect metrics from the exporter. Assuming your Prometheus server can reach the EC2 instance running the exporter on port 9114:
scrape_configs:
- job_name: 'elasticsearch'
static_configs:
- targets: [':9114']
metrics_path: /metrics
params:
# Optional: Filter metrics if needed, e.g., by index name
# index: ['my-index-*']
Replace <EC2_INSTANCE_IP_OR_DNS> with the actual IP address or DNS name of the EC2 instance running the Elasticsearch exporter. After updating prometheus.yml, reload Prometheus configuration (e.g., by sending a SIGHUP signal or restarting the Prometheus service).
Grafana Dashboard for Elasticsearch
Grafana provides numerous pre-built dashboards for Elasticsearch. You can import one by its ID from Grafana.com or create your own. A popular community dashboard is the “Elasticsearch Cluster Monitoring” dashboard (ID: 1427). Ensure your Grafana instance is configured with Prometheus as a data source.
Key metrics to monitor include:
- Cluster Health:
elasticsearch_cluster_health_status(0=red, 1=yellow, 2=green) - Node Count:
elasticsearch_nodes_count - JVM Heap Usage:
elasticsearch_jvm_heap_used_percent - Indexing Rate:
elasticsearch_indices_indexing_total(rate over time) - Search Rate:
elasticsearch_indices_search_total(rate over time) - Disk Usage:
elasticsearch_fs_data_free_bytes,elasticsearch_fs_data_total_bytes - CPU Usage:
elasticsearch_process_cpu_percent - Request Latency:
elasticsearch_indices_query_cache_hit_rate,elasticsearch_indices_request_cache_hit_rate(indirect indicators)
For latency, you might need to instrument your application or use Elasticsearch’s slow logs and analyze them separately, as direct latency metrics from the exporter can be limited.
AWS CloudWatch Alarms for Critical Metrics
While Prometheus and Grafana provide deep insights, AWS CloudWatch is essential for setting up actionable alarms that can trigger automated responses or notify teams. We’ll focus on alarms for both the C application instances and the Elasticsearch cluster.
CloudWatch Alarms for C Application EC2 Instances
Ensure the CloudWatch agent is installed and configured on your EC2 instances to send custom metrics and logs. For basic health, we can monitor CPU utilization, network traffic, and disk space. More importantly, we can leverage the systemd service status.
Alarm 1: High CPU Utilization
Metric: CPUUtilization Namespace: AWS/EC2 Statistic: Average Period: 5 minutes Threshold: > 80% Evaluation Periods: 3 Alarm Actions: SNS topic for critical alerts
Alarm 2: Service Unhealthy (using systemd status)
This requires a custom metric. You can write a small script that checks systemctl is-active my-c-app.service and systemctl is-enabled my-c-app.service, and then publishes a custom metric (e.g., ServiceActive = 1 if active, 0 if not) to CloudWatch using the CloudWatch agent or AWS SDK.
# Example script snippet (to be run periodically and pushed to CloudWatch)
SERVICE_NAME="my-c-app.service"
METRIC_NAME="MyCAppServiceActive"
NAMESPACE="Custom/MyApp"
if systemctl is-active --quiet $SERVICE_NAME; then
VALUE=1
else
VALUE=0
fi
# Use AWS CLI or CloudWatch Agent to put-metric-data
# Example using AWS CLI (requires IAM permissions)
aws cloudwatch put-metric-data --metric-name $METRIC_NAME --namespace $NAMESPACE --value $VALUE --dimensions ServiceName=$SERVICE_NAME
Once the custom metric is published:
Metric: MyCAppServiceActive Namespace: Custom/MyApp Statistic: Minimum Period: 1 minute Threshold: < 1 Evaluation Periods: 2 Alarm Actions: SNS topic for critical alerts
This alarm triggers if the service is not active for two consecutive 1-minute periods.
CloudWatch Alarms for Elasticsearch (AWS OpenSearch Service)
If you are using AWS OpenSearch Service (formerly Elasticsearch Service), many key metrics are automatically published to CloudWatch. If you’re self-hosting Elasticsearch on EC2, you’ll need to use the CloudWatch agent to collect relevant metrics (similar to the C app example).
Alarm 1: Cluster Status (for OpenSearch Service)
Metric: ClusterStatus.red Namespace: AWS/ES Statistic: Maximum Period: 5 minutes Threshold: = 1 Evaluation Periods: 1 Alarm Actions: SNS topic for critical alerts
You can create similar alarms for ClusterStatus.yellow if yellow status is considered critical for your use case.
Alarm 2: JVM Memory Pressure (for OpenSearch Service)
Metric: JVMMemoryPressure Namespace: AWS/ES Statistic: Average Period: 5 minutes Threshold: > 85% Evaluation Periods: 3 Alarm Actions: SNS topic for warnings/investigation
Alarm 3: High Disk Usage (for OpenSearch Service)
Metric: DiskQueueDepth Namespace: AWS/ES Statistic: Average Period: 5 minutes Threshold: > 1000 Evaluation Periods: 3 Alarm Actions: SNS topic for warnings/investigation
Alarm 4: Unhealthy Nodes (for OpenSearch Service)
Metric: UnassignedShards Namespace: AWS/ES Statistic: Maximum Period: 5 minutes Threshold: > 0 Evaluation Periods: 1 Alarm Actions: SNS topic for critical alerts
By combining systemd health checks for your C application, Prometheus/Grafana for deep Elasticsearch observability, and CloudWatch for critical AWS-native alerting, you establish a robust, multi-layered monitoring strategy to keep your services operational and performant on AWS.