Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on OVH
Proactive C Application Health Checks with Systemd
Maintaining the stability of a critical C application, especially one serving high-traffic endpoints, requires more than just basic process monitoring. We need to implement deep health checks that go beyond mere existence. For applications managed by systemd, this means leveraging its robust service management capabilities to define comprehensive health indicators. This approach ensures that systemd doesn’t just restart a crashed process, but also flags and potentially isolates services that are technically running but functionally impaired.
A common pattern is to use systemd’s ExecStartPre, ExecStartPost, and crucially, ExecStart with a health-checking mechanism. For a C application, this could involve a dedicated health check endpoint (e.g., via HTTP or a simple TCP port) or a command-line utility that probes internal states. Let’s consider a scenario where our C application exposes a health check endpoint on port 9090.
Systemd Service Unit for C Application with Health Check
Here’s a sample systemd service unit file. We’ll use curl in ExecStartPost to verify the health endpoint after the application starts. We’ll also configure Restart=on-failure and WatchdogSec for more aggressive monitoring.
[Unit] Description=My Critical C Application After=network.target [Service] Type=simple User=appuser Group=appgroup WorkingDirectory=/opt/myapp ExecStart=/opt/myapp/bin/my_c_app --config /etc/myapp.conf # Execute a health check after the application starts. # We expect a 200 OK response from the health endpoint. ExecStartPost=/usr/bin/curl --fail http://localhost:9090/health # If the health check fails (non-zero exit code), the service will be considered failed. Restart=on-failure # Set a watchdog timeout. If the service doesn't signal readiness within this time, # systemd will consider it failed. This requires the application to implement # sd_notify() for Type=notify services, or we rely on the health check. # For Type=simple, WatchdogSec is less direct but still useful. WatchdogSec=30 # Consider adding a timeout for the ExecStartPost command itself. # TimeoutStartSec=60 [Install] WantedBy=multi-user.target
In this configuration:
Type=simpleis suitable for applications that don’t fork and don’t explicitly notify systemd of their readiness.ExecStartPostruns acurlcommand. The--failoption makescurlreturn a non-zero exit code if the HTTP status code is 4xx or 5xx, or if it cannot connect. This is crucial for triggering theRestart=on-failuremechanism.Restart=on-failureensures that if the application crashes or ifExecStartPostfails, systemd will attempt to restart it.WatchdogSec=30sets a timeout. If the service doesn’t become “active” within 30 seconds (which, forType=simplewithoutsd_notify, is generally considered to be whenExecStartcompletes), systemd might take action. When combined withExecStartPost, a failure in the health check will cause the service to be marked as failed, andRestart=on-failurewill kick in.
To enable and start this service:
sudo systemctl daemon-reload sudo systemctl enable myapp.service sudo systemctl start myapp.service sudo systemctl status myapp.service
This setup provides a basic but effective layer of automated health checking for your C application, ensuring that systemd is aware of its functional status, not just its process ID.
Elasticsearch Cluster Health Monitoring with Prometheus and Alertmanager
Monitoring Elasticsearch clusters, especially in a distributed environment on OVH, requires a robust solution that can handle metrics collection, visualization, and alerting. Prometheus, coupled with Elasticsearch’s built-in metrics endpoints and Alertmanager, is a de facto standard for this. We’ll focus on collecting key cluster health metrics and setting up alerts for common failure scenarios.
Prometheus Exporter for Elasticsearch
Prometheus doesn’t natively scrape Elasticsearch. We need an exporter. The prometheus-community/elasticsearch-exporter is a popular choice. It scrapes the Elasticsearch Cluster Stats API and Node Stats API.
First, ensure your Elasticsearch nodes are accessible and configured to expose metrics. By default, Elasticsearch exposes metrics via JMX, but the exporter typically uses the REST API. Ensure your Elasticsearch security settings (if any) allow the exporter to access the necessary endpoints.
We’ll deploy the exporter as a Docker container or a systemd service. Here’s a systemd service example:
[Unit] Description=Prometheus Elasticsearch Exporter After=network.target [Service] User=prometheus Group=prometheus ExecStart=/usr/local/bin/elasticsearch_exporter \ --es.uri=http://elasticsearch-node1:9200 \ --es.uri=http://elasticsearch-node2:9200 \ --es.uri=http://elasticsearch-node3:9200 \ --web.listen-address=":9114" Restart=always [Install] WantedBy=multi-user.target
Replace elasticsearch-nodeX:9200 with your actual Elasticsearch node addresses. The exporter will run on port 9114. You’ll need to create the prometheus user and group, and ensure the binary is in place.
Prometheus Configuration for Scraping Elasticsearch Exporter
Next, configure Prometheus to scrape the Elasticsearch exporter. Add the following to your prometheus.yml:
scrape_configs:
- job_name: 'elasticsearch'
static_configs:
- targets: ['elasticsearch-exporter-host:9114'] # Replace with your exporter's host
metrics_path: /metrics
# Optional: Add relabeling if you need to tag metrics with cluster names etc.
# relabel_configs:
# - source_labels: [__address__]
# target_label: cluster
# regex: '([^:]+):.*'
# replacement: 'my-production-cluster'
Reload Prometheus configuration:
curl -X POST http://localhost:9090/-/reload
Key Elasticsearch Metrics to Monitor
The exporter provides a wealth of metrics. Focus on these for cluster health:
elasticsearch_cluster_health_status: The overall health of the cluster (0=green, 1=yellow, 2=red). This is critical.elasticsearch_cluster_nodes_count: Number of nodes in the cluster.elasticsearch_cluster_indices_count: Number of indices.elasticsearch_cluster_shards_total: Total number of shards.elasticsearch_cluster_shards_primary: Number of primary shards.elasticsearch_cluster_shards_unassigned: Number of unassigned shards.elasticsearch_node_heap_usage_bytes: Heap usage per node.elasticsearch_node_cpu_usage_seconds_total: CPU usage per node.elasticsearch_indices_indexing_total: Indexing rate.elasticsearch_indices_search_total: Search rate.
Alerting Rules with Alertmanager
Configure Prometheus alerting rules. Create a file like alert.rules.yml:
groups:
- name: elasticsearch_alerts
rules:
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_health_status{status="red"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Elasticsearch cluster is RED"
description: "The Elasticsearch cluster {{ $labels.cluster }} is in RED health status. This indicates data loss or unavailability."
- alert: ElasticsearchClusterYellow
expr: elasticsearch_cluster_health_status{status="yellow"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Elasticsearch cluster is YELLOW"
description: "The Elasticsearch cluster {{ $labels.cluster }} is in YELLOW health status. This means some primary shards are not allocated, potentially impacting availability."
- alert: ElasticsearchUnassignedShards
expr: elasticsearch_cluster_shards_unassigned > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Elasticsearch has unassigned shards"
description: "There are {{ $value }} unassigned shards in the Elasticsearch cluster {{ $labels.cluster }}. This could be due to node failures or insufficient resources."
- alert: HighElasticsearchHeapUsage
expr: elasticsearch_node_heap_usage_bytes{job="elasticsearch"} / elasticsearch_node_heap_max_bytes{job="elasticsearch"} * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High Elasticsearch heap usage on node {{ $labels.node }}"
description: "Node {{ $labels.node }} in cluster {{ $labels.cluster }} has {{ printf "%.2f" $value }}% heap usage, exceeding the 85% threshold."
Ensure Prometheus is configured to load these rules (add -f alert.rules.yml to the rule_files section of prometheus.yml) and that Alertmanager is set up to receive alerts from Prometheus and route them to your desired notification channels (email, Slack, PagerDuty, etc.).
OVH Specific Considerations
When deploying on OVH, consider the following:
- Network Segmentation: Use OVH’s network features (VPCs, Security Groups) to isolate your Elasticsearch cluster and monitoring infrastructure. Only allow necessary traffic between nodes and from your monitoring system.
- Instance Sizing: Properly size your C application instances and Elasticsearch nodes based on expected load. OVH offers a range of instance types. Monitor resource utilization (CPU, RAM, Disk I/O) closely.
- Disk Performance: Elasticsearch is I/O intensive. Choose OVH instance types with fast local SSDs or consider network-attached storage solutions if local storage is insufficient. Monitor disk latency and throughput.
- High Availability: For Elasticsearch, deploy multiple nodes across different availability zones within an OVH region if possible, and configure replication appropriately. For your C application, consider load balancing and redundant instances.
- Logging: Centralize logs from your C application and Elasticsearch nodes using a service like ELK stack (which you’re monitoring!) or a cloud-native logging solution. This is invaluable for debugging issues that metrics alone don’t reveal.
- OVH Monitoring Tools: While Prometheus is powerful, don’t neglect OVH’s built-in monitoring tools for infrastructure-level metrics (e.g., network traffic, disk usage at the hypervisor level). These can provide early warnings of underlying hardware or network issues.
By combining robust application-level health checks with comprehensive cluster monitoring and leveraging OVH’s infrastructure capabilities, you can build a highly resilient and observable system.