Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on OVH

Proactive C Application Health Checks with Systemd

Maintaining the stability of a critical C application, especially one serving high-traffic endpoints, requires more than just basic process monitoring. We need to implement deep health checks that go beyond mere existence. For applications managed by systemd, this means leveraging its robust service management capabilities to define comprehensive health indicators. This approach ensures that systemd doesn’t just restart a crashed process, but also flags and potentially isolates services that are technically running but functionally impaired.

A common pattern is to use systemd’s ExecStartPre, ExecStartPost, and crucially, ExecStart with a health-checking mechanism. For a C application, this could involve a dedicated health check endpoint (e.g., via HTTP or a simple TCP port) or a command-line utility that probes internal states. Let’s consider a scenario where our C application exposes a health check endpoint on port 9090.

Systemd Service Unit for C Application with Health Check

Here’s a sample systemd service unit file. We’ll use curl in ExecStartPost to verify the health endpoint after the application starts. We’ll also configure Restart=on-failure and WatchdogSec for more aggressive monitoring.

[Unit]
Description=My Critical C Application
After=network.target

[Service]
Type=simple
User=appuser
Group=appgroup
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/bin/my_c_app --config /etc/myapp.conf
# Execute a health check after the application starts.
# We expect a 200 OK response from the health endpoint.
ExecStartPost=/usr/bin/curl --fail http://localhost:9090/health
# If the health check fails (non-zero exit code), the service will be considered failed.
Restart=on-failure
# Set a watchdog timeout. If the service doesn't signal readiness within this time,
# systemd will consider it failed. This requires the application to implement
# sd_notify() for Type=notify services, or we rely on the health check.
# For Type=simple, WatchdogSec is less direct but still useful.
WatchdogSec=30
# Consider adding a timeout for the ExecStartPost command itself.
# TimeoutStartSec=60

[Install]
WantedBy=multi-user.target

In this configuration:

Type=simple is suitable for applications that don’t fork and don’t explicitly notify systemd of their readiness.
ExecStartPost runs a curl command. The --fail option makes curl return a non-zero exit code if the HTTP status code is 4xx or 5xx, or if it cannot connect. This is crucial for triggering the Restart=on-failure mechanism.
Restart=on-failure ensures that if the application crashes or if ExecStartPost fails, systemd will attempt to restart it.
WatchdogSec=30 sets a timeout. If the service doesn’t become “active” within 30 seconds (which, for Type=simple without sd_notify, is generally considered to be when ExecStart completes), systemd might take action. When combined with ExecStartPost, a failure in the health check will cause the service to be marked as failed, and Restart=on-failure will kick in.

To enable and start this service:

sudo systemctl daemon-reload
sudo systemctl enable myapp.service
sudo systemctl start myapp.service
sudo systemctl status myapp.service

This setup provides a basic but effective layer of automated health checking for your C application, ensuring that systemd is aware of its functional status, not just its process ID.

Elasticsearch Cluster Health Monitoring with Prometheus and Alertmanager

Monitoring Elasticsearch clusters, especially in a distributed environment on OVH, requires a robust solution that can handle metrics collection, visualization, and alerting. Prometheus, coupled with Elasticsearch’s built-in metrics endpoints and Alertmanager, is a de facto standard for this. We’ll focus on collecting key cluster health metrics and setting up alerts for common failure scenarios.

Prometheus Exporter for Elasticsearch

Prometheus doesn’t natively scrape Elasticsearch. We need an exporter. The prometheus-community/elasticsearch-exporter is a popular choice. It scrapes the Elasticsearch Cluster Stats API and Node Stats API.

First, ensure your Elasticsearch nodes are accessible and configured to expose metrics. By default, Elasticsearch exposes metrics via JMX, but the exporter typically uses the REST API. Ensure your Elasticsearch security settings (if any) allow the exporter to access the necessary endpoints.

We’ll deploy the exporter as a Docker container or a systemd service. Here’s a systemd service example:

[Unit]
Description=Prometheus Elasticsearch Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/elasticsearch_exporter \
  --es.uri=http://elasticsearch-node1:9200 \
  --es.uri=http://elasticsearch-node2:9200 \
  --es.uri=http://elasticsearch-node3:9200 \
  --web.listen-address=":9114"
Restart=always

[Install]
WantedBy=multi-user.target

Replace elasticsearch-nodeX:9200 with your actual Elasticsearch node addresses. The exporter will run on port 9114. You’ll need to create the prometheus user and group, and ensure the binary is in place.

Prometheus Configuration for Scraping Elasticsearch Exporter

Next, configure Prometheus to scrape the Elasticsearch exporter. Add the following to your prometheus.yml:

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['elasticsearch-exporter-host:9114'] # Replace with your exporter's host
    metrics_path: /metrics
    # Optional: Add relabeling if you need to tag metrics with cluster names etc.
    # relabel_configs:
    #   - source_labels: [__address__]
    #     target_label: cluster
    #     regex: '([^:]+):.*'
    #     replacement: 'my-production-cluster'

Reload Prometheus configuration:

curl -X POST http://localhost:9090/-/reload

Key Elasticsearch Metrics to Monitor

The exporter provides a wealth of metrics. Focus on these for cluster health:

elasticsearch_cluster_health_status: The overall health of the cluster (0=green, 1=yellow, 2=red). This is critical.
elasticsearch_cluster_nodes_count: Number of nodes in the cluster.
elasticsearch_cluster_indices_count: Number of indices.
elasticsearch_cluster_shards_total: Total number of shards.
elasticsearch_cluster_shards_primary: Number of primary shards.
elasticsearch_cluster_shards_unassigned: Number of unassigned shards.
elasticsearch_node_heap_usage_bytes: Heap usage per node.
elasticsearch_node_cpu_usage_seconds_total: CPU usage per node.
elasticsearch_indices_indexing_total: Indexing rate.
elasticsearch_indices_search_total: Search rate.

Alerting Rules with Alertmanager

Configure Prometheus alerting rules. Create a file like alert.rules.yml:

groups:
- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status{status="red"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster is RED"
      description: "The Elasticsearch cluster {{ $labels.cluster }} is in RED health status. This indicates data loss or unavailability."

  - alert: ElasticsearchClusterYellow
    expr: elasticsearch_cluster_health_status{status="yellow"} == 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch cluster is YELLOW"
      description: "The Elasticsearch cluster {{ $labels.cluster }} is in YELLOW health status. This means some primary shards are not allocated, potentially impacting availability."

  - alert: ElasticsearchUnassignedShards
    expr: elasticsearch_cluster_shards_unassigned > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch has unassigned shards"
      description: "There are {{ $value }} unassigned shards in the Elasticsearch cluster {{ $labels.cluster }}. This could be due to node failures or insufficient resources."

  - alert: HighElasticsearchHeapUsage
    expr: elasticsearch_node_heap_usage_bytes{job="elasticsearch"} / elasticsearch_node_heap_max_bytes{job="elasticsearch"} * 100 > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High Elasticsearch heap usage on node {{ $labels.node }}"
      description: "Node {{ $labels.node }} in cluster {{ $labels.cluster }} has {{ printf "%.2f" $value }}% heap usage, exceeding the 85% threshold."

Ensure Prometheus is configured to load these rules (add -f alert.rules.yml to the rule_files section of prometheus.yml) and that Alertmanager is set up to receive alerts from Prometheus and route them to your desired notification channels (email, Slack, PagerDuty, etc.).

OVH Specific Considerations

When deploying on OVH, consider the following:

Network Segmentation: Use OVH’s network features (VPCs, Security Groups) to isolate your Elasticsearch cluster and monitoring infrastructure. Only allow necessary traffic between nodes and from your monitoring system.
Instance Sizing: Properly size your C application instances and Elasticsearch nodes based on expected load. OVH offers a range of instance types. Monitor resource utilization (CPU, RAM, Disk I/O) closely.
Disk Performance: Elasticsearch is I/O intensive. Choose OVH instance types with fast local SSDs or consider network-attached storage solutions if local storage is insufficient. Monitor disk latency and throughput.
High Availability: For Elasticsearch, deploy multiple nodes across different availability zones within an OVH region if possible, and configure replication appropriately. For your C application, consider load balancing and redundant instances.
Logging: Centralize logs from your C application and Elasticsearch nodes using a service like ELK stack (which you’re monitoring!) or a cloud-native logging solution. This is invaluable for debugging issues that metrics alone don’t reveal.
OVH Monitoring Tools: While Prometheus is powerful, don’t neglect OVH’s built-in monitoring tools for infrastructure-level metrics (e.g., network traffic, disk usage at the hypervisor level). These can provide early warnings of underlying hardware or network issues.

By combining robust application-level health checks with comprehensive cluster monitoring and leveraging OVH’s infrastructure capabilities, you can build a highly resilient and observable system.

Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on OVH

Proactive C Application Health Checks with Systemd

Systemd Service Unit for C Application with Health Check

Elasticsearch Cluster Health Monitoring with Prometheus and Alertmanager

Prometheus Exporter for Elasticsearch

Prometheus Configuration for Scraping Elasticsearch Exporter

Key Elasticsearch Metrics to Monitor

Alerting Rules with Alertmanager

OVH Specific Considerations

Recent Posts

Top Categories

Our Products

Our Services