Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on Linode

Proactive C Application Health Checks with Systemd

Maintaining the stability of a C application, especially one serving critical functions, requires robust health monitoring. For applications managed by systemd, we can leverage its built-in capabilities for self-reporting and automated recovery. This involves defining a systemd service unit that includes a health check mechanism.

Consider a C application that exposes a simple HTTP endpoint for health checks. We can create a systemd service file that periodically queries this endpoint and restarts the service if it fails. This approach is far more effective than simply checking if the process is running, as it verifies the application’s actual responsiveness.

Systemd Service Unit Configuration

Let’s define a systemd service file, typically located at /etc/systemd/system/my-c-app.service. This file will specify how to start, stop, and monitor our C application.

The key here is the ExecStartPre and ExecStartPost directives, which can be used to run commands before and after the main service starts. We’ll also use WatchdogSec for more granular health checks.

[Unit]
Description=My Critical C Application
After=network.target

[Service]
Type=simple
User=appuser
Group=appgroup
WorkingDirectory=/opt/my-c-app
ExecStart=/opt/my-c-app/bin/my_c_app --config /etc/my-c-app/config.conf
ExecStartPre=/usr/bin/curl --fail --silent --head http://localhost:8080/health
ExecStartPost=/bin/sleep 5  # Give the app a moment to fully initialize
Restart=on-failure
RestartSec=10
WatchdogSec=30  # Expect the service to signal readiness within 30 seconds
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

In this configuration:

ExecStartPre=/usr/bin/curl --fail --silent --head http://localhost:8080/health: This command attempts to fetch the HTTP headers from the application’s health check endpoint. The --fail option ensures that curl returns a non-zero exit code if the HTTP status code is 4xx or 5xx, triggering a service restart.
WatchdogSec=30: This directive, when used with a service of Type=notify or Type=simple where the main process periodically pings the watchdog, tells systemd to consider the service failed if it doesn’t receive a “keep-alive” signal within 30 seconds. For Type=simple, the application itself needs to implement this watchdog pinging mechanism. If your C app doesn’t support this, you can rely more heavily on ExecStartPre and Restart=on-failure.
Restart=on-failure and RestartSec=10: These ensure that if the application exits with a non-zero status code (or if ExecStartPre fails), systemd will attempt to restart it after a 10-second delay.

After creating or modifying this file, reload the systemd daemon and start/enable the service:

sudo systemctl daemon-reload
sudo systemctl enable my-c-app.service
sudo systemctl start my-c-app.service

To check the status and logs:

sudo systemctl status my-c-app.service
sudo journalctl -u my-c-app.service -f

Elasticsearch Cluster Monitoring with Prometheus and Alertmanager

Monitoring an Elasticsearch cluster involves tracking not just the availability of nodes but also key performance indicators like JVM heap usage, disk I/O, query latency, and indexing rates. Prometheus is an excellent choice for collecting these metrics, and Alertmanager for handling notifications.

Prometheus Exporter Setup

Elasticsearch doesn’t expose metrics in a Prometheus-native format by default. We need an exporter. The most common and robust solution is the official elasticsearch_exporter.

First, download and install the exporter. On a Debian/Ubuntu system:

wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/v0.12.1/elasticsearch_exporter-0.12.1.linux-amd64.tar.gz
tar xvfz elasticsearch_exporter-0.12.1.linux-amd64.tar.gz
sudo mv elasticsearch_exporter-0.12.1.linux-amd64/elasticsearch_exporter /usr/local/bin/
sudo rm -rf elasticsearch_exporter-0.12.1.linux-amd64*

Next, create a systemd service for the exporter. This exporter will scrape metrics from your Elasticsearch cluster.

[Unit]
Description=Prometheus Elasticsearch Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/elasticsearch_exporter \
  --es.uri=http://localhost:9200 \
  --es.timeout=5s \
  --web.listen-address=":9114" \
  --log.level=info

[Install]
WantedBy=multi-user.target

Ensure the prometheus user and group exist. If not, create them:

sudo groupadd --system prometheus
sudo useradd --system --no-create-home --gid prometheus prometheus
sudo mkdir -p /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Now, enable and start the exporter service:

sudo systemctl daemon-reload
sudo systemctl enable elasticsearch_exporter.service
sudo systemctl start elasticsearch_exporter.service
sudo systemctl status elasticsearch_exporter.service

Prometheus Configuration for Elasticsearch

Add the Elasticsearch exporter as a scrape target in your Prometheus configuration (e.g., /etc/prometheus/prometheus.yml). If you have multiple Elasticsearch nodes, you’ll want to configure Prometheus to scrape each exporter instance.

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['localhost:9114'] # Assuming exporter is on the same host as Prometheus
        labels:
          cluster: 'my-es-cluster-1'
          instance: 'es-node-1'

  - job_name: 'elasticsearch_node2'
    static_configs:
      - targets: ['es-node-2-ip:9114'] # If exporter is on a different node
        labels:
          cluster: 'my-es-cluster-1'
          instance: 'es-node-2'

Reload Prometheus configuration:

sudo systemctl reload prometheus

Alerting Rules for Elasticsearch

Define alerting rules in Prometheus to notify Alertmanager of potential issues. These rules should be placed in a separate file, e.g., /etc/prometheus/alert.rules.yml, and referenced in prometheus.yml.

groups:
- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchDown
    expr: up{job="elasticsearch"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch node {{ $labels.instance }} is down."
      description: "Prometheus failed to scrape Elasticsearch node {{ $labels.instance }} for 5 minutes."

  - alert: ElasticsearchHighHeapUsage
    expr: elasticsearch_jvm_memory_used_bytes{job="elasticsearch"} / elasticsearch_jvm_memory_max_bytes{job="elasticsearch"} * 100 > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch node {{ $labels.instance }} has high heap usage."
      description: "Elasticsearch node {{ $labels.instance }} is using {{ $value | printf "%.2f" }}% of its JVM heap."

  - alert: ElasticsearchDiskSpaceLow
    expr: elasticsearch_filesystem_free_bytes{job="elasticsearch"} / elasticsearch_filesystem_size_bytes{job="elasticsearch"} * 100 < 20
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch node {{ $labels.instance }} has low disk space."
      description: "Elasticsearch node {{ $labels.instance }} has only {{ $value | printf "%.2f" }}% free disk space remaining."

  - alert: ElasticsearchIndexingRateLow
    expr: rate(elasticsearch_indices_indexing_total{job="elasticsearch"}[5m]) < 10
    for: 10m
    labels:
      severity: info
    annotations:
      summary: "Elasticsearch indexing rate is low."
      description: "Elasticsearch node {{ $labels.instance }} has an indexing rate of {{ $value | printf "%.2f" }} documents/sec over the last 5 minutes."

Ensure your prometheus.yml includes the alert rules file:

rule_files:
  - "/etc/prometheus/alert.rules.yml"

And configure Alertmanager in the same prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager-ip:9093'] # Your Alertmanager instance

Reload Prometheus again after adding alert rules and Alertmanager configuration.

Advanced Diagnostics: Tracing C Application Issues

When your C application experiences unexpected behavior, deep diagnostics are crucial. Beyond basic logging, tracing can provide invaluable insights into the execution flow and identify performance bottlenecks or deadlocks.

Using `strace` for System Call Analysis

strace is a powerful utility that intercepts and records the system calls made by a process and the signals it receives. This can help pinpoint issues related to file access, network operations, or memory management.

To trace a running C application:

sudo strace -p $(pgrep -f my_c_app) -s 1024 -o /tmp/my_c_app.strace.log

Explanation:

-p $(pgrep -f my_c_app): Attaches to the process ID (PID) of your C application. pgrep -f finds the PID based on the full command line.
-s 1024: Sets the string length to 1024 characters, ensuring that arguments to system calls are fully captured.
-o /tmp/my_c_app.strace.log: Writes the output to a specified file.

If the application is crashing, you can start it with strace:

sudo strace -f -o /tmp/my_c_app_startup.strace.log /opt/my-c-app/bin/my_c_app --config /etc/my-c-app/config.conf

-f traces child processes as well.

Using `perf` for Performance Profiling

For performance-related issues, the Linux perf tool is indispensable. It uses hardware performance counters and kernel tracepoints to provide detailed insights into CPU usage, cache misses, branch prediction, and more.

To profile your C application for a specific duration:

sudo perf record -g -o /tmp/my_c_app.perf.data -- sleep 60
sudo perf report -i /tmp/my_c_app.perf.data

Explanation:

perf record -g -o /tmp/my_c_app.perf.data -- sleep 60: Records performance data for 60 seconds. -g enables call graph (stack trace) recording, which is crucial for understanding where time is spent.
perf report -i /tmp/my_c_app.perf.data: Analyzes the recorded data and presents it in an interactive TUI.

To profile a running process:

sudo perf top -p $(pgrep -f my_c_app)

This will show a real-time view of the most active functions in your application.

Elasticsearch Cluster Health Checks and Diagnostics

Beyond Prometheus metrics, direct checks on the Elasticsearch cluster’s health are vital. This includes verifying cluster status, node health, and basic query functionality.

Cluster Health API

The Elasticsearch Cluster Health API provides a high-level overview of the cluster’s status. A healthy cluster should report a status of green.

curl -X GET "localhost:9200/_cluster/health?pretty"

Example output for a healthy cluster:

{
  "cluster_name" : "my-es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 10,
  "active_shards" : 30,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

If the status is yellow, it means primary shards are allocated, but some replicas are not. If it’s red, it indicates that some primary shards are not allocated, meaning data might be unavailable.

Node Info and Stats

To check individual node health and resource utilization:

# Get information about all nodes
curl -X GET "localhost:9200/_cat/nodes?v"

# Get stats for a specific node (replace node_name with actual node name)
curl -X GET "localhost:9200/_nodes/stats/jvm,fs,indices?pretty"

The _cat/nodes endpoint provides a quick overview of node names, IPs, roles, load, and heap usage. The _nodes/stats endpoint offers detailed metrics for JVM, filesystem, and index statistics.

Index Health and Shard Allocation

Unassigned shards are a common cause of cluster instability. The _cat/shards API can help diagnose this.

curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason&s=state"

Look for shards in a state other than STARTED, and pay attention to the unassigned.reason column. Common reasons include disk space issues, allocation filtering, or cluster-wide throttling.

Querying for Slow Queries

Slow queries can degrade cluster performance. Elasticsearch logs slow queries by default if configured. You can also query the slowlog indices directly (if enabled) or use the Search Profiler API for real-time analysis.

# Example of querying for slow search logs (if configured)
curl -X GET "localhost:9200/.logs-application-slowlog-2023.10.27/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-1h/m",
        "lt": "now/m"
      }
    }
  },
  "sort": [
    { "took": { "order": "desc" } }
  ]
}
'

For more granular performance analysis of a specific query, use the Search Profiler API:

curl -X GET "localhost:9200/your_index/_search?profile=true" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}
'

The output of the profiler provides detailed timing information for each phase of the query execution, helping to identify bottlenecks within the search process itself.