Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on Linode
Proactive C Application Health Checks with Systemd
Maintaining the stability of a C application, especially one serving critical functions, requires robust health monitoring. For applications managed by systemd, we can leverage its built-in capabilities for self-reporting and automated recovery. This involves defining a systemd service unit that includes a health check mechanism.
Consider a C application that exposes a simple HTTP endpoint for health checks. We can create a systemd service file that periodically queries this endpoint and restarts the service if it fails. This approach is far more effective than simply checking if the process is running, as it verifies the application’s actual responsiveness.
Systemd Service Unit Configuration
Let’s define a systemd service file, typically located at /etc/systemd/system/my-c-app.service. This file will specify how to start, stop, and monitor our C application.
The key here is the ExecStartPre and ExecStartPost directives, which can be used to run commands before and after the main service starts. We’ll also use WatchdogSec for more granular health checks.
[Unit] Description=My Critical C Application After=network.target [Service] Type=simple User=appuser Group=appgroup WorkingDirectory=/opt/my-c-app ExecStart=/opt/my-c-app/bin/my_c_app --config /etc/my-c-app/config.conf ExecStartPre=/usr/bin/curl --fail --silent --head http://localhost:8080/health ExecStartPost=/bin/sleep 5 # Give the app a moment to fully initialize Restart=on-failure RestartSec=10 WatchdogSec=30 # Expect the service to signal readiness within 30 seconds StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target
In this configuration:
ExecStartPre=/usr/bin/curl --fail --silent --head http://localhost:8080/health: This command attempts to fetch the HTTP headers from the application’s health check endpoint. The--failoption ensures thatcurlreturns a non-zero exit code if the HTTP status code is 4xx or 5xx, triggering a service restart.WatchdogSec=30: This directive, when used with a service ofType=notifyorType=simplewhere the main process periodically pings the watchdog, tellssystemdto consider the service failed if it doesn’t receive a “keep-alive” signal within 30 seconds. ForType=simple, the application itself needs to implement this watchdog pinging mechanism. If your C app doesn’t support this, you can rely more heavily onExecStartPreandRestart=on-failure.Restart=on-failureandRestartSec=10: These ensure that if the application exits with a non-zero status code (or ifExecStartPrefails),systemdwill attempt to restart it after a 10-second delay.
After creating or modifying this file, reload the systemd daemon and start/enable the service:
sudo systemctl daemon-reload sudo systemctl enable my-c-app.service sudo systemctl start my-c-app.service
To check the status and logs:
sudo systemctl status my-c-app.service sudo journalctl -u my-c-app.service -f
Elasticsearch Cluster Monitoring with Prometheus and Alertmanager
Monitoring an Elasticsearch cluster involves tracking not just the availability of nodes but also key performance indicators like JVM heap usage, disk I/O, query latency, and indexing rates. Prometheus is an excellent choice for collecting these metrics, and Alertmanager for handling notifications.
Prometheus Exporter Setup
Elasticsearch doesn’t expose metrics in a Prometheus-native format by default. We need an exporter. The most common and robust solution is the official elasticsearch_exporter.
First, download and install the exporter. On a Debian/Ubuntu system:
wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/v0.12.1/elasticsearch_exporter-0.12.1.linux-amd64.tar.gz tar xvfz elasticsearch_exporter-0.12.1.linux-amd64.tar.gz sudo mv elasticsearch_exporter-0.12.1.linux-amd64/elasticsearch_exporter /usr/local/bin/ sudo rm -rf elasticsearch_exporter-0.12.1.linux-amd64*
Next, create a systemd service for the exporter. This exporter will scrape metrics from your Elasticsearch cluster.
[Unit] Description=Prometheus Elasticsearch Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/elasticsearch_exporter \ --es.uri=http://localhost:9200 \ --es.timeout=5s \ --web.listen-address=":9114" \ --log.level=info [Install] WantedBy=multi-user.target
Ensure the prometheus user and group exist. If not, create them:
sudo groupadd --system prometheus sudo useradd --system --no-create-home --gid prometheus prometheus sudo mkdir -p /var/lib/prometheus sudo chown prometheus:prometheus /var/lib/prometheus
Now, enable and start the exporter service:
sudo systemctl daemon-reload sudo systemctl enable elasticsearch_exporter.service sudo systemctl start elasticsearch_exporter.service sudo systemctl status elasticsearch_exporter.service
Prometheus Configuration for Elasticsearch
Add the Elasticsearch exporter as a scrape target in your Prometheus configuration (e.g., /etc/prometheus/prometheus.yml). If you have multiple Elasticsearch nodes, you’ll want to configure Prometheus to scrape each exporter instance.
scrape_configs:
- job_name: 'elasticsearch'
static_configs:
- targets: ['localhost:9114'] # Assuming exporter is on the same host as Prometheus
labels:
cluster: 'my-es-cluster-1'
instance: 'es-node-1'
- job_name: 'elasticsearch_node2'
static_configs:
- targets: ['es-node-2-ip:9114'] # If exporter is on a different node
labels:
cluster: 'my-es-cluster-1'
instance: 'es-node-2'
Reload Prometheus configuration:
sudo systemctl reload prometheus
Alerting Rules for Elasticsearch
Define alerting rules in Prometheus to notify Alertmanager of potential issues. These rules should be placed in a separate file, e.g., /etc/prometheus/alert.rules.yml, and referenced in prometheus.yml.
groups:
- name: elasticsearch_alerts
rules:
- alert: ElasticsearchDown
expr: up{job="elasticsearch"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Elasticsearch node {{ $labels.instance }} is down."
description: "Prometheus failed to scrape Elasticsearch node {{ $labels.instance }} for 5 minutes."
- alert: ElasticsearchHighHeapUsage
expr: elasticsearch_jvm_memory_used_bytes{job="elasticsearch"} / elasticsearch_jvm_memory_max_bytes{job="elasticsearch"} * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Elasticsearch node {{ $labels.instance }} has high heap usage."
description: "Elasticsearch node {{ $labels.instance }} is using {{ $value | printf "%.2f" }}% of its JVM heap."
- alert: ElasticsearchDiskSpaceLow
expr: elasticsearch_filesystem_free_bytes{job="elasticsearch"} / elasticsearch_filesystem_size_bytes{job="elasticsearch"} * 100 < 20
for: 15m
labels:
severity: warning
annotations:
summary: "Elasticsearch node {{ $labels.instance }} has low disk space."
description: "Elasticsearch node {{ $labels.instance }} has only {{ $value | printf "%.2f" }}% free disk space remaining."
- alert: ElasticsearchIndexingRateLow
expr: rate(elasticsearch_indices_indexing_total{job="elasticsearch"}[5m]) < 10
for: 10m
labels:
severity: info
annotations:
summary: "Elasticsearch indexing rate is low."
description: "Elasticsearch node {{ $labels.instance }} has an indexing rate of {{ $value | printf "%.2f" }} documents/sec over the last 5 minutes."
Ensure your prometheus.yml includes the alert rules file:
rule_files: - "/etc/prometheus/alert.rules.yml"
And configure Alertmanager in the same prometheus.yml:
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager-ip:9093'] # Your Alertmanager instance
Reload Prometheus again after adding alert rules and Alertmanager configuration.
Advanced Diagnostics: Tracing C Application Issues
When your C application experiences unexpected behavior, deep diagnostics are crucial. Beyond basic logging, tracing can provide invaluable insights into the execution flow and identify performance bottlenecks or deadlocks.
Using `strace` for System Call Analysis
strace is a powerful utility that intercepts and records the system calls made by a process and the signals it receives. This can help pinpoint issues related to file access, network operations, or memory management.
To trace a running C application:
sudo strace -p $(pgrep -f my_c_app) -s 1024 -o /tmp/my_c_app.strace.log
Explanation:
-p $(pgrep -f my_c_app): Attaches to the process ID (PID) of your C application.pgrep -ffinds the PID based on the full command line.-s 1024: Sets the string length to 1024 characters, ensuring that arguments to system calls are fully captured.-o /tmp/my_c_app.strace.log: Writes the output to a specified file.
If the application is crashing, you can start it with strace:
sudo strace -f -o /tmp/my_c_app_startup.strace.log /opt/my-c-app/bin/my_c_app --config /etc/my-c-app/config.conf
-f traces child processes as well.
Using `perf` for Performance Profiling
For performance-related issues, the Linux perf tool is indispensable. It uses hardware performance counters and kernel tracepoints to provide detailed insights into CPU usage, cache misses, branch prediction, and more.
To profile your C application for a specific duration:
sudo perf record -g -o /tmp/my_c_app.perf.data -- sleep 60 sudo perf report -i /tmp/my_c_app.perf.data
Explanation:
perf record -g -o /tmp/my_c_app.perf.data -- sleep 60: Records performance data for 60 seconds.-genables call graph (stack trace) recording, which is crucial for understanding where time is spent.perf report -i /tmp/my_c_app.perf.data: Analyzes the recorded data and presents it in an interactive TUI.
To profile a running process:
sudo perf top -p $(pgrep -f my_c_app)
This will show a real-time view of the most active functions in your application.
Elasticsearch Cluster Health Checks and Diagnostics
Beyond Prometheus metrics, direct checks on the Elasticsearch cluster’s health are vital. This includes verifying cluster status, node health, and basic query functionality.
Cluster Health API
The Elasticsearch Cluster Health API provides a high-level overview of the cluster’s status. A healthy cluster should report a status of green.
curl -X GET "localhost:9200/_cluster/health?pretty"
Example output for a healthy cluster:
{
"cluster_name" : "my-es-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 10,
"active_shards" : 30,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
If the status is yellow, it means primary shards are allocated, but some replicas are not. If it’s red, it indicates that some primary shards are not allocated, meaning data might be unavailable.
Node Info and Stats
To check individual node health and resource utilization:
# Get information about all nodes curl -X GET "localhost:9200/_cat/nodes?v" # Get stats for a specific node (replace node_name with actual node name) curl -X GET "localhost:9200/_nodes/stats/jvm,fs,indices?pretty"
The _cat/nodes endpoint provides a quick overview of node names, IPs, roles, load, and heap usage. The _nodes/stats endpoint offers detailed metrics for JVM, filesystem, and index statistics.
Index Health and Shard Allocation
Unassigned shards are a common cause of cluster instability. The _cat/shards API can help diagnose this.
curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason&s=state"
Look for shards in a state other than STARTED, and pay attention to the unassigned.reason column. Common reasons include disk space issues, allocation filtering, or cluster-wide throttling.
Querying for Slow Queries
Slow queries can degrade cluster performance. Elasticsearch logs slow queries by default if configured. You can also query the slowlog indices directly (if enabled) or use the Search Profiler API for real-time analysis.
# Example of querying for slow search logs (if configured)
curl -X GET "localhost:9200/.logs-application-slowlog-2023.10.27/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"@timestamp": {
"gte": "now-1h/m",
"lt": "now/m"
}
}
},
"sort": [
{ "took": { "order": "desc" } }
]
}
'
For more granular performance analysis of a specific query, use the Search Profiler API:
curl -X GET "localhost:9200/your_index/_search?profile=true" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}
'
The output of the profiler provides detailed timing information for each phase of the query execution, helping to identify bottlenecks within the search process itself.