Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on DigitalOcean
Proactive C Application Health Checks with Systemd
For a C application running on DigitalOcean, robust health checking is paramount. Relying solely on external HTTP probes can miss critical internal states. Systemd’s built-in service management offers a powerful, low-level mechanism for this. We’ll configure systemd to monitor our C application’s process and restart it if it crashes or becomes unresponsive.
Assume your C application is compiled and installed at /opt/myapp/myapp_server. We’ll create a systemd service file to manage it.
Creating the Systemd Service Unit
Create a new service file, for example, /etc/systemd/system/myapp.service:
[Unit] Description=My C Application Server After=network.target [Service] Type=simple User=myappuser Group=myappgroup WorkingDirectory=/opt/myapp ExecStart=/opt/myapp/myapp_server --config /opt/myapp/myapp.conf Restart=on-failure RestartSec=5 KillMode=process StandardOutput=syslog StandardError=syslog SyslogIdentifier=myapp [Install] WantedBy=multi-user.target
Let’s break down the key directives:
Description: A human-readable description of the service.After=network.target: Ensures the network is up before starting the service.Type=simple: The default type, suitable for most applications that don’t fork.UserandGroup: Run the application as a non-privileged user for security. Ensure this user and group exist.WorkingDirectory: Sets the current directory for the application.ExecStart: The command to execute to start your application. Include any necessary arguments.Restart=on-failure: This is crucial. Systemd will automatically restart the service if the process exits with a non-zero status code. Other options includealways,on-success,on-abnormal, etc.RestartSec=5: Wait 5 seconds before attempting a restart. This prevents rapid restart loops if the application fails immediately upon startup.KillMode=process: When stopping the service, only send signals to the main process.StandardOutputandStandardError: Redirect stdout and stderr to syslog. This is vital for debugging.SyslogIdentifier: A tag to identify log messages from this service in syslog.WantedBy=multi-user.target: Ensures the service is started when the system reaches the multi-user runlevel.
Enabling and Managing the Service
After creating the service file, reload the systemd daemon:
sudo systemctl daemon-reload
Enable the service to start on boot:
sudo systemctl enable myapp.service
Start the service:
sudo systemctl start myapp.service
Check its status:
sudo systemctl status myapp.service
To view logs, use journalctl:
sudo journalctl -u myapp.service -f
Advanced Health Checks: Liveness and Readiness Probes
While Restart=on-failure handles crashes, it doesn’t detect a “hung” application that’s still running but not processing requests. For this, we can leverage systemd’s ExecStartPost and a simple health check script, or integrate with an external monitoring tool. A common pattern is to have your C application expose a simple HTTP endpoint (e.g., /healthz) that returns 200 OK if healthy.
Let’s assume your C app listens on port 8080 and has a /healthz endpoint.
Option 1: Systemd’s ExecStartPost with curl
This is a basic check that runs *after* the main process starts. It’s not a continuous probe but verifies initial startup health.
[Service] # ... other directives ... ExecStartPost=/usr/bin/curl --fail http://localhost:8080/healthz Restart=on-failure RestartSec=5 # ... rest of the service file ...
The --fail option makes curl return a non-zero exit code if the HTTP status is 4xx or 5xx, or if it cannot connect, triggering a systemd restart.
Option 2: Dedicated Health Check Daemon (e.g., systemd-socket-proxyd or custom script)
For more sophisticated, continuous health checking, consider a separate process. A simple approach is a Python script that periodically polls the health endpoint and signals systemd if the application becomes unhealthy. This script could be managed by another systemd service.
Example Python health checker script (/opt/myapp/health_checker.py):
import requests
import time
import sys
import os
HEALTH_URL = "http://localhost:8080/healthz"
CHECK_INTERVAL = 10 # seconds
TIMEOUT = 5 # seconds
MAX_FAILURES = 3
def check_health():
failures = 0
while True:
try:
response = requests.get(HEALTH_URL, timeout=TIMEOUT)
if response.status_code == 200:
if failures > 0:
print(f"Health check OK. Recovered after {failures} failures.")
failures = 0
else:
print("Health check OK.")
else:
print(f"Health check failed: HTTP {response.status_code}")
failures += 1
except requests.exceptions.RequestException as e:
print(f"Health check failed: {e}")
failures += 1
if failures >= MAX_FAILURES:
print(f"Exceeded maximum failures ({MAX_FAILURES}). Exiting to trigger restart.")
sys.exit(1) # Exit with non-zero code to signal failure
time.sleep(CHECK_INTERVAL)
if __name__ == "__main__":
# Ensure the main app is running before starting checks
# This is a basic check; a more robust solution might use PID files or socket checks
try:
# Attempt to connect to the app's main port to see if it's listening
import socket
s = socket.create_connection(("localhost", 8080), TIMEOUT)
s.close()
print("Application port is listening. Starting health checks.")
check_health()
except (socket.error, ConnectionRefusedError) as e:
print(f"Application port not listening: {e}. Waiting for application to start.")
sys.exit(1) # Exit to allow main app to start first
except Exception as e:
print(f"An unexpected error occurred during initial check: {e}")
sys.exit(1)
And its systemd service file (/etc/systemd/system/myapp-healthcheck.service):
[Unit] Description=My C Application Health Checker After=myapp.service [Service] Type=simple User=myappuser Group=myappgroup ExecStart=/usr/bin/python3 /opt/myapp/health_checker.py Restart=on-failure RestartSec=10 StandardOutput=syslog StandardError=syslog SyslogIdentifier=myapp-healthcheck [Install] WantedBy=myapp.service
Make sure to set appropriate permissions for the Python script and enable/start this new service.
Monitoring Elasticsearch Clusters on DigitalOcean
Elasticsearch clusters require a different monitoring approach, focusing on cluster health, node status, indexing performance, and resource utilization. DigitalOcean’s Managed Databases for Elasticsearch offer some built-in monitoring, but for deeper insights and custom alerting, we’ll use Prometheus and Grafana.
Setting up Prometheus Node Exporter
First, deploy the Prometheus Node Exporter on each DigitalOcean Droplet hosting an Elasticsearch node. This provides system-level metrics.
Download the latest release:
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
Create a systemd service for Node Exporter (/etc/systemd/system/node_exporter.service):
[Unit] Description=Prometheus Node Exporter After=network.target [Service] User=nobody Group=nogroup Type=simple ExecStart=/usr/local/bin/node_exporter --collector.filesystem --collector.cpu --collector.meminfo --collector.netdev --collector.diskstats [Install] WantedBy=multi-user.target
Enable and start it:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter
Setting up Elasticsearch Exporter
To get Elasticsearch-specific metrics, we’ll use the official Elasticsearch Exporter or a community-maintained one. A popular choice is prometheus-community/elasticsearch-exporter.
You can run this as a Docker container or a standalone binary. Here’s an example using Docker:
docker run -d \ --name elasticsearch-exporter \ -p 9114:9114 \ -e "ES_URI=http://YOUR_ELASTICSEARCH_HOST:9200" \ prom/elasticsearch-exporter:latest
Replace YOUR_ELASTICSEARCH_HOST with the actual hostname or IP of your Elasticsearch instance. If using DigitalOcean Managed Databases, you’ll use the provided connection string and potentially authentication.
Configuring Prometheus
Your Prometheus server needs to scrape these exporters. Edit your prometheus.yml configuration:
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape Node Exporter on each Elasticsearch node
- job_name: 'elasticsearch_nodes'
static_configs:
- targets:
- 'es-node-1.yourdomain.com:9100'
- 'es-node-2.yourdomain.com:9100'
- 'es-node-3.yourdomain.com:9100'
# Optional: Add relabeling if you need to add metadata like cluster name
# relabel_configs:
# - source_labels: [__address__]
# target_label: instance
# regex: '([^:]+):.*'
# replacement: '$1'
# Scrape Elasticsearch Exporter
- job_name: 'elasticsearch_exporter'
static_configs:
- targets:
- 'es-node-1.yourdomain.com:9114'
- 'es-node-2.yourdomain.com:9114'
- 'es-node-3.yourdomain.com:9114'
# Optional: Add relabeling for cluster name
# relabel_configs:
# - source_labels: [__address__]
# target_label: instance
# regex: '([^:]+):.*'
# replacement: '$1'
Reload Prometheus configuration after changes.
Setting up Grafana Dashboards
Import pre-built Grafana dashboards for Elasticsearch and Node Exporter. You can find excellent community dashboards on Grafana.com. Search for “Elasticsearch” and “Node Exporter”.
Key Elasticsearch metrics to monitor:
- Cluster Health (
elasticsearch_cluster_health_status– 0 for green, 1 for yellow, 2 for red) - Node Status (
elasticsearch_node_status) - Indexing Rate (
elasticsearch_indices_indexing_rate) - Search Rate (
elasticsearch_search_query_rate) - JVM Heap Usage (
jvm_memory_bytes_used,jvm_memory_bytes_max) - Disk Usage (
node_filesystem_avail_bytes,node_filesystem_size_bytes) - Network Traffic (
node_netdev_rx_bytes_total,node_netdev_tx_bytes_total) - CPU Usage (
node_cpu_seconds_total)
Alerting with Prometheus Alertmanager
Configure Prometheus Alertmanager to send notifications based on critical alerts. Define alert rules in Prometheus’s rule files (e.g., rules.yml).
Example alert rule for Elasticsearch cluster status:
groups:
- name: elasticsearch_alerts
rules:
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_health_status{job="elasticsearch_exporter"} > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Elasticsearch cluster is RED!"
description: "The Elasticsearch cluster {{ $labels.cluster }} is in a RED state for more than 5 minutes."
- alert: ElasticsearchNodeNotReady
expr: elasticsearch_node_status{job="elasticsearch_exporter"} != 1
for: 10m
labels:
severity: warning
annotations:
summary: "Elasticsearch node is not ready."
description: "Node {{ $labels.instance }} in cluster {{ $labels.cluster }} has been in a non-ready state for 10 minutes."
- alert: HighJVMPoolUsage
expr: (jvm_memory_bytes_used{job="elasticsearch_exporter", area="heap"} / jvm_memory_bytes_max{job="elasticsearch_exporter", area="heap"}) * 100 > 85
for: 15m
labels:
severity: warning
annotations:
summary: "High JVM Heap Usage on Elasticsearch node."
description: "Elasticsearch node {{ $labels.instance }} has {{ $value | printf "%.2f" }}% JVM heap usage."
Ensure your Alertmanager is configured with receivers for Slack, PagerDuty, or email, and that Prometheus is configured to send alerts to Alertmanager.
Conclusion
By combining systemd’s process management for your C application with Prometheus and Grafana for your Elasticsearch cluster, you establish a comprehensive, multi-layered monitoring strategy. This proactive approach ensures high availability and rapid issue detection on DigitalOcean, minimizing downtime and performance degradation.