Server Monitoring Best Practices: Keeping Your C App and Elasticsearch Clusters Alive on DigitalOcean

Proactive C Application Health Checks with Systemd

For a C application running on DigitalOcean, robust health checking is paramount. Relying solely on external HTTP probes can miss critical internal states. Systemd’s built-in service management offers a powerful, low-level mechanism for this. We’ll configure systemd to monitor our C application’s process and restart it if it crashes or becomes unresponsive.

Assume your C application is compiled and installed at /opt/myapp/myapp_server. We’ll create a systemd service file to manage it.

Creating the Systemd Service Unit

Create a new service file, for example, /etc/systemd/system/myapp.service:

[Unit]
Description=My C Application Server
After=network.target

[Service]
Type=simple
User=myappuser
Group=myappgroup
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/myapp_server --config /opt/myapp/myapp.conf
Restart=on-failure
RestartSec=5
KillMode=process
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=myapp

[Install]
WantedBy=multi-user.target

Let’s break down the key directives:

Description: A human-readable description of the service.
After=network.target: Ensures the network is up before starting the service.
Type=simple: The default type, suitable for most applications that don’t fork.
User and Group: Run the application as a non-privileged user for security. Ensure this user and group exist.
WorkingDirectory: Sets the current directory for the application.
ExecStart: The command to execute to start your application. Include any necessary arguments.
Restart=on-failure: This is crucial. Systemd will automatically restart the service if the process exits with a non-zero status code. Other options include always, on-success, on-abnormal, etc.
RestartSec=5: Wait 5 seconds before attempting a restart. This prevents rapid restart loops if the application fails immediately upon startup.
KillMode=process: When stopping the service, only send signals to the main process.
StandardOutput and StandardError: Redirect stdout and stderr to syslog. This is vital for debugging.
SyslogIdentifier: A tag to identify log messages from this service in syslog.
WantedBy=multi-user.target: Ensures the service is started when the system reaches the multi-user runlevel.

Enabling and Managing the Service

After creating the service file, reload the systemd daemon:

sudo systemctl daemon-reload

Enable the service to start on boot:

sudo systemctl enable myapp.service

Start the service:

sudo systemctl start myapp.service

Check its status:

sudo systemctl status myapp.service

To view logs, use journalctl:

sudo journalctl -u myapp.service -f

Advanced Health Checks: Liveness and Readiness Probes

While Restart=on-failure handles crashes, it doesn’t detect a “hung” application that’s still running but not processing requests. For this, we can leverage systemd’s ExecStartPost and a simple health check script, or integrate with an external monitoring tool. A common pattern is to have your C application expose a simple HTTP endpoint (e.g., /healthz) that returns 200 OK if healthy.

Let’s assume your C app listens on port 8080 and has a /healthz endpoint.

Option 1: Systemd’s `ExecStartPost` with `curl`

This is a basic check that runs *after* the main process starts. It’s not a continuous probe but verifies initial startup health.

[Service]
# ... other directives ...
ExecStartPost=/usr/bin/curl --fail http://localhost:8080/healthz
Restart=on-failure
RestartSec=5
# ... rest of the service file ...

The --fail option makes curl return a non-zero exit code if the HTTP status is 4xx or 5xx, or if it cannot connect, triggering a systemd restart.

Option 2: Dedicated Health Check Daemon (e.g., `systemd-socket-proxyd` or custom script)

For more sophisticated, continuous health checking, consider a separate process. A simple approach is a Python script that periodically polls the health endpoint and signals systemd if the application becomes unhealthy. This script could be managed by another systemd service.

Example Python health checker script (/opt/myapp/health_checker.py):

import requests
import time
import sys
import os

HEALTH_URL = "http://localhost:8080/healthz"
CHECK_INTERVAL = 10  # seconds
TIMEOUT = 5  # seconds
MAX_FAILURES = 3

def check_health():
    failures = 0
    while True:
        try:
            response = requests.get(HEALTH_URL, timeout=TIMEOUT)
            if response.status_code == 200:
                if failures > 0:
                    print(f"Health check OK. Recovered after {failures} failures.")
                    failures = 0
                else:
                    print("Health check OK.")
            else:
                print(f"Health check failed: HTTP {response.status_code}")
                failures += 1
        except requests.exceptions.RequestException as e:
            print(f"Health check failed: {e}")
            failures += 1

        if failures >= MAX_FAILURES:
            print(f"Exceeded maximum failures ({MAX_FAILURES}). Exiting to trigger restart.")
            sys.exit(1) # Exit with non-zero code to signal failure

        time.sleep(CHECK_INTERVAL)

if __name__ == "__main__":
    # Ensure the main app is running before starting checks
    # This is a basic check; a more robust solution might use PID files or socket checks
    try:
        # Attempt to connect to the app's main port to see if it's listening
        import socket
        s = socket.create_connection(("localhost", 8080), TIMEOUT)
        s.close()
        print("Application port is listening. Starting health checks.")
        check_health()
    except (socket.error, ConnectionRefusedError) as e:
        print(f"Application port not listening: {e}. Waiting for application to start.")
        sys.exit(1) # Exit to allow main app to start first
    except Exception as e:
        print(f"An unexpected error occurred during initial check: {e}")
        sys.exit(1)

And its systemd service file (/etc/systemd/system/myapp-healthcheck.service):

[Unit]
Description=My C Application Health Checker
After=myapp.service

[Service]
Type=simple
User=myappuser
Group=myappgroup
ExecStart=/usr/bin/python3 /opt/myapp/health_checker.py
Restart=on-failure
RestartSec=10
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=myapp-healthcheck

[Install]
WantedBy=myapp.service

Make sure to set appropriate permissions for the Python script and enable/start this new service.

Monitoring Elasticsearch Clusters on DigitalOcean

Elasticsearch clusters require a different monitoring approach, focusing on cluster health, node status, indexing performance, and resource utilization. DigitalOcean’s Managed Databases for Elasticsearch offer some built-in monitoring, but for deeper insights and custom alerting, we’ll use Prometheus and Grafana.

Setting up Prometheus Node Exporter

First, deploy the Prometheus Node Exporter on each DigitalOcean Droplet hosting an Elasticsearch node. This provides system-level metrics.

Download the latest release:

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

Create a systemd service for Node Exporter (/etc/systemd/system/node_exporter.service):

[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.filesystem --collector.cpu --collector.meminfo --collector.netdev --collector.diskstats

[Install]
WantedBy=multi-user.target

Enable and start it:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Setting up Elasticsearch Exporter

To get Elasticsearch-specific metrics, we’ll use the official Elasticsearch Exporter or a community-maintained one. A popular choice is prometheus-community/elasticsearch-exporter.

You can run this as a Docker container or a standalone binary. Here’s an example using Docker:

docker run -d \
  --name elasticsearch-exporter \
  -p 9114:9114 \
  -e "ES_URI=http://YOUR_ELASTICSEARCH_HOST:9200" \
  prom/elasticsearch-exporter:latest

Replace YOUR_ELASTICSEARCH_HOST with the actual hostname or IP of your Elasticsearch instance. If using DigitalOcean Managed Databases, you’ll use the provided connection string and potentially authentication.

Configuring Prometheus

Your Prometheus server needs to scrape these exporters. Edit your prometheus.yml configuration:

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape Node Exporter on each Elasticsearch node
  - job_name: 'elasticsearch_nodes'
    static_configs:
      - targets:
          - 'es-node-1.yourdomain.com:9100'
          - 'es-node-2.yourdomain.com:9100'
          - 'es-node-3.yourdomain.com:9100'
    # Optional: Add relabeling if you need to add metadata like cluster name
    # relabel_configs:
    #   - source_labels: [__address__]
    #     target_label: instance
    #     regex: '([^:]+):.*'
    #     replacement: '$1'

  # Scrape Elasticsearch Exporter
  - job_name: 'elasticsearch_exporter'
    static_configs:
      - targets:
          - 'es-node-1.yourdomain.com:9114'
          - 'es-node-2.yourdomain.com:9114'
          - 'es-node-3.yourdomain.com:9114'
    # Optional: Add relabeling for cluster name
    # relabel_configs:
    #   - source_labels: [__address__]
    #     target_label: instance
    #     regex: '([^:]+):.*'
    #     replacement: '$1'

Reload Prometheus configuration after changes.

Setting up Grafana Dashboards

Import pre-built Grafana dashboards for Elasticsearch and Node Exporter. You can find excellent community dashboards on Grafana.com. Search for “Elasticsearch” and “Node Exporter”.

Key Elasticsearch metrics to monitor:

Cluster Health (elasticsearch_cluster_health_status – 0 for green, 1 for yellow, 2 for red)
Node Status (elasticsearch_node_status)
Indexing Rate (elasticsearch_indices_indexing_rate)
Search Rate (elasticsearch_search_query_rate)
JVM Heap Usage (jvm_memory_bytes_used, jvm_memory_bytes_max)
Disk Usage (node_filesystem_avail_bytes, node_filesystem_size_bytes)
Network Traffic (node_netdev_rx_bytes_total, node_netdev_tx_bytes_total)
CPU Usage (node_cpu_seconds_total)

Alerting with Prometheus Alertmanager

Configure Prometheus Alertmanager to send notifications based on critical alerts. Define alert rules in Prometheus’s rule files (e.g., rules.yml).

Example alert rule for Elasticsearch cluster status:

groups:
- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status{job="elasticsearch_exporter"} > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster is RED!"
      description: "The Elasticsearch cluster {{ $labels.cluster }} is in a RED state for more than 5 minutes."

  - alert: ElasticsearchNodeNotReady
    expr: elasticsearch_node_status{job="elasticsearch_exporter"} != 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch node is not ready."
      description: "Node {{ $labels.instance }} in cluster {{ $labels.cluster }} has been in a non-ready state for 10 minutes."

  - alert: HighJVMPoolUsage
    expr: (jvm_memory_bytes_used{job="elasticsearch_exporter", area="heap"} / jvm_memory_bytes_max{job="elasticsearch_exporter", area="heap"}) * 100 > 85
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High JVM Heap Usage on Elasticsearch node."
      description: "Elasticsearch node {{ $labels.instance }} has {{ $value | printf "%.2f" }}% JVM heap usage."

Ensure your Alertmanager is configured with receivers for Slack, PagerDuty, or email, and that Prometheus is configured to send alerts to Alertmanager.

Conclusion

By combining systemd’s process management for your C application with Prometheus and Grafana for your Elasticsearch cluster, you establish a comprehensive, multi-layered monitoring strategy. This proactive approach ensures high availability and rapid issue detection on DigitalOcean, minimizing downtime and performance degradation.