Server Monitoring Best Practices: Keeping Your Ruby App and Elasticsearch Clusters Alive on DigitalOcean

Proactive Monitoring for Ruby on Rails and Elasticsearch on DigitalOcean

Maintaining high availability for critical services like Ruby on Rails applications and Elasticsearch clusters on DigitalOcean requires a robust, multi-layered monitoring strategy. This isn’t about setting up a few basic alerts; it’s about building a system that anticipates issues, provides deep diagnostic insights, and allows for rapid, informed remediation. We’ll cover essential metrics, configuration examples for common tools, and specific considerations for both Rails and Elasticsearch.

Core Metrics for Ruby on Rails Applications

A healthy Rails application is more than just “up” or “down.” We need to monitor resource utilization, application performance, and potential error conditions. Key metrics include:

CPU Usage: High CPU can indicate inefficient code, runaway processes, or insufficient resources.
Memory Usage: Leaks or excessive memory consumption can lead to swapping and performance degradation.
Request Latency: The time it takes for your application to respond to a request is a direct measure of user experience.
Error Rate: Tracking HTTP 5xx errors and application-level exceptions is crucial for identifying bugs.
Database Connection Pool: Exhausted connection pools are a common bottleneck.
Background Job Queues: For applications using Sidekiq, Resque, or similar, queue depth and processing times are vital.
Disk I/O: Especially important if your application performs heavy file operations or logging.

Setting Up Node Exporter for System Metrics

Prometheus is a de facto standard for metrics collection. Node Exporter is a simple way to expose hardware and OS metrics from your DigitalOcean Droplets. We’ll use it to gather CPU, memory, disk, and network statistics.

Installation and Configuration on a Debian/Ubuntu Droplet

First, download the latest release of Node Exporter. Replace X.Y.Z with the current version.

wget https://github.com/prometheus/node_exporter/releases/download/vX.Y.Z/node_exporter-X.Y.Z.linux-amd64.tar.gz
tar xvfz node_exporter-X.Y.Z.linux-amd64.tar.gz
sudo mv node_exporter-X.Y.Z.linux-amd64/node_exporter /usr/local/bin/

Next, create a systemd service file to manage Node Exporter. This ensures it starts on boot and can be controlled with systemctl.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Save this content to /etc/systemd/system/node_exporter.service. Then, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Node Exporter will now be listening on port 9100. You can verify by curling its metrics endpoint:

curl http://localhost:9100/metrics

Monitoring Rails Application Performance with Prometheus and an Exporter

To get application-specific metrics from your Rails app, you’ll need an exporter. The prometheus_client gem is a popular choice for Ruby. It allows you to define custom metrics within your application code.

Integrating `prometheus_client` into your Rails App

Add the gem to your Gemfile:

gem 'prometheus_client', require: 'prometheus_client/middleware'

Run bundle install. Then, configure the exporter in an initializer (e.g., config/initializers/prometheus.rb):

require 'prometheus_client/middleware'

# Initialize Prometheus Client
# Use a registry for better organization if you have many metrics
$prometheus_registry = PrometheusClient::Registry.new

# Example: A counter for total requests
$request_counter = $prometheus_registry.counter(:http_requests_total, 'Total HTTP requests processed')

# Example: A histogram for request duration
$request_duration_histogram = $prometheus_registry.histogram(:http_request_duration_seconds, 'HTTP request duration in seconds')

# Example: A gauge for active database connections (requires custom logic or a DB exporter)
# $db_connections_gauge = $prometheus_registry.gauge(:database_connections_active, 'Number of active database connections')

# Example: A counter for application-specific errors
$app_error_counter = $prometheus_registry.counter(:application_errors_total, 'Total application errors encountered', { type: String })

# Mount the middleware to expose metrics at /metrics
# Ensure this is placed *after* other middleware that might modify responses
Rails.application.config.middleware.use PrometheusClient::Middleware, registry: $prometheus_registry

Now, instrument your application code. For example, to increment the request counter and record duration:

# In a controller or application_controller.rb
class ApplicationController < ActionController::Base
  before_action :record_request_metrics

  private

  def record_request_metrics
    # This will be called for every request *before* the controller action
    # The middleware will handle the duration measurement automatically for requests
    # that pass through it. For custom metrics, you'd do something like:
    start_time = Time.current
    yield # This is important if you are overriding a Rack middleware's behavior
    duration = Time.current - start_time

    # The middleware handles basic request metrics. For custom ones:
    # $request_counter.increment(labels: { controller: controller_name, action: action_name })
    # $request_duration_histogram.observe(duration, labels: { controller: controller_name, action: action_name })
  end

  # Example of incrementing an error counter
  def handle_error(error)
    error_type = error.class.name
    $app_error_counter.increment(labels: { type: error_type })
    # ... other error handling logic
  end
end

The PrometheusClient::Middleware will automatically expose metrics at the /metrics endpoint of your Rails application. Ensure your web server (e.g., Nginx) is configured to allow access to this path.

Elasticsearch Cluster Monitoring

Elasticsearch is notoriously resource-intensive and can be complex to troubleshoot when performance degrades. Monitoring its internal state is paramount.

Key Elasticsearch Metrics

Cluster Health: Status (green, yellow, red), number of nodes, shards.
Node Stats: CPU usage, JVM heap usage, disk usage, network traffic.
Indexing Rate: Documents indexed per second.
Search Rate: Searches per second.
Query Latency: Average and p95/p99 latency for search queries.
Shard Status: Number of unassigned shards, relocating shards.
JVM Memory Pressure: Indicates potential garbage collection issues.
Disk Watermarks: Low disk space can cause nodes to become read-only.

Using Metricbeat for Elasticsearch Metrics

Metricbeat is part of the Elastic Stack and is excellent for collecting metrics from Elasticsearch itself. It can be installed on a dedicated monitoring node or on your application nodes.

Installation and Configuration

Download and install Metricbeat on your chosen node. The exact commands depend on your OS. For Debian/Ubuntu:

curl -L -O https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-X.Y.Z-amd64.deb
sudo dpkg -i metricbeat-X.Y.Z-amd64.deb

Edit the main configuration file, /etc/metricbeat/metricbeat.yml. You’ll need to configure the Elasticsearch output and enable the Elasticsearch module.

metricbeat.modules:
- module: elasticsearch
  period: 10s
  hosts: ["YOUR_ELASTICSEARCH_HOST:9200"] # Replace with your ES cluster endpoint
  # Optional: If your ES cluster requires authentication
  # username: "elastic"
  # password: "changeme"

# Configure the Elasticsearch output (where Metricbeat sends its own data)
output.elasticsearch:
  hosts: ["YOUR_ELASTICSEARCH_HOST:9200"] # Replace with your ES cluster endpoint
  # Optional: If your ES cluster requires authentication
  # username: "elastic"
  # password: "changeme"

# If you are using Kibana for visualization, configure it here
# setup.kibana:
#   host: "YOUR_KIBANA_HOST:5601"

# If you are using Prometheus for alerting, you might configure the Prometheus exporter here
# output.prometheus:
#   host: "localhost"
#   port: 9108

Enable and start the Metricbeat service:

sudo systemctl daemon-reload
sudo systemctl enable metricbeat
sudo systemctl start metricbeat
sudo systemctl status metricbeat

Metricbeat will now collect and send Elasticsearch metrics to your configured output (e.g., another Elasticsearch instance or a Prometheus endpoint). If sending to Elasticsearch, you’ll typically use Kibana to create dashboards for visualization.

Alerting Strategies with Prometheus Alertmanager

Collecting metrics is only half the battle; you need to be alerted when things go wrong. Prometheus Alertmanager is the standard companion for Prometheus.

Configuring Prometheus to Scrape Your Services

In your Prometheus configuration file (e.g., /etc/prometheus/prometheus.yml), add scrape configurations for your Rails app and Node Exporter:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['YOUR_RAILS_DROPLET_IP:9100', 'YOUR_ES_NODE_1_IP:9100', 'YOUR_ES_NODE_2_IP:9100'] # Add all nodes running node_exporter

  - job_name: 'rails_app'
    static_configs:
      - targets: ['YOUR_RAILS_DROPLET_IP:3000'] # Or the port your Rails app is exposed on

  - job_name: 'elasticsearch'
    # If using Metricbeat's Prometheus output, configure it here.
    # Otherwise, you'd use a dedicated ES exporter or rely on Metricbeat sending to ES.
    static_configs:
      - targets: ['YOUR_METRICBEAT_PROMETHEUS_OUTPUT_HOST:9108'] # If Metricbeat is configured for Prometheus output

Reload your Prometheus configuration:

curl -X POST http://localhost:9090/-/reload

Setting Up Alertmanager Rules

Define alerting rules in a separate file (e.g., /etc/prometheus/alert.rules.yml) and reference it in prometheus.yml.

groups:
- name: rails_alerts
  rules:
  - alert: HighCpuUsageRails
    expr: node_cpu_seconds_total{job="node_exporter", mode="idle", instance="YOUR_RAILS_DROPLET_IP:9100"} < 0.2
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High CPU usage on Rails Droplet {{ $labels.instance }}"
      description: "CPU usage is below 20% for the last 5 minutes on {{ $labels.instance }}."

  - alert: HighMemoryUsageRails
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High Memory Usage on Rails Droplet {{ $labels.instance }}"
      description: "Memory usage is over 90% for the last 5 minutes on {{ $labels.instance }}."

  - alert: HighRequestLatencyRails
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="rails_app"}[5m])) by (le, instance)) > 2.0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High P95 Request Latency on Rails App {{ $labels.instance }}"
      description: "P95 request latency for Rails app on {{ $labels.instance }} is above 2 seconds for 5 minutes."

  - alert: HighErrorRateRails
    expr: sum(rate(http_requests_total{job="rails_app", status=~"5.."} [5m])) by (instance) > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High 5xx Error Rate on Rails App {{ $labels.instance }}"
      description: "Rails app on {{ $labels.instance }} is experiencing more than 5 5xx errors per minute."

- name: elasticsearch_alerts
  rules:
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_status == 0 # Assuming 0=red, 1=yellow, 2=green
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster is RED on {{ $labels.instance }}"
      description: "Elasticsearch cluster health is RED. Shard allocation issues are likely."

  - alert: ElasticsearchHighJVMMemoryPressure
    expr: elasticsearch_jvm_memory_pressure > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High JVM Memory Pressure on Elasticsearch node {{ $labels.instance }}"
      description: "JVM memory pressure on Elasticsearch node {{ $labels.instance }} is above 80%."

  - alert: ElasticsearchDiskWatermarkLow
    expr: elasticsearch_filesystem_free_bytes / elasticsearch_filesystem_total_bytes * 100 < 10
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Low Disk Space on Elasticsearch node {{ $labels.instance }}"
      description: "Elasticsearch node {{ $labels.instance }} has less than 10% free disk space."

Ensure your prometheus.yml includes this rule file:

rule_files:
  - "/etc/prometheus/alert.rules.yml"

Configure Alertmanager (/etc/alertmanager/alertmanager.yml) to route these alerts to your desired receivers (e.g., Slack, PagerDuty, email).

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
- name: 'default-receiver'
  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL'
    channel: '#alerts'
    send_resolved: true

inhibit_rules:
  - target_match:
      severity: 'critical'
    source_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Reload Alertmanager:

curl -X POST http://localhost:9093/-/reload

Advanced Considerations and Best Practices

Centralized Logging: Integrate a centralized logging solution (e.g., ELK stack, Loki) to aggregate logs from all your Droplets. This is invaluable for debugging issues that metrics alone can’t explain.
Distributed Tracing: For complex Rails applications, consider implementing distributed tracing (e.g., Jaeger, Zipkin) to understand request flows across different services and identify latency bottlenecks.
Health Checks: Implement dedicated health check endpoints in your Rails application that can be polled by load balancers or monitoring systems.
Resource Limits: On DigitalOcean, ensure your Droplet sizes are appropriate. For Elasticsearch, consider dedicated nodes with sufficient RAM and fast SSDs.
Automated Deployments: Integrate monitoring into your CI/CD pipeline. Failed health checks or sudden metric spikes post-deployment should trigger rollbacks.
Regular Review: Periodically review your alerts and dashboards. Are they noisy? Are they missing critical issues? Tune them based on operational experience.
Security: Secure your monitoring endpoints. Use firewalls to restrict access to Prometheus, Alertmanager, and metric endpoints (e.g., 9100, 9090, 9093, Rails /metrics).

By implementing these layered monitoring strategies, you can move from reactive firefighting to proactive system management, ensuring the stability and performance of your Ruby on Rails applications and Elasticsearch clusters on DigitalOcean.