Server Monitoring Best Practices: Keeping Your Ruby App and Elasticsearch Clusters Alive on DigitalOcean
Proactive Monitoring for Ruby on Rails and Elasticsearch on DigitalOcean
Maintaining high availability for critical services like Ruby on Rails applications and Elasticsearch clusters on DigitalOcean requires a robust, multi-layered monitoring strategy. This isn’t about setting up a few basic alerts; it’s about building a system that anticipates issues, provides deep diagnostic insights, and allows for rapid, informed remediation. We’ll cover essential metrics, configuration examples for common tools, and specific considerations for both Rails and Elasticsearch.
Core Metrics for Ruby on Rails Applications
A healthy Rails application is more than just “up” or “down.” We need to monitor resource utilization, application performance, and potential error conditions. Key metrics include:
- CPU Usage: High CPU can indicate inefficient code, runaway processes, or insufficient resources.
- Memory Usage: Leaks or excessive memory consumption can lead to swapping and performance degradation.
- Request Latency: The time it takes for your application to respond to a request is a direct measure of user experience.
- Error Rate: Tracking HTTP 5xx errors and application-level exceptions is crucial for identifying bugs.
- Database Connection Pool: Exhausted connection pools are a common bottleneck.
- Background Job Queues: For applications using Sidekiq, Resque, or similar, queue depth and processing times are vital.
- Disk I/O: Especially important if your application performs heavy file operations or logging.
Setting Up Node Exporter for System Metrics
Prometheus is a de facto standard for metrics collection. Node Exporter is a simple way to expose hardware and OS metrics from your DigitalOcean Droplets. We’ll use it to gather CPU, memory, disk, and network statistics.
Installation and Configuration on a Debian/Ubuntu Droplet
First, download the latest release of Node Exporter. Replace X.Y.Z with the current version.
wget https://github.com/prometheus/node_exporter/releases/download/vX.Y.Z/node_exporter-X.Y.Z.linux-amd64.tar.gz tar xvfz node_exporter-X.Y.Z.linux-amd64.tar.gz sudo mv node_exporter-X.Y.Z.linux-amd64/node_exporter /usr/local/bin/
Next, create a systemd service file to manage Node Exporter. This ensures it starts on boot and can be controlled with systemctl.
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=nobody Group=nogroup Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Save this content to /etc/systemd/system/node_exporter.service. Then, enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
Node Exporter will now be listening on port 9100. You can verify by curling its metrics endpoint:
curl http://localhost:9100/metrics
Monitoring Rails Application Performance with Prometheus and an Exporter
To get application-specific metrics from your Rails app, you’ll need an exporter. The prometheus_client gem is a popular choice for Ruby. It allows you to define custom metrics within your application code.
Integrating prometheus_client into your Rails App
Add the gem to your Gemfile:
gem 'prometheus_client', require: 'prometheus_client/middleware'
Run bundle install. Then, configure the exporter in an initializer (e.g., config/initializers/prometheus.rb):
require 'prometheus_client/middleware'
# Initialize Prometheus Client
# Use a registry for better organization if you have many metrics
$prometheus_registry = PrometheusClient::Registry.new
# Example: A counter for total requests
$request_counter = $prometheus_registry.counter(:http_requests_total, 'Total HTTP requests processed')
# Example: A histogram for request duration
$request_duration_histogram = $prometheus_registry.histogram(:http_request_duration_seconds, 'HTTP request duration in seconds')
# Example: A gauge for active database connections (requires custom logic or a DB exporter)
# $db_connections_gauge = $prometheus_registry.gauge(:database_connections_active, 'Number of active database connections')
# Example: A counter for application-specific errors
$app_error_counter = $prometheus_registry.counter(:application_errors_total, 'Total application errors encountered', { type: String })
# Mount the middleware to expose metrics at /metrics
# Ensure this is placed *after* other middleware that might modify responses
Rails.application.config.middleware.use PrometheusClient::Middleware, registry: $prometheus_registry
Now, instrument your application code. For example, to increment the request counter and record duration:
# In a controller or application_controller.rb
class ApplicationController < ActionController::Base
before_action :record_request_metrics
private
def record_request_metrics
# This will be called for every request *before* the controller action
# The middleware will handle the duration measurement automatically for requests
# that pass through it. For custom metrics, you'd do something like:
start_time = Time.current
yield # This is important if you are overriding a Rack middleware's behavior
duration = Time.current - start_time
# The middleware handles basic request metrics. For custom ones:
# $request_counter.increment(labels: { controller: controller_name, action: action_name })
# $request_duration_histogram.observe(duration, labels: { controller: controller_name, action: action_name })
end
# Example of incrementing an error counter
def handle_error(error)
error_type = error.class.name
$app_error_counter.increment(labels: { type: error_type })
# ... other error handling logic
end
end
The PrometheusClient::Middleware will automatically expose metrics at the /metrics endpoint of your Rails application. Ensure your web server (e.g., Nginx) is configured to allow access to this path.
Elasticsearch Cluster Monitoring
Elasticsearch is notoriously resource-intensive and can be complex to troubleshoot when performance degrades. Monitoring its internal state is paramount.
Key Elasticsearch Metrics
- Cluster Health: Status (green, yellow, red), number of nodes, shards.
- Node Stats: CPU usage, JVM heap usage, disk usage, network traffic.
- Indexing Rate: Documents indexed per second.
- Search Rate: Searches per second.
- Query Latency: Average and p95/p99 latency for search queries.
- Shard Status: Number of unassigned shards, relocating shards.
- JVM Memory Pressure: Indicates potential garbage collection issues.
- Disk Watermarks: Low disk space can cause nodes to become read-only.
Using Metricbeat for Elasticsearch Metrics
Metricbeat is part of the Elastic Stack and is excellent for collecting metrics from Elasticsearch itself. It can be installed on a dedicated monitoring node or on your application nodes.
Installation and Configuration
Download and install Metricbeat on your chosen node. The exact commands depend on your OS. For Debian/Ubuntu:
curl -L -O https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-X.Y.Z-amd64.deb sudo dpkg -i metricbeat-X.Y.Z-amd64.deb
Edit the main configuration file, /etc/metricbeat/metricbeat.yml. You’ll need to configure the Elasticsearch output and enable the Elasticsearch module.
metricbeat.modules: - module: elasticsearch period: 10s hosts: ["YOUR_ELASTICSEARCH_HOST:9200"] # Replace with your ES cluster endpoint # Optional: If your ES cluster requires authentication # username: "elastic" # password: "changeme" # Configure the Elasticsearch output (where Metricbeat sends its own data) output.elasticsearch: hosts: ["YOUR_ELASTICSEARCH_HOST:9200"] # Replace with your ES cluster endpoint # Optional: If your ES cluster requires authentication # username: "elastic" # password: "changeme" # If you are using Kibana for visualization, configure it here # setup.kibana: # host: "YOUR_KIBANA_HOST:5601" # If you are using Prometheus for alerting, you might configure the Prometheus exporter here # output.prometheus: # host: "localhost" # port: 9108
Enable and start the Metricbeat service:
sudo systemctl daemon-reload sudo systemctl enable metricbeat sudo systemctl start metricbeat sudo systemctl status metricbeat
Metricbeat will now collect and send Elasticsearch metrics to your configured output (e.g., another Elasticsearch instance or a Prometheus endpoint). If sending to Elasticsearch, you’ll typically use Kibana to create dashboards for visualization.
Alerting Strategies with Prometheus Alertmanager
Collecting metrics is only half the battle; you need to be alerted when things go wrong. Prometheus Alertmanager is the standard companion for Prometheus.
Configuring Prometheus to Scrape Your Services
In your Prometheus configuration file (e.g., /etc/prometheus/prometheus.yml), add scrape configurations for your Rails app and Node Exporter:
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['YOUR_RAILS_DROPLET_IP:9100', 'YOUR_ES_NODE_1_IP:9100', 'YOUR_ES_NODE_2_IP:9100'] # Add all nodes running node_exporter
- job_name: 'rails_app'
static_configs:
- targets: ['YOUR_RAILS_DROPLET_IP:3000'] # Or the port your Rails app is exposed on
- job_name: 'elasticsearch'
# If using Metricbeat's Prometheus output, configure it here.
# Otherwise, you'd use a dedicated ES exporter or rely on Metricbeat sending to ES.
static_configs:
- targets: ['YOUR_METRICBEAT_PROMETHEUS_OUTPUT_HOST:9108'] # If Metricbeat is configured for Prometheus output
Reload your Prometheus configuration:
curl -X POST http://localhost:9090/-/reload
Setting Up Alertmanager Rules
Define alerting rules in a separate file (e.g., /etc/prometheus/alert.rules.yml) and reference it in prometheus.yml.
groups:
- name: rails_alerts
rules:
- alert: HighCpuUsageRails
expr: node_cpu_seconds_total{job="node_exporter", mode="idle", instance="YOUR_RAILS_DROPLET_IP:9100"} < 0.2
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on Rails Droplet {{ $labels.instance }}"
description: "CPU usage is below 20% for the last 5 minutes on {{ $labels.instance }}."
- alert: HighMemoryUsageRails
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory Usage on Rails Droplet {{ $labels.instance }}"
description: "Memory usage is over 90% for the last 5 minutes on {{ $labels.instance }}."
- alert: HighRequestLatencyRails
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="rails_app"}[5m])) by (le, instance)) > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "High P95 Request Latency on Rails App {{ $labels.instance }}"
description: "P95 request latency for Rails app on {{ $labels.instance }} is above 2 seconds for 5 minutes."
- alert: HighErrorRateRails
expr: sum(rate(http_requests_total{job="rails_app", status=~"5.."} [5m])) by (instance) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High 5xx Error Rate on Rails App {{ $labels.instance }}"
description: "Rails app on {{ $labels.instance }} is experiencing more than 5 5xx errors per minute."
- name: elasticsearch_alerts
rules:
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_status == 0 # Assuming 0=red, 1=yellow, 2=green
for: 1m
labels:
severity: critical
annotations:
summary: "Elasticsearch cluster is RED on {{ $labels.instance }}"
description: "Elasticsearch cluster health is RED. Shard allocation issues are likely."
- alert: ElasticsearchHighJVMMemoryPressure
expr: elasticsearch_jvm_memory_pressure > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High JVM Memory Pressure on Elasticsearch node {{ $labels.instance }}"
description: "JVM memory pressure on Elasticsearch node {{ $labels.instance }} is above 80%."
- alert: ElasticsearchDiskWatermarkLow
expr: elasticsearch_filesystem_free_bytes / elasticsearch_filesystem_total_bytes * 100 < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Low Disk Space on Elasticsearch node {{ $labels.instance }}"
description: "Elasticsearch node {{ $labels.instance }} has less than 10% free disk space."
Ensure your prometheus.yml includes this rule file:
rule_files: - "/etc/prometheus/alert.rules.yml"
Configure Alertmanager (/etc/alertmanager/alertmanager.yml) to route these alerts to your desired receivers (e.g., Slack, PagerDuty, email).
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver' # Default receiver
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
send_resolved: true
inhibit_rules:
- target_match:
severity: 'critical'
source_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Reload Alertmanager:
curl -X POST http://localhost:9093/-/reload
Advanced Considerations and Best Practices
- Centralized Logging: Integrate a centralized logging solution (e.g., ELK stack, Loki) to aggregate logs from all your Droplets. This is invaluable for debugging issues that metrics alone can’t explain.
- Distributed Tracing: For complex Rails applications, consider implementing distributed tracing (e.g., Jaeger, Zipkin) to understand request flows across different services and identify latency bottlenecks.
- Health Checks: Implement dedicated health check endpoints in your Rails application that can be polled by load balancers or monitoring systems.
- Resource Limits: On DigitalOcean, ensure your Droplet sizes are appropriate. For Elasticsearch, consider dedicated nodes with sufficient RAM and fast SSDs.
- Automated Deployments: Integrate monitoring into your CI/CD pipeline. Failed health checks or sudden metric spikes post-deployment should trigger rollbacks.
- Regular Review: Periodically review your alerts and dashboards. Are they noisy? Are they missing critical issues? Tune them based on operational experience.
- Security: Secure your monitoring endpoints. Use firewalls to restrict access to Prometheus, Alertmanager, and metric endpoints (e.g., 9100, 9090, 9093, Rails /metrics).
By implementing these layered monitoring strategies, you can move from reactive firefighting to proactive system management, ensuring the stability and performance of your Ruby on Rails applications and Elasticsearch clusters on DigitalOcean.