Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on DigitalOcean
Establishing Core Metrics for Ruby on Rails Applications
Effective server monitoring for a Ruby on Rails application on DigitalOcean hinges on a multi-layered approach. We need to track not just the underlying infrastructure but also the application’s performance and health from an end-user perspective. For the Rails app itself, key metrics include request latency, error rates, and throughput. These provide immediate insight into application responsiveness and stability.
Instrumenting Your Rails Application with Prometheus
Prometheus is a de facto standard for application-level metrics. For Rails, the prometheus_client gem is an excellent choice. It allows you to expose custom metrics that can be scraped by a Prometheus server.
First, add the gem to your Gemfile:
# Gemfile gem 'prometheus_client'
Next, initialize the client and define some basic metrics. A common place for this is in an initializer file, e.g., config/initializers/prometheus.rb:
# config/initializers/prometheus.rb
require 'prometheus_client'
# Initialize Prometheus Client
PrometheusClient.configure do |config|
config.logger = Rails.logger
end
# Define custom metrics
# A counter for total requests
Rails.application.config.x.prometheus_metrics = {
http_requests_total: PrometheusClient::Counter.new(
:http_requests_total,
'Total HTTP requests processed by the application.'
),
http_request_duration_seconds: PrometheusClient::Histogram.new(
:http_request_duration_seconds,
'HTTP request latency in seconds.',
{ buckets: PrometheusClient::Histogram::DEFAULT_BUCKETS }
)
}
# Middleware to record metrics for each request
# This should be added to config/application.rb or config/environments/*.rb
# Example for config/application.rb:
# config.middleware.use PrometheusClient::Rack::Middleware,
# metrics: Rails.application.config.x.prometheus_metrics
Ensure you add the middleware to your Rails application’s middleware stack. In config/application.rb:
# config/application.rb
module YourApp
class Application << Rails::Application
# ... other configurations ...
config.middleware.use PrometheusClient::Rack::Middleware,
metrics: Rails.application.config.x.prometheus_metrics
# ... other configurations ...
end
end
This setup will expose metrics at a /metrics endpoint, which your Prometheus server can then scrape. You can further customize this by adding labels for controller, action, or HTTP status codes to gain more granular insights.
Monitoring PostgreSQL Clusters with pg_exporter
For PostgreSQL, we need to monitor database performance, connection pooling, replication status, and resource utilization. pg_exporter is a Prometheus exporter for PostgreSQL that provides a comprehensive set of metrics.
First, install pg_exporter on your database nodes or a dedicated monitoring server. The installation process varies by OS, but often involves downloading a binary or using a package manager.
Next, configure pg_exporter to connect to your PostgreSQL instances. This typically involves a .pgpass file for authentication and a configuration file for pg_exporter itself.
Create a .pgpass file in the home directory of the user running pg_exporter:
# ~/.pgpass hostname:port:database:username:password
Then, create a configuration file for pg_exporter (e.g., pg_exporter.yml). This file specifies which collectors to enable and connection string details.
# pg_exporter.yml log_level: info web_listen_address: "0.0.0.0:9187" # Default pg_exporter port collectors: - pg_stat_activity - pg_stat_database - pg_stat_replication - pg_stat_statements # Requires pg_stat_statements extension enabled in PostgreSQL - pg_locks - pg_settings - pg_replication_slots - pg_database_size - pg_postmaster_start_time - pg_stat_bgwriter - pg_stat_user_tables - pg_stat_user_indexes # Example for a single PostgreSQL instance # If you have multiple instances, you'll need to configure them accordingly, # potentially using environment variables or multiple pg_exporter instances. # For simplicity, we'll assume a single connection string here. # pg_exporter can also be configured to connect to multiple databases. # See pg_exporter documentation for advanced configurations. # The connection string format is typically: # postgresql://user:password@host:port/database?sslmode=disable # Or using environment variables: PG_EXPORTER_CONNSTR # For this example, we'll use a direct connection string. # NOTE: Storing passwords directly in config is not recommended for production. # Use environment variables or secrets management. # For demonstration: # connstr: "postgresql://monitor_user:your_password@your_db_host:5432/postgres?sslmode=disable"
Start pg_exporter. It’s recommended to run this as a systemd service for robustness.
# Example systemd service file (e.g., /etc/systemd/system/pg_exporter.service) [Unit] Description=PostgreSQL Exporter Wants=network-online.target After=network-online.target [Service] User=postgres # Or the user running pg_exporter Group=postgres Type=simple ExecStart=/usr/local/bin/pg_exporter --config.file=/etc/pg_exporter/pg_exporter.yml Restart=on-failure [Install] WantedBy=multi-user.target
Then, enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable pg_exporter sudo systemctl start pg_exporter
This will expose PostgreSQL metrics on port 9187, ready for Prometheus to scrape.
Configuring Prometheus for Scraping
Your Prometheus server needs to be configured to discover and scrape metrics from your Rails application and PostgreSQL instances. This is done via the prometheus.yml configuration file.
Here’s a sample configuration snippet for scraping:
# prometheus.yml
scrape_configs:
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Scrape Rails Application instances
- job_name: 'rails_app'
static_configs:
- targets:
- 'app_server_1_ip:3000/metrics' # Assuming your app runs on port 3000
- 'app_server_2_ip:3000/metrics'
# If using a reverse proxy like Nginx, you might scrape the proxy's metrics endpoint
# or configure Prometheus to scrape the app directly.
# For direct scraping, ensure the /metrics endpoint is accessible.
# Scrape PostgreSQL instances via pg_exporter
- job_name: 'postgres'
static_configs:
- targets:
- 'db_node_1_ip:9187'
- 'db_node_2_ip:9187'
- 'db_node_3_ip:9187'
# For PostgreSQL clusters, you might want to add labels to identify primary/replica
# or specific roles.
# Example with labels:
# - targets: ['db_node_1_ip:9187']
# labels:
# role: 'primary'
# - targets: ['db_node_2_ip:9187']
# labels:
# role: 'replica'
# Example for scraping Nginx status (if used as a reverse proxy)
# Requires nginx-module-vts or similar
# - job_name: 'nginx'
# static_configs:
# - targets: ['app_server_1_ip:8080'] # Assuming Nginx exporter is on port 8080
After updating prometheus.yml, reload the Prometheus configuration:
curl -X POST http://localhost:9090/-/reload
Alerting with Alertmanager
Metrics are only useful if they trigger alerts when something goes wrong. Alertmanager, integrated with Prometheus, handles this. Define alerting rules in Prometheus’s rule files.
Example alerting rule for high Rails application error rate (in a file like rules.yml):
# rules.yml
groups:
- name: rails_app_alerts
rules:
- alert: HighRailsErrorRate
expr: |
sum(rate(http_requests_total{status=~"5..|401|403"}[5m])) by (instance)
/
sum(rate(http_requests_total[5m])) by (instance)
> 0.05 # More than 5% errors over 5 minutes
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "The Rails application on {{ $labels.instance }} is experiencing a high error rate (over 5%)."
- alert: HighRequestLatency
expr: |
avg(rate(http_request_duration_seconds_sum{job="rails_app"}[5m])) by (instance)
/
avg(rate(http_request_duration_seconds_count{job="rails_app"}[5m])) by (instance)
> 2 # Average request latency exceeds 2 seconds
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "Average request latency on {{ $labels.instance }} is above 2 seconds."
Example alerting rule for PostgreSQL replication lag:
# rules.yml (continued)
- name: postgres_alerts
rules:
- alert: PostgreSQLReplicationLagging
expr: |
max_over_time(pg_replication_lag_seconds{role="replica"}[5m]) > 60 # Replication lag is over 60 seconds
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL replication lag detected on {{ $labels.instance }}"
description: "Replication lag on {{ $labels.instance }} is {{ $value }} seconds, exceeding the 60-second threshold."
- alert: PostgreSQLHighConnectionCount
expr: |
pg_stat_activity_count{state="active"} > 100 # More than 100 active connections
for: 5m
labels:
severity: warning
annotations:
summary: "High PostgreSQL active connection count on {{ $labels.instance }}"
description: "PostgreSQL on {{ $labels.instance }} has {{ $value }} active connections, exceeding the threshold of 100."
Ensure your prometheus.yml includes these rule files:
# prometheus.yml rule_files: - "rules.yml"
Configure Alertmanager to receive alerts from Prometheus and route them to your desired notification channels (Slack, PagerDuty, email, etc.).
Infrastructure Monitoring with DigitalOcean Monitoring and Node Exporter
Beyond application-specific metrics, we need to monitor the underlying DigitalOcean infrastructure. DigitalOcean’s built-in monitoring provides CPU, memory, disk I/O, and network traffic for Droplets. However, for more granular OS-level metrics that Prometheus can scrape, we use node_exporter.
Install node_exporter on each Droplet. Similar to pg_exporter, this often involves downloading a binary or using a package manager.
Run node_exporter. Again, a systemd service is recommended:
# Example systemd service file for node_exporter [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter # Create a dedicated user Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/.*)" --collector.netdev.ignore-devices="^(veth|docker|lo)" Restart=on-failure [Install] WantedBy=multi-user.target
Start and enable the service:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter
By default, node_exporter listens on port 9100. Add this to your Prometheus configuration:
# prometheus.yml (adding node_exporter)
scrape_configs:
# ... other jobs ...
- job_name: 'node'
static_configs:
- targets:
- 'app_server_1_ip:9100'
- 'app_server_2_ip:9100'
- 'db_node_1_ip:9100'
- 'db_node_2_ip:9100'
- 'db_node_3_ip:9100'
This provides essential OS-level metrics like CPU load, memory usage, disk space, and network I/O, which are crucial for diagnosing performance bottlenecks and capacity planning.
Log Aggregation and Analysis
Metrics tell you *what* is happening, but logs tell you *why*. A robust log aggregation strategy is vital. For a production environment, consider solutions like:
- ELK Stack (Elasticsearch, Logstash, Kibana): Powerful but resource-intensive.
- Loki with Promtail and Grafana: A more lightweight, cloud-native approach that integrates well with Prometheus and Grafana.
- Datadog, Splunk, or similar SaaS solutions: Offer managed services but come with recurring costs.
For this setup, let’s assume you’re using Loki. You’ll need to deploy Promtail agents on your Droplets to collect logs and forward them to a Loki instance.
Promtail configuration (promtail-local-config.yaml):
# promtail-local-config.yaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://your-loki-instance:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*.log
- job_name: rails_app_logs
static_configs:
- targets:
- localhost
labels:
job: rails
# Adjust the path to your Rails application's log files
__path__: /path/to/your/rails/app/log/*.log
- job_name: postgres_logs
static_configs:
- targets:
- localhost
labels:
job: postgres
# Adjust the path to your PostgreSQL log files
__path__: /var/log/postgresql/*.log
Run Promtail as a systemd service. Once configured, Promtail will tail your log files and send them to Loki. You can then query and visualize these logs in Grafana, correlating them with your Prometheus metrics.
Health Checks and Synthetic Monitoring
Beyond passive monitoring, active health checks and synthetic monitoring ensure your application is not only running but also performing as expected from an external perspective. This can involve:
- HTTP Health Checks: A simple endpoint (e.g.,
/health) on your Rails app that returns 200 OK if the app is healthy. This can be checked by load balancers or external monitoring tools. - Synthetic Transactions: Tools like
blackbox_exporter(for Prometheus) or external services can simulate user actions (e.g., logging in, performing a search) to verify end-to-end functionality.
For a basic HTTP health check in Rails, you could add a route and controller action:
# config/routes.rb
get '/health', to: 'health#show'
# app/controllers/health_controller.rb
class HealthController < ApplicationController
skip_before_action :authenticate_user!, if: :devise_controller? # If using Devise
def show
# Add checks for database connectivity, cache status, etc.
# For simplicity, just return OK if the app is running.
render json: { status: 'ok' }, status: :ok
rescue StandardError => e
render json: { status: 'error', message: e.message }, status: :internal_server_error
end
end
This /health endpoint can then be monitored by DigitalOcean’s load balancer health checks or by an external monitoring service.
Putting It All Together: Grafana for Visualization
Grafana is the perfect tool to bring all these metrics and logs together into a unified dashboard. Configure Grafana to use Prometheus and Loki as data sources.
Create dashboards that visualize:
- Rails App Performance: Request latency, error rates (5xx, 4xx), throughput (requests per second).
- PostgreSQL Cluster Health: Replication lag, active connections, query performance (if
pg_stat_statementsis enabled), disk usage, CPU/memory on DB nodes. - Infrastructure Metrics: CPU utilization, memory usage, disk I/O, network traffic for all Droplets.
- Log Trends: Error counts over time, specific error messages, log volume.
By correlating these different data sources in Grafana, you gain a holistic view of your system’s health, enabling faster troubleshooting and proactive maintenance.