Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on DigitalOcean

Establishing Core Metrics for Ruby on Rails Applications

Effective server monitoring for a Ruby on Rails application on DigitalOcean hinges on a multi-layered approach. We need to track not just the underlying infrastructure but also the application’s performance and health from an end-user perspective. For the Rails app itself, key metrics include request latency, error rates, and throughput. These provide immediate insight into application responsiveness and stability.

Instrumenting Your Rails Application with Prometheus

Prometheus is a de facto standard for application-level metrics. For Rails, the prometheus_client gem is an excellent choice. It allows you to expose custom metrics that can be scraped by a Prometheus server.

First, add the gem to your Gemfile:

# Gemfile
gem 'prometheus_client'

Next, initialize the client and define some basic metrics. A common place for this is in an initializer file, e.g., config/initializers/prometheus.rb:

# config/initializers/prometheus.rb
require 'prometheus_client'

# Initialize Prometheus Client
PrometheusClient.configure do |config|
  config.logger = Rails.logger
end

# Define custom metrics
# A counter for total requests
Rails.application.config.x.prometheus_metrics = {
  http_requests_total: PrometheusClient::Counter.new(
    :http_requests_total,
    'Total HTTP requests processed by the application.'
  ),
  http_request_duration_seconds: PrometheusClient::Histogram.new(
    :http_request_duration_seconds,
    'HTTP request latency in seconds.',
    { buckets: PrometheusClient::Histogram::DEFAULT_BUCKETS }
  )
}

# Middleware to record metrics for each request
# This should be added to config/application.rb or config/environments/*.rb
# Example for config/application.rb:
# config.middleware.use PrometheusClient::Rack::Middleware,
#   metrics: Rails.application.config.x.prometheus_metrics

Ensure you add the middleware to your Rails application’s middleware stack. In config/application.rb:

# config/application.rb
module YourApp
  class Application << Rails::Application
    # ... other configurations ...

    config.middleware.use PrometheusClient::Rack::Middleware,
      metrics: Rails.application.config.x.prometheus_metrics

    # ... other configurations ...
  end
end

This setup will expose metrics at a /metrics endpoint, which your Prometheus server can then scrape. You can further customize this by adding labels for controller, action, or HTTP status codes to gain more granular insights.

Monitoring PostgreSQL Clusters with pg_exporter

For PostgreSQL, we need to monitor database performance, connection pooling, replication status, and resource utilization. pg_exporter is a Prometheus exporter for PostgreSQL that provides a comprehensive set of metrics.

First, install pg_exporter on your database nodes or a dedicated monitoring server. The installation process varies by OS, but often involves downloading a binary or using a package manager.

Next, configure pg_exporter to connect to your PostgreSQL instances. This typically involves a .pgpass file for authentication and a configuration file for pg_exporter itself.

Create a .pgpass file in the home directory of the user running pg_exporter:

# ~/.pgpass
hostname:port:database:username:password

Then, create a configuration file for pg_exporter (e.g., pg_exporter.yml). This file specifies which collectors to enable and connection string details.

# pg_exporter.yml
log_level: info
web_listen_address: "0.0.0.0:9187" # Default pg_exporter port
collectors:
  - pg_stat_activity
  - pg_stat_database
  - pg_stat_replication
  - pg_stat_statements # Requires pg_stat_statements extension enabled in PostgreSQL
  - pg_locks
  - pg_settings
  - pg_replication_slots
  - pg_database_size
  - pg_postmaster_start_time
  - pg_stat_bgwriter
  - pg_stat_user_tables
  - pg_stat_user_indexes

# Example for a single PostgreSQL instance
# If you have multiple instances, you'll need to configure them accordingly,
# potentially using environment variables or multiple pg_exporter instances.
# For simplicity, we'll assume a single connection string here.
# pg_exporter can also be configured to connect to multiple databases.
# See pg_exporter documentation for advanced configurations.
# The connection string format is typically:
# postgresql://user:password@host:port/database?sslmode=disable
# Or using environment variables: PG_EXPORTER_CONNSTR
# For this example, we'll use a direct connection string.
# NOTE: Storing passwords directly in config is not recommended for production.
# Use environment variables or secrets management.
# For demonstration:
# connstr: "postgresql://monitor_user:your_password@your_db_host:5432/postgres?sslmode=disable"

Start pg_exporter. It’s recommended to run this as a systemd service for robustness.

# Example systemd service file (e.g., /etc/systemd/system/pg_exporter.service)
[Unit]
Description=PostgreSQL Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=postgres # Or the user running pg_exporter
Group=postgres
Type=simple
ExecStart=/usr/local/bin/pg_exporter --config.file=/etc/pg_exporter/pg_exporter.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

Then, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable pg_exporter
sudo systemctl start pg_exporter

This will expose PostgreSQL metrics on port 9187, ready for Prometheus to scrape.

Configuring Prometheus for Scraping

Your Prometheus server needs to be configured to discover and scrape metrics from your Rails application and PostgreSQL instances. This is done via the prometheus.yml configuration file.

Here’s a sample configuration snippet for scraping:

# prometheus.yml

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape Rails Application instances
  - job_name: 'rails_app'
    static_configs:
      - targets:
          - 'app_server_1_ip:3000/metrics' # Assuming your app runs on port 3000
          - 'app_server_2_ip:3000/metrics'
    # If using a reverse proxy like Nginx, you might scrape the proxy's metrics endpoint
    # or configure Prometheus to scrape the app directly.
    # For direct scraping, ensure the /metrics endpoint is accessible.

  # Scrape PostgreSQL instances via pg_exporter
  - job_name: 'postgres'
    static_configs:
      - targets:
          - 'db_node_1_ip:9187'
          - 'db_node_2_ip:9187'
          - 'db_node_3_ip:9187'
    # For PostgreSQL clusters, you might want to add labels to identify primary/replica
    # or specific roles.
    # Example with labels:
    # - targets: ['db_node_1_ip:9187']
    #   labels:
    #     role: 'primary'
    # - targets: ['db_node_2_ip:9187']
    #   labels:
    #     role: 'replica'

  # Example for scraping Nginx status (if used as a reverse proxy)
  # Requires nginx-module-vts or similar
  # - job_name: 'nginx'
  #   static_configs:
  #     - targets: ['app_server_1_ip:8080'] # Assuming Nginx exporter is on port 8080

After updating prometheus.yml, reload the Prometheus configuration:

curl -X POST http://localhost:9090/-/reload

Alerting with Alertmanager

Metrics are only useful if they trigger alerts when something goes wrong. Alertmanager, integrated with Prometheus, handles this. Define alerting rules in Prometheus’s rule files.

Example alerting rule for high Rails application error rate (in a file like rules.yml):

# rules.yml
groups:
- name: rails_app_alerts
  rules:
  - alert: HighRailsErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5..|401|403"}[5m])) by (instance)
      /
      sum(rate(http_requests_total[5m])) by (instance)
      > 0.05 # More than 5% errors over 5 minutes
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "The Rails application on {{ $labels.instance }} is experiencing a high error rate (over 5%)."

  - alert: HighRequestLatency
    expr: |
      avg(rate(http_request_duration_seconds_sum{job="rails_app"}[5m])) by (instance)
      /
      avg(rate(http_request_duration_seconds_count{job="rails_app"}[5m])) by (instance)
      > 2 # Average request latency exceeds 2 seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High request latency on {{ $labels.instance }}"
      description: "Average request latency on {{ $labels.instance }} is above 2 seconds."

Example alerting rule for PostgreSQL replication lag:

# rules.yml (continued)
- name: postgres_alerts
  rules:
  - alert: PostgreSQLReplicationLagging
    expr: |
      max_over_time(pg_replication_lag_seconds{role="replica"}[5m]) > 60 # Replication lag is over 60 seconds
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "PostgreSQL replication lag detected on {{ $labels.instance }}"
      description: "Replication lag on {{ $labels.instance }} is {{ $value }} seconds, exceeding the 60-second threshold."

  - alert: PostgreSQLHighConnectionCount
    expr: |
      pg_stat_activity_count{state="active"} > 100 # More than 100 active connections
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High PostgreSQL active connection count on {{ $labels.instance }}"
      description: "PostgreSQL on {{ $labels.instance }} has {{ $value }} active connections, exceeding the threshold of 100."

Ensure your prometheus.yml includes these rule files:

# prometheus.yml
rule_files:
  - "rules.yml"

Configure Alertmanager to receive alerts from Prometheus and route them to your desired notification channels (Slack, PagerDuty, email, etc.).

Infrastructure Monitoring with DigitalOcean Monitoring and Node Exporter

Beyond application-specific metrics, we need to monitor the underlying DigitalOcean infrastructure. DigitalOcean’s built-in monitoring provides CPU, memory, disk I/O, and network traffic for Droplets. However, for more granular OS-level metrics that Prometheus can scrape, we use node_exporter.

Install node_exporter on each Droplet. Similar to pg_exporter, this often involves downloading a binary or using a package manager.

Run node_exporter. Again, a systemd service is recommended:

# Example systemd service file for node_exporter
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter # Create a dedicated user
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/.*)" --collector.netdev.ignore-devices="^(veth|docker|lo)"
Restart=on-failure

[Install]
WantedBy=multi-user.target

Start and enable the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

By default, node_exporter listens on port 9100. Add this to your Prometheus configuration:

# prometheus.yml (adding node_exporter)
scrape_configs:
  # ... other jobs ...

  - job_name: 'node'
    static_configs:
      - targets:
          - 'app_server_1_ip:9100'
          - 'app_server_2_ip:9100'
          - 'db_node_1_ip:9100'
          - 'db_node_2_ip:9100'
          - 'db_node_3_ip:9100'

This provides essential OS-level metrics like CPU load, memory usage, disk space, and network I/O, which are crucial for diagnosing performance bottlenecks and capacity planning.

Log Aggregation and Analysis

Metrics tell you *what* is happening, but logs tell you *why*. A robust log aggregation strategy is vital. For a production environment, consider solutions like:

ELK Stack (Elasticsearch, Logstash, Kibana): Powerful but resource-intensive.
Loki with Promtail and Grafana: A more lightweight, cloud-native approach that integrates well with Prometheus and Grafana.
Datadog, Splunk, or similar SaaS solutions: Offer managed services but come with recurring costs.

For this setup, let’s assume you’re using Loki. You’ll need to deploy Promtail agents on your Droplets to collect logs and forward them to a Loki instance.

Promtail configuration (promtail-local-config.yaml):

# promtail-local-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://your-loki-instance:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*.log

  - job_name: rails_app_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: rails
          # Adjust the path to your Rails application's log files
          __path__: /path/to/your/rails/app/log/*.log

  - job_name: postgres_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: postgres
          # Adjust the path to your PostgreSQL log files
          __path__: /var/log/postgresql/*.log

Run Promtail as a systemd service. Once configured, Promtail will tail your log files and send them to Loki. You can then query and visualize these logs in Grafana, correlating them with your Prometheus metrics.

Health Checks and Synthetic Monitoring

Beyond passive monitoring, active health checks and synthetic monitoring ensure your application is not only running but also performing as expected from an external perspective. This can involve:

HTTP Health Checks: A simple endpoint (e.g., /health) on your Rails app that returns 200 OK if the app is healthy. This can be checked by load balancers or external monitoring tools.
Synthetic Transactions: Tools like blackbox_exporter (for Prometheus) or external services can simulate user actions (e.g., logging in, performing a search) to verify end-to-end functionality.

For a basic HTTP health check in Rails, you could add a route and controller action:

# config/routes.rb
get '/health', to: 'health#show'

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user!, if: :devise_controller? # If using Devise

  def show
    # Add checks for database connectivity, cache status, etc.
    # For simplicity, just return OK if the app is running.
    render json: { status: 'ok' }, status: :ok
  rescue StandardError => e
    render json: { status: 'error', message: e.message }, status: :internal_server_error
  end
end

This /health endpoint can then be monitored by DigitalOcean’s load balancer health checks or by an external monitoring service.

Putting It All Together: Grafana for Visualization

Grafana is the perfect tool to bring all these metrics and logs together into a unified dashboard. Configure Grafana to use Prometheus and Loki as data sources.

Create dashboards that visualize:

Rails App Performance: Request latency, error rates (5xx, 4xx), throughput (requests per second).
PostgreSQL Cluster Health: Replication lag, active connections, query performance (if pg_stat_statements is enabled), disk usage, CPU/memory on DB nodes.
Infrastructure Metrics: CPU utilization, memory usage, disk I/O, network traffic for all Droplets.
Log Trends: Error counts over time, specific error messages, log volume.

By correlating these different data sources in Grafana, you gain a holistic view of your system’s health, enabling faster troubleshooting and proactive maintenance.