Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on Linode

Proactive PostgreSQL Monitoring with pg_stat_statements

Maintaining the health and performance of your PostgreSQL clusters is paramount, especially when they underpin critical Ruby applications. Relying solely on basic CPU and memory metrics is insufficient. A deeper dive into query performance is essential for identifying bottlenecks and preventing outages. The pg_stat_statements extension is your first line of defense.

First, ensure the extension is enabled in your postgresql.conf. You’ll need to add or modify these lines:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000
pg_stat_statements.save = off

After restarting your PostgreSQL server, you can create the extension within your database:

CREATE EXTENSION pg_stat_statements;

Now, you can query the pg_stat_statements view to identify the most resource-intensive queries. Focus on metrics like total_time (cumulative time spent executing the statement) and calls (number of times the statement was executed). A high total_time coupled with a moderate to high calls count often indicates a query that needs optimization.

SELECT
    query,
    calls,
    total_time,
    rows,
    mean_time,
    stddev_time
FROM
    pg_stat_statements
ORDER BY
    total_time DESC
LIMIT 20;

To automate this, we can create a simple Ruby script that connects to PostgreSQL, fetches this data, and sends alerts if certain thresholds are breached. This script can be run periodically via cron.

require 'pg'
require 'json'

# Configuration
DB_HOST = ENV.fetch('DB_HOST', 'localhost')
DB_PORT = ENV.fetch('DB_PORT', '5432')
DB_NAME = ENV.fetch('DB_NAME', 'your_db')
DB_USER = ENV.fetch('DB_USER', 'postgres')
DB_PASSWORD = ENV.fetch('DB_PASSWORD', '')

# Thresholds (in seconds)
TOTAL_TIME_THRESHOLD = 600 # e.g., 10 minutes of cumulative execution time
CALLS_THRESHOLD = 10000 # e.g., 10,000 executions

def monitor_pg_stats
  conn = PG.connect(host: DB_HOST, port: DB_PORT, dbname: DB_NAME, user: DB_USER, password: DB_PASSWORD)
  
  begin
    result = conn.exec(%{
      SELECT
          query,
          calls,
          total_time,
          rows,
          mean_time,
          stddev_time
      FROM
          pg_stat_statements
      ORDER BY
          total_time DESC
      LIMIT 50;
    })

    alerts = []
    result.each do |row|
      total_time_sec = row['total_time'].to_f / 1000.0 # pg_stat_statements reports in milliseconds
      calls = row['calls'].to_i

      if total_time_sec > TOTAL_TIME_THRESHOLD && calls > 100 # Only alert if it's also called a reasonable amount
        alerts << {
          query: row['query'],
          calls: calls,
          total_time_sec: total_time_sec,
          mean_time_sec: row['mean_time'].to_f / 1000.0,
          stddev_time_sec: row['stddev_time'].to_f / 1000.0
        }
      end
    end

    if !alerts.empty?
      puts "ALERT: High-performance queries detected!"
      puts JSON.pretty_generate(alerts)
      # In a real-world scenario, send this to an alerting system (e.g., PagerDuty, Slack webhook)
      # send_alert_to_slack(alerts)
    else
      puts "PostgreSQL query performance looks good."
    end

  rescue PG::Error => e
    $stderr.puts "Database error: #{e.message}"
    # Trigger an alert for database connection issues
    # send_alert_to_slack("PostgreSQL connection error: #{e.message}")
  ensure
    conn.close if conn
  end
end

# Placeholder for actual alerting mechanism
# def send_alert_to_slack(alerts_data)
#   # Implement Slack integration here
# end

monitor_pg_stats

Schedule this script using cron. For example, to run it every 5 minutes:

*/5 * * * * /usr/bin/env ruby /path/to/your/monitor_script.rb >> /var/log/pg_monitor.log 2>&1

System-Level Monitoring with Prometheus and Node Exporter

While application-specific metrics are crucial, robust system-level monitoring is non-negotiable. Prometheus, coupled with Node Exporter, provides a powerful, time-series-based monitoring solution for your Linode instances. This setup allows you to collect a wide array of system metrics and visualize them in Grafana.

First, install Node Exporter on each of your Linode servers. The easiest way is to download the pre-compiled binary.

# On each Linode server
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
rm -rf node_exporter-1.7.0.linux-amd64*

Next, set up Node Exporter to run as a systemd service for automatic startup and management.

# Create a systemd service file
sudo nano /etc/systemd/system/node_exporter.service

# Paste the following content:
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Node Exporter will now listen on port 9100. You can verify this by curling its metrics endpoint:

curl http://localhost:9100/metrics

On your Prometheus server, configure it to scrape these Node Exporter endpoints. Edit your prometheus.yml:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['linode1_ip:9100', 'linode2_ip:9100', 'linode3_ip:9100'] # Replace with your Linode IPs
        labels:
          instance: 'linode1' # Or use hostname
      - targets: ['linode2_ip:9100']
        labels:
          instance: 'linode2'
      - targets: ['linode3_ip:9100']
        labels:
          instance: 'linode3'

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres_master_ip:9187', 'postgres_replica1_ip:9187'] # Assuming pg_exporter on port 9187
        labels:
          instance: 'postgres_master'
      - targets: ['postgres_replica1_ip:9187']
        labels:
          instance: 'postgres_replica1'

Restart Prometheus for the changes to take effect. You should now see your Linode instances and PostgreSQL instances (if you’re using a PostgreSQL exporter like pg_exporter) appearing under the “Targets” page in the Prometheus UI.

Application-Level Metrics with Prometheus Client Libraries

To gain visibility into your Ruby application’s behavior, integrate Prometheus client libraries. This allows you to expose custom metrics directly from your application, such as request latency, error rates, and queue depths.

Add the prometheus-client gem to your application’s Gemfile:

gem 'prometheus-client', require: 'prometheus_client/middleware'

Then, initialize the client and register your metrics. For a Rails application, you can do this in an initializer:

# config/initializers/prometheus_client.rb
require 'prometheus_client/middleware'
require 'prometheus_client/registry'

# Initialize the default registry
PrometheusClient.configure do |config|
  config.initial_value_calculators = PrometheusClient::ValueCalculators::Simple.new
  config.milliseconds_to_seconds = true
end

# Define custom metrics
# Example: HTTP Request Counter
$http_requests_total = PrometheusClient::Counter.new(
  name: 'http_requests_total',
  documentation: 'Total number of HTTP requests',
  labels: [:method, :path, :status]
)
PrometheusClient.register($http_requests_total)

# Example: HTTP Request Duration Histogram
$http_request_duration_seconds = PrometheusClient::Histogram.new(
  name: 'http_request_duration_seconds',
  documentation: 'HTTP request duration in seconds',
  labels: [:method, :path, :status]
)
PrometheusClient.register($http_request_duration_seconds)

# Example: Background Job Counter
$background_jobs_total = PrometheusClient::Counter.new(
  name: 'background_jobs_total',
  documentation: 'Total number of background jobs processed',
  labels: [:job_name, :status]
)
PrometheusClient.register($background_jobs_total)

# Example: Database Query Duration Histogram
$db_query_duration_seconds = PrometheusClient::Histogram.new(
  name: 'db_query_duration_seconds',
  documentation: 'Database query duration in seconds',
  labels: [:query_name]
)
PrometheusClient.register($db_query_duration_seconds)

# Mount the middleware in your Rails application
# config/application.rb or config/environments/*.rb
Rails.application.config.middleware.use PrometheusClient::Middleware,
  metrics_path: '/metrics',
  registry: PrometheusClient.registry,
  requests_counter: $http_requests_total,
  duration_histogram: $http_request_duration_seconds

You’ll also need to instrument your background job processing (e.g., Sidekiq) and database queries manually.

# Example for Sidekiq (in an initializer or a custom middleware)
module Sidekiq
  module Middleware
    module Server
      class PrometheusMetrics
        def initialize(options = {})
          @jobs_total = options.fetch(:jobs_total)
          @job_duration = options.fetch(:job_duration)
        end

        def call(worker_class, msg, queue)
          start_time = Time.now
          begin
            yield
            status = 'success'
          rescue StandardError => e
            status = 'failure'
            raise e # Re-raise the exception so Sidekiq handles it
          ensure
            duration = Time.now - start_time
            @jobs_total.increment(job_name: worker_class.name, status: status)
            @job_duration.observe({ job_name: worker_class.name, status: status }, duration)
          end
        end
      end
    end
  end
end

# In config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Sidekiq::Middleware::Server::PrometheusMetrics,
              jobs_total: $background_jobs_total,
              job_duration: $background_job_duration_seconds # Assuming you defined this histogram
  end
end

# Example for database queries (using ActiveRecord)
# In an initializer or a custom ActiveRecord extension
module ActiveRecord
  class Base
    class << self
      alias_method :original_connection, :connection
      def connection
        conn = original_connection
        # This is a simplified example; a more robust solution would involve
        # instrumenting specific query methods or using ActiveSupport::Notifications.
        # For demonstration, let's assume we wrap a hypothetical 'execute_query' method.
        # In reality, you'd hook into `ActiveSupport::Notifications.subscribe('sql.active_record')`
        # and extract query details.
        conn
      end
    end
  end
end

# Example using ActiveSupport::Notifications (more idiomatic)
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
  duration = finish - start
  query_name = payload[:name] || 'unknown' # e.g., 'User Load', 'Post Fetch'
  
  # You might want to sanitize or categorize query_name for better metrics
  $db_query_duration_seconds.observe({ query_name: query_name }, duration)
end

Configure Prometheus to scrape your application's metrics endpoint (e.g., http://your_app_ip:3000/metrics). You'll need to add a new job to your prometheus.yml:

scrape_configs:
  # ... other jobs ...
  - job_name: 'ruby_app'
    static_configs:
      - targets: ['your_app_ip:3000'] # Replace with your app's IP and port
        labels:
          instance: 'app_server_1'

Alerting with Alertmanager

Collecting metrics is only half the battle; you need to be alerted when things go wrong. Alertmanager handles alerts sent by Prometheus, deduplicates them, groups them, and routes them to the correct receiver (e.g., Slack, PagerDuty, email).

Here's a sample alertmanager.yml configuration:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver if no specific route matches

  routes:
    - receiver: 'critical-alerts'
      match:
        severity: 'critical'
      continue: true # Allows matching other routes if needed

    - receiver: 'warning-alerts'
      match:
        severity: 'warning'
      continue: true

receivers:
  - name: 'default-receiver'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL' # Replace with your Slack webhook URL
        channel: '#monitoring-alerts'
        send_resolved: true
        title: '[{{ .Status | toUpper }}{{ if .CommonLabels.alertname }} - {{ .CommonLabels.alertname }}{{ end }}]'
        text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'

  - name: 'critical-alerts'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#critical-alerts-channel'
        send_resolved: true
        title: '[{{ .Status | toUpper }}{{ if .CommonLabels.alertname }} - {{ .CommonLabels.alertname }}{{ end }}]'
        text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'

  - name: 'warning-alerts'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#warning-alerts-channel'
        send_resolved: true
        title: '[{{ .Status | toUpper }}{{ if .CommonLabels.alertname }} - {{ .CommonLabels.alertname }}{{ end }}]'
        text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'

In Prometheus, you'll define alerting rules in separate YAML files (e.g., rules.yml) and point Prometheus to them in prometheus.yml.

# prometheus.yml
rule_files:
  - "rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager_ip:9093'] # Replace with your Alertmanager IP and port

# rules.yml
groups:
  - name: postgres_alerts
    rules:
      - alert: HighPostgresQueryTime
        expr: sum(rate(pg_stat_statements_total_time_seconds[5m])) by (query) > 60 # Total time spent on a query in the last 5 mins exceeds 60s
        for: 5m
        labels:
          severity: warning
          service: postgres
        annotations:
          summary: "High query execution time detected for query: {{ $labels.query }}"
          description: "PostgreSQL query '{{ $labels.query }}' has been running for a cumulative total of over 60 seconds in the last 5 minutes."

      - alert: PostgresReplicationLag
        expr: pg_replication_lag_seconds > 300 # Replication lag exceeds 5 minutes
        for: 2m
        labels:
          severity: critical
          service: postgres
        annotations:
          summary: "PostgreSQL replication lag is high on {{ $labels.instance }}"
          description: "Replication lag on instance {{ $labels.instance }} is {{ $value }} seconds, exceeding the 300-second threshold."

  - name: node_alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90 # CPU usage over 90% for 5 minutes
        for: 5m
        labels:
          severity: warning
          service: node
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "Instance {{ $labels.instance }} is experiencing CPU usage above 90%."

      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 # Less than 10% disk space available
        for: 10m
        labels:
          severity: critical
          service: node
        annotations:
          summary: "Low disk space on {{ $labels.instance }} for filesystem {{ $labels.mountpoint }}"
          description: "Filesystem {{ $labels.mountpoint }} on instance {{ $labels.instance }} has less than 10% free space remaining."

  - name: ruby_app_alerts
    rules:
      - alert: HighHttpRequestLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, method, path, status)) > 5 # 95th percentile latency > 5 seconds
        for: 5m
        labels:
          severity: warning
          service: ruby_app
        annotations:
          summary: "High HTTP request latency for {{ $labels.method }} {{ $labels.path }} ({{ $labels.status }})"
          description: "The 95th percentile latency for {{ $labels.method }} {{ $labels.path }} ({{ $labels.status }}) is {{ $value }} seconds, exceeding the 5-second threshold."

      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5 # More than 5% of requests are 5xx errors
        for: 5m
        labels:
          severity: critical
          service: ruby_app
        annotations:
          summary: "High HTTP error rate on {{ $labels.instance }}"
          description: "More than 5% of HTTP requests are resulting in 5xx errors."

Ensure your Prometheus server is configured to scrape Alertmanager, and Alertmanager is configured with your notification endpoints. Test your alerts by temporarily lowering thresholds or triggering conditions manually.

Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on Linode

Proactive PostgreSQL Monitoring with pg_stat_statements

System-Level Monitoring with Prometheus and Node Exporter

Application-Level Metrics with Prometheus Client Libraries

Alerting with Alertmanager

Recent Posts

Top Categories

Our Products

Our Services