Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on Linode
Proactive PostgreSQL Monitoring with pg_stat_statements
Maintaining the health and performance of your PostgreSQL clusters is paramount, especially when they underpin critical Ruby applications. Relying solely on basic CPU and memory metrics is insufficient. A deeper dive into query performance is essential for identifying bottlenecks and preventing outages. The pg_stat_statements extension is your first line of defense.
First, ensure the extension is enabled in your postgresql.conf. You’ll need to add or modify these lines:
shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all pg_stat_statements.max = 10000 pg_stat_statements.save = off
After restarting your PostgreSQL server, you can create the extension within your database:
CREATE EXTENSION pg_stat_statements;
Now, you can query the pg_stat_statements view to identify the most resource-intensive queries. Focus on metrics like total_time (cumulative time spent executing the statement) and calls (number of times the statement was executed). A high total_time coupled with a moderate to high calls count often indicates a query that needs optimization.
SELECT
query,
calls,
total_time,
rows,
mean_time,
stddev_time
FROM
pg_stat_statements
ORDER BY
total_time DESC
LIMIT 20;
To automate this, we can create a simple Ruby script that connects to PostgreSQL, fetches this data, and sends alerts if certain thresholds are breached. This script can be run periodically via cron.
require 'pg'
require 'json'
# Configuration
DB_HOST = ENV.fetch('DB_HOST', 'localhost')
DB_PORT = ENV.fetch('DB_PORT', '5432')
DB_NAME = ENV.fetch('DB_NAME', 'your_db')
DB_USER = ENV.fetch('DB_USER', 'postgres')
DB_PASSWORD = ENV.fetch('DB_PASSWORD', '')
# Thresholds (in seconds)
TOTAL_TIME_THRESHOLD = 600 # e.g., 10 minutes of cumulative execution time
CALLS_THRESHOLD = 10000 # e.g., 10,000 executions
def monitor_pg_stats
conn = PG.connect(host: DB_HOST, port: DB_PORT, dbname: DB_NAME, user: DB_USER, password: DB_PASSWORD)
begin
result = conn.exec(%{
SELECT
query,
calls,
total_time,
rows,
mean_time,
stddev_time
FROM
pg_stat_statements
ORDER BY
total_time DESC
LIMIT 50;
})
alerts = []
result.each do |row|
total_time_sec = row['total_time'].to_f / 1000.0 # pg_stat_statements reports in milliseconds
calls = row['calls'].to_i
if total_time_sec > TOTAL_TIME_THRESHOLD && calls > 100 # Only alert if it's also called a reasonable amount
alerts << {
query: row['query'],
calls: calls,
total_time_sec: total_time_sec,
mean_time_sec: row['mean_time'].to_f / 1000.0,
stddev_time_sec: row['stddev_time'].to_f / 1000.0
}
end
end
if !alerts.empty?
puts "ALERT: High-performance queries detected!"
puts JSON.pretty_generate(alerts)
# In a real-world scenario, send this to an alerting system (e.g., PagerDuty, Slack webhook)
# send_alert_to_slack(alerts)
else
puts "PostgreSQL query performance looks good."
end
rescue PG::Error => e
$stderr.puts "Database error: #{e.message}"
# Trigger an alert for database connection issues
# send_alert_to_slack("PostgreSQL connection error: #{e.message}")
ensure
conn.close if conn
end
end
# Placeholder for actual alerting mechanism
# def send_alert_to_slack(alerts_data)
# # Implement Slack integration here
# end
monitor_pg_stats
Schedule this script using cron. For example, to run it every 5 minutes:
*/5 * * * * /usr/bin/env ruby /path/to/your/monitor_script.rb >> /var/log/pg_monitor.log 2>&1
System-Level Monitoring with Prometheus and Node Exporter
While application-specific metrics are crucial, robust system-level monitoring is non-negotiable. Prometheus, coupled with Node Exporter, provides a powerful, time-series-based monitoring solution for your Linode instances. This setup allows you to collect a wide array of system metrics and visualize them in Grafana.
First, install Node Exporter on each of your Linode servers. The easiest way is to download the pre-compiled binary.
# On each Linode server wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ rm -rf node_exporter-1.7.0.linux-amd64*
Next, set up Node Exporter to run as a systemd service for automatic startup and management.
# Create a systemd service file sudo nano /etc/systemd/system/node_exporter.service # Paste the following content: [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=nobody Group=nogroup Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
# Enable and start the service sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
Node Exporter will now listen on port 9100. You can verify this by curling its metrics endpoint:
curl http://localhost:9100/metrics
On your Prometheus server, configure it to scrape these Node Exporter endpoints. Edit your prometheus.yml:
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['linode1_ip:9100', 'linode2_ip:9100', 'linode3_ip:9100'] # Replace with your Linode IPs
labels:
instance: 'linode1' # Or use hostname
- targets: ['linode2_ip:9100']
labels:
instance: 'linode2'
- targets: ['linode3_ip:9100']
labels:
instance: 'linode3'
- job_name: 'postgres'
static_configs:
- targets: ['postgres_master_ip:9187', 'postgres_replica1_ip:9187'] # Assuming pg_exporter on port 9187
labels:
instance: 'postgres_master'
- targets: ['postgres_replica1_ip:9187']
labels:
instance: 'postgres_replica1'
Restart Prometheus for the changes to take effect. You should now see your Linode instances and PostgreSQL instances (if you’re using a PostgreSQL exporter like pg_exporter) appearing under the “Targets” page in the Prometheus UI.
Application-Level Metrics with Prometheus Client Libraries
To gain visibility into your Ruby application’s behavior, integrate Prometheus client libraries. This allows you to expose custom metrics directly from your application, such as request latency, error rates, and queue depths.
Add the prometheus-client gem to your application’s Gemfile:
gem 'prometheus-client', require: 'prometheus_client/middleware'
Then, initialize the client and register your metrics. For a Rails application, you can do this in an initializer:
# config/initializers/prometheus_client.rb require 'prometheus_client/middleware' require 'prometheus_client/registry' # Initialize the default registry PrometheusClient.configure do |config| config.initial_value_calculators = PrometheusClient::ValueCalculators::Simple.new config.milliseconds_to_seconds = true end # Define custom metrics # Example: HTTP Request Counter $http_requests_total = PrometheusClient::Counter.new( name: 'http_requests_total', documentation: 'Total number of HTTP requests', labels: [:method, :path, :status] ) PrometheusClient.register($http_requests_total) # Example: HTTP Request Duration Histogram $http_request_duration_seconds = PrometheusClient::Histogram.new( name: 'http_request_duration_seconds', documentation: 'HTTP request duration in seconds', labels: [:method, :path, :status] ) PrometheusClient.register($http_request_duration_seconds) # Example: Background Job Counter $background_jobs_total = PrometheusClient::Counter.new( name: 'background_jobs_total', documentation: 'Total number of background jobs processed', labels: [:job_name, :status] ) PrometheusClient.register($background_jobs_total) # Example: Database Query Duration Histogram $db_query_duration_seconds = PrometheusClient::Histogram.new( name: 'db_query_duration_seconds', documentation: 'Database query duration in seconds', labels: [:query_name] ) PrometheusClient.register($db_query_duration_seconds) # Mount the middleware in your Rails application # config/application.rb or config/environments/*.rb Rails.application.config.middleware.use PrometheusClient::Middleware, metrics_path: '/metrics', registry: PrometheusClient.registry, requests_counter: $http_requests_total, duration_histogram: $http_request_duration_seconds
You’ll also need to instrument your background job processing (e.g., Sidekiq) and database queries manually.
# Example for Sidekiq (in an initializer or a custom middleware)
module Sidekiq
module Middleware
module Server
class PrometheusMetrics
def initialize(options = {})
@jobs_total = options.fetch(:jobs_total)
@job_duration = options.fetch(:job_duration)
end
def call(worker_class, msg, queue)
start_time = Time.now
begin
yield
status = 'success'
rescue StandardError => e
status = 'failure'
raise e # Re-raise the exception so Sidekiq handles it
ensure
duration = Time.now - start_time
@jobs_total.increment(job_name: worker_class.name, status: status)
@job_duration.observe({ job_name: worker_class.name, status: status }, duration)
end
end
end
end
end
end
# In config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add Sidekiq::Middleware::Server::PrometheusMetrics,
jobs_total: $background_jobs_total,
job_duration: $background_job_duration_seconds # Assuming you defined this histogram
end
end
# Example for database queries (using ActiveRecord)
# In an initializer or a custom ActiveRecord extension
module ActiveRecord
class Base
class << self
alias_method :original_connection, :connection
def connection
conn = original_connection
# This is a simplified example; a more robust solution would involve
# instrumenting specific query methods or using ActiveSupport::Notifications.
# For demonstration, let's assume we wrap a hypothetical 'execute_query' method.
# In reality, you'd hook into `ActiveSupport::Notifications.subscribe('sql.active_record')`
# and extract query details.
conn
end
end
end
end
# Example using ActiveSupport::Notifications (more idiomatic)
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
duration = finish - start
query_name = payload[:name] || 'unknown' # e.g., 'User Load', 'Post Fetch'
# You might want to sanitize or categorize query_name for better metrics
$db_query_duration_seconds.observe({ query_name: query_name }, duration)
end
Configure Prometheus to scrape your application's metrics endpoint (e.g., http://your_app_ip:3000/metrics). You'll need to add a new job to your prometheus.yml:
scrape_configs:
# ... other jobs ...
- job_name: 'ruby_app'
static_configs:
- targets: ['your_app_ip:3000'] # Replace with your app's IP and port
labels:
instance: 'app_server_1'
Alerting with Alertmanager
Collecting metrics is only half the battle; you need to be alerted when things go wrong. Alertmanager handles alerts sent by Prometheus, deduplicates them, groups them, and routes them to the correct receiver (e.g., Slack, PagerDuty, email).
Here's a sample alertmanager.yml configuration:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver' # Default receiver if no specific route matches
routes:
- receiver: 'critical-alerts'
match:
severity: 'critical'
continue: true # Allows matching other routes if needed
- receiver: 'warning-alerts'
match:
severity: 'warning'
continue: true
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL' # Replace with your Slack webhook URL
channel: '#monitoring-alerts'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if .CommonLabels.alertname }} - {{ .CommonLabels.alertname }}{{ end }}]'
text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'
- name: 'critical-alerts'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#critical-alerts-channel'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if .CommonLabels.alertname }} - {{ .CommonLabels.alertname }}{{ end }}]'
text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'
- name: 'warning-alerts'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#warning-alerts-channel'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if .CommonLabels.alertname }} - {{ .CommonLabels.alertname }}{{ end }}]'
text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:* {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'
In Prometheus, you'll define alerting rules in separate YAML files (e.g., rules.yml) and point Prometheus to them in prometheus.yml.
# prometheus.yml
rule_files:
- "rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager_ip:9093'] # Replace with your Alertmanager IP and port
# rules.yml
groups:
- name: postgres_alerts
rules:
- alert: HighPostgresQueryTime
expr: sum(rate(pg_stat_statements_total_time_seconds[5m])) by (query) > 60 # Total time spent on a query in the last 5 mins exceeds 60s
for: 5m
labels:
severity: warning
service: postgres
annotations:
summary: "High query execution time detected for query: {{ $labels.query }}"
description: "PostgreSQL query '{{ $labels.query }}' has been running for a cumulative total of over 60 seconds in the last 5 minutes."
- alert: PostgresReplicationLag
expr: pg_replication_lag_seconds > 300 # Replication lag exceeds 5 minutes
for: 2m
labels:
severity: critical
service: postgres
annotations:
summary: "PostgreSQL replication lag is high on {{ $labels.instance }}"
description: "Replication lag on instance {{ $labels.instance }} is {{ $value }} seconds, exceeding the 300-second threshold."
- name: node_alerts
rules:
- alert: HighCpuUsage
expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90 # CPU usage over 90% for 5 minutes
for: 5m
labels:
severity: warning
service: node
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} is experiencing CPU usage above 90%."
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 # Less than 10% disk space available
for: 10m
labels:
severity: critical
service: node
annotations:
summary: "Low disk space on {{ $labels.instance }} for filesystem {{ $labels.mountpoint }}"
description: "Filesystem {{ $labels.mountpoint }} on instance {{ $labels.instance }} has less than 10% free space remaining."
- name: ruby_app_alerts
rules:
- alert: HighHttpRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, method, path, status)) > 5 # 95th percentile latency > 5 seconds
for: 5m
labels:
severity: warning
service: ruby_app
annotations:
summary: "High HTTP request latency for {{ $labels.method }} {{ $labels.path }} ({{ $labels.status }})"
description: "The 95th percentile latency for {{ $labels.method }} {{ $labels.path }} ({{ $labels.status }}) is {{ $value }} seconds, exceeding the 5-second threshold."
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5 # More than 5% of requests are 5xx errors
for: 5m
labels:
severity: critical
service: ruby_app
annotations:
summary: "High HTTP error rate on {{ $labels.instance }}"
description: "More than 5% of HTTP requests are resulting in 5xx errors."
Ensure your Prometheus server is configured to scrape Alertmanager, and Alertmanager is configured with your notification endpoints. Test your alerts by temporarily lowering thresholds or triggering conditions manually.