Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on AWS

Proactive PostgreSQL Monitoring on AWS RDS

Maintaining the health and performance of PostgreSQL clusters on AWS RDS is paramount for any production Ruby application. Beyond basic CPU and memory utilization, we need to delve into PostgreSQL-specific metrics that indicate potential bottlenecks or impending failures. CloudWatch provides a wealth of these metrics, but understanding which ones to prioritize and how to set effective alarms is key.

The most critical metrics to monitor for PostgreSQL on RDS include:

Database Connections: High numbers of active connections can exhaust resources.
Read/Write IOPS: Spikes or sustained high IOPS can indicate inefficient queries or storage contention.
Read/Write Latency: Increasing latency is a direct indicator of performance degradation.
CPU Utilization: While a general metric, sustained high CPU on the RDS instance points to processing load.
Freeable Memory: Insufficient free memory can lead to increased swapping and reduced performance.
Disk Queue Depth: A growing queue depth suggests the storage subsystem is struggling to keep up with I/O requests.
Replication Lag: For read replicas, monitoring lag is crucial for data consistency.
Transaction Rate: Sudden changes can signal application issues or performance regressions.
Deadlocks: While less frequent, deadlocks can halt application processes.

Setting Up Essential CloudWatch Alarms

We’ll configure alarms using the AWS CLI for consistency and automation. These alarms should trigger notifications via SNS to a dedicated DevOps channel or email distribution list.

First, ensure you have an SNS topic created (e.g., arn:aws:sns:us-east-1:123456789012:devops-alerts). Then, create alarms for key metrics:

High Database Connections:

aws cloudwatch put-metric-alarm \
    --alarm-name "RDS-Postgres-HighConnections-$(date +%Y%m%d%H%M%S)" \
    --alarm-description "High number of active database connections" \
    --metric-name "DatabaseConnections" \
    --namespace "AWS/RDS" \
    --statistic Average \
    --period 300 \
    --threshold 150 \
    --comparison-operator GreaterThanThreshold \
    --dimensions "Name=DBInstanceIdentifier,Value=your-rds-instance-id" \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:devops-alerts

High Read Latency:

aws cloudwatch put-metric-alarm \
    --alarm-name "RDS-Postgres-HighReadLatency-$(date +%Y%m%d%H%M%S)" \
    --alarm-description "High read latency on RDS instance" \
    --metric-name "ReadLatency" \
    --namespace "AWS/RDS" \
    --statistic Average \
    --period 300 \
    --threshold 0.05 \
    --comparison-operator GreaterThanThreshold \
    --dimensions "Name=DBInstanceIdentifier,Value=your-rds-instance-id" \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:devops-alerts

Low Freeable Memory:

aws cloudwatch put-metric-alarm \
    --alarm-name "RDS-Postgres-LowFreeableMemory-$(date +%Y%m%d%H%M%S)" \
    --alarm-description "Low freeable memory on RDS instance" \
    --metric-name "FreeableMemory" \
    --namespace "AWS/RDS" \
    --statistic Average \
    --period 300 \
    --threshold 1073741824 \
    --comparison-operator LessThanThreshold \
    --dimensions "Name=DBInstanceIdentifier,Value=your-rds-instance-id" \
    --evaluation-periods 3 \
    --datapoints-to-alarm 3 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:devops-alerts

Adjust threshold values based on your instance size and typical load. For read replicas, add alarms for ReplicaLag.

Monitoring Ruby Application Performance with Prometheus and Grafana

While RDS metrics are crucial, understanding the performance of your Ruby application itself is equally important. We’ll leverage Prometheus for metrics collection and Grafana for visualization and alerting. The prometheus-client-mruby gem (or its Ruby equivalent) can expose application-level metrics.

Exposing Application Metrics

In your Ruby application (e.g., Rails, Sinatra), integrate the Prometheus client library. Here’s a simplified example using prometheus-client gem:

# Gemfile
gem 'prometheus-client'

# config/initializers/prometheus.rb (for Rails)
require 'prometheus_client'

# Initialize Prometheus client
PrometheusClient.configure do |config|
  config.logger = Rails.logger
end

# Define custom metrics
# A counter for total requests
REQUEST_TOTAL = PrometheusClient::Counter.new(
  name: 'http_requests_total',
  documentation: 'Total HTTP requests processed',
  labels: [:method, :path, :status]
)

# A summary for request duration
REQUEST_DURATION_SECONDS = PrometheusClient::Summary.new(
  name: 'http_request_duration_seconds',
  documentation: 'HTTP request duration in seconds',
  labels: [:method, :path, :status]
)

# Register metrics
PrometheusClient.register(REQUEST_TOTAL)
PrometheusClient.register(REQUEST_DURATION_SECONDS)

# Middleware to collect metrics (for Rails)
module Prometheus
  class Middleware
    def initialize(app)
      @app = app
    end

    def call(env)
      start_time = Time.current
      status, headers, body = @app.call(env)
      duration = Time.current - start_time

      method = env['REQUEST_METHOD']
      path = env['REQUEST_PATH']
      status_code = status.to_s[0..2] # e.g., "200", "404", "500"

      REQUEST_TOTAL.increment(method: method, path: path, status: status_code)
      REQUEST_DURATION_SECONDS.observe(duration, method: method, path: path, status: status_code)

      [status, headers, body]
    end
  end
end

# Add to config/application.rb or application.html.erb
# config.middleware.use Prometheus::Middleware

# Expose metrics endpoint (e.g., /metrics)
# In Rails, you might add a route:
# get '/metrics', to: proc { |env| [200, { 'Content-Type' => 'text/plain' }, PrometheusClient.metrics_text] }

This middleware will automatically record the total number of requests and their durations, categorized by HTTP method, path, and status code. The /metrics endpoint will then serve these metrics in a Prometheus-readable format.

Setting Up Prometheus Server

Deploy a Prometheus server (e.g., on an EC2 instance or using Kubernetes). Configure its prometheus.yml to scrape your application instances.

global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.

scrape_configs:
  - job_name: 'ruby_app'
    static_configs:
      - targets: ['your-app-instance-1.example.com:3000', 'your-app-instance-2.example.com:3000']
        labels:
          env: 'production'
          app: 'my-ruby-app'
    metrics_path: /metrics # Default is /metrics

Ensure your application instances are accessible from the Prometheus server. If running on EC2, security groups must allow inbound traffic on port 3000 (or your app’s port) from the Prometheus server’s IP/security group.

Configuring Grafana for Visualization and Alerting

Add your Prometheus instance as a data source in Grafana. Then, create dashboards to visualize key application metrics:

Request Rate: rate(http_requests_total[5m])
Average Request Duration: sum(rate(http_request_duration_seconds_sum[5m])) by (method, path) / sum(rate(http_request_duration_seconds_count[5m])) by (method, path)
HTTP 5xx Error Rate: sum(rate(http_requests_total{status=~"5..$"}[5m])) by (method, path)
Request Duration Percentiles (e.g., 95th): histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, method, path))

For alerting in Grafana, configure alert rules based on these queries. For instance, an alert for a sustained increase in 5xx errors or a significant rise in request latency.

# Example Grafana Alert Rule (Conceptual)
# Alert if the rate of 5xx errors exceeds 10 per minute for any endpoint over 5 minutes
ALERT HighErrorRate
  IF sum(rate(http_requests_total{status=~"5..$"}[5m])) by (method, path) > 10
  FOR 5m
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "High HTTP 5xx error rate detected on {{ $labels.method }} {{ $labels.path }}",
    description = "The rate of 5xx errors for {{ $labels.method }} {{ $labels.path }} has exceeded 10 per minute for the last 5 minutes."
  }

Configure Grafana to send these alerts to Slack, PagerDuty, or email.

System-Level Monitoring on EC2 Instances

For EC2 instances hosting your Ruby application or acting as Prometheus/Grafana servers, standard system-level monitoring is essential. This includes CPU, memory, disk I/O, and network traffic.

Leveraging the CloudWatch Agent

The CloudWatch Agent provides a robust way to collect system metrics and logs from your EC2 instances. Install and configure it to send custom metrics and logs to CloudWatch.

First, install the agent:

# For Amazon Linux 2
sudo rpm -U /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-latest.rpm

# For Ubuntu/Debian
sudo dpkg -i /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-latest.deb

# For CentOS/RHEL
sudo yum install /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-latest.rpm

Next, create a configuration file (e.g., /opt/aws/cloudwatch/agent/config.json). This example collects basic system metrics and application logs:

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyRubyApp/EC2",
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_user",
          "cpu_usage_system",
          "cpu_usage_iowait"
        ],
        "totalcpu": true
      },
      "mem": {
        "measurement": [
          "mem_used_percent",
          "mem_available_percent"
        ]
      },
      "disk": {
        "measurement": [
          "used_percent",
          "inodes_free"
        ],
        "resources": [
          "/"
        ]
      },
      "netif": {
        "measurement": [
          "bytes_sent",
          "bytes_recv",
          "packets_sent",
          "packets_recv"
        ]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/my-ruby-app/production.log",
            "log_group_name": "my-ruby-app/production",
            "log_stream_name": "{instance_id}/production"
          },
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "nginx/access",
            "log_stream_name": "{instance_id}/access"
          }
        ]
      }
    }
  }
}

Start the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/cloudwatch/agent/config.json -s

This configuration will send system metrics under the MyRubyApp/EC2 namespace and application/web server logs to CloudWatch Logs. You can then create CloudWatch Alarms based on these EC2 metrics (e.g., high CPU utilization, low disk space).

Database Connection Pooling and Health Checks

Efficient database connection management is critical for Ruby applications. Using a connection pooler like PgBouncer or the built-in pooling in ActiveRecord (though less robust for high concurrency) is essential. Monitoring the pool’s health is as important as monitoring the database itself.

Monitoring PgBouncer

PgBouncer exposes its statistics via a PostgreSQL connection. You can query these statistics directly from a monitoring tool or script.

-- Connect to the PgBouncer stats database (usually named 'pgbouncer')
-- Then run these queries:

-- Pool statistics
SELECT pool_name,
       max_client_conn,
       max_db_connections,
       current_عداد_conn,
       current_عداد_db_conn,
       active_connections,
       idle_connections,
       waiting_clients,
       waiting_connections
FROM pgbouncer.pools;

-- Server connection statistics
SELECT server_id,
       database,
       username,
       client_addr,
       client_port,
       server_pid,
       server_idle_timeout,
       server_version
FROM pgbouncer.servers;

-- Client connection statistics
SELECT pool_name,
       user_name,
       client_addr,
       client_port,
       connect_time,
       login_time,
       state
FROM pgbouncer.clients;

Key metrics to watch:

waiting_clients: A consistently high number indicates the pool is saturated and clients are waiting for connections.
active_connections vs. max_db_connections: Approaching the maximum can lead to connection refusal.
idle_connections: High idle connections might mean the pool is over-provisioned or connections are not being released properly by the application.

You can set up a script (e.g., Python with psycopg2) to query these stats and push them to Prometheus as custom metrics, or use them to trigger alerts directly.

Application-Level Health Checks

Implement a dedicated health check endpoint in your Ruby application (e.g., /health). This endpoint should:

Check basic application responsiveness.
Attempt a quick database query (e.g., SELECT 1) to verify database connectivity and responsiveness.
Check the status of external dependencies (e.g., Redis, message queues).

# Example Rails Controller for Health Check
class HealthController < ApplicationController
  skip_before_action :authenticate_user! # Or other auth checks if applicable

  def show
    status = {
      database: database_healthy?,
      redis: redis_healthy?,
      app: true # Assume app is healthy if we reach here
    }

    if status.values.all?
      render json: status, status: :ok
    else
      render json: status, status: :service_unavailable
    end
  end

  private

  def database_healthy?
    ActiveRecord::Base.connection.execute('SELECT 1')
    true
  rescue ActiveRecord::ConnectionNotEstablished, PG::Error => e
    Rails.logger.error "Database health check failed: #{e.message}"
    false
  end

  def redis_healthy?
    # Assuming you use Redis with a client like 'redis-rb'
    # $redis is a global instance, adjust as per your setup
    $redis.ping
    true
  rescue Redis::CannotConnectError => e
    Rails.logger.error "Redis health check failed: #{e.message}"
    false
  end
end

Configure your load balancer (e.g., AWS ELB/ALB) or container orchestrator (e.g., ECS, Kubernetes) to use this /health endpoint for health checks. This ensures traffic is only sent to healthy application instances.

Log Aggregation and Analysis

Centralized logging is indispensable for debugging issues across your distributed system. Logs from your Ruby application, web servers (Nginx/Apache), and database (if not using RDS’s managed logs) should be aggregated.

AWS CloudWatch Logs, Elasticsearch/OpenSearch with Kibana, or Datadog are common choices. As shown in the CloudWatch Agent section, you can configure the agent to ship application and web server logs.

For PostgreSQL on RDS, you can enable log exports to CloudWatch Logs. This allows you to search for slow queries, errors, and other database events.

# AWS CLI to enable PostgreSQL logging to CloudWatch Logs
aws rds modify-db-instance \
    --db-instance-identifier your-rds-instance-id \
    --enable-cloudwatch-logs-exports "postgres_log, upgrade_log" \
    --apply-immediately

Once logs are aggregated, set up alerts for critical log patterns:

PostgreSQL errors (e.g., “FATAL”, “PANIC”).
Application exceptions (e.g., “RuntimeError”, “StandardError”).
Nginx 5xx errors.
Slow query logs exceeding a certain threshold.

Tools like CloudWatch Logs Insights, Kibana Query Language (KQL), or Datadog’s log query language can be used to define these alerts.