Server Monitoring Best Practices: Keeping Your Ruby App and DynamoDB Clusters Alive on OVH

Proactive Health Checks for Ruby Applications on OVH

Maintaining the health of a Ruby application deployed on OVH infrastructure requires a multi-layered monitoring strategy. Beyond basic uptime checks, we need to delve into application-specific metrics and resource utilization. This section focuses on implementing robust health checks that go beyond simple HTTP 200 responses.

A common approach is to expose a dedicated health check endpoint within your Ruby application. This endpoint should not only verify that the web server is responding but also perform critical internal checks. For Rails applications, this can be a simple controller action.

Rails Health Check Endpoint

Create a controller to handle health checks. This controller can be configured to run specific checks, such as database connectivity, cache status, or the availability of external services.

# app/controllers/health_check_controller.rb
class HealthCheckController < ApplicationController
  skip_before_action :authenticate_user! # Adjust as per your auth setup

  def show
    checks = {
      database: check_database_connection,
      cache: check_cache_connection,
      # Add more checks as needed, e.g., external_api: check_external_api
    }

    if checks.values.all?
      render json: { status: 'ok', checks: checks }, status: :ok
    else
      render json: { status: 'error', checks: checks }, status: :internal_server_error
    end
  end

  private

  def check_database_connection
    ActiveRecord::Base.connection.execute('SELECT 1')
    true
  rescue ActiveRecord::ConnectionNotEstablished, PG::Error => e
    Rails.logger.error "Database connection check failed: #{e.message}"
    false
  end

  def check_cache_connection
    # Assuming Redis is used for caching
    $redis.ping
    true
  rescue Redis::CannotConnectError => e
    Rails.logger.error "Cache connection check failed: #{e.message}"
    false
  end

  # def check_external_api
  #   # Implement logic to check an external API
  #   # e.g., Net::HTTP.get_response(URI('http://example.com/health'))
  #   true
  # rescue StandardError => e
  #   Rails.logger.error "External API check failed: #{e.message}"
  #   false
  # end
end

Routing for Health Check

Define a route for this health check endpoint. It’s advisable to use a non-standard path to avoid accidental access and to keep it simple.

# config/routes.rb
Rails.application.routes.draw do
  # ... other routes
  get '/_health', to: 'health_check#show'
  # ...
end

OVH Load Balancer and Health Checks

OVH’s load balancers (e.g., HAProxy-based) can be configured to use this endpoint for their health checks. This ensures that unhealthy application instances are automatically removed from the load balancing pool.

When configuring your OVH load balancer, set the health check URL to http://<instance_ip>/_health. The expected HTTP status code for a healthy instance is 200 OK. For unhealthy instances, the application should return 500 Internal Server Error.

Monitoring DynamoDB Performance and Capacity on OVH

While OVH doesn’t directly manage AWS DynamoDB, many applications hosted on OVH will interact with AWS services. Effective monitoring of DynamoDB is crucial for application performance and cost management. We’ll focus on key metrics and how to collect them.

Key DynamoDB Metrics to Monitor

ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits: Essential for understanding actual usage versus provisioned capacity. Spikes here can indicate performance bottlenecks or inefficient queries.
ProvisionedReadCapacityUnits and ProvisionedWriteCapacityUnits: Track these to ensure you’re not over-provisioning (wasting money) or under-provisioning (causing throttling).
ThrottledRequests: A direct indicator of insufficient capacity. High throttling rates mean requests are being rejected, impacting application responsiveness.
SuccessfulRequestLatency: Measures the time taken for successful requests. High latency can point to inefficient scans, large items, or hot partitions.
SystemErrors: Count of internal server errors from DynamoDB.
ConditionalCheckFailedRequests: Indicates issues with conditional writes, which can be a source of application logic errors.

Collecting DynamoDB Metrics with AWS CloudWatch

AWS CloudWatch is the primary service for collecting and visualizing DynamoDB metrics. You can access these metrics via the AWS Management Console, AWS CLI, or SDKs.

Using AWS CLI for Metric Retrieval

You can fetch specific metrics using the AWS CLI. This is useful for scripting or integrating with external monitoring tools.

# Get consumed read capacity for a table in the last 5 minutes
aws cloudwatch get-metric-statistics \
    --namespace AWS/DynamoDB \
    --metric-name ConsumedReadCapacityUnits \
    --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 300 \
    --statistics Average \
    --dimensions Name=TableName,Value=YourTableName \
    --region us-east-1

# Get throttled requests for a table in the last hour
aws cloudwatch get-metric-statistics \
    --namespace AWS/DynamoDB \
    --metric-name ThrottledRequests \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 600 \
    --statistics Sum \
    --dimensions Name=TableName,Value=YourTableName \
    --region us-east-1

Integrating CloudWatch Metrics with External Monitoring Systems

For a unified view, especially when hosting on OVH, you’ll want to pull CloudWatch metrics into your primary monitoring system (e.g., Prometheus, Datadog, Grafana). The Prometheus AWS Exporter is a common choice.

Prometheus AWS Exporter Configuration

Deploy the AWS Exporter and configure it to scrape DynamoDB metrics. Ensure the IAM role or user associated with the exporter has permissions for cloudwatch:GetMetricStatistics and dynamodb:ListTables.

# Example prometheus.yml configuration snippet for AWS Exporter
scrape_configs:
  - job_name: 'aws-dynamodb'
    static_configs:
      - targets: ['aws-exporter.example.com:9108'] # Replace with your exporter's address
    metrics_path: '/metrics'
    params:
      region: ['us-east-1'] # Specify your AWS region
      tables: ['YourTableName1', 'YourTableName2'] # List tables you want to monitor
      metrics:
        - ConsumedReadCapacityUnits
        - ConsumedWriteCapacityUnits
        - ProvisionedReadCapacityUnits
        - ProvisionedWriteCapacityUnits
        - ThrottledRequests
        - SuccessfulRequestLatency
        - SystemErrors
        - ConditionalCheckFailedRequests

Once configured, you can query these metrics in Prometheus and visualize them in Grafana dashboards. This allows correlation of DynamoDB performance with your Ruby application’s behavior on OVH.

System-Level Monitoring on OVH Instances

Even with application-level and database monitoring, robust system-level metrics are foundational. For servers running your Ruby application on OVH, we need to monitor CPU, memory, disk I/O, and network traffic.

Essential System Metrics

CPU Utilization: High CPU can indicate inefficient Ruby code, heavy background jobs, or insufficient instance sizing.
Memory Usage: Monitor both RAM and swap. Excessive swapping is a strong indicator of memory pressure.
Disk I/O Wait: High I/O wait times suggest storage bottlenecks, which can impact application responsiveness.
Network Traffic: Monitor inbound and outbound traffic to detect unusual patterns or potential network saturation.
Load Average: A general indicator of system load.

Implementing Node Exporter for Prometheus

Node Exporter is the de facto standard for collecting hardware and OS metrics for Prometheus. Deploying it on each OVH instance provides the necessary data.

Installation and Configuration (Ubuntu/Debian)

Download the latest release, extract it, and run it as a service.

# Download the latest release (check https://prometheus.io/download/ for latest version)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Create a systemd service file
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Create user and group for node_exporter
sudo groupadd --system prometheus
sudo useradd --system --no-create-home --shell /bin/false -g prometheus prometheus

# Copy the binary
sudo cp node_exporter /usr/local/bin/

# Reload systemd, enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify status
sudo systemctl status node_exporter

Prometheus Scrape Configuration

Add a job to your Prometheus configuration to scrape the Node Exporter instances running on your OVH servers.

# prometheus.yml
scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['ovh-server-1.example.com:9100', 'ovh-server-2.example.com:9100'] # Replace with your OVH server IPs/hostnames
    metrics_path: '/metrics'

Alerting Strategies for Production Readiness

Effective monitoring is incomplete without a robust alerting strategy. Alerts should be actionable, timely, and minimize noise. We’ll focus on setting up alerts for critical conditions across our Ruby app, DynamoDB, and OVH instances.

Alerting on Ruby Application Health

Alert when the health check endpoint returns an error or times out. This indicates a severe application issue.

# Prometheus Alerting Rule
groups:
- name: ruby_app_alerts
  rules:
  - alert: RubyAppUnhealthy
    expr: probe_success{job="your_ruby_app_job", instance="your_app_instance:port"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Ruby application on {{ $labels.instance }} is unhealthy."
      description: "The health check endpoint for {{ $labels.instance }} has failed for 5 minutes."

Additionally, monitor application-specific error rates or latency if your application exposes custom metrics (e.g., via a Prometheus client library).

Alerting on DynamoDB Capacity and Performance

Alert when DynamoDB is being throttled or experiencing high latency. These are direct indicators of potential user impact.

# Prometheus Alerting Rules for DynamoDB
groups:
- name: dynamodb_alerts
  rules:
  - alert: DynamoDBThrottled
    expr: sum(aws_dynamodb_throttled_requests_sum{job="aws-exporter", table="YourTableName"}) by (table) > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "DynamoDB table {{ $labels.table }} is experiencing throttling."
      description: "Throttled requests for table {{ $labels.table }} have been detected for 10 minutes."

  - alert: DynamoDBHighLatency
    expr: avg(aws_dynamodb_successful_request_latency_average{job="aws-exporter", table="YourTableName"}) by (table) > 1.0 # Adjust threshold as needed
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High latency on DynamoDB table {{ $labels.table }}."
      description: "Average request latency for table {{ $labels.table }} has exceeded 1.0 second for 15 minutes."

  - alert: DynamoDBLowReadCapacity
    expr: avg(aws_dynamodb_consumed_read_capacity_units_average{job="aws-exporter", table="YourTableName"}) by (table) >= avg(aws_dynamodb_provisioned_read_capacity_units_average{job="aws-exporter", table="YourTableName"}) by (table) * 0.9
    for: 30m
    labels:
      severity: info
    annotations:
      summary: "DynamoDB table {{ $labels.table }} is nearing its read capacity."
      description: "Consumed read capacity for table {{ $labels.table }} is consistently above 90% of provisioned capacity for 30 minutes."

Alerting on System Resource Exhaustion

Alert on critical system resource thresholds to prevent application downtime due to infrastructure limitations.

# Prometheus Alerting Rules for System Resources
groups:
- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.2
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}."
      description: "CPU usage on {{ $labels.instance }} has been above 80% for 10 minutes."

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}."
      description: "Memory usage on {{ $labels.instance }} has been above 90% for 10 minutes."

  - alert: HighDiskIOWait
    expr: avg by (instance) (rate(node_disk_io_time_seconds_total{device=~"sd.*|nvme.*"}[5m])) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High disk I/O wait on {{ $labels.instance }}."
      description: "Disk I/O wait time on {{ $labels.instance }} has been consistently high for 10 minutes."

Alertmanager Configuration

Configure Alertmanager to route these alerts to the appropriate teams via email, Slack, PagerDuty, etc. Ensure alert routing rules are well-defined to avoid alert fatigue.

# alertmanager.yml
route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

  routes:
  - receiver: 'critical-alerts'
    match:
      severity: 'critical'
    continue: true

  - receiver: 'warning-alerts'
    match:
      severity: 'warning'
    continue: true

receivers:
- name: 'default-receiver'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#monitoring-alerts'

- name: 'critical-alerts'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#oncall-critical'
  pagerduty_configs:
  - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'

- name: 'warning-alerts'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#monitoring-alerts'