Server Monitoring Best Practices: Keeping Your Ruby App and DynamoDB Clusters Alive on OVH
Proactive Health Checks for Ruby Applications on OVH
Maintaining the health of a Ruby application deployed on OVH infrastructure requires a multi-layered monitoring strategy. Beyond basic uptime checks, we need to delve into application-specific metrics and resource utilization. This section focuses on implementing robust health checks that go beyond simple HTTP 200 responses.
A common approach is to expose a dedicated health check endpoint within your Ruby application. This endpoint should not only verify that the web server is responding but also perform critical internal checks. For Rails applications, this can be a simple controller action.
Rails Health Check Endpoint
Create a controller to handle health checks. This controller can be configured to run specific checks, such as database connectivity, cache status, or the availability of external services.
# app/controllers/health_check_controller.rb
class HealthCheckController < ApplicationController
skip_before_action :authenticate_user! # Adjust as per your auth setup
def show
checks = {
database: check_database_connection,
cache: check_cache_connection,
# Add more checks as needed, e.g., external_api: check_external_api
}
if checks.values.all?
render json: { status: 'ok', checks: checks }, status: :ok
else
render json: { status: 'error', checks: checks }, status: :internal_server_error
end
end
private
def check_database_connection
ActiveRecord::Base.connection.execute('SELECT 1')
true
rescue ActiveRecord::ConnectionNotEstablished, PG::Error => e
Rails.logger.error "Database connection check failed: #{e.message}"
false
end
def check_cache_connection
# Assuming Redis is used for caching
$redis.ping
true
rescue Redis::CannotConnectError => e
Rails.logger.error "Cache connection check failed: #{e.message}"
false
end
# def check_external_api
# # Implement logic to check an external API
# # e.g., Net::HTTP.get_response(URI('http://example.com/health'))
# true
# rescue StandardError => e
# Rails.logger.error "External API check failed: #{e.message}"
# false
# end
end
Routing for Health Check
Define a route for this health check endpoint. It’s advisable to use a non-standard path to avoid accidental access and to keep it simple.
# config/routes.rb Rails.application.routes.draw do # ... other routes get '/_health', to: 'health_check#show' # ... end
OVH Load Balancer and Health Checks
OVH’s load balancers (e.g., HAProxy-based) can be configured to use this endpoint for their health checks. This ensures that unhealthy application instances are automatically removed from the load balancing pool.
When configuring your OVH load balancer, set the health check URL to http://<instance_ip>/_health. The expected HTTP status code for a healthy instance is 200 OK. For unhealthy instances, the application should return 500 Internal Server Error.
Monitoring DynamoDB Performance and Capacity on OVH
While OVH doesn’t directly manage AWS DynamoDB, many applications hosted on OVH will interact with AWS services. Effective monitoring of DynamoDB is crucial for application performance and cost management. We’ll focus on key metrics and how to collect them.
Key DynamoDB Metrics to Monitor
- ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits: Essential for understanding actual usage versus provisioned capacity. Spikes here can indicate performance bottlenecks or inefficient queries.
- ProvisionedReadCapacityUnits and ProvisionedWriteCapacityUnits: Track these to ensure you’re not over-provisioning (wasting money) or under-provisioning (causing throttling).
- ThrottledRequests: A direct indicator of insufficient capacity. High throttling rates mean requests are being rejected, impacting application responsiveness.
- SuccessfulRequestLatency: Measures the time taken for successful requests. High latency can point to inefficient scans, large items, or hot partitions.
- SystemErrors: Count of internal server errors from DynamoDB.
- ConditionalCheckFailedRequests: Indicates issues with conditional writes, which can be a source of application logic errors.
Collecting DynamoDB Metrics with AWS CloudWatch
AWS CloudWatch is the primary service for collecting and visualizing DynamoDB metrics. You can access these metrics via the AWS Management Console, AWS CLI, or SDKs.
Using AWS CLI for Metric Retrieval
You can fetch specific metrics using the AWS CLI. This is useful for scripting or integrating with external monitoring tools.
# Get consumed read capacity for a table in the last 5 minutes
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ConsumedReadCapacityUnits \
--start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Average \
--dimensions Name=TableName,Value=YourTableName \
--region us-east-1
# Get throttled requests for a table in the last hour
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 600 \
--statistics Sum \
--dimensions Name=TableName,Value=YourTableName \
--region us-east-1
Integrating CloudWatch Metrics with External Monitoring Systems
For a unified view, especially when hosting on OVH, you’ll want to pull CloudWatch metrics into your primary monitoring system (e.g., Prometheus, Datadog, Grafana). The Prometheus AWS Exporter is a common choice.
Prometheus AWS Exporter Configuration
Deploy the AWS Exporter and configure it to scrape DynamoDB metrics. Ensure the IAM role or user associated with the exporter has permissions for cloudwatch:GetMetricStatistics and dynamodb:ListTables.
# Example prometheus.yml configuration snippet for AWS Exporter
scrape_configs:
- job_name: 'aws-dynamodb'
static_configs:
- targets: ['aws-exporter.example.com:9108'] # Replace with your exporter's address
metrics_path: '/metrics'
params:
region: ['us-east-1'] # Specify your AWS region
tables: ['YourTableName1', 'YourTableName2'] # List tables you want to monitor
metrics:
- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- ProvisionedReadCapacityUnits
- ProvisionedWriteCapacityUnits
- ThrottledRequests
- SuccessfulRequestLatency
- SystemErrors
- ConditionalCheckFailedRequests
Once configured, you can query these metrics in Prometheus and visualize them in Grafana dashboards. This allows correlation of DynamoDB performance with your Ruby application’s behavior on OVH.
System-Level Monitoring on OVH Instances
Even with application-level and database monitoring, robust system-level metrics are foundational. For servers running your Ruby application on OVH, we need to monitor CPU, memory, disk I/O, and network traffic.
Essential System Metrics
- CPU Utilization: High CPU can indicate inefficient Ruby code, heavy background jobs, or insufficient instance sizing.
- Memory Usage: Monitor both RAM and swap. Excessive swapping is a strong indicator of memory pressure.
- Disk I/O Wait: High I/O wait times suggest storage bottlenecks, which can impact application responsiveness.
- Network Traffic: Monitor inbound and outbound traffic to detect unusual patterns or potential network saturation.
- Load Average: A general indicator of system load.
Implementing Node Exporter for Prometheus
Node Exporter is the de facto standard for collecting hardware and OS metrics for Prometheus. Deploying it on each OVH instance provides the necessary data.
Installation and Configuration (Ubuntu/Debian)
Download the latest release, extract it, and run it as a service.
# Download the latest release (check https://prometheus.io/download/ for latest version) wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz cd node_exporter-1.7.0.linux-amd64 # Create a systemd service file sudo tee /etc/systemd/system/node_exporter.service <<EOF [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target EOF # Create user and group for node_exporter sudo groupadd --system prometheus sudo useradd --system --no-create-home --shell /bin/false -g prometheus prometheus # Copy the binary sudo cp node_exporter /usr/local/bin/ # Reload systemd, enable and start the service sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter # Verify status sudo systemctl status node_exporter
Prometheus Scrape Configuration
Add a job to your Prometheus configuration to scrape the Node Exporter instances running on your OVH servers.
# prometheus.yml
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['ovh-server-1.example.com:9100', 'ovh-server-2.example.com:9100'] # Replace with your OVH server IPs/hostnames
metrics_path: '/metrics'
Alerting Strategies for Production Readiness
Effective monitoring is incomplete without a robust alerting strategy. Alerts should be actionable, timely, and minimize noise. We’ll focus on setting up alerts for critical conditions across our Ruby app, DynamoDB, and OVH instances.
Alerting on Ruby Application Health
Alert when the health check endpoint returns an error or times out. This indicates a severe application issue.
# Prometheus Alerting Rule
groups:
- name: ruby_app_alerts
rules:
- alert: RubyAppUnhealthy
expr: probe_success{job="your_ruby_app_job", instance="your_app_instance:port"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ruby application on {{ $labels.instance }} is unhealthy."
description: "The health check endpoint for {{ $labels.instance }} has failed for 5 minutes."
Additionally, monitor application-specific error rates or latency if your application exposes custom metrics (e.g., via a Prometheus client library).
Alerting on DynamoDB Capacity and Performance
Alert when DynamoDB is being throttled or experiencing high latency. These are direct indicators of potential user impact.
# Prometheus Alerting Rules for DynamoDB
groups:
- name: dynamodb_alerts
rules:
- alert: DynamoDBThrottled
expr: sum(aws_dynamodb_throttled_requests_sum{job="aws-exporter", table="YourTableName"}) by (table) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "DynamoDB table {{ $labels.table }} is experiencing throttling."
description: "Throttled requests for table {{ $labels.table }} have been detected for 10 minutes."
- alert: DynamoDBHighLatency
expr: avg(aws_dynamodb_successful_request_latency_average{job="aws-exporter", table="YourTableName"}) by (table) > 1.0 # Adjust threshold as needed
for: 15m
labels:
severity: warning
annotations:
summary: "High latency on DynamoDB table {{ $labels.table }}."
description: "Average request latency for table {{ $labels.table }} has exceeded 1.0 second for 15 minutes."
- alert: DynamoDBLowReadCapacity
expr: avg(aws_dynamodb_consumed_read_capacity_units_average{job="aws-exporter", table="YourTableName"}) by (table) >= avg(aws_dynamodb_provisioned_read_capacity_units_average{job="aws-exporter", table="YourTableName"}) by (table) * 0.9
for: 30m
labels:
severity: info
annotations:
summary: "DynamoDB table {{ $labels.table }} is nearing its read capacity."
description: "Consumed read capacity for table {{ $labels.table }} is consistently above 90% of provisioned capacity for 30 minutes."
Alerting on System Resource Exhaustion
Alert on critical system resource thresholds to prevent application downtime due to infrastructure limitations.
# Prometheus Alerting Rules for System Resources
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.2
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}."
description: "CPU usage on {{ $labels.instance }} has been above 80% for 10 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}."
description: "Memory usage on {{ $labels.instance }} has been above 90% for 10 minutes."
- alert: HighDiskIOWait
expr: avg by (instance) (rate(node_disk_io_time_seconds_total{device=~"sd.*|nvme.*"}[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High disk I/O wait on {{ $labels.instance }}."
description: "Disk I/O wait time on {{ $labels.instance }} has been consistently high for 10 minutes."
Alertmanager Configuration
Configure Alertmanager to route these alerts to the appropriate teams via email, Slack, PagerDuty, etc. Ensure alert routing rules are well-defined to avoid alert fatigue.
# alertmanager.yml
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- receiver: 'critical-alerts'
match:
severity: 'critical'
continue: true
- receiver: 'warning-alerts'
match:
severity: 'warning'
continue: true
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#monitoring-alerts'
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#oncall-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
- name: 'warning-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#monitoring-alerts'