Server Monitoring Best Practices: Keeping Your Ruby App and MongoDB Clusters Alive on DigitalOcean
Proactive Health Checks for Ruby Applications
Maintaining the health of a Ruby application on DigitalOcean isn’t just about reacting to downtime; it’s about building a robust, proactive monitoring strategy. This involves deep inspection of application-level metrics, not just server resource utilization. We’ll focus on essential checks that can be implemented using readily available tools and custom scripts.
Application Performance Monitoring (APM) Integration
While not strictly “server monitoring,” integrating an APM solution is paramount for understanding application behavior. Tools like New Relic, AppSignal, or Scout APM provide invaluable insights into request latency, error rates, database query performance, and slow transactions. Configure these agents to send critical alerts for:
- Sustained high error rates (e.g., > 1% of requests).
- Unusually high average response times (e.g., > 500ms for critical endpoints).
- Specific transaction traces exceeding defined thresholds.
- Memory leaks or excessive garbage collection cycles.
Custom Health Check Endpoints
Beyond APM, a dedicated health check endpoint within your Ruby application provides a simple, yet effective, way for external monitoring systems to verify basic functionality. This endpoint should ideally:
- Respond quickly (e.g., < 100ms).
- Check essential dependencies (e.g., database connectivity, Redis connection).
- Return a 200 OK status code on success and a non-2xx status code on failure.
Here’s a basic implementation using Sinatra, which can be easily adapted for Rails:
require 'sinatra'
require 'sequel' # Or your preferred DB adapter
# Assume DB connection is established elsewhere and available as $DB
# $DB = Sequel.connect('postgres://user:password@host:port/database')
get '/health' do
begin
# Check database connectivity
if $DB.nil? || !$DB.test_connection
status 503
return { status: 'error', message: 'Database connection failed' }.to_json
end
# Add other critical dependency checks here (e.g., Redis, external APIs)
# redis_client = Redis.new(url: ENV['REDIS_URL'])
# unless redis_client.ping == 'PONG'
# status 503
# return { status: 'error', message: 'Redis connection failed' }.to_json
# end
status 200
{ status: 'ok', message: 'Application is healthy' }.to_json
rescue StandardError => e
status 500
{ status: 'error', message: "Internal server error: #{e.message}" }.to_json
end
end
This endpoint can then be polled by external monitoring services like UptimeRobot, Pingdom, or even a custom Nagios/Prometheus check.
Log Aggregation and Analysis
Centralized logging is non-negotiable. Deploying a log shipper like Fluentd, Filebeat, or Logstash to collect logs from your Ruby application instances and forward them to a centralized store (e.g., Elasticsearch, Loki, Splunk) is crucial. Configure your log shipper to:
- Collect application logs (e.g.,
production.log, error logs). - Collect system logs (e.g.,
syslog,auth.log). - Parse logs to extract structured data (timestamps, log levels, request IDs).
In your monitoring dashboard (e.g., Kibana, Grafana), set up alerts for:
- High frequency of
ERRORorFATALlog messages. - Specific critical error patterns (e.g., database connection errors, authentication failures).
- Sudden spikes in log volume.
Resource Monitoring with Prometheus and Node Exporter
While DigitalOcean provides basic resource graphs, a more granular and alertable system is needed. Prometheus, coupled with node_exporter, offers a powerful solution for collecting system-level metrics.
Setting up Node Exporter
On each of your Ruby application servers (and MongoDB nodes), install and run node_exporter. This exposes a metrics endpoint typically on port 9100.
# Download the latest release (adjust version as needed) wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz cd node_exporter-1.7.0.linux-amd64 # Run it (consider running as a systemd service for production) ./node_exporter
For production, create a systemd service file (e.g., /etc/systemd/system/node_exporter.service):
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter # Adjust path if installed elsewhere [Install] WantedBy=multi-user.target
Then enable and start it:
sudo useradd -rs /bin/false node_exporter sudo mv node_exporter /usr/local/bin/ sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter
Configuring Prometheus Scrape Targets
Configure your Prometheus server to scrape these endpoints. In your prometheus.yml:
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['your_ruby_app_server_1:9100', 'your_ruby_app_server_2:9100', 'your_mongodb_node_1:9100', 'your_mongodb_node_2:9100']
labels:
env: 'production'
role: 'app' # or 'db' for MongoDB nodes
Key Metrics to Monitor and Alert On
With Prometheus collecting data, set up alerts in Alertmanager for critical system metrics:
- CPU Usage:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90(High CPU utilization for 5 minutes). - Memory Usage:
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90(High memory utilization). - Disk I/O Wait:
rate(node_disk_io_time_seconds_total[5m]) > 0.8(High disk I/O wait times). - Network Traffic: Monitor for unusual spikes or drops in network throughput.
- Filesystem Usage:
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 10(Low disk space remaining).
MongoDB Cluster Monitoring with MongoDB Exporter
Monitoring MongoDB requires specific metrics related to database operations, replication, and performance. The mongodb_exporter is an excellent choice for this.
Setting up MongoDB Exporter
Similar to node_exporter, deploy mongodb_exporter on a server that can access your MongoDB instances (ideally not on the MongoDB nodes themselves to avoid resource contention, but can be co-located if necessary). It typically runs on port 9216.
# Download the latest release (adjust version as needed)
wget https://github.com/mongodb-developer/mongodb_exporter/releases/download/v0.35.0/mongodb_exporter-v0.35.0.linux-amd64.tar.gz
tar xvfz mongodb_exporter-v0.35.0.linux-amd64.tar.gz
cd mongodb_exporter-v0.35.0.linux-amd64
# Create a MongoDB user for monitoring
# Connect to your MongoDB instance (e.g., using mongosh)
# use admin
# db.createUser({ user: "monitor_user", pwd: "your_secure_password", roles: [ { role: "clusterMonitor", db: "admin" }, { role: "readAnyDatabase", db: "admin" } ] })
# Run the exporter, pointing to your MongoDB URI
./mongodb_exporter --mongodb.uri="mongodb://monitor_user:your_secure_password@your_mongodb_host:27017/?authSource=admin"
For production, create a systemd service file similar to the node_exporter example, ensuring the --mongodb.uri flag is correctly configured.
Configuring Prometheus to Scrape MongoDB Exporter
Add a new job to your prometheus.yml:
scrape_configs:
- job_name: 'mongodb_exporter'
static_configs:
- targets: ['your_mongodb_exporter_host:9216']
labels:
env: 'production'
cluster: 'main_mongo_cluster'
If you have multiple MongoDB clusters or instances, you’ll need to adjust the targets and potentially use service discovery or more complex configuration.
Essential MongoDB Metrics and Alerts
Key metrics to monitor for MongoDB clusters:
- Replication Lag:
mongodb_replset_member_oplog_lag_seconds(Alert if lag exceeds a few minutes). - Connection Count:
mongodb_connections_current(Alert on unusually high or low connection counts). - Query Performance:
mongodb_opcounters_query(Monitor rates of different query types). Look for spikes in slow queries or specific operations. - Lock Percentage:
mongodb_locks_percentage(High global lock percentages indicate contention). - Disk Usage:
mongodb_storage_data_size_bytes(Monitor data size growth). - Network Traffic:
mongodb_network_bytes_in_total,mongodb_network_bytes_out_total. - OOM Killer Events: While not directly from
mongodb_exporter, monitor system logs for OOM killer events on MongoDB nodes using your log aggregation system.
Alerting Strategy with Alertmanager
Prometheus rules define when alerts fire, and Alertmanager handles deduplication, grouping, and routing of these alerts to the appropriate channels (e.g., Slack, PagerDuty, email). A well-defined alerting strategy is crucial to avoid alert fatigue.
Example Prometheus Alerting Rules (Ruby App)
groups:
- name: ruby_app_alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle", job="node_exporter"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} has been running at over 90% CPU for 5 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High Memory Usage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} is using over 90% of memory for 5 minutes."
- alert: AppHealthCheckFailed
# Assumes your health check endpoint returns a metric like 'http_requests_total'
# and you can detect non-2xx responses. This is a simplified example.
# A more robust approach would involve a blackbox exporter.
expr: up{job="your_ruby_app_job"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Ruby application health check failed on {{ $labels.instance }}"
description: "The health check endpoint for {{ $labels.instance }} is unreachable or returning an error."
Example Prometheus Alerting Rules (MongoDB)
groups:
- name: mongodb_alerts
rules:
- alert: HighReplicationLag
expr: mongodb_replset_member_oplog_lag_seconds{job="mongodb_exporter"} > 300 # 5 minutes
for: 2m
labels:
severity: critical
annotations:
summary: "High MongoDB replication lag on {{ $labels.instance }}"
description: "Replica set member {{ $labels.instance }} has replication lag of over 5 minutes."
- alert: HighConnectionCount
expr: mongodb_connections_current{job="mongodb_exporter"} > 1000 # Adjust threshold based on your capacity
for: 5m
labels:
severity: warning
annotations:
summary: "High MongoDB connection count on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} has exceeded 1000 active connections."
- alert: HighLockPercentage
expr: mongodb_locks_percentage{job="mongodb_exporter", lock_type="Global"} > 50 # Adjust threshold
for: 3m
labels:
severity: critical
annotations:
summary: "High MongoDB Global Lock Percentage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} is experiencing high global lock contention (over 50%)."
DigitalOcean Specific Considerations
When deploying on DigitalOcean, remember to:
- Firewall Rules: Ensure your DigitalOcean Cloud Firewalls or UFW rules allow traffic for your monitoring ports (e.g., 9100 for node_exporter, 9216 for mongodb_exporter, and the application port).
- Droplet Sizing: Monitor resource utilization closely to right-size your Droplets. Over-provisioning is costly, while under-provisioning leads to performance issues and alerts.
- Managed Databases: If you opt for DigitalOcean’s Managed MongoDB, the monitoring and alerting capabilities are built-in, simplifying some aspects. However, you’ll still need to monitor your application servers and potentially integrate with their metrics.
- VPC Networking: If your MongoDB cluster and application servers are in different VPCs or private networks, ensure proper routing and firewall configurations are in place for monitoring agents to communicate.
Conclusion
A comprehensive server monitoring strategy for your Ruby applications and MongoDB clusters on DigitalOcean involves a multi-layered approach. Combine application-level insights from APM and custom health checks with robust system metrics from Prometheus and specialized exporters. Centralized logging and a well-tuned alerting system are essential for proactive issue detection and rapid response, ensuring the stability and availability of your critical services.