Server Monitoring Best Practices: Keeping Your Ruby App and MongoDB Clusters Alive on DigitalOcean

Proactive Health Checks for Ruby Applications

Maintaining the health of a Ruby application on DigitalOcean isn’t just about reacting to downtime; it’s about building a robust, proactive monitoring strategy. This involves deep inspection of application-level metrics, not just server resource utilization. We’ll focus on essential checks that can be implemented using readily available tools and custom scripts.

Application Performance Monitoring (APM) Integration

While not strictly “server monitoring,” integrating an APM solution is paramount for understanding application behavior. Tools like New Relic, AppSignal, or Scout APM provide invaluable insights into request latency, error rates, database query performance, and slow transactions. Configure these agents to send critical alerts for:

Sustained high error rates (e.g., > 1% of requests).
Unusually high average response times (e.g., > 500ms for critical endpoints).
Specific transaction traces exceeding defined thresholds.
Memory leaks or excessive garbage collection cycles.

Custom Health Check Endpoints

Beyond APM, a dedicated health check endpoint within your Ruby application provides a simple, yet effective, way for external monitoring systems to verify basic functionality. This endpoint should ideally:

Respond quickly (e.g., < 100ms).
Check essential dependencies (e.g., database connectivity, Redis connection).
Return a 200 OK status code on success and a non-2xx status code on failure.

Here’s a basic implementation using Sinatra, which can be easily adapted for Rails:

require 'sinatra'
require 'sequel' # Or your preferred DB adapter

# Assume DB connection is established elsewhere and available as $DB
# $DB = Sequel.connect('postgres://user:password@host:port/database')

get '/health' do
  begin
    # Check database connectivity
    if $DB.nil? || !$DB.test_connection
      status 503
      return { status: 'error', message: 'Database connection failed' }.to_json
    end

    # Add other critical dependency checks here (e.g., Redis, external APIs)
    # redis_client = Redis.new(url: ENV['REDIS_URL'])
    # unless redis_client.ping == 'PONG'
    #   status 503
    #   return { status: 'error', message: 'Redis connection failed' }.to_json
    # end

    status 200
    { status: 'ok', message: 'Application is healthy' }.to_json
  rescue StandardError => e
    status 500
    { status: 'error', message: "Internal server error: #{e.message}" }.to_json
  end
end

This endpoint can then be polled by external monitoring services like UptimeRobot, Pingdom, or even a custom Nagios/Prometheus check.

Log Aggregation and Analysis

Centralized logging is non-negotiable. Deploying a log shipper like Fluentd, Filebeat, or Logstash to collect logs from your Ruby application instances and forward them to a centralized store (e.g., Elasticsearch, Loki, Splunk) is crucial. Configure your log shipper to:

Collect application logs (e.g., production.log, error logs).
Collect system logs (e.g., syslog, auth.log).
Parse logs to extract structured data (timestamps, log levels, request IDs).

In your monitoring dashboard (e.g., Kibana, Grafana), set up alerts for:

High frequency of ERROR or FATAL log messages.
Specific critical error patterns (e.g., database connection errors, authentication failures).
Sudden spikes in log volume.

Resource Monitoring with Prometheus and Node Exporter

While DigitalOcean provides basic resource graphs, a more granular and alertable system is needed. Prometheus, coupled with node_exporter, offers a powerful solution for collecting system-level metrics.

Setting up Node Exporter

On each of your Ruby application servers (and MongoDB nodes), install and run node_exporter. This exposes a metrics endpoint typically on port 9100.

# Download the latest release (adjust version as needed)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Run it (consider running as a systemd service for production)
./node_exporter

For production, create a systemd service file (e.g., /etc/systemd/system/node_exporter.service):

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter # Adjust path if installed elsewhere

[Install]
WantedBy=multi-user.target

Then enable and start it:

sudo useradd -rs /bin/false node_exporter
sudo mv node_exporter /usr/local/bin/
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Configuring Prometheus Scrape Targets

Configure your Prometheus server to scrape these endpoints. In your prometheus.yml:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['your_ruby_app_server_1:9100', 'your_ruby_app_server_2:9100', 'your_mongodb_node_1:9100', 'your_mongodb_node_2:9100']
        labels:
          env: 'production'
          role: 'app' # or 'db' for MongoDB nodes

Key Metrics to Monitor and Alert On

With Prometheus collecting data, set up alerts in Alertmanager for critical system metrics:

CPU Usage: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 (High CPU utilization for 5 minutes).
Memory Usage: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90 (High memory utilization).
Disk I/O Wait: rate(node_disk_io_time_seconds_total[5m]) > 0.8 (High disk I/O wait times).
Network Traffic: Monitor for unusual spikes or drops in network throughput.
Filesystem Usage: node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 10 (Low disk space remaining).

MongoDB Cluster Monitoring with MongoDB Exporter

Monitoring MongoDB requires specific metrics related to database operations, replication, and performance. The mongodb_exporter is an excellent choice for this.

Setting up MongoDB Exporter

Similar to node_exporter, deploy mongodb_exporter on a server that can access your MongoDB instances (ideally not on the MongoDB nodes themselves to avoid resource contention, but can be co-located if necessary). It typically runs on port 9216.

# Download the latest release (adjust version as needed)
wget https://github.com/mongodb-developer/mongodb_exporter/releases/download/v0.35.0/mongodb_exporter-v0.35.0.linux-amd64.tar.gz
tar xvfz mongodb_exporter-v0.35.0.linux-amd64.tar.gz
cd mongodb_exporter-v0.35.0.linux-amd64

# Create a MongoDB user for monitoring
# Connect to your MongoDB instance (e.g., using mongosh)
# use admin
# db.createUser({ user: "monitor_user", pwd: "your_secure_password", roles: [ { role: "clusterMonitor", db: "admin" }, { role: "readAnyDatabase", db: "admin" } ] })

# Run the exporter, pointing to your MongoDB URI
./mongodb_exporter --mongodb.uri="mongodb://monitor_user:your_secure_password@your_mongodb_host:27017/?authSource=admin"

For production, create a systemd service file similar to the node_exporter example, ensuring the --mongodb.uri flag is correctly configured.

Configuring Prometheus to Scrape MongoDB Exporter

Add a new job to your prometheus.yml:

scrape_configs:
  - job_name: 'mongodb_exporter'
    static_configs:
      - targets: ['your_mongodb_exporter_host:9216']
        labels:
          env: 'production'
          cluster: 'main_mongo_cluster'

If you have multiple MongoDB clusters or instances, you’ll need to adjust the targets and potentially use service discovery or more complex configuration.

Essential MongoDB Metrics and Alerts

Key metrics to monitor for MongoDB clusters:

Replication Lag: mongodb_replset_member_oplog_lag_seconds (Alert if lag exceeds a few minutes).
Connection Count: mongodb_connections_current (Alert on unusually high or low connection counts).
Query Performance: mongodb_opcounters_query (Monitor rates of different query types). Look for spikes in slow queries or specific operations.
Lock Percentage: mongodb_locks_percentage (High global lock percentages indicate contention).
Disk Usage: mongodb_storage_data_size_bytes (Monitor data size growth).
Network Traffic: mongodb_network_bytes_in_total, mongodb_network_bytes_out_total.
OOM Killer Events: While not directly from mongodb_exporter, monitor system logs for OOM killer events on MongoDB nodes using your log aggregation system.

Alerting Strategy with Alertmanager

Prometheus rules define when alerts fire, and Alertmanager handles deduplication, grouping, and routing of these alerts to the appropriate channels (e.g., Slack, PagerDuty, email). A well-defined alerting strategy is crucial to avoid alert fatigue.

Example Prometheus Alerting Rules (Ruby App)

groups:
- name: ruby_app_alerts
  rules:
  - alert: HighCpuUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle", job="node_exporter"}[5m])) * 100) > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} has been running at over 90% CPU for 5 minutes."

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High Memory Usage on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} is using over 90% of memory for 5 minutes."

  - alert: AppHealthCheckFailed
    # Assumes your health check endpoint returns a metric like 'http_requests_total'
    # and you can detect non-2xx responses. This is a simplified example.
    # A more robust approach would involve a blackbox exporter.
    expr: up{job="your_ruby_app_job"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Ruby application health check failed on {{ $labels.instance }}"
      description: "The health check endpoint for {{ $labels.instance }} is unreachable or returning an error."

Example Prometheus Alerting Rules (MongoDB)

groups:
- name: mongodb_alerts
  rules:
  - alert: HighReplicationLag
    expr: mongodb_replset_member_oplog_lag_seconds{job="mongodb_exporter"} > 300 # 5 minutes
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High MongoDB replication lag on {{ $labels.instance }}"
      description: "Replica set member {{ $labels.instance }} has replication lag of over 5 minutes."

  - alert: HighConnectionCount
    expr: mongodb_connections_current{job="mongodb_exporter"} > 1000 # Adjust threshold based on your capacity
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High MongoDB connection count on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} has exceeded 1000 active connections."

  - alert: HighLockPercentage
    expr: mongodb_locks_percentage{job="mongodb_exporter", lock_type="Global"} > 50 # Adjust threshold
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "High MongoDB Global Lock Percentage on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} is experiencing high global lock contention (over 50%)."

DigitalOcean Specific Considerations

When deploying on DigitalOcean, remember to:

Firewall Rules: Ensure your DigitalOcean Cloud Firewalls or UFW rules allow traffic for your monitoring ports (e.g., 9100 for node_exporter, 9216 for mongodb_exporter, and the application port).
Droplet Sizing: Monitor resource utilization closely to right-size your Droplets. Over-provisioning is costly, while under-provisioning leads to performance issues and alerts.
Managed Databases: If you opt for DigitalOcean’s Managed MongoDB, the monitoring and alerting capabilities are built-in, simplifying some aspects. However, you’ll still need to monitor your application servers and potentially integrate with their metrics.
VPC Networking: If your MongoDB cluster and application servers are in different VPCs or private networks, ensure proper routing and firewall configurations are in place for monitoring agents to communicate.

Conclusion

A comprehensive server monitoring strategy for your Ruby applications and MongoDB clusters on DigitalOcean involves a multi-layered approach. Combine application-level insights from APM and custom health checks with robust system metrics from Prometheus and specialized exporters. Centralized logging and a well-tuned alerting system are essential for proactive issue detection and rapid response, ensuring the stability and availability of your critical services.