Server Monitoring Best Practices: Keeping Your Ruby App and MySQL Clusters Alive on DigitalOcean

Proactive MySQL Replication Lag Monitoring

MySQL replication lag is a silent killer of data consistency and application availability. On DigitalOcean, managing a cluster of MySQL instances, especially for a Ruby application, demands vigilant monitoring of replication status. We’ll focus on a practical, script-driven approach using standard MySQL tools and a simple shell script, deployable via cron.

The core of this monitoring lies in querying the `SHOW REPLICA STATUS` (or `SHOW SLAVE STATUS` for older versions) command on each replica. We’re particularly interested in `Seconds_Behind_Master` and `Replica_IO_Running` / `Replica_SQL_Running` (or `Slave_IO_Running` / `Slave_SQL_Running`). A non-zero `Seconds_Behind_Master` indicates lag, and a `NO` status for either IO or SQL running thread signifies a broken replication stream.

Automated Replication Health Check Script

Let’s craft a robust shell script that connects to each replica, checks its status, and alerts if issues are detected. This script will be executed periodically by cron.

First, ensure you have a dedicated monitoring user in MySQL with appropriate privileges. For example:

CREATE USER 'monitor'@'localhost' IDENTIFIED BY 'your_secure_password';
GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'localhost';
FLUSH PRIVILEGES;

Now, the monitoring script. Save this as `check_mysql_replication.sh` on your monitoring server or one of the MySQL nodes.

#!/bin/bash

# --- Configuration ---
MYSQL_USER="monitor"
MYSQL_PASSWORD="your_secure_password"
# Array of replica hostnames or IPs
REPLICAS=("mysql-replica-1.doapp.com" "mysql-replica-2.doapp.com" "mysql-replica-3.doapp.com")
# Threshold for replication lag in seconds
LAG_THRESHOLD=300 # 5 minutes
# Alerting mechanism (e.g., email, Slack webhook)
ALERT_EMAIL="[email protected]"
ALERT_SUBJECT="CRITICAL: MySQL Replication Lag Detected"
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX" # Optional

# --- Functions ---
send_alert() {
    local message="$1"
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ALERT: $message"

    # Email Alert
    echo "$message" | mail -s "$ALERT_SUBJECT" "$ALERT_EMAIL"

    # Slack Alert (if configured)
    if [ -n "$SLACK_WEBHOOK_URL" ]; then
        local slack_message="payload={\"text\": \"$message\"}"
        curl -X POST --data-urlencode "$slack_message" "$SLACK_WEBHOOK_URL" >> /dev/null 2>&1
    fi
}

# --- Main Logic ---
echo "Starting MySQL replication check at $(date '+%Y-%m-%d %H:%M:%S')"

for REPLICA_HOST in "${REPLICAS[@]}"; do
    echo "Checking replica: $REPLICA_HOST"

    # Check connection and basic status
    REPLICA_STATUS=$(mysql -h "$REPLICA_HOST" -u "$MYSQL_USER" -p"$MYSQL_PASSWORD" -e "SHOW REPLICA STATUS\G" 2>&1)

    if [[ $REPLICA_STATUS == *"ERROR"* ]]; then
        send_alert "Failed to connect to MySQL replica $REPLICA_HOST. Error: $REPLICA_STATUS"
        continue # Move to the next replica
    fi

    # Extract relevant metrics
    SECONDS_BEHIND_MASTER=$(echo "$REPLICA_STATUS" | grep -oP "Seconds_Behind_Master: \K\d+")
    REPLICA_IO_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Replica_IO_Running: \K(Yes|No)")
    REPLICA_SQL_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Replica_SQL_Running: \K(Yes|No)")

    # Handle older MySQL versions (e.g., 5.7)
    if [ -z "$REPLICA_IO_RUNNING" ]; then
        REPLICA_IO_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Slave_IO_Running: \K(Yes|No)")
        REPLICA_SQL_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Slave_SQL_Running: \K(Yes|No)")
        SECONDS_BEHIND_MASTER=$(echo "$REPLICA_STATUS" | grep -oP "Seconds_Behind_Master: \K\d+")
    fi

    # Check for broken replication threads
    if [[ "$REPLICA_IO_RUNNING" != "Yes" || "$REPLICA_SQL_RUNNING" != "Yes" ]]; then
        send_alert "MySQL replication broken on $REPLICA_HOST. IO thread: $REPLICA_IO_RUNNING, SQL thread: $REPLICA_SQL_RUNNING."
        continue
    fi

    # Check for replication lag
    if [[ -n "$SECONDS_BEHIND_MASTER" && "$SECONDS_BEHIND_MASTER" -gt "$LAG_THRESHOLD" ]]; then
        send_alert "MySQL replication lag detected on $REPLICA_HOST. Lag: $SECONDS_BEHIND_MASTER seconds (Threshold: $LAG_THRESHOLD seconds)."
        continue
    fi

    echo "Replica $REPLICA_HOST is healthy. Lag: ${SECONDS_BEHIND_MASTER:-N/A}s, IO: $REPLICA_IO_RUNNING, SQL: $REPLICA_SQL_RUNNING"
done

echo "MySQL replication check finished at $(date '+%Y-%m-%d %H:%M:%S')"
exit 0

Make the script executable:

chmod +x check_mysql_replication.sh

To automate this, add a cron job. For example, to run it every 5 minutes:

# Edit crontab for the user running the script
crontab -e

# Add the following line:
*/5 * * * * /path/to/your/check_mysql_replication.sh >> /var/log/mysql_replication_check.log 2>&1

Remember to replace placeholder values like passwords, hostnames, and email addresses. For Slack integration, you’ll need to set up an incoming webhook in your Slack workspace.

Monitoring Ruby Application Performance with New Relic

For your Ruby application, a robust Application Performance Monitoring (APM) tool is essential. New Relic is a popular choice that provides deep insights into transaction traces, database queries, error rates, and more. Deploying the New Relic agent is straightforward.

First, sign up for a New Relic account and obtain your license key. Then, add the `newrelic_rpm` gem to your application’s `Gemfile`:

# Gemfile
gem 'newrelic_rpm'

Run `bundle install`. Next, create a `newrelic.yml` configuration file in the root of your application. This file tells the agent how to connect to New Relic and what to monitor.

# newrelic.yml
common: &common
  license_key: YOUR_NEW_RELIC_LICENSE_KEY

development:
  <<: *common
  app_name: MyRubyApp (Development)

production:
  <<: *common
  app_name: MyRubyApp (Production)
  log_level: info

staging:
  <<: *common
  app_name: MyRubyApp (Staging)
  log_level: debug

Replace `YOUR_NEW_RELIC_LICENSE_KEY` with your actual key. The `app_name` should be unique for each environment.

To ensure the agent is loaded, you typically need to wrap your application's startup command. For a standard Rails application run with `rails server` or `puma`:

# Example for Puma
NEW_RELIC_LICENSE_KEY=YOUR_NEW_RELIC_LICENSE_KEY NEW_RELIC_APP_NAME="MyRubyApp (Production)" bundle exec puma -C config/puma.rb

Alternatively, you can set environment variables directly before starting your application. The `newrelic_rpm` gem automatically instruments many common Ruby libraries and frameworks, including ActiveRecord. This means you'll see detailed breakdowns of your SQL queries directly in the New Relic UI, allowing you to correlate application performance issues with specific database operations.

DigitalOcean Droplet Resource Monitoring

Beyond application-specific metrics, fundamental Droplet resource utilization is critical. High CPU, memory, or disk I/O can cripple your application and database performance. DigitalOcean provides basic monitoring metrics through its control panel, but for deeper analysis and alerting, we'll use `node_exporter` and Prometheus.

This setup involves installing `node_exporter` on each Droplet (application and database servers) and setting up a Prometheus server to scrape these metrics. For simplicity, we'll assume you have a dedicated Droplet for Prometheus or are using a managed Prometheus service.

Installing and Configuring Node Exporter

Download the latest `node_exporter` binary for your Droplet's architecture (e.g., amd64).

# On each Droplet (app and DB servers)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

Create a systemd service file to manage `node_exporter`.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Save this as `/etc/systemd/system/node_exporter.service` and then enable and start it:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

By default, `node_exporter` listens on port 9100. Ensure this port is accessible from your Prometheus server (adjust DigitalOcean firewall rules accordingly).

Configuring Prometheus to Scrape Node Exporter

On your Prometheus server, edit the `prometheus.yml` configuration file to add scrape targets for your Droplets.

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
  # Job for Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Job for Application Droplets
  - job_name: 'ruby_app_droplets'
    static_configs:
      - targets:
          - 'app-droplet-1.yourdomain.com:9100'
          - 'app-droplet-2.yourdomain.com:9100'
        labels:
          env: 'production'
          role: 'app'

  # Job for MySQL Droplets (assuming they are separate)
  - job_name: 'mysql_droplets'
    static_configs:
      - targets:
          - 'mysql-replica-1.doapp.com:9100'
          - 'mysql-replica-2.doapp.com:9100'
          - 'mysql-master.doapp.com:9100' # If master also runs node_exporter
        labels:
          env: 'production'
          role: 'db'

Restart the Prometheus service after updating the configuration. You can then access the Prometheus UI (usually `http://your-prometheus-server-ip:9090`) to verify that your targets are being scraped successfully.

Alerting with Prometheus Alertmanager

Prometheus alone only collects metrics; for proactive alerting, it needs Alertmanager. Configure Prometheus to send alerts to Alertmanager, and then configure Alertmanager to route those alerts to your team via email, Slack, PagerDuty, etc.

In `prometheus.yml`, add the Alertmanager configuration:

# ... (previous scrape_configs) ...

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # Replace with your Alertmanager instance address
          - 'alertmanager.yourdomain.com:9093'

Your Alertmanager configuration (`alertmanager.yml`) would then define receivers and routing rules. For example, to send critical alerts to Slack:

global:
  slack_api_url: '"https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX"' # Your Slack webhook URL

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications' # Default receiver

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        text: '{{ template "slack.default.text" . }}'

# Example alert rule for high CPU usage
# In a separate rules file, e.g., rules.yml, referenced in prometheus.yml
groups:
  - name: host_alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "Instance {{ $labels.instance }} has been running at over 85% CPU for 5 minutes."

Ensure your Prometheus configuration points to this `rules.yml` file. With this setup, you'll receive timely alerts for critical resource exhaustion on your DigitalOcean Droplets, allowing you to scale or troubleshoot before your Ruby application or MySQL cluster is impacted.

Server Monitoring Best Practices: Keeping Your Ruby App and MySQL Clusters Alive on DigitalOcean

Proactive MySQL Replication Lag Monitoring

Automated Replication Health Check Script

Monitoring Ruby Application Performance with New Relic

DigitalOcean Droplet Resource Monitoring

Installing and Configuring Node Exporter

Configuring Prometheus to Scrape Node Exporter

Alerting with Prometheus Alertmanager

Recent Posts

Top Categories

Our Products

Our Services