Server Monitoring Best Practices: Keeping Your Ruby App and MySQL Clusters Alive on DigitalOcean
Proactive MySQL Replication Lag Monitoring
MySQL replication lag is a silent killer of data consistency and application availability. On DigitalOcean, managing a cluster of MySQL instances, especially for a Ruby application, demands vigilant monitoring of replication status. We’ll focus on a practical, script-driven approach using standard MySQL tools and a simple shell script, deployable via cron.
The core of this monitoring lies in querying the `SHOW REPLICA STATUS` (or `SHOW SLAVE STATUS` for older versions) command on each replica. We’re particularly interested in `Seconds_Behind_Master` and `Replica_IO_Running` / `Replica_SQL_Running` (or `Slave_IO_Running` / `Slave_SQL_Running`). A non-zero `Seconds_Behind_Master` indicates lag, and a `NO` status for either IO or SQL running thread signifies a broken replication stream.
Automated Replication Health Check Script
Let’s craft a robust shell script that connects to each replica, checks its status, and alerts if issues are detected. This script will be executed periodically by cron.
First, ensure you have a dedicated monitoring user in MySQL with appropriate privileges. For example:
CREATE USER 'monitor'@'localhost' IDENTIFIED BY 'your_secure_password'; GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'localhost'; FLUSH PRIVILEGES;
Now, the monitoring script. Save this as `check_mysql_replication.sh` on your monitoring server or one of the MySQL nodes.
#!/bin/bash
# --- Configuration ---
MYSQL_USER="monitor"
MYSQL_PASSWORD="your_secure_password"
# Array of replica hostnames or IPs
REPLICAS=("mysql-replica-1.doapp.com" "mysql-replica-2.doapp.com" "mysql-replica-3.doapp.com")
# Threshold for replication lag in seconds
LAG_THRESHOLD=300 # 5 minutes
# Alerting mechanism (e.g., email, Slack webhook)
ALERT_EMAIL="[email protected]"
ALERT_SUBJECT="CRITICAL: MySQL Replication Lag Detected"
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX" # Optional
# --- Functions ---
send_alert() {
local message="$1"
echo "$(date '+%Y-%m-%d %H:%M:%S') - ALERT: $message"
# Email Alert
echo "$message" | mail -s "$ALERT_SUBJECT" "$ALERT_EMAIL"
# Slack Alert (if configured)
if [ -n "$SLACK_WEBHOOK_URL" ]; then
local slack_message="payload={\"text\": \"$message\"}"
curl -X POST --data-urlencode "$slack_message" "$SLACK_WEBHOOK_URL" >> /dev/null 2>&1
fi
}
# --- Main Logic ---
echo "Starting MySQL replication check at $(date '+%Y-%m-%d %H:%M:%S')"
for REPLICA_HOST in "${REPLICAS[@]}"; do
echo "Checking replica: $REPLICA_HOST"
# Check connection and basic status
REPLICA_STATUS=$(mysql -h "$REPLICA_HOST" -u "$MYSQL_USER" -p"$MYSQL_PASSWORD" -e "SHOW REPLICA STATUS\G" 2>&1)
if [[ $REPLICA_STATUS == *"ERROR"* ]]; then
send_alert "Failed to connect to MySQL replica $REPLICA_HOST. Error: $REPLICA_STATUS"
continue # Move to the next replica
fi
# Extract relevant metrics
SECONDS_BEHIND_MASTER=$(echo "$REPLICA_STATUS" | grep -oP "Seconds_Behind_Master: \K\d+")
REPLICA_IO_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Replica_IO_Running: \K(Yes|No)")
REPLICA_SQL_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Replica_SQL_Running: \K(Yes|No)")
# Handle older MySQL versions (e.g., 5.7)
if [ -z "$REPLICA_IO_RUNNING" ]; then
REPLICA_IO_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Slave_IO_Running: \K(Yes|No)")
REPLICA_SQL_RUNNING=$(echo "$REPLICA_STATUS" | grep -oP "Slave_SQL_Running: \K(Yes|No)")
SECONDS_BEHIND_MASTER=$(echo "$REPLICA_STATUS" | grep -oP "Seconds_Behind_Master: \K\d+")
fi
# Check for broken replication threads
if [[ "$REPLICA_IO_RUNNING" != "Yes" || "$REPLICA_SQL_RUNNING" != "Yes" ]]; then
send_alert "MySQL replication broken on $REPLICA_HOST. IO thread: $REPLICA_IO_RUNNING, SQL thread: $REPLICA_SQL_RUNNING."
continue
fi
# Check for replication lag
if [[ -n "$SECONDS_BEHIND_MASTER" && "$SECONDS_BEHIND_MASTER" -gt "$LAG_THRESHOLD" ]]; then
send_alert "MySQL replication lag detected on $REPLICA_HOST. Lag: $SECONDS_BEHIND_MASTER seconds (Threshold: $LAG_THRESHOLD seconds)."
continue
fi
echo "Replica $REPLICA_HOST is healthy. Lag: ${SECONDS_BEHIND_MASTER:-N/A}s, IO: $REPLICA_IO_RUNNING, SQL: $REPLICA_SQL_RUNNING"
done
echo "MySQL replication check finished at $(date '+%Y-%m-%d %H:%M:%S')"
exit 0
Make the script executable:
chmod +x check_mysql_replication.sh
To automate this, add a cron job. For example, to run it every 5 minutes:
# Edit crontab for the user running the script crontab -e # Add the following line: */5 * * * * /path/to/your/check_mysql_replication.sh >> /var/log/mysql_replication_check.log 2>&1
Remember to replace placeholder values like passwords, hostnames, and email addresses. For Slack integration, you’ll need to set up an incoming webhook in your Slack workspace.
Monitoring Ruby Application Performance with New Relic
For your Ruby application, a robust Application Performance Monitoring (APM) tool is essential. New Relic is a popular choice that provides deep insights into transaction traces, database queries, error rates, and more. Deploying the New Relic agent is straightforward.
First, sign up for a New Relic account and obtain your license key. Then, add the `newrelic_rpm` gem to your application’s `Gemfile`:
# Gemfile gem 'newrelic_rpm'
Run `bundle install`. Next, create a `newrelic.yml` configuration file in the root of your application. This file tells the agent how to connect to New Relic and what to monitor.
# newrelic.yml common: &common license_key: YOUR_NEW_RELIC_LICENSE_KEY development: <<: *common app_name: MyRubyApp (Development) production: <<: *common app_name: MyRubyApp (Production) log_level: info staging: <<: *common app_name: MyRubyApp (Staging) log_level: debug
Replace `YOUR_NEW_RELIC_LICENSE_KEY` with your actual key. The `app_name` should be unique for each environment.
To ensure the agent is loaded, you typically need to wrap your application's startup command. For a standard Rails application run with `rails server` or `puma`:
# Example for Puma NEW_RELIC_LICENSE_KEY=YOUR_NEW_RELIC_LICENSE_KEY NEW_RELIC_APP_NAME="MyRubyApp (Production)" bundle exec puma -C config/puma.rb
Alternatively, you can set environment variables directly before starting your application. The `newrelic_rpm` gem automatically instruments many common Ruby libraries and frameworks, including ActiveRecord. This means you'll see detailed breakdowns of your SQL queries directly in the New Relic UI, allowing you to correlate application performance issues with specific database operations.
DigitalOcean Droplet Resource Monitoring
Beyond application-specific metrics, fundamental Droplet resource utilization is critical. High CPU, memory, or disk I/O can cripple your application and database performance. DigitalOcean provides basic monitoring metrics through its control panel, but for deeper analysis and alerting, we'll use `node_exporter` and Prometheus.
This setup involves installing `node_exporter` on each Droplet (application and database servers) and setting up a Prometheus server to scrape these metrics. For simplicity, we'll assume you have a dedicated Droplet for Prometheus or are using a managed Prometheus service.
Installing and Configuring Node Exporter
Download the latest `node_exporter` binary for your Droplet's architecture (e.g., amd64).
# On each Droplet (app and DB servers) wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
Create a systemd service file to manage `node_exporter`.
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=nobody Group=nogroup Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Save this as `/etc/systemd/system/node_exporter.service` and then enable and start it:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter
By default, `node_exporter` listens on port 9100. Ensure this port is accessible from your Prometheus server (adjust DigitalOcean firewall rules accordingly).
Configuring Prometheus to Scrape Node Exporter
On your Prometheus server, edit the `prometheus.yml` configuration file to add scrape targets for your Droplets.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_configs:
# Job for Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Job for Application Droplets
- job_name: 'ruby_app_droplets'
static_configs:
- targets:
- 'app-droplet-1.yourdomain.com:9100'
- 'app-droplet-2.yourdomain.com:9100'
labels:
env: 'production'
role: 'app'
# Job for MySQL Droplets (assuming they are separate)
- job_name: 'mysql_droplets'
static_configs:
- targets:
- 'mysql-replica-1.doapp.com:9100'
- 'mysql-replica-2.doapp.com:9100'
- 'mysql-master.doapp.com:9100' # If master also runs node_exporter
labels:
env: 'production'
role: 'db'
Restart the Prometheus service after updating the configuration. You can then access the Prometheus UI (usually `http://your-prometheus-server-ip:9090`) to verify that your targets are being scraped successfully.
Alerting with Prometheus Alertmanager
Prometheus alone only collects metrics; for proactive alerting, it needs Alertmanager. Configure Prometheus to send alerts to Alertmanager, and then configure Alertmanager to route those alerts to your team via email, Slack, PagerDuty, etc.
In `prometheus.yml`, add the Alertmanager configuration:
# ... (previous scrape_configs) ...
alerting:
alertmanagers:
- static_configs:
- targets:
# Replace with your Alertmanager instance address
- 'alertmanager.yourdomain.com:9093'
Your Alertmanager configuration (`alertmanager.yml`) would then define receivers and routing rules. For example, to send critical alerts to Slack:
global:
slack_api_url: '"https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX"' # Your Slack webhook URL
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications' # Default receiver
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
text: '{{ template "slack.default.text" . }}'
# Example alert rule for high CPU usage
# In a separate rules file, e.g., rules.yml, referenced in prometheus.yml
groups:
- name: host_alerts
rules:
- alert: HighCpuUsage
expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} has been running at over 85% CPU for 5 minutes."
Ensure your Prometheus configuration points to this `rules.yml` file. With this setup, you'll receive timely alerts for critical resource exhaustion on your DigitalOcean Droplets, allowing you to scale or troubleshoot before your Ruby application or MySQL cluster is impacted.