Server Monitoring Best Practices: Keeping Your Ruby App and MongoDB Clusters Alive on AWS

Proactive Health Checks for Ruby Applications on EC2

Maintaining the health of Ruby applications deployed on AWS EC2 instances requires a multi-layered approach to monitoring. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. This involves leveraging both AWS CloudWatch and application-level tools.

1. Application Process Monitoring with `monit`

monit is a powerful, lightweight utility for managing and monitoring Unix systems. It can automatically perform actions (like restarting a process) when a service fails. We’ll configure it to watch our Puma or Unicorn process.

First, install monit on your EC2 instance:

On Ubuntu/Debian:

sudo apt-get update
sudo apt-get install monit

On Amazon Linux/CentOS/RHEL:

sudo yum update
sudo yum install monit

Next, create a configuration file for your Ruby application. Assuming your application runs via Puma and its PID file is located at /var/www/my_ruby_app/shared/pids/puma.pid:

# /etc/monit/conf.d/puma.conf
check process puma with pidfile /var/www/my_ruby_app/shared/pids/puma.pid
  start program = "/bin/systemctl start puma"
  stop program = "/bin/systemctl stop puma"
  if not exist then restart
  if 5 restarts within 5 cycles then timeout

Ensure your Puma service is managed by systemd. A typical systemd service file (/etc/systemd/system/puma.service) might look like this:

[Unit]
Description=Puma HTTP Server
After=network.target

[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/var/www/my_ruby_app/current
ExecStart=/usr/local/bin/bundle exec puma -C /var/www/my_ruby_app/shared/puma.rb
ExecStop=/bin/kill -s TERM $MAINPID
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

After creating or modifying the monit configuration, test it and reload:

sudo monit -t
sudo systemctl reload monit

You can check the status of your monitored processes via the monit status page (if enabled) or the command line:

sudo monit status

2. Custom Application Metrics with CloudWatch Agent

monit handles process availability, but we need deeper insights into application performance. The CloudWatch Agent allows us to collect custom metrics, logs, and system-level metrics beyond the default EC2 metrics.

Install the CloudWatch Agent. Refer to the official AWS documentation for the latest installation instructions for your specific OS. For Amazon Linux 2, it’s typically:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -e amazon-cloudwatch-agent # If an older version is installed
sudo rpm -i amazon-cloudwatch-agent.rpm

Create a configuration file for the agent. This example collects standard system metrics and custom metrics from a hypothetical Ruby application log file. First, create a JSON configuration file (e.g., /opt/aws/amazon-cloudwatch/agent/config.json):

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyRubyApp",
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_user",
          "cpu_usage_system",
          "cpu_usage_idle"
        ],
        "totalcpu": true
      },
      "mem": {
        "measurement": [
          "mem_used_percent"
        ]
      },
      "disk": {
        "measurement": [
          "used_percent"
        ],
        "resources": [
          "/"
        ]
      },
      "statsd": {
        "service_address": "udp:localhost:8125",
        "metrics_collection_interval": 60
      },
      "log_file": {
        "file_path": "/var/log/my_ruby_app/production.log",
        "log_group_name": "/aws/ecs/my_ruby_app/production",
        "auto_create_group": true,
        "log_stream_name": "{instance_id}",
        "timezone": "UTC",
        "multi_line": "false",
        "tail_lines": 100,
        "stats_log_group_name": "/aws/ecs/my_ruby_app/stats",
        "stats_log_stream_name": "{instance_id}",
        "stats_regex": "^(?P<timestamp>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z)\\s+(?P<level>\\w+)\\s+(?P<message>.*)$",
        "stats_json_keys": ["request_time", "db_time", "view_time"]
      }
    }
  }
}

To send custom application metrics (like request latency, database query times, etc.) to CloudWatch, you’ll need to instrument your Ruby application. A common approach is to use a StatsD client library. For example, using the statsd-ruby gem:

# Gemfile
gem 'statsd-ruby'

# In your Ruby application (e.g., in middleware or a service object)
require 'statsd'

# Initialize StatsD client (assuming agent is listening on localhost:8125)
statsd = Statsd.new('localhost', 8125)

# Example: Timing a database query
start_time = Time.now
# ... perform database query ...
db_duration = Time.now - start_time
statsd.timing('db.query_time', db_duration * 1000) # Send in milliseconds

# Example: Incrementing a counter for successful requests
statsd.increment('requests.success')

# Example: Setting a gauge for active users
statsd.gauge('users.active', current_user_count)

Start the CloudWatch Agent with your configuration:

sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s

Verify that metrics are appearing in CloudWatch under the “MyRubyApp” namespace and logs are flowing to the specified log groups.

MongoDB Cluster Health on AWS: RDS vs. Self-Managed EC2

Monitoring MongoDB clusters on AWS can be done either through Amazon RDS for MongoDB or by managing your own instances on EC2. Each approach has distinct monitoring requirements and tools.

1. Monitoring Amazon RDS for MongoDB

RDS simplifies management by handling patching, backups, and underlying infrastructure. Monitoring primarily relies on CloudWatch metrics provided by AWS and Performance Insights.

Key CloudWatch Metrics to Monitor:

CPUUtilization: High CPU can indicate inefficient queries or insufficient instance size.
DatabaseConnections: Monitor the number of active connections. Exceeding limits can cause application failures.
ReadIOPS / WriteIOPS: Track disk I/O operations. Spikes might correlate with heavy read/write loads or slow queries.
ReadLatency / WriteLatency: Crucial for understanding query performance. High latency points to performance bottlenecks.
NetworkReceiveThroughput / NetworkTransmitThroughput: Monitor data transfer rates.
FreeStorageSpace: Essential for preventing storage exhaustion. Set alarms to trigger well before capacity is reached.
ReplicaLag: For replica sets, this metric indicates the replication delay between the primary and secondary nodes. High lag can lead to stale reads and data consistency issues.

Set up CloudWatch Alarms for these metrics. For example, an alarm for ReplicaLag exceeding 10 seconds on any secondary node is a critical alert.

Performance Insights:

Enable Performance Insights for your RDS instance. This feature provides a dashboard to visualize database load and identify performance bottlenecks, such as specific SQL statements (or in MongoDB’s case, query patterns) contributing most to the load. It helps pinpoint slow queries and resource-intensive operations.

2. Monitoring Self-Managed MongoDB on EC2

Managing MongoDB on EC2 gives you full control but also full responsibility for monitoring and maintenance. This requires a combination of OS-level monitoring, MongoDB-specific tools, and potentially third-party solutions.

2.1. OS-Level Monitoring (CloudWatch Agent)

Use the CloudWatch Agent as described previously to collect standard EC2 metrics (CPU, Memory, Disk, Network). For MongoDB, pay close attention to:

Disk I/O: MongoDB is I/O intensive. Monitor iostat output or equivalent CloudWatch metrics for high wait times or queue lengths.
Memory Usage: Ensure sufficient RAM is available, especially for the WiredTiger cache.
Network Traffic: Monitor inter-node communication for replica sets and sharded clusters.

2.2. MongoDB Native Tools (`mongostat`, `mongotop`)

mongostat provides a real-time overview of MongoDB server statistics. It’s invaluable for quick health checks.

mongostat --host mongodb.example.com:27017 --username myuser --password mypassword --authenticationDatabase admin --oplog --rowcount 10

Key fields to watch:

insert, query, update, delete: Operations per second.
getmore: Indicates cursor operations, often related to large result sets or inefficient queries.
lock %: Percentage of time the database was locked. High values indicate contention.
netIn / netOut: Network traffic in/out.
res: Resident memory usage.
dirty %: Percentage of dirty cache pages.
oplog: Oplog statistics, crucial for replication lag.

mongotop provides a per-collection view of read/write activity and time spent.

mongotop --host mongodb.example.com:27017 --username myuser --password mypassword --authenticationDatabase admin --locks

This helps identify which collections are experiencing the most I/O or lock contention.

2.3. MongoDB Metrics via `mongod` Status Endpoint

MongoDB 4.4+ exposes detailed metrics via an HTTP status endpoint. This is ideal for integration with monitoring systems like Prometheus or custom agents.

Ensure the net.http.enabled setting is true in your mongod.conf and configure net.http.port (e.g., 8080).

# /etc/mongod.conf
net:
  port: 27017
  bindIp: 0.0.0.0
  http:
    enabled: true
    port: 8080
    IPAddress: 127.0.0.1

Restart mongod after changing the configuration.

You can then access metrics via:

curl http://localhost:8080/metrics

This output is in Prometheus exposition format. You can scrape this endpoint with the Prometheus server or use a custom script to parse it and send metrics to CloudWatch or another monitoring backend.

2.4. Replication Lag Monitoring

Replication lag is critical. You can check it using the rs.status() command in the mongo shell.

mongo --host mongodb.example.com:27017 --username myuser --password mypassword --authenticationDatabase admin
rs.status()

Look for the optimeDate on the primary and the optimeDate on each secondary. The difference indicates replication lag. Automate this check by scripting mongo commands and sending the lag as a custom metric to CloudWatch.

import pymongo
import time
from datetime import datetime, timedelta

# Connection details
MONGO_HOST = "mongodb.example.com"
MONGO_PORT = 27017
MONGO_USER = "myuser"
MONGO_PASS = "mypassword"
AUTH_DB = "admin"

# CloudWatch client (assuming boto3 is configured)
# import boto3
# cw_client = boto3.client('cloudwatch')

def get_replication_lag():
    try:
        client = pymongo.MongoClient(
            host=MONGO_HOST,
            port=MONGO_PORT,
            username=MONGO_USER,
            password=MONGO_PASS,
            authSource=AUTH_DB,
            serverSelectionTimeoutMS=5000
        )
        client.admin.command('ping') # Check connection

        repl_status = client.admin.command('replSetGetStatus')
        members = repl_status.get('members', [])

        primary_optime = None
        for member in members:
            if member.get('stateStr') == 'PRIMARY':
                primary_optime = datetime.fromtimestamp(member.get('optime').time)
                break

        if not primary_optime:
            print("Could not find primary node.")
            return None

        max_lag = timedelta(0)
        for member in members:
            if member.get('stateStr') != 'PRIMARY':
                secondary_optime = datetime.fromtimestamp(member.get('optime').time)
                lag = primary_optime - secondary_optime
                if lag > max_lag:
                    max_lag = lag

                # Example: Send lag to CloudWatch
                # cw_client.put_metric_data(
                #     Namespace='MyMongoDBCluster',
                #     MetricData=[
                #         {
                #             'MetricName': 'ReplicaLagSeconds',
                #             'Dimensions': [
                #                 {'Name': 'ReplicaSetName', 'Value': repl_status.get('setname')},
                #                 {'Name': 'MemberId', 'Value': str(member.get('id'))},
                #                 {'Name': 'MemberHost', 'Value': member.get('name')}
                #             ],
                #             'Value': lag.total_seconds(),
                #             'Unit': 'Seconds'
                #         },
                #     ]
                # )
                print(f"Lag for {member.get('name')}: {lag.total_seconds()} seconds")

        print(f"Max replication lag: {max_lag.total_seconds()} seconds")
        return max_lag.total_seconds()

    except pymongo.errors.ConnectionFailure as e:
        print(f"Could not connect to MongoDB: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

if __name__ == "__main__":
    get_replication_lag()

Alerting Strategy and Best Practices

Effective alerting is crucial to prevent outages. A well-defined strategy ensures that the right people are notified about the right issues at the right time.

1. Define Alerting Tiers

Categorize alerts based on severity and required action:

Critical Alerts: Immediate action required. These should trigger PagerDuty, Opsgenie, or similar on-call alerting systems. Examples: Application down, database unreachable, critical error rate spike, disk full.
Warning Alerts: Indicate a potential problem that needs investigation soon, but not necessarily immediate intervention. These can go to a team Slack channel or email distribution list. Examples: High but not critical CPU usage, increasing error rates, nearing disk capacity.
Informational Alerts: For awareness. Examples: Deployment success/failure, scheduled maintenance start/end.

2. Actionable Alerts

Each alert should provide enough context for an engineer to understand the problem and begin troubleshooting. Include:

The affected service and environment (e.g., “Production Ruby App on EC2-us-east-1a”).
The specific metric that triggered the alert (e.g., “Puma process not running”).
The current value of the metric and the threshold that was breached.
Links to relevant dashboards (CloudWatch, Grafana, Kibana) for further investigation.
Potential runbooks or troubleshooting guides.

3. Alerting Tools and Integration

Leverage AWS CloudWatch Alarms as the primary mechanism for triggering alerts based on collected metrics. Integrate these alarms with notification services like:

Amazon SNS (Simple Notification Service): To fan out notifications to various endpoints (email, SMS, SQS, Lambda).
PagerDuty / Opsgenie: For robust on-call scheduling and escalation policies. Configure SNS topics to publish to these services.
Slack / Microsoft Teams: For team-wide visibility. Use SNS to trigger a Lambda function that posts messages to your chat channels.

Example CloudWatch Alarm configuration (conceptual):

# CloudWatch Alarm: Puma Process Down
MetricName: PumaProcessStatus
Namespace: MyRubyApp
Statistic: Minimum
Period: 300 # 5 minutes
EvaluationPeriods: 1
Threshold: 0 # Assuming status is 1 if running, 0 if down
ComparisonOperator: LessThanThreshold
AlarmActions:
  - arn:aws:sns:us-east-1:123456789012:MyCriticalAlertsTopic
OKActions:
  - arn:aws:sns:us-east-1:123456789012:MyInformationalAlertsTopic
AlarmDescription: "The Puma process is not running on an EC2 instance."

Ensure your monitoring and alerting systems are regularly reviewed and tested. False positives can lead to alert fatigue, while missed alerts can result in prolonged outages. A mature monitoring strategy is an ongoing process of refinement.