Server Monitoring Best Practices: Keeping Your Ruby App and MongoDB Clusters Alive on AWS
Proactive Health Checks for Ruby Applications on EC2
Maintaining the health of Ruby applications deployed on AWS EC2 instances requires a multi-layered approach to monitoring. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. This involves leveraging both AWS CloudWatch and application-level tools.
1. Application Process Monitoring with `monit`
monit is a powerful, lightweight utility for managing and monitoring Unix systems. It can automatically perform actions (like restarting a process) when a service fails. We’ll configure it to watch our Puma or Unicorn process.
First, install monit on your EC2 instance:
On Ubuntu/Debian:
sudo apt-get update sudo apt-get install monit
On Amazon Linux/CentOS/RHEL:
sudo yum update sudo yum install monit
Next, create a configuration file for your Ruby application. Assuming your application runs via Puma and its PID file is located at /var/www/my_ruby_app/shared/pids/puma.pid:
# /etc/monit/conf.d/puma.conf check process puma with pidfile /var/www/my_ruby_app/shared/pids/puma.pid start program = "/bin/systemctl start puma" stop program = "/bin/systemctl stop puma" if not exist then restart if 5 restarts within 5 cycles then timeout
Ensure your Puma service is managed by systemd. A typical systemd service file (/etc/systemd/system/puma.service) might look like this:
[Unit] Description=Puma HTTP Server After=network.target [Service] Type=simple User=deploy Group=deploy WorkingDirectory=/var/www/my_ruby_app/current ExecStart=/usr/local/bin/bundle exec puma -C /var/www/my_ruby_app/shared/puma.rb ExecStop=/bin/kill -s TERM $MAINPID Restart=always RestartSec=10 [Install] WantedBy=multi-user.target
After creating or modifying the monit configuration, test it and reload:
sudo monit -t sudo systemctl reload monit
You can check the status of your monitored processes via the monit status page (if enabled) or the command line:
sudo monit status
2. Custom Application Metrics with CloudWatch Agent
monit handles process availability, but we need deeper insights into application performance. The CloudWatch Agent allows us to collect custom metrics, logs, and system-level metrics beyond the default EC2 metrics.
Install the CloudWatch Agent. Refer to the official AWS documentation for the latest installation instructions for your specific OS. For Amazon Linux 2, it’s typically:
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm sudo rpm -e amazon-cloudwatch-agent # If an older version is installed sudo rpm -i amazon-cloudwatch-agent.rpm
Create a configuration file for the agent. This example collects standard system metrics and custom metrics from a hypothetical Ruby application log file. First, create a JSON configuration file (e.g., /opt/aws/amazon-cloudwatch/agent/config.json):
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "MyRubyApp",
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_user",
"cpu_usage_system",
"cpu_usage_idle"
],
"totalcpu": true
},
"mem": {
"measurement": [
"mem_used_percent"
]
},
"disk": {
"measurement": [
"used_percent"
],
"resources": [
"/"
]
},
"statsd": {
"service_address": "udp:localhost:8125",
"metrics_collection_interval": 60
},
"log_file": {
"file_path": "/var/log/my_ruby_app/production.log",
"log_group_name": "/aws/ecs/my_ruby_app/production",
"auto_create_group": true,
"log_stream_name": "{instance_id}",
"timezone": "UTC",
"multi_line": "false",
"tail_lines": 100,
"stats_log_group_name": "/aws/ecs/my_ruby_app/stats",
"stats_log_stream_name": "{instance_id}",
"stats_regex": "^(?P<timestamp>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z)\\s+(?P<level>\\w+)\\s+(?P<message>.*)$",
"stats_json_keys": ["request_time", "db_time", "view_time"]
}
}
}
}
To send custom application metrics (like request latency, database query times, etc.) to CloudWatch, you’ll need to instrument your Ruby application. A common approach is to use a StatsD client library. For example, using the statsd-ruby gem:
# Gemfile gem 'statsd-ruby'
# In your Ruby application (e.g., in middleware or a service object)
require 'statsd'
# Initialize StatsD client (assuming agent is listening on localhost:8125)
statsd = Statsd.new('localhost', 8125)
# Example: Timing a database query
start_time = Time.now
# ... perform database query ...
db_duration = Time.now - start_time
statsd.timing('db.query_time', db_duration * 1000) # Send in milliseconds
# Example: Incrementing a counter for successful requests
statsd.increment('requests.success')
# Example: Setting a gauge for active users
statsd.gauge('users.active', current_user_count)
Start the CloudWatch Agent with your configuration:
sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s
Verify that metrics are appearing in CloudWatch under the “MyRubyApp” namespace and logs are flowing to the specified log groups.
MongoDB Cluster Health on AWS: RDS vs. Self-Managed EC2
Monitoring MongoDB clusters on AWS can be done either through Amazon RDS for MongoDB or by managing your own instances on EC2. Each approach has distinct monitoring requirements and tools.
1. Monitoring Amazon RDS for MongoDB
RDS simplifies management by handling patching, backups, and underlying infrastructure. Monitoring primarily relies on CloudWatch metrics provided by AWS and Performance Insights.
Key CloudWatch Metrics to Monitor:
- CPUUtilization: High CPU can indicate inefficient queries or insufficient instance size.
- DatabaseConnections: Monitor the number of active connections. Exceeding limits can cause application failures.
- ReadIOPS / WriteIOPS: Track disk I/O operations. Spikes might correlate with heavy read/write loads or slow queries.
- ReadLatency / WriteLatency: Crucial for understanding query performance. High latency points to performance bottlenecks.
- NetworkReceiveThroughput / NetworkTransmitThroughput: Monitor data transfer rates.
- FreeStorageSpace: Essential for preventing storage exhaustion. Set alarms to trigger well before capacity is reached.
- ReplicaLag: For replica sets, this metric indicates the replication delay between the primary and secondary nodes. High lag can lead to stale reads and data consistency issues.
Set up CloudWatch Alarms for these metrics. For example, an alarm for ReplicaLag exceeding 10 seconds on any secondary node is a critical alert.
Performance Insights:
Enable Performance Insights for your RDS instance. This feature provides a dashboard to visualize database load and identify performance bottlenecks, such as specific SQL statements (or in MongoDB’s case, query patterns) contributing most to the load. It helps pinpoint slow queries and resource-intensive operations.
2. Monitoring Self-Managed MongoDB on EC2
Managing MongoDB on EC2 gives you full control but also full responsibility for monitoring and maintenance. This requires a combination of OS-level monitoring, MongoDB-specific tools, and potentially third-party solutions.
2.1. OS-Level Monitoring (CloudWatch Agent)
Use the CloudWatch Agent as described previously to collect standard EC2 metrics (CPU, Memory, Disk, Network). For MongoDB, pay close attention to:
- Disk I/O: MongoDB is I/O intensive. Monitor
iostatoutput or equivalent CloudWatch metrics for high wait times or queue lengths. - Memory Usage: Ensure sufficient RAM is available, especially for the WiredTiger cache.
- Network Traffic: Monitor inter-node communication for replica sets and sharded clusters.
2.2. MongoDB Native Tools (`mongostat`, `mongotop`)
mongostat provides a real-time overview of MongoDB server statistics. It’s invaluable for quick health checks.
mongostat --host mongodb.example.com:27017 --username myuser --password mypassword --authenticationDatabase admin --oplog --rowcount 10
Key fields to watch:
- insert, query, update, delete: Operations per second.
- getmore: Indicates cursor operations, often related to large result sets or inefficient queries.
- lock %: Percentage of time the database was locked. High values indicate contention.
- netIn / netOut: Network traffic in/out.
- res: Resident memory usage.
- dirty %: Percentage of dirty cache pages.
- oplog: Oplog statistics, crucial for replication lag.
mongotop provides a per-collection view of read/write activity and time spent.
mongotop --host mongodb.example.com:27017 --username myuser --password mypassword --authenticationDatabase admin --locks
This helps identify which collections are experiencing the most I/O or lock contention.
2.3. MongoDB Metrics via `mongod` Status Endpoint
MongoDB 4.4+ exposes detailed metrics via an HTTP status endpoint. This is ideal for integration with monitoring systems like Prometheus or custom agents.
Ensure the net.http.enabled setting is true in your mongod.conf and configure net.http.port (e.g., 8080).
# /etc/mongod.conf
net:
port: 27017
bindIp: 0.0.0.0
http:
enabled: true
port: 8080
IPAddress: 127.0.0.1
Restart mongod after changing the configuration.
You can then access metrics via:
curl http://localhost:8080/metrics
This output is in Prometheus exposition format. You can scrape this endpoint with the Prometheus server or use a custom script to parse it and send metrics to CloudWatch or another monitoring backend.
2.4. Replication Lag Monitoring
Replication lag is critical. You can check it using the rs.status() command in the mongo shell.
mongo --host mongodb.example.com:27017 --username myuser --password mypassword --authenticationDatabase admin rs.status()
Look for the optimeDate on the primary and the optimeDate on each secondary. The difference indicates replication lag. Automate this check by scripting mongo commands and sending the lag as a custom metric to CloudWatch.
import pymongo
import time
from datetime import datetime, timedelta
# Connection details
MONGO_HOST = "mongodb.example.com"
MONGO_PORT = 27017
MONGO_USER = "myuser"
MONGO_PASS = "mypassword"
AUTH_DB = "admin"
# CloudWatch client (assuming boto3 is configured)
# import boto3
# cw_client = boto3.client('cloudwatch')
def get_replication_lag():
try:
client = pymongo.MongoClient(
host=MONGO_HOST,
port=MONGO_PORT,
username=MONGO_USER,
password=MONGO_PASS,
authSource=AUTH_DB,
serverSelectionTimeoutMS=5000
)
client.admin.command('ping') # Check connection
repl_status = client.admin.command('replSetGetStatus')
members = repl_status.get('members', [])
primary_optime = None
for member in members:
if member.get('stateStr') == 'PRIMARY':
primary_optime = datetime.fromtimestamp(member.get('optime').time)
break
if not primary_optime:
print("Could not find primary node.")
return None
max_lag = timedelta(0)
for member in members:
if member.get('stateStr') != 'PRIMARY':
secondary_optime = datetime.fromtimestamp(member.get('optime').time)
lag = primary_optime - secondary_optime
if lag > max_lag:
max_lag = lag
# Example: Send lag to CloudWatch
# cw_client.put_metric_data(
# Namespace='MyMongoDBCluster',
# MetricData=[
# {
# 'MetricName': 'ReplicaLagSeconds',
# 'Dimensions': [
# {'Name': 'ReplicaSetName', 'Value': repl_status.get('setname')},
# {'Name': 'MemberId', 'Value': str(member.get('id'))},
# {'Name': 'MemberHost', 'Value': member.get('name')}
# ],
# 'Value': lag.total_seconds(),
# 'Unit': 'Seconds'
# },
# ]
# )
print(f"Lag for {member.get('name')}: {lag.total_seconds()} seconds")
print(f"Max replication lag: {max_lag.total_seconds()} seconds")
return max_lag.total_seconds()
except pymongo.errors.ConnectionFailure as e:
print(f"Could not connect to MongoDB: {e}")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
if __name__ == "__main__":
get_replication_lag()
Alerting Strategy and Best Practices
Effective alerting is crucial to prevent outages. A well-defined strategy ensures that the right people are notified about the right issues at the right time.
1. Define Alerting Tiers
Categorize alerts based on severity and required action:
- Critical Alerts: Immediate action required. These should trigger PagerDuty, Opsgenie, or similar on-call alerting systems. Examples: Application down, database unreachable, critical error rate spike, disk full.
- Warning Alerts: Indicate a potential problem that needs investigation soon, but not necessarily immediate intervention. These can go to a team Slack channel or email distribution list. Examples: High but not critical CPU usage, increasing error rates, nearing disk capacity.
- Informational Alerts: For awareness. Examples: Deployment success/failure, scheduled maintenance start/end.
2. Actionable Alerts
Each alert should provide enough context for an engineer to understand the problem and begin troubleshooting. Include:
- The affected service and environment (e.g., “Production Ruby App on EC2-us-east-1a”).
- The specific metric that triggered the alert (e.g., “Puma process not running”).
- The current value of the metric and the threshold that was breached.
- Links to relevant dashboards (CloudWatch, Grafana, Kibana) for further investigation.
- Potential runbooks or troubleshooting guides.
3. Alerting Tools and Integration
Leverage AWS CloudWatch Alarms as the primary mechanism for triggering alerts based on collected metrics. Integrate these alarms with notification services like:
- Amazon SNS (Simple Notification Service): To fan out notifications to various endpoints (email, SMS, SQS, Lambda).
- PagerDuty / Opsgenie: For robust on-call scheduling and escalation policies. Configure SNS topics to publish to these services.
- Slack / Microsoft Teams: For team-wide visibility. Use SNS to trigger a Lambda function that posts messages to your chat channels.
Example CloudWatch Alarm configuration (conceptual):
# CloudWatch Alarm: Puma Process Down MetricName: PumaProcessStatus Namespace: MyRubyApp Statistic: Minimum Period: 300 # 5 minutes EvaluationPeriods: 1 Threshold: 0 # Assuming status is 1 if running, 0 if down ComparisonOperator: LessThanThreshold AlarmActions: - arn:aws:sns:us-east-1:123456789012:MyCriticalAlertsTopic OKActions: - arn:aws:sns:us-east-1:123456789012:MyInformationalAlertsTopic AlarmDescription: "The Puma process is not running on an EC2 instance."
Ensure your monitoring and alerting systems are regularly reviewed and tested. False positives can lead to alert fatigue, while missed alerts can result in prolonged outages. A mature monitoring strategy is an ongoing process of refinement.