Server Monitoring Best Practices: Keeping Your Ruby App and Elasticsearch Clusters Alive on AWS
Proactive Health Checks for Ruby Applications on AWS EC2
Maintaining the health of Ruby applications deployed on AWS EC2 instances requires a multi-layered approach to monitoring. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. This involves leveraging both AWS native tools and external monitoring solutions.
Application-Level Health Endpoints
A fundamental practice is to expose a dedicated health check endpoint within your Ruby application. This endpoint should perform checks against its dependencies (database, external services, cache) and return a clear status. For Rails applications, a simple controller action can suffice:
# app/controllers/health_controller.rb
class HealthController << ApplicationController
def show
# Example: Check database connection
begin
ActiveRecord::Base.connection.execute('SELECT 1')
db_status = :ok
rescue ActiveRecord::ConnectionNotEstablished, PG::Error => e
db_status = :error
Rails.logger.error("Database health check failed: #{e.message}")
end
# Example: Check external service (e.g., Redis)
begin
$redis.ping # Assuming $redis is a globally accessible Redis client
redis_status = :ok
rescue Redis::CannotConnectError => e
redis_status = :error
Rails.logger.error("Redis health check failed: #{e.message}")
end
if db_status == :ok && redis_status == :ok
render json: { status: 'ok', database: db_status, redis: redis_status }, status: :ok
else
render json: { status: 'degraded', database: db_status, redis: redis_status }, status: :service_unavailable
end
end
end
Ensure this controller is routed correctly. For Rails 5+, add to config/routes.rb:
# config/routes.rb Rails.application.routes.draw do get 'health', to: 'health#show' # ... other routes end
This endpoint can then be polled by external monitoring services or AWS Elastic Load Balancer (ELB) health checks.
AWS ELB Health Checks Configuration
For applications behind an ELB (Application Load Balancer or Classic Load Balancer), configuring health checks is crucial for automatically removing unhealthy instances from the load balancing pool. Target your application’s health endpoint.
Application Load Balancer (ALB) Example:
# AWS Console -> EC2 -> Load Balancers -> [Your ALB] -> Listeners -> View/edit rules -> [Your Rule] -> Edit action -> Forward to [Your Target Group] # Target Group Settings: Protocol: HTTP Port: 80 Path: /health Healthy threshold: 3 Unhealthy threshold: 2 Timeout: 5 seconds Interval: 30 seconds Success codes: 200
Classic Load Balancer (CLB) Example:
# AWS Console -> EC2 -> Load Balancers -> [Your CLB] -> Listeners -> [Your Listener] -> Edit -> Health Checks # Health Check Settings: Ping Protocol: HTTP Ping Port: 80 Ping Path: /health Response Timeout: 5 Interval: 30 Unhealthy Threshold: 2 Healthy Threshold: 3
Process Monitoring with `monit`
While ELB handles instance-level health, we need to ensure the Ruby application process itself (e.g., Puma, Unicorn) is running. `monit` is a robust, open-source process and service monitoring tool that can restart failed processes and alert administrators.
Installation (Ubuntu/Debian):
sudo apt-get update sudo apt-get install monit
Configuration for Puma:
Create a configuration file for your application’s process. Assuming your Puma PID file is located at /path/to/your/app/tmp/pids/puma.pid and your application runs from /path/to/your/app:
# /etc/monit/conf.d/my_ruby_app check process puma with pidfile /path/to/your/app/tmp/pids/puma.pid start program = "/bin/su - deploy -c 'cd /path/to/your/app && bundle exec puma -C /path/to/your/app/config/puma.rb'" stop program = "/bin/kill -s QUIT `cat /path/to/your/app/tmp/pids/puma.pid`" if failed port 3000 protocol http then restart if 5 restarts within 5 cycles then timeout group ruby_app
Enabling and Testing `monit`:
sudo monit reload sudo monit status
This configuration checks if the Puma process is running, attempts to start/stop it, and restarts it if it fails to respond on its listening port (e.g., 3000). The `group ruby_app` directive is useful for managing multiple related services.
Integrating with AWS CloudWatch
AWS CloudWatch is essential for collecting metrics, logging, and setting up alarms. We’ll use it to monitor system-level metrics and application-specific logs.
System Metrics and Alarms
CloudWatch automatically collects basic EC2 metrics like CPUUtilization, NetworkIn, and NetworkOut. Configure alarms on these metrics to trigger notifications or auto-scaling actions.
# AWS Console -> CloudWatch -> Alarms -> Create alarm # Metric: EC2 -> Per-Instance Metrics -> [Your Instance ID] -> CPUUtilization # Threshold type: Static # Whenever CPUUtilization is Greater/Equal than 80 for 5 consecutive periods of 1 minute # Actions: Send notification to SNS topic (e.g., for email/Slack alerts) or trigger Auto Scaling action.
Application Log Aggregation
Centralizing application logs is critical for debugging and analysis. The CloudWatch Agent can be configured to stream logs from your EC2 instances to CloudWatch Logs.
Install CloudWatch Agent: Follow AWS documentation for your specific OS. For Amazon Linux 2:
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/amazon-cloudwatch-agent.zip unzip amazon-cloudwatch-agent.zip sudo ./install-amazon-cloudwatch-agent.sh
Configure the Agent: Create a configuration file (e.g., /opt/aws/amazon-cloudwatch/agent/config.json) to specify which logs to collect. Ensure the IAM role attached to your EC2 instance has permissions for cloudwatch:PutLogEvents and logs:CreateLogStream.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/path/to/your/app/log/production.log",
"log_group_name": "/aws/ecs/your-app/production",
"log_stream_name": "{instance_id}"
},
{
"file_path": "/var/log/syslog",
"log_group_name": "/aws/ecs/your-app/syslog",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}
Start the Agent:
sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s
Now, your application logs will stream to CloudWatch Logs, allowing you to create metric filters and alarms based on log content (e.g., error messages).
Monitoring Elasticsearch Clusters on AWS (OpenSearch Service)
For Elasticsearch (or AWS OpenSearch Service), monitoring focuses on cluster health, node performance, and query performance. AWS OpenSearch Service provides built-in metrics and logging capabilities.
Key OpenSearch Service Metrics
AWS OpenSearch Service automatically publishes metrics to CloudWatch. Essential metrics to monitor include:
- Cluster Status:
ClusterStatus.red,ClusterStatus.yellow(indicates issues with shards) - Node CPU Utilization:
JVMMemoryPressure,CPUUtilization - Disk Usage:
FreeStorageSpace(ensure sufficient space) - Indexing Performance:
IndexingRate,IndexingThrottledEvents(indicates overload) - Search Performance:
SearchRate,SearchLatency - Shards:
UnassignedShards
Set up CloudWatch alarms on these metrics. For instance, an alarm on UnassignedShards or ClusterStatus.red should immediately notify your team.
# AWS Console -> CloudWatch -> Alarms -> Create alarm # Metric: OpenSearch Service -> Per-Domain Metrics -> [Your Domain Name] -> UnassignedShards # Threshold type: Static # Whenever UnassignedShards is Greater than 0 for 1 consecutive period of 5 minutes # Actions: Notify SNS topic.
OpenSearch Service Slow Logs
To diagnose performance issues, enable and monitor slow logs for indexing and search operations. These logs can be streamed to CloudWatch Logs or S3.
Enabling Slow Logs (AWS Console):
# AWS Console -> OpenSearch Service -> Domains -> [Your Domain Name] -> Actions -> Edit # Under "Advanced options" -> "Slow log publishing" # Enable "Index slow logs" and "Search slow logs" # Set thresholds (e.g., 1000ms for index, 5000ms for search) # Specify log destination (e.g., CloudWatch Logs Log Group)
Once enabled, you can create CloudWatch Metric Filters from these logs to track the frequency of slow operations and set alarms.
# AWS Console -> CloudWatch -> Log groups -> [Your Log Group] -> Metric filters -> Create metric filter
# Filter pattern: ERROR (for general errors) or specific patterns for slow logs.
# Example for slow search logs (adjust pattern based on log format):
# Pattern: "{ $.level = \"WARN\" && $.message like /took\[\d+ms\]/ && $.message like /search/ }"
# Metric Name: SlowSearchOperations
# Metric Namespace: MyOpenSearchMetrics
# Default Value: 0
# Actions: Create alarm on this metric (e.g., if SlowSearchOperations > 10 in 5 minutes).
Node-Level Monitoring (if self-managed)
If you are not using AWS OpenSearch Service and manage your own Elasticsearch cluster on EC2, you’ll need to install and configure monitoring agents like Prometheus with the `node_exporter` and `elasticsearch_exporter`, or use tools like `Filebeat` and `Metricbeat` to send data to a central Elasticsearch/Kibana instance.
Example `metricbeat.yml` configuration for Elasticsearch metrics:
metricbeat.modules:
- module: elasticsearch
period: 10s
hosts: ["http://localhost:9200"] # Or your Elasticsearch host
xpack.enabled: true # If using X-Pack monitoring
# Optional: If using security features
# username: "elastic"
# password: "changeme"
- module: node
period: 10s
metricsets:
- cpu
- memory
- disk
- network
hosts: ["localhost"]
output.elasticsearch:
hosts: ["http://localhost:9200"] # Or your Elasticsearch host
# username: "elastic"
# password: "changeme"
Ensure Metricbeat is configured to send its own logs to a separate, healthy Elasticsearch cluster or to CloudWatch Logs for monitoring the monitoring system itself.
Alerting and Notification Strategy
A robust alerting strategy is paramount. Use AWS Simple Notification Service (SNS) to fan out alerts to various endpoints:
- Email: For immediate, human-readable notifications.
- Slack/PagerDuty: Integrate via SNS subscriptions or Lambda functions for on-call engineer alerts.
- AWS Lambda: Trigger automated remediation actions (e.g., restarting a service, scaling up an instance) based on specific alarms.
Define clear alert thresholds and escalation policies. Avoid alert fatigue by tuning alarms to be actionable and relevant. Regularly review and refine your monitoring and alerting setup as your application and infrastructure evolve.