Server Monitoring Best Practices: Keeping Your Ruby App and MySQL Clusters Alive on AWS
Proactive Health Checks for Ruby Applications on EC2
Maintaining the health of Ruby applications deployed on AWS EC2 instances requires a multi-layered approach to monitoring. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. This involves setting up robust health checks that can automatically detect and, where possible, remediate issues.
A common pattern is to expose an HTTP endpoint on your application that signals its readiness and liveness. For Rails applications, this can be a simple controller action. For more complex scenarios, consider using a dedicated gem like rack-healthcheck or health_monitor.
Implementing a Basic Rails Health Check Endpoint
Create a controller to handle the health check request. This endpoint should ideally check database connectivity and the status of any essential background job processors.
# app/controllers/health_controller.rb
class HealthController << ApplicationController
def show
# Basic check: ensure the database is accessible
unless ActiveRecord::Base.connection.execute("SELECT 1")
render json: { status: "error", message: "Database connection failed" }, status: 503
return
end
# Add more checks here: e.g., Redis connectivity, background job queue status
# For example, checking Sidekiq:
# unless Sidekiq.redis { |conn| conn.ping == "PONG" }
# render json: { status: "error", message: "Sidekiq Redis connection failed" }, status: 503
# return
# end
render json: { status: "ok", message: "Application is healthy" }, status: 200
rescue StandardError => e
render json: { status: "error", message: "An unexpected error occurred: #{e.message}" }, status: 500
end
end
Next, define a route for this endpoint.
# config/routes.rb Rails.application.routes.draw do get 'health', to: 'health#show' # ... other routes end
Leveraging AWS Services for Health Monitoring
AWS provides several services that can actively poll this health endpoint and react to failures. Elastic Load Balancing (ELB) is the primary tool for this.
Configuring ELB Health Checks
When setting up your ELB (Application Load Balancer or Classic Load Balancer) for your EC2 instances running the Ruby app, configure its health checks to target your new /health endpoint. Key parameters to tune:
- Port: The port your Ruby application listens on (e.g., 80, 3000, 8080).
- Protocol: HTTP or HTTPS.
- Path:
/health. - Interval: How often to perform the check (e.g., 30 seconds).
- Timeout: How long to wait for a response (e.g., 5 seconds).
- Healthy threshold: Number of consecutive successful checks to consider an instance healthy (e.g., 2).
- Unhealthy threshold: Number of consecutive failed checks to consider an instance unhealthy (e.g., 3).
An ELB will automatically mark an instance as unhealthy if it fails these checks and stop sending traffic to it. If you’re using Auto Scaling Groups, this can trigger a replacement instance.
System-Level Monitoring with CloudWatch Agent
While ELB handles application-level HTTP checks, system-level metrics are crucial. The CloudWatch Agent allows you to collect detailed system metrics (CPU, memory, disk, network) and application logs from your EC2 instances and send them to CloudWatch.
Installing and Configuring the CloudWatch Agent
First, install the agent on your EC2 instances. The installation process varies slightly by OS. For Amazon Linux 2:
sudo rpm -U /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-latest.rpm
Next, create a configuration file. This JSON file defines which metrics and logs to collect. Ensure the agent has an IAM role attached to the EC2 instance with permissions to write to CloudWatch Logs and Metrics.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "YourApp/EC2",
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait",
"cpu_usage_user",
"cpu_usage_system"
],
"totalcpu": true
},
"disk": {
"measurement": [
"free_percent",
"inodes_free"
],
"resources": [
"/",
"/var/log"
],
"ignore_file_system_types": [
"sysfs",
"devtmpfs",
"tmpfs",
"devfs",
"iso9660",
"overlay",
"aufs",
"squashfs"
]
},
"mem": {
"measurement": [
"mem_used_percent",
"swap_used_percent"
]
},
"net": {
"measurement": [
"bytes_recv",
"bytes_sent",
"packets_recv",
"packets_sent"
]
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "YourApp/Nginx/Access",
"log_stream_name": "{instance_id}/nginx-access"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "YourApp/Nginx/Error",
"log_stream_name": "{instance_id}/nginx-error"
},
{
"file_path": "/var/log/your_app.log",
"log_group_name": "YourApp/Rails/App",
"log_stream_name": "{instance_id}/rails-app"
}
]
}
}
}
}
Start the agent and configure it to run on boot:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/path/to/your/config.json -s sudo systemctl enable amazon-cloudwatch-agent
Setting Up CloudWatch Alarms
Once metrics are flowing into CloudWatch, create alarms to notify you of potential issues. These alarms can trigger actions like sending notifications via SNS or even invoking Lambda functions for automated remediation.
Example Alarms for Ruby Apps
- High CPU Utilization: Trigger if
CPUUtilizationin namespaceYourApp/EC2exceeds 80% for 15 minutes. - Low Disk Space: Trigger if
disk_free_percentfor/falls below 10% for 30 minutes. - High Memory Usage: Trigger if
mem_used_percentexceeds 85% for 10 minutes. - Application Errors (from logs): Use CloudWatch Logs Insights to create metric filters for specific error patterns in your application logs (e.g., “Uncaught exception” or “5xx error”).
Configure these alarms via the AWS Management Console or programmatically using AWS CLI or Infrastructure as Code tools like CloudFormation or Terraform.
Monitoring MySQL Clusters on AWS RDS or EC2
Database health is paramount. For MySQL, whether running on Amazon RDS or self-managed on EC2, monitoring key performance indicators (KPIs) and resource utilization is critical to prevent performance degradation and outages.
RDS MySQL Monitoring with Enhanced Monitoring
Amazon RDS provides built-in monitoring capabilities. Enabling “Enhanced Monitoring” provides more granular, near real-time OS-level metrics for your RDS instances. This is essential for diagnosing performance bottlenecks.
Key RDS Metrics to Monitor
- CPUUtilization: High CPU can indicate inefficient queries, insufficient instance size, or high traffic.
- DatabaseConnections: Monitor the number of active connections. A sudden spike or consistently high number might indicate connection leaks or insufficient connection pooling in your application.
- FreeableMemory: Low freeable memory can lead to increased disk I/O as the OS swaps memory.
- ReadIOPS / WriteIOPS: High I/O operations can signal performance issues with queries or data volume.
- ReadLatency / WriteLatency: Increasing latency directly impacts application performance.
- DiskQueueDepth: A consistently high queue depth indicates the storage system cannot keep up with the read/write requests.
- NetworkReceiveThroughput / NetworkTransmitThroughput: Monitor network traffic to ensure it aligns with expectations.
These metrics are available directly in CloudWatch under the AWS/RDS namespace. Set up CloudWatch Alarms on these metrics, similar to EC2 instance monitoring, to proactively address issues.
Self-Managed MySQL on EC2: System and MySQL-Specific Metrics
If you’re managing MySQL on EC2 instances, you’ll need to combine system-level monitoring (via CloudWatch Agent as described above) with MySQL-specific performance metrics.
Collecting MySQL Metrics with Prometheus and mysqld_exporter
A robust approach is to use Prometheus for time-series data collection and visualization, with mysqld_exporter to expose MySQL metrics in a Prometheus-compatible format.
1. Install mysqld_exporter:
# Download the latest release from https://github.com/prometheus/mysqld_exporter/releases wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.0/mysqld_exporter-0.15.0.linux-amd64.tar.gz tar xvfz mysqld_exporter-0.15.0.linux-amd64.tar.gz cd mysqld_exporter-0.15.0.linux-amd64 sudo mv mysqld_exporter /usr/local/bin/
2. Create a MySQL User for the Exporter:
-- Connect to MySQL as root CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'your_strong_password'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost'; FLUSH PRIVILEGES;
3. Configure the Exporter: Create a .my.cnf file for the user running the exporter (e.g., `prometheus` user) or set the DSN directly.
[client] user=exporter password=your_strong_password host=localhost port=3306
4. Run mysqld_exporter:
# Run as the user that has access to .my.cnf /usr/local/bin/mysqld_exporter --config.my-cnf=/home/prometheus/.my.cnf &
5. Configure Prometheus to Scrape: Add a scrape configuration to your prometheus.yml.
scrape_configs:
- job_name: 'mysql'
static_configs:
- targets: ['your_ec2_instance_ip:9104'] # Default port for mysqld_exporter
Key MySQL Metrics from mysqld_exporter
mysql_global_status_threads_connected: Number of connected threads.mysql_global_status_threads_running: Number of running threads.mysql_global_status_slow_queries: Count of slow queries.mysql_global_status_innodb_buffer_pool_wait_free: Indicates if InnoDB needs to wait for free pages in the buffer pool.mysql_global_status_innodb_row_lock_waits: Number of times a row lock wait occurred.mysql_global_status_questions: Total number of statements executed by the server.mysql_global_status_com_select,com_insert,com_update,com_delete: Breakdown of DML operations.mysql_up: Indicates if the exporter could connect to MySQL.
MySQL Cluster Monitoring (e.g., Galera, InnoDB Cluster)
For multi-node MySQL clusters, monitor inter-node communication, replication status, and cluster health. Specific metrics depend on the clustering technology:
- Galera: Monitor
wsrep_cluster_size(should match the number of nodes),wsrep_local_state_comment(should be ‘Synced’),wsrep_incoming_addresses, andwsrep_flow_control_paused. - InnoDB Cluster: Monitor Group Replication status, such as
performance_schema.replication_group_membersandperformance_schema.replication_group_member_stats.
These metrics can often be exposed via mysqld_exporter with appropriate configuration or require custom exporters.
Centralized Logging and Alerting Strategy
A robust monitoring strategy is incomplete without effective centralized logging and a well-defined alerting system. Aggregating logs from all your application and database instances into a single location simplifies troubleshooting and analysis.
AWS CloudWatch Logs for Centralization
As demonstrated with the CloudWatch Agent, you can stream application logs, web server logs (Nginx/Apache), and system logs directly to CloudWatch Logs. This provides a searchable, queryable, and long-term storable log repository.
Leveraging CloudWatch Logs Insights
CloudWatch Logs Insights is a powerful tool for querying your log data. You can use it to:
- Identify error patterns and count occurrences.
- Analyze request latency from web server logs.
- Trace specific user requests across different log streams.
- Create metric filters based on query results to generate custom CloudWatch Metrics.
For example, to find all occurrences of “Uncaught exception” in your Rails app logs:
fields @timestamp, @message | filter @message like /Uncaught exception/ | sort @timestamp desc | limit 20
Alerting with CloudWatch Alarms and SNS
Combine CloudWatch Alarms with Amazon Simple Notification Service (SNS) for a flexible alerting system. When an alarm state changes (e.g., from OK to ALARM), it can publish a message to an SNS topic.
This SNS topic can then have subscriptions for:
- Email: For immediate notification to on-call engineers.
- SMS: For critical alerts requiring urgent attention.
- SQS Queues: To trigger automated remediation workflows via Lambda functions or other consumers.
- HTTP/S Endpoints: To integrate with third-party incident management tools (e.g., PagerDuty, Opsgenie).
Ensure your alarms are tuned to minimize false positives while ensuring critical issues are not missed. Define clear escalation policies based on the severity and type of alert.