Server Monitoring Best Practices: Keeping Your Ruby App and Elasticsearch Clusters Alive on AWS

Proactive Health Checks for Ruby Applications on AWS EC2

Maintaining the health of Ruby applications deployed on AWS EC2 instances requires a multi-layered approach to monitoring. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. This involves leveraging both AWS native tools and external monitoring solutions.

Application-Level Health Endpoints

A fundamental practice is to expose a dedicated health check endpoint within your Ruby application. This endpoint should perform checks against its dependencies (database, external services, cache) and return a clear status. For Rails applications, a simple controller action can suffice:

# app/controllers/health_controller.rb
class HealthController << ApplicationController
  def show
    # Example: Check database connection
    begin
      ActiveRecord::Base.connection.execute('SELECT 1')
      db_status = :ok
    rescue ActiveRecord::ConnectionNotEstablished, PG::Error => e
      db_status = :error
      Rails.logger.error("Database health check failed: #{e.message}")
    end

    # Example: Check external service (e.g., Redis)
    begin
      $redis.ping # Assuming $redis is a globally accessible Redis client
      redis_status = :ok
    rescue Redis::CannotConnectError => e
      redis_status = :error
      Rails.logger.error("Redis health check failed: #{e.message}")
    end

    if db_status == :ok && redis_status == :ok
      render json: { status: 'ok', database: db_status, redis: redis_status }, status: :ok
    else
      render json: { status: 'degraded', database: db_status, redis: redis_status }, status: :service_unavailable
    end
  end
end

Ensure this controller is routed correctly. For Rails 5+, add to config/routes.rb:

# config/routes.rb
Rails.application.routes.draw do
  get 'health', to: 'health#show'
  # ... other routes
end

This endpoint can then be polled by external monitoring services or AWS Elastic Load Balancer (ELB) health checks.

AWS ELB Health Checks Configuration

For applications behind an ELB (Application Load Balancer or Classic Load Balancer), configuring health checks is crucial for automatically removing unhealthy instances from the load balancing pool. Target your application’s health endpoint.

Application Load Balancer (ALB) Example:

# AWS Console -> EC2 -> Load Balancers -> [Your ALB] -> Listeners -> View/edit rules -> [Your Rule] -> Edit action -> Forward to [Your Target Group]
# Target Group Settings:
Protocol: HTTP
Port: 80
Path: /health
Healthy threshold: 3
Unhealthy threshold: 2
Timeout: 5 seconds
Interval: 30 seconds
Success codes: 200

Classic Load Balancer (CLB) Example:

# AWS Console -> EC2 -> Load Balancers -> [Your CLB] -> Listeners -> [Your Listener] -> Edit -> Health Checks
# Health Check Settings:
Ping Protocol: HTTP
Ping Port: 80
Ping Path: /health
Response Timeout: 5
Interval: 30
Unhealthy Threshold: 2
Healthy Threshold: 3

Process Monitoring with `monit`

While ELB handles instance-level health, we need to ensure the Ruby application process itself (e.g., Puma, Unicorn) is running. `monit` is a robust, open-source process and service monitoring tool that can restart failed processes and alert administrators.

Installation (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install monit

Configuration for Puma:

Create a configuration file for your application’s process. Assuming your Puma PID file is located at /path/to/your/app/tmp/pids/puma.pid and your application runs from /path/to/your/app:

# /etc/monit/conf.d/my_ruby_app
check process puma with pidfile /path/to/your/app/tmp/pids/puma.pid
  start program = "/bin/su - deploy -c 'cd /path/to/your/app && bundle exec puma -C /path/to/your/app/config/puma.rb'"
  stop program  = "/bin/kill -s QUIT `cat /path/to/your/app/tmp/pids/puma.pid`"
  if failed port 3000 protocol http then restart
  if 5 restarts within 5 cycles then timeout
  group ruby_app

Enabling and Testing `monit`:

sudo monit reload
sudo monit status

This configuration checks if the Puma process is running, attempts to start/stop it, and restarts it if it fails to respond on its listening port (e.g., 3000). The `group ruby_app` directive is useful for managing multiple related services.

Integrating with AWS CloudWatch

AWS CloudWatch is essential for collecting metrics, logging, and setting up alarms. We’ll use it to monitor system-level metrics and application-specific logs.

System Metrics and Alarms

CloudWatch automatically collects basic EC2 metrics like CPUUtilization, NetworkIn, and NetworkOut. Configure alarms on these metrics to trigger notifications or auto-scaling actions.

# AWS Console -> CloudWatch -> Alarms -> Create alarm
# Metric: EC2 -> Per-Instance Metrics -> [Your Instance ID] -> CPUUtilization
# Threshold type: Static
# Whenever CPUUtilization is Greater/Equal than 80 for 5 consecutive periods of 1 minute
# Actions: Send notification to SNS topic (e.g., for email/Slack alerts) or trigger Auto Scaling action.

Application Log Aggregation

Centralizing application logs is critical for debugging and analysis. The CloudWatch Agent can be configured to stream logs from your EC2 instances to CloudWatch Logs.

Install CloudWatch Agent: Follow AWS documentation for your specific OS. For Amazon Linux 2:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/amazon-cloudwatch-agent.zip
unzip amazon-cloudwatch-agent.zip
sudo ./install-amazon-cloudwatch-agent.sh

Configure the Agent: Create a configuration file (e.g., /opt/aws/amazon-cloudwatch/agent/config.json) to specify which logs to collect. Ensure the IAM role attached to your EC2 instance has permissions for cloudwatch:PutLogEvents and logs:CreateLogStream.

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/path/to/your/app/log/production.log",
            "log_group_name": "/aws/ecs/your-app/production",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/syslog",
            "log_group_name": "/aws/ecs/your-app/syslog",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

Start the Agent:

sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s

Now, your application logs will stream to CloudWatch Logs, allowing you to create metric filters and alarms based on log content (e.g., error messages).

Monitoring Elasticsearch Clusters on AWS (OpenSearch Service)

For Elasticsearch (or AWS OpenSearch Service), monitoring focuses on cluster health, node performance, and query performance. AWS OpenSearch Service provides built-in metrics and logging capabilities.

Key OpenSearch Service Metrics

AWS OpenSearch Service automatically publishes metrics to CloudWatch. Essential metrics to monitor include:

Cluster Status: ClusterStatus.red, ClusterStatus.yellow (indicates issues with shards)
Node CPU Utilization: JVMMemoryPressure, CPUUtilization
Disk Usage: FreeStorageSpace (ensure sufficient space)
Indexing Performance: IndexingRate, IndexingThrottledEvents (indicates overload)
Search Performance: SearchRate, SearchLatency
Shards: UnassignedShards

Set up CloudWatch alarms on these metrics. For instance, an alarm on UnassignedShards or ClusterStatus.red should immediately notify your team.

# AWS Console -> CloudWatch -> Alarms -> Create alarm
# Metric: OpenSearch Service -> Per-Domain Metrics -> [Your Domain Name] -> UnassignedShards
# Threshold type: Static
# Whenever UnassignedShards is Greater than 0 for 1 consecutive period of 5 minutes
# Actions: Notify SNS topic.

OpenSearch Service Slow Logs

To diagnose performance issues, enable and monitor slow logs for indexing and search operations. These logs can be streamed to CloudWatch Logs or S3.

Enabling Slow Logs (AWS Console):

# AWS Console -> OpenSearch Service -> Domains -> [Your Domain Name] -> Actions -> Edit
# Under "Advanced options" -> "Slow log publishing"
# Enable "Index slow logs" and "Search slow logs"
# Set thresholds (e.g., 1000ms for index, 5000ms for search)
# Specify log destination (e.g., CloudWatch Logs Log Group)

Once enabled, you can create CloudWatch Metric Filters from these logs to track the frequency of slow operations and set alarms.

# AWS Console -> CloudWatch -> Log groups -> [Your Log Group] -> Metric filters -> Create metric filter
# Filter pattern: ERROR (for general errors) or specific patterns for slow logs.
# Example for slow search logs (adjust pattern based on log format):
# Pattern: "{ $.level = \"WARN\" && $.message like /took\[\d+ms\]/ && $.message like /search/ }"
# Metric Name: SlowSearchOperations
# Metric Namespace: MyOpenSearchMetrics
# Default Value: 0
# Actions: Create alarm on this metric (e.g., if SlowSearchOperations > 10 in 5 minutes).

Node-Level Monitoring (if self-managed)

If you are not using AWS OpenSearch Service and manage your own Elasticsearch cluster on EC2, you’ll need to install and configure monitoring agents like Prometheus with the `node_exporter` and `elasticsearch_exporter`, or use tools like `Filebeat` and `Metricbeat` to send data to a central Elasticsearch/Kibana instance.

Example `metricbeat.yml` configuration for Elasticsearch metrics:

metricbeat.modules:
- module: elasticsearch
  period: 10s
  hosts: ["http://localhost:9200"] # Or your Elasticsearch host
  xpack.enabled: true # If using X-Pack monitoring
  # Optional: If using security features
  # username: "elastic"
  # password: "changeme"

- module: node
  period: 10s
  metricsets:
    - cpu
    - memory
    - disk
    - network
  hosts: ["localhost"]

output.elasticsearch:
  hosts: ["http://localhost:9200"] # Or your Elasticsearch host
  # username: "elastic"
  # password: "changeme"

Ensure Metricbeat is configured to send its own logs to a separate, healthy Elasticsearch cluster or to CloudWatch Logs for monitoring the monitoring system itself.

Alerting and Notification Strategy

A robust alerting strategy is paramount. Use AWS Simple Notification Service (SNS) to fan out alerts to various endpoints:

Email: For immediate, human-readable notifications.
Slack/PagerDuty: Integrate via SNS subscriptions or Lambda functions for on-call engineer alerts.
AWS Lambda: Trigger automated remediation actions (e.g., restarting a service, scaling up an instance) based on specific alarms.

Define clear alert thresholds and escalation policies. Avoid alert fatigue by tuning alarms to be actionable and relevant. Regularly review and refine your monitoring and alerting setup as your application and infrastructure evolve.