Server Monitoring Best Practices: Keeping Your Ruby App and Redis Clusters Alive on AWS

Establishing a Robust Monitoring Baseline for Ruby on AWS

Effective server monitoring for a Ruby application on AWS hinges on a multi-layered approach. We need to go beyond basic CPU and memory utilization. For a production Ruby environment, this means deep dives into application-level metrics, garbage collection (GC) performance, and request latency. AWS CloudWatch is our foundational tool, but it needs to be augmented with application-specific instrumentation.

Application Performance Monitoring (APM) with New Relic

While CloudWatch provides infrastructure metrics, it lacks the granular insight into Ruby application behavior. New Relic (or a similar APM tool like Datadog or AppSignal) is indispensable. The agent needs to be installed and configured within your application’s deployment process. For a typical Rails application, this involves adding the gem and initializing it in an initializer file.

Ensure your `Gemfile` includes the New Relic agent:

gem 'newrelic_rpm'

Then, create an initializer (e.g., `config/initializers/newrelic.rb`):

# config/initializers/newrelic.rb
if ENV['NEW_RELIC_LICENSE_KEY']
  NewRelic::Agent.manual_start(
    license_key: ENV['NEW_RELIC_LICENSE_KEY'],
    app_name: ENV['NEW_RELIC_APP_NAME'] || 'MyRubyApp-Production'
  )
end

Crucially, set the `NEW_RELIC_LICENSE_KEY` and `NEW_RELIC_APP_NAME` environment variables in your EC2 instance’s environment or within your Elastic Beanstalk configuration. This ensures the agent connects to your New Relic account and reports under the correct application name. Key metrics to monitor via New Relic include:

Request throughput (Requests per minute)
Average request throughput
Error rate (percentage of requests returning errors)
Transaction traces (identifying slow database queries, external calls, and Ruby code execution)
GC activity (time spent in GC, number of collections)
Memory usage (heap size, object counts)

Leveraging CloudWatch for Infrastructure and Application Logs

CloudWatch is essential for collecting system-level metrics and application logs. For EC2 instances running your Ruby app, ensure the CloudWatch agent is installed and configured to stream:

System metrics (CPU Utilization, Memory Utilization, Disk I/O, Network Traffic)
Application logs (e.g., `production.log`, `error.log`)
Web server logs (e.g., Nginx or Apache access and error logs)

The CloudWatch agent configuration (typically `amazon-cloudwatch-agent.json`) should look something like this:

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyRubyApp/EC2",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_iowait",
          "cpu_usage_user",
          "cpu_usage_system"
        ],
        "totalcpu": true
      },
      "mem": {
        "measurement": [
          "mem_used_percent"
        ]
      },
      "disk": {
        "measurement": [
          "used_percent"
        ],
        "resources": [
          "/"
        ]
      },
      "netif": {
        "measurement": [
          "bytes_sent",
          "bytes_recv",
          "packets_sent",
          "packets_recv"
        ]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "MyRubyApp/Nginx/Access",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "MyRubyApp/Nginx/Error",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/app/current/log/production.log",
            "log_group_name": "MyRubyApp/Rails/Production",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/app/current/log/sidekiq.log",
            "log_group_name": "MyRubyApp/Sidekiq/Logs",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

After deploying this configuration, you can set up CloudWatch Alarms on key metrics. For instance, an alarm for high CPU utilization:

aws cloudwatch put-metric-alarm \
    --alarm-name "HighCPUUtilization-MyRubyApp" \
    --alarm-description "Alarm when CPU exceeds 80% for 5 minutes" \
    --metric-name CPUUtilization \
    --namespace "MyRubyApp/EC2" \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
    --evaluation-periods 1 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic

Similarly, configure alarms for low disk space, high error rates (from Nginx/Apache logs), and potentially high request latency if your application logs this metric. For log analysis, CloudWatch Logs Insights is invaluable for querying and identifying patterns in application errors.

Monitoring Redis Clusters: Performance and Availability

Redis, often used as a cache or message broker for Ruby applications, requires its own dedicated monitoring strategy. AWS ElastiCache for Redis provides managed infrastructure, but we still need to monitor its performance and availability from the application’s perspective and leverage ElastiCache’s built-in metrics.

ElastiCache Metrics in CloudWatch

ElastiCache automatically publishes a rich set of metrics to CloudWatch. These are crucial for understanding the health of your Redis cluster. Key metrics to monitor include:

CPUUtilization: High CPU can indicate heavy load or inefficient queries.
FreeableMemory: A declining trend suggests memory pressure.
CacheHits and CacheMisses: A low hit ratio (CacheHits / (CacheHits + CacheMisses)) indicates the cache isn’t effective, leading to increased load on your primary data store.
CurrConnections: Monitor for unexpected spikes or sustained high connection counts.
Evictions: High eviction rates mean data is being removed from the cache due to memory pressure.
NetworkBytesIn and NetworkBytesOut: Monitor traffic patterns and potential bottlenecks.
ReplicationLag (for read replicas): Crucial for ensuring data consistency.

Set up CloudWatch Alarms on these metrics. For example, an alarm for a low cache hit ratio:

aws cloudwatch put-metric-alarm \
    --alarm-name "LowCacheHitRatio-MyRedisCluster" \
    --alarm-description "Alarm when Cache Hit Ratio drops below 70% for 10 minutes" \
    --metric-name CacheHits \
    --namespace "AWS/ElastiCache" \
    --statistic Sum \
    --period 600 \
    --threshold 0.7 \
    --comparison-operator LessThanThreshold \
    --dimensions Name=CacheClusterId,Value=my-redis-cluster \
    --evaluation-periods 1 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic \
    --extended-statistic HitRatio

Note: The `HitRatio` is an extended statistic. You might need to adjust the `threshold` and `period` based on your application’s baseline behavior. A sustained high `Evictions` count is also a strong indicator that your Redis instance is undersized or your cache invalidation strategy needs review.

Application-Level Redis Monitoring

Beyond ElastiCache metrics, your Ruby application needs to be aware of Redis health. Implement connection pooling and health checks within your application code. The `redis-rb` gem provides mechanisms for this.

Example of a basic health check within a Rails initializer:

# config/initializers/redis_health_check.rb
if defined?(Redis)
  # Assuming you use Redis for Sidekiq or caching
  redis_client = Redis.new(url: ENV['REDIS_URL']) # Or your specific Redis connection

  # Perform a simple PING to check connectivity and responsiveness
  begin
    redis_client.ping
    Rails.logger.info "Redis connection successful."
  rescue Redis::CannotConnectError => e
    Rails.logger.error "Redis connection failed: #{e.message}"
    # Consider triggering an alert or a more drastic action here
  rescue Redis::TimeoutError => e
    Rails.logger.error "Redis ping timed out: #{e.message}"
    # This indicates high latency or a blocked Redis instance
  end

  # For more advanced scenarios, you might want to monitor queue depths for Sidekiq
  # or implement periodic cache read/write tests.
end

For Sidekiq, monitor its own metrics. Sidekiq exposes metrics that can be scraped by Prometheus or sent to other monitoring systems. Key Sidekiq metrics include:

sidekiq_workers_size
sidekiq_queue_size (for each queue)
sidekiq_processed (count of processed jobs)
sidekiq_failed (count of failed jobs)
sidekiq_enqueued (count of enqueued jobs)

If you’re using Prometheus, configure the `redis_exporter` or Sidekiq’s built-in metrics endpoint. For example, to enable Sidekiq metrics in Prometheus:

# In your Sidekiq initializer or configuration
require 'sidekiq/web'
require 'sidekiq/api'

# Enable Prometheus metrics endpoint
Sidekiq::Web.use Rack::Auth::Basic do |user, password|
  user == ENV['SIDEKIQ_MONITOR_USER'] && password == ENV['SIDEKIQ_MONITOR_PASSWORD']
end if ENV['SIDEKIQ_MONITOR_USER']

# Add Prometheus metrics endpoint
Sidekiq::Web.use Rack::Deflater
Sidekiq::Web.use Prometheus::Client::Rack # Assuming you have prometheus-client-mruby or similar

# Mount Sidekiq Web UI if needed
# mount Sidekiq::Web => '/sidekiq'

Ensure your Prometheus server is configured to scrape the `/metrics` endpoint exposed by your Sidekiq process. This allows for correlation between application performance and background job processing.

Advanced Alerting and Incident Response

A comprehensive monitoring strategy is incomplete without a robust alerting and incident response plan. Alerts should be actionable and routed to the appropriate teams. AWS Simple Notification Service (SNS) is a common choice for centralizing alerts.

Alerting Strategy

Define clear thresholds for your CloudWatch Alarms. Avoid alert fatigue by setting sensible thresholds and using appropriate `EvaluationPeriods`. For critical services like Redis, consider multi-metric alarms. For example, an alarm that triggers if both `CPUUtilization` is high AND `Evictions` are occurring.

aws cloudwatch put-composite-alarm \
    --alarm-name "RedisPerformanceDegradation-MyRedisCluster" \
    --alarm-description "Alarm for critical Redis performance issues" \
    --actions-enabled \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlertsTopic \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:MyIncidentResponseTeamTopic \
    --composite-expression "ALARM(HighCPUUtilization-MyRedisCluster) OR ALARM(HighEvictions-MyRedisCluster)"

Integrate your SNS topics with incident management tools like PagerDuty or Opsgenie. This ensures that critical alerts are escalated and acknowledged promptly. For less critical alerts, consider routing them to Slack channels.

Log Analysis and Root Cause Identification

When an incident occurs, quick root cause analysis is paramount. CloudWatch Logs Insights is your best friend here. Practice writing queries to quickly diagnose common issues.

Example: Find all PUMA errors in your Rails logs within the last hour:

fields @timestamp, @message
| filter @message like /PUMA ERROR/
| sort @timestamp desc
| limit 50

Example: Find slow Redis commands (if logged by your application or Redis itself):

fields @timestamp, @message
| filter @message like /slowlog/
| sort @timestamp desc
| limit 50

Regularly review your logs and APM data (New Relic, etc.) to proactively identify performance bottlenecks or recurring errors before they escalate into major incidents. Establish runbooks for common alert types, detailing diagnostic steps and potential remediation actions.