Server Monitoring Best Practices: Keeping Your Ruby App and Elasticsearch Clusters Alive on Google Cloud

Proactive Health Checks for Ruby Applications on Google Cloud Compute Engine

Maintaining the health of Ruby applications deployed on Google Cloud Compute Engine (GCE) requires a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to ensure the application server itself is responsive and that critical background jobs are executing as expected. This involves instrumenting the application and leveraging external health check mechanisms.

Application-Level Health Endpoint

A fundamental practice is to expose a dedicated health check endpoint within your Ruby application. This endpoint should perform checks against critical dependencies like the database and any external services. For Rails applications, a simple controller action suffices:

# app/controllers/health_controller.rb
class HealthController << ApplicationController
  def show
    # Check database connection
    unless ActiveRecord::Base.connection.execute('SELECT 1')
      render json: { status: 'error', message: 'Database connection failed' }, status: 503
      return
    end

    # Add checks for other critical services (e.g., Redis, external APIs)
    # Example for Redis:
    # unless $redis.ping == 'PONG'
    #   render json: { status: 'error', message: 'Redis connection failed' }, status: 503
    #   return
    # end

    render json: { status: 'ok', message: 'Application is healthy' }, status: 200
  rescue StandardError => e
    render json: { status: 'error', message: "An unexpected error occurred: #{e.message}" }, status: 500
  end
end

Ensure this endpoint is routed correctly in your Rails application:

# config/routes.rb
Rails.application.routes.draw do
  get 'health', to: 'health#show'
  # ... other routes
end

Google Cloud Load Balancer Health Checks

Google Cloud Load Balancers (both HTTP(S) and TCP) can be configured to periodically probe this health endpoint. This is crucial for automatically removing unhealthy instances from the load balancing pool.

When setting up an HTTP(S) Load Balancer backend service, configure the health check as follows:

# Example gcloud CLI command for creating an HTTP health check
gcloud compute health-checks create http ruby-app-health-check \
    --request-path=/health \
    --port=8080 \
    --check-interval=5s \
    --timeout=5s \
    --unhealthy-threshold=2 \
    --healthy-threshold=2 \
    --global

Note: Replace 8080 with the actual port your Ruby application is listening on (e.g., 3000 for Puma/Unicorn, or a custom port if using a reverse proxy like Nginx). The --global flag is for global HTTP(S) load balancers; use --region=[REGION] for regional ones.

Monitoring Background Job Queues (Sidekiq Example)

For asynchronous tasks managed by Sidekiq, monitoring queue depth and worker availability is paramount. Sidekiq exposes a web UI, but for programmatic monitoring, we can query its Redis-backed statistics.

A simple script can check the number of jobs in critical queues and the number of active workers:

# monitor_sidekiq.rb
require 'redis'
require 'sidekiq'

# Configure Sidekiq client to connect to Redis
# Ensure this matches your Sidekiq configuration
redis_url = ENV['REDIS_URL'] || 'redis://localhost:6379/0'
Sidekiq.configure_client do |config|
  config.redis = { url: redis_url }
end

# Get Redis client instance
redis_client = Redis.new(url: redis_url)

# Define critical queues and thresholds
CRITICAL_QUEUES = {
  'default' => 1000, # Max jobs in queue before alerting
  'high'    => 500
}.freeze

MAX_INACTIVE_WORKERS = 0 # Alert if no workers are active for a prolonged period

begin
  # Get Sidekiq stats
  stats = Sidekiq::Stats.new
  queues = Sidekiq::Queue.all
  processes = Sidekiq::Process.all

  alert_messages = []

  # Check queue depths
  CRITICAL_QUEUES.each do |queue_name, threshold|
    queue = Sidekiq::Queue.new(queue_name)
    if queue.size > threshold
      alert_messages << "CRITICAL: Sidekiq queue '#{queue_name}' has #{queue.size} jobs (threshold: #{threshold})."
    end
  end

  # Check for active workers
  active_workers = processes.sum { |p| p['concurrency'] }
  if active_workers <= MAX_INACTIVE_WORKERS && !processes.empty?
    alert_messages << "WARNING: No active Sidekiq workers detected. #{processes.size} processes found but concurrency is zero."
  elsif processes.empty?
    alert_messages << "CRITICAL: No Sidekiq processes found."
  end

  if alert_messages.empty?
    puts "Sidekiq health check passed. Queues: #{stats.queues.keys.join(', ')}, Jobs: #{stats.enqueued}, Processes: #{processes.size}"
    exit 0
  else
    puts "Sidekiq health check failed:"
    alert_messages.each { |msg| puts "- #{msg}" }
    exit 1
  end

rescue Redis::CannotConnectError => e
  puts "CRITICAL: Could not connect to Redis at #{redis_url}. Error: #{e.message}"
  exit 1
rescue StandardError => e
  puts "CRITICAL: An unexpected error occurred during Sidekiq monitoring: #{e.message}"
  e.backtrace.each { |line| puts line }
  exit 1
end

This script can be run periodically by a cron job or a dedicated monitoring agent (like Prometheus Node Exporter with a custom collector) and its exit code used to trigger alerts in systems like Cloud Monitoring or PagerDuty.

Elasticsearch Cluster Health and Performance Monitoring on Google Cloud

Elasticsearch clusters, especially those managed on GCE or Google Kubernetes Engine (GKE), require diligent monitoring to ensure data integrity, query performance, and availability. Key metrics include cluster health status, node resource utilization, indexing rates, and search latency.

Leveraging Elasticsearch APIs for Health Checks

Elasticsearch provides a comprehensive set of APIs to query its internal state. The _cluster/health API is the primary tool for understanding the overall health of the cluster.

# Example using curl to check cluster health
curl -X GET "http://your-elasticsearch-host:9200/_cluster/health?pretty"

The output will contain a status field, which can be one of green, yellow, or red.

green: All primary and replica shards are allocated. The cluster is healthy.
yellow: All primary shards are allocated, but some replica shards are not. Data is safe, but redundancy is reduced.
red: Some primary shards are not allocated. Data might be unavailable. This is a critical state.

We can automate checks for this status. A simple script can poll this endpoint and alert on non-green statuses:

import requests
import json
import sys

ES_HOST = "http://your-elasticsearch-host:9200"
ALERT_ON_STATUS = ["red", "yellow"]

try:
    response = requests.get(f"{ES_HOST}/_cluster/health?pretty")
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
    health_data = response.json()

    cluster_status = health_data.get("status")
    unassigned_shards = health_data.get("unassigned_shards", 0)
    initializing_shards = health_data.get("initializing_shards", 0)
    relocating_shards = health_data.get("relocating_shards", 0)

    if cluster_status in ALERT_ON_STATUS:
        print(f"CRITICAL: Elasticsearch cluster status is '{cluster_status}'.")
        print(f"Unassigned Shards: {unassigned_shards}, Initializing Shards: {initializing_shards}, Relocating Shards: {relocating_shards}")
        sys.exit(1)
    else:
        print(f"Elasticsearch cluster status is '{cluster_status}'. Unassigned: {unassigned_shards}, Initializing: {initializing_shards}, Relocating: {relocating_shards}")
        sys.exit(0)

except requests.exceptions.RequestException as e:
    print(f"CRITICAL: Failed to connect to Elasticsearch at {ES_HOST}. Error: {e}")
    sys.exit(1)
except json.JSONDecodeError:
    print(f"CRITICAL: Failed to decode JSON response from Elasticsearch.")
    sys.exit(1)
except Exception as e:
    print(f"CRITICAL: An unexpected error occurred: {e}")
    sys.exit(1)

Monitoring Node-Level Metrics

Individual node health and resource usage are critical. The _nodes/stats API provides detailed metrics.

# Example: Get JVM heap usage for all nodes
curl -X GET "http://your-elasticsearch-host:9200/_nodes/stats/jvm/heap?pretty"

Key metrics to monitor programmatically include:

JVM Heap Usage: High heap usage can lead to Garbage Collection pauses and instability. Set alerts when usage exceeds 80-90%.
CPU Usage: High CPU can indicate heavy indexing or search load.
Disk I/O and Space: Elasticsearch is I/O intensive. Monitor disk space to prevent outages and I/O wait times.
Network Traffic: High network traffic can indicate inter-node communication issues or large search result transfers.

For comprehensive monitoring, consider using the Elasticsearch Exporter for Prometheus. This exporter scrapes metrics from Elasticsearch APIs and exposes them in a Prometheus-compatible format, allowing integration with Grafana for visualization and Alertmanager for sophisticated alerting.

Indexing and Search Performance

Slow indexing or search queries directly impact application performance. Monitor these metrics:

Indexing Rate: Documents per second. A sudden drop might indicate issues.
Search Rate: Queries per second.
Indexing Latency: Time taken to index a document.
Search Latency: Time taken to execute a search query.

The _stats API provides these details, often aggregated per index.

# Example: Get indexing and search stats for a specific index
curl -X GET "http://your-elasticsearch-host:9200/your-index-name/_stats/index,search?pretty"

Monitoring these metrics allows for proactive identification of performance bottlenecks, such as inefficient queries, poorly designed mappings, or insufficient cluster resources.

Integrating with Google Cloud Monitoring (Stackdriver)

Google Cloud Monitoring (formerly Stackdriver) is the native solution for collecting, visualizing, and alerting on metrics within Google Cloud. It’s essential for consolidating your monitoring efforts.

Ingesting Custom Metrics

For the custom scripts monitoring Ruby apps (e.g., Sidekiq queue depth) and Elasticsearch health, you can use the Cloud Monitoring API or the Ops Agent to ingest these metrics.

Using the Ops Agent: The Ops Agent is a unified agent for collecting logs and metrics from Compute Engine and GKE. It supports custom metrics collection via Prometheus receivers.

# Example ops-agent.yaml configuration snippet for Prometheus receiver
metrics:
  receivers:
    prometheus:
      type: prometheus
      endpoint: http://localhost:9100/metrics # Example for a Prometheus exporter
      collection_interval_sec: 60
  service:
    pipelines:
      prometheus_pipeline:
        receivers: [prometheus]
        processors: [batch] # Optional batch processor
        exporters: [google_cloud_monitoring]

You would configure your Prometheus exporters (like Elasticsearch Exporter) to run on your GCE instances or GKE nodes and point the Ops Agent’s Prometheus receiver to their scrape endpoints.

Using the Cloud Monitoring API (Programmatic): For scripts that don’t expose Prometheus endpoints, you can directly push metrics using the client libraries.

# Example using Python client library to push a custom metric
from google.cloud import monitoring_v3
from google.protobuf.timestamp_pb2 import Timestamp
import time

client = monitoring_v3.MetricServiceClient()
project_id = "your-gcp-project-id"
project_name = f"projects/{project_id}"

series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/ruby_app/sidekiq_queue_size" # Your custom metric type
series.resource.type = "gce_instance"
series.resource.labels["instance_id"] = "your-instance-id" # Or use GKE container/pod
series.resource.labels["project_id"] = project_id
series.resource.labels["zone"] = "your-instance-zone"

now = time.time()
seconds = int(now)
nanos = int((now - seconds) * 10**9)
timestamp = Timestamp(seconds=seconds, nanos=nanos)

point = monitoring_v3.Point(value=monitoring_v3.Point.Value(int64_value=150), interval=monitoring_v3.TimeInterval(end_time=timestamp))
series.points = [point]

try:
    client.create_time_series(name=project_name, time_series=[series])
    print("Successfully wrote time series.")
except Exception as e:
    print(f"Error writing time series: {e}")

Setting Up Alerting Policies

Once metrics are flowing into Cloud Monitoring, configure alerting policies based on thresholds or anomalies.

Example Alerting Policy Configuration (Conceptual):

Metric: custom.googleapis.com/ruby_app/sidekiq_queue_size
Condition: Threshold
Trigger: When the metric value is > 1000 for 5 minutes.
Notification Channel: PagerDuty, Slack, Email.

Similarly, for Elasticsearch metrics ingested via Prometheus or directly:

Metric: elasticsearch_cluster_status (if using exporter, map status to numeric or use specific metrics like elasticsearch_nodes_red)
Condition: Threshold
Trigger: When elasticsearch_nodes_red metric is > 0 for 1 minute.
Notification Channel: PagerDuty.

Leveraging Google Cloud Monitoring provides a centralized view and robust alerting capabilities, ensuring you are notified of issues before they impact your users.