Server Monitoring Best Practices: Keeping Your Ruby App and MongoDB Clusters Alive on Google Cloud
Establishing a Robust Monitoring Foundation with Google Cloud Operations Suite
Effectively monitoring a production Ruby application, especially when coupled with a distributed MongoDB cluster on Google Cloud Platform (GCP), demands a multi-layered approach. We’ll leverage Google Cloud Operations Suite (formerly Stackdriver) as our primary observability platform, focusing on key metrics, logging, and alerting for both application and database tiers. This isn’t about basic uptime checks; it’s about deep visibility into performance bottlenecks, error rates, and resource utilization that directly impact user experience and operational stability.
Monitoring the Ruby Application: Key Metrics and Instrumentation
For our Ruby application, we’ll focus on metrics that indicate application health and performance. This includes request latency, error rates (HTTP 5xx), throughput, and garbage collection (GC) activity. While GCP’s built-in metrics provide a good starting point, custom instrumentation is crucial for application-specific insights.
Application Performance Monitoring (APM) with OpenTelemetry
OpenTelemetry is the de facto standard for distributed tracing and custom metrics. We’ll instrument our Ruby application to send traces and metrics to Google Cloud Trace and Google Cloud Monitoring. This involves adding the necessary gems and configuring them to export data.
First, add the required gems to your Gemfile:
gem 'opentelemetry-sdk' gem 'opentelemetry-instrumentation-all' # Includes instrumentation for common frameworks like Rails, Sinatra gem 'opentelemetry-exporter-otlp' gem 'opentelemetry-instrumentation-net_http' # For HTTP client requests gem 'opentelemetry-instrumentation-redis' # If using Redis
Next, configure the OpenTelemetry SDK in your application’s initialization. For a Rails application, this typically goes into an initializer file (e.g., config/initializers/opentelemetry.rb).
# config/initializers/opentelemetry.rb
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
# Configure the SDK
OpenTelemetry::SDK.configure do |c|
# Set the service name
c.service_name = 'my-ruby-app'
# Configure the OTLP exporter to send data to Google Cloud's OpenTelemetry collector
# The collector is typically deployed as a sidecar or as a separate service.
# For GCP, it's common to use the agent or a managed collector.
# If running on GKE, the collector might be accessible via a service name.
# For simplicity here, we assume a local collector or one accessible via localhost.
# In a real-world GKE setup, you'd use the appropriate service discovery.
exporter = OpenTelemetry::Exporter::OTLP.new(
traces_endpoint: ENV.fetch('OTEL_EXPORTER_OTLP_TRACES_ENDPOINT', 'http://localhost:4318/v1/traces'),
metrics_endpoint: ENV.fetch('OTEL_EXPORTER_OTLP_METRICS_ENDPOINT', 'http://localhost:4318/v1/metrics')
)
c.add_exporter(exporter)
# Enable all available instrumentation
c.use_all()
end
# Optional: Configure custom metrics if needed
# OpenTelemetry::Metrics.global_meter_provider.meter('my_app_meter').observable_gauge('custom_metric_name') do |m|
# m.observe { rand(100) }
# end
Ensure your OpenTelemetry Collector is configured to receive OTLP data and export it to Google Cloud. A minimal collector configuration might look like this (otel-collector-config.yaml):
receivers:
otlp:
protocols:
grpc:
http:
exporters:
googlecloud:
project: "your-gcp-project-id"
metric:
# Optional: Specify endpoint if not using default
# endpoint: "monitoring.googleapis.com"
trace:
# Optional: Specify endpoint if not using default
# endpoint: "cloudtrace.googleapis.com"
processors:
batch:
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [googlecloud]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [googlecloud]
Deploy this collector as a sidecar container in your GKE pods or as a separate Deployment. The OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and OTEL_EXPORTER_OTLP_METRICS_ENDPOINT environment variables in your Ruby app should point to the collector’s service address.
Logging with Fluentd/Fluent Bit and Google Cloud Logging
Structured logging is essential for debugging. We’ll use Fluentd or Fluent Bit as a log forwarder to capture application logs and send them to Google Cloud Logging. This allows for powerful querying and analysis.
Ensure your Ruby application logs to stdout and stderr in a structured format (e.g., JSON). For Rails, you can configure this in config/environments/production.rb:
# config/environments/production.rb config.logger = ActiveSupport::Logger.new($stdout, formatter: ::Logger::Formatter.new) config.logger.level = :info config.log_formatter = ::Logger::JSONFormatter.new # If using a JSON formatter gem
If you’re not using a JSON formatter gem, you can create a simple one:
# lib/json_formatter.rb
require 'json'
class Logger::JSONFormatter <:Logger::Formatter
def call(severity, time, progname, msg)
# Convert message to string if it's not already
message = msg2str(msg)
# Attempt to parse if it's already JSON, otherwise treat as plain string
log_entry = if message.start_with?('{') && message.end_with?('}')
begin
JSON.parse(message)
rescue JSON::ParserError
{ message: message }
end
else
{ message: message }
end
log_entry.merge!(
timestamp: time.utc.iso8601(3),
severity: severity,
progname: progname
)
log_entry.to_json + "\n"
end
end
On GKE, deploy Fluent Bit as a DaemonSet. The configuration (fluent-bit.conf) should tail your application logs and forward them to Cloud Logging:
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix k8s.
[OUTPUT]
Name es # Use 'es' for Elasticsearch compatibility, which Cloud Logging uses
Match kube.*
Host logging.googleapis.com
Port 443
Logstash_Format On
Replace_Dots On
Retry_Limit False
# Use 'googlecloud' output plugin if available and preferred
# Name googlecloud
# Project your-gcp-project-id
# Log_Type application
Ensure your Fluent Bit deployment has the necessary RBAC permissions to access Kubernetes metadata and the Cloud Logging API.
Monitoring MongoDB Clusters: Performance and Health
Monitoring MongoDB requires a focus on query performance, replication lag, disk I/O, memory usage, and connection counts. For a cluster, we also need to monitor the health of individual nodes and the overall cluster state.
Leveraging MongoDB’s Built-in Metrics
MongoDB exposes a wealth of metrics via the serverStatus command and the MongoDB Cloud Manager/Ops Manager (if used). We can collect these using a Prometheus exporter and then scrape them into Google Cloud Monitoring.
Deploy the mongodb_exporter as a sidecar or a separate Deployment. Configure it to connect to your MongoDB instances. A basic configuration might involve setting environment variables:
# Example environment variables for mongodb_exporter export MONGODB_URI="mongodb://user:[email protected]:27017,mongodb-1.mongodb-service.namespace.svc.cluster.local:27017/?replicaSet=rs0" export MONGODB_EXPORTER_COLLECTORS="serverStatus,replSetGetStatus,dbStats,collStats,oplog" # For specific database stats: # export MONGODB_EXPORTER_DB_NAMES="admin,local,mydatabase"
Then, configure Prometheus to scrape the exporter’s metrics endpoint (usually /metrics on port 9216).
Collecting MongoDB Metrics with Google Cloud Monitoring Agent
The Google Cloud Monitoring agent (Ops Agent) can be configured to collect metrics from various sources, including Prometheus exporters. We’ll use the agent to scrape the mongodb_exporter and send the data to Cloud Monitoring.
Install the Ops Agent on your MongoDB nodes (or GKE nodes if using a DaemonSet). Then, configure its config.yaml to include a Prometheus receiver and a Google Cloud Monitoring exporter.
# /etc/google-cloud-ops-agent/config.yaml
logging:
receivers:
mongodb_logs:
type: files
include_paths:
- /var/log/mongodb/mongod.log # Adjust path as needed
record_log_line: true
processors:
# Add any necessary processors for log parsing
forwarders:
default:
destination:
cloud_logging:
project_id: "your-gcp-project-id"
metrics:
receivers:
prometheus:
# Scrape metrics from mongodb_exporter
scrape_configs:
- job_name: 'mongodb_exporter'
static_configs:
- targets: ['localhost:9216'] # Assuming mongodb_exporter runs on the same node
# Add relabel_configs if needed to filter or add labels
relabel_configs:
- source_labels: [__address__]
regex: '(.*):9216'
target_label: instance
replacement: '$1'
processors:
# Add any processors for metrics if needed
service:
pipelines:
metrics:
receivers: [prometheus]
# The default exporter is googlecloud, no need to explicitly define if using default
Restart the Ops Agent after applying the configuration: sudo systemctl restart google-cloud-ops-agent.
Alerting Strategies for Proactive Incident Response
Effective alerting is about minimizing Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). We’ll set up alerts in Google Cloud Monitoring based on critical thresholds for both application and database metrics.
Application Alerts
Key application alerts include:
- High HTTP 5xx Error Rate: Trigger when the rate of server errors exceeds a defined threshold (e.g., > 5% of total requests over 5 minutes).
- High Request Latency: Alert on P95 or P99 latency exceeding acceptable limits (e.g., P99 latency > 2 seconds for 10 minutes).
- Application Throughput Drop: Notify if the request rate falls below a baseline, indicating a potential outage or severe performance degradation.
- OpenTelemetry Collector Unhealthy: Monitor the health of the OpenTelemetry collector itself (e.g., high error rates in its logs, or lack of recent trace/metric data).
Example Alerting Policy in Cloud Monitoring (using `gcloud` CLI):
# Alert for high HTTP 5xx error rate gcloud monitoring policies create \ --display-name="High HTTP 5xx Error Rate - My Ruby App" \ --combiner=OR \ --conditions='display_name="5xx Errors",condition="resource.type=\"k8s_container\" AND metric.type=\"workload.googleapis.com/kubernetes.container.requests_count\" AND metric.labels.response_code_class=\"5xx\" AND ON_WINDOW(5m) HAS_VALUE_GT(0.05 * resource.labels.response_code_class=\"[0-9][0-9][0-9]\" AND ON_WINDOW(5m) HAS_VALUE_GT(0))",duration=300s,trigger.count=1' \ --alert-strategy=auto \ --notification-channels=projects/your-gcp-project-id/notificationChannels/your-channel-id \ --documentation="message=High rate of HTTP 5xx errors detected on the Ruby application. Check application logs and traces for root cause."
MongoDB Cluster Alerts
Critical MongoDB alerts include:
- High Replication Lag: Notify if the oplog window is growing excessively or if secondary nodes are falling behind the primary by more than a few seconds.
- High Disk I/O Wait Time: Alert if disk I/O becomes a bottleneck, impacting read/write performance.
- Low Disk Space: Proactive alert before disk fills up, which can cause database instability.
- High Connection Count: Monitor for unusually high connection counts that might indicate connection leaks or resource exhaustion.
- Replica Set Member Unhealthy: Alert if a replica set member goes offline or enters a non-primary state unexpectedly.
Example Alerting Policy for Replication Lag:
# Alert for MongoDB replication lag gcloud monitoring policies create \ --display-name="MongoDB Replication Lag - Replica Set RS0" \ --combiner=OR \ --conditions='display_name="Replication Lag",condition="resource.type=\"mongodb_instance\" AND metric.type=\"mongodb.googleapis.com/server/replication_lag\" AND resource.labels.cluster_name=\"your-mongo-cluster-name\" AND ON_WINDOW(10m) HAS_VALUE_GT(60)",duration=600s,trigger.count=1' \ --alert-strategy=auto \ --notification-channels=projects/your-gcp-project-id/notificationChannels/your-channel-id \ --documentation="message=MongoDB replication lag is exceeding 60 seconds. Investigate the health of secondary nodes and network connectivity."
Advanced Diagnostics and Troubleshooting Workflows
When incidents occur, having a structured diagnostic approach is key. Here are some common workflows:
Debugging Application Performance Issues
1. Check Application Logs: Use Cloud Logging to filter logs for the affected service and time range. Look for recurring errors, exceptions, or slow query logs.
gcloud logging read 'resource.type="k8s_container" AND resource.labels.cluster_name="your-gke-cluster" AND resource.labels.namespace_name="your-namespace" AND resource.labels.container_name="my-ruby-app" AND severity>=ERROR' --limit=100 --format="table(timestamp, severity, textPayload)"
2. Analyze Traces: Navigate to Cloud Trace and filter by your service name. Identify slow spans, external service calls, or database queries contributing to high latency.
3. Review Resource Utilization: Check CPU, memory, and network usage for your application pods in Cloud Monitoring. Look for resource contention.
Troubleshooting MongoDB Cluster Problems
1. Check MongoDB Logs: Use Cloud Logging to collect and analyze MongoDB’s own logs for errors, warnings, or performance-related messages.
gcloud logging read 'resource.type="k8s_container" AND resource.labels.cluster_name="your-gke-cluster" AND resource.labels.namespace_name="your-namespace" AND resource.labels.container_name="mongodb" AND textPayload=~"ERROR|WARNING"' --limit=100 --format="table(timestamp, textPayload)"
2. Inspect Replication Status: Use the MongoDB shell to check the replica set status.
mongo --host mongodb-0.mongodb-service.namespace.svc.cluster.local --port 27017 -u-p > rs.status()
3. Analyze Query Performance: Use db.collection.explain() or enable the MongoDB profiler to identify slow-running queries. Cloud Monitoring metrics like query_execution_time can also be indicative.
4. Monitor Node Health: Check individual node metrics in Cloud Monitoring for CPU, memory, disk I/O, and network saturation. Ensure all nodes in the replica set are healthy and reachable.
Conclusion: Continuous Improvement in Observability
Implementing comprehensive monitoring for your Ruby application and MongoDB clusters on GCP is an ongoing process. Regularly review your alerts, refine your metrics, and update your diagnostic playbooks. By combining Google Cloud Operations Suite with robust instrumentation and well-defined alerting strategies, you can ensure the stability, performance, and availability of your critical services.