Server Monitoring Best Practices: Keeping Your Ruby App and PostgreSQL Clusters Alive on Google Cloud
Proactive PostgreSQL Health Checks with pg_cron
Maintaining the health of a PostgreSQL cluster, especially in a distributed cloud environment like Google Cloud, requires more than just reactive alerts. Proactive, automated checks are crucial for identifying and mitigating potential issues before they impact your Ruby application. We’ll leverage pg_cron, a PostgreSQL extension, to schedule routine maintenance and health checks directly within the database.
First, ensure pg_cron is installed and enabled on your PostgreSQL instances. This typically involves modifying your postgresql.conf file and restarting the PostgreSQL service. On Google Cloud SQL, this is often managed via instance flags.
Configuring pg_cron for PostgreSQL Maintenance
Once installed, you can schedule jobs using SQL. A common and vital task is to regularly run VACUUM VERBOSE on your databases. This process reclaims storage occupied by dead tuples and can prevent table bloat, which directly impacts query performance for your Ruby app.
Scheduling a Daily VACUUM
The following SQL command schedules a daily VACUUM VERBOSE for all databases on the instance. It’s important to run this during off-peak hours to minimize performance impact.
-- Connect to a database that has pg_cron installed (e.g., postgres)
-- Schedule a job to run VACUUM VERBOSE on all databases daily at 3 AM
SELECT cron.schedule(
'daily-vacuum',
'0 3 * * *', -- Cron syntax: Minute Hour DayOfMonth Month DayOfWeek
$$
DO $$
BEGIN
FOR db_name IN (SELECT datname FROM pg_database WHERE datistemplate = false) LOOP
RAISE NOTICE 'Running VACUUM VERBOSE on database: %', db_name;
EXECUTE format('VACUUM VERBOSE %I', db_name);
END LOOP;
END;
$$;
$$
);
To monitor the execution of these cron jobs, you can query the cron.job and cron.job_run_details tables.
-- View scheduled jobs SELECT * FROM cron.job; -- View recent job run history SELECT * FROM cron.job_run_details ORDER BY start_time DESC LIMIT 10;
For more advanced scenarios, consider scheduling ANALYZE to update table statistics, which is critical for the query planner’s efficiency. You might also want to schedule checks for replication lag if you’re using read replicas.
Monitoring Ruby Application Performance with Prometheus and Grafana
Your Ruby application’s performance is directly tied to the health of your PostgreSQL cluster and its own resource utilization. We’ll set up Prometheus to scrape metrics from your Ruby application and Grafana for visualization and alerting.
Instrumenting Your Ruby Application
The prometheus-client-ruby gem is the standard for instrumenting Ruby applications. You’ll need to add it to your Gemfile and initialize it.
# Gemfile gem 'prometheus-client-ruby'
# config/initializers/prometheus.rb (for Rails applications)
require 'prometheus/client'
Prometheus::Client.configure do |config|
config.logger = Rails.logger
end
# Define custom metrics
$metrics = {}
$metrics[:http_requests_total] = Prometheus::Client::Counter.new(
:http_requests_total,
'Total number of HTTP requests processed'
)
$metrics[:http_request_duration_seconds] = Prometheus::Client::Histogram.new(
:http_request_duration_seconds,
'HTTP request latency in seconds'
)
$metrics[:db_query_duration_seconds] = Prometheus::Client::Histogram.new(
:db_query_duration_seconds,
'Database query latency in seconds'
)
# Register metrics with the default registry
Prometheus::Client.default.register($metrics[:http_requests_total])
Prometheus::Client.default.register($metrics[:http_request_duration_seconds])
Prometheus::Client.default.register($metrics[:db_query_duration_seconds])
# Expose metrics endpoint (e.g., /metrics)
# In Rails, you can use a Rack middleware or a dedicated controller action.
# For simplicity, let's assume a Rack middleware is configured in config/application.rb
# or config/routes.rb for non-Rails apps.
# Example for Rails:
# config.middleware.use Prometheus::Client::Rack::Metrics, path: '/metrics'
You’ll also want to instrument your database queries. This can be done by monkey-patching your ORM (like ActiveRecord) or by manually recording durations around your query execution.
# Example for ActiveRecord (in an initializer or a custom instrumentation file)
module ActiveRecord
module ConnectionAdapters
module PostgreSQLAdapter
def execute(sql, name = nil, parameters = nil)
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
result = super
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
$metrics[:db_query_duration_seconds].observe({ query: name || 'unknown' }, duration)
result
rescue => e
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
$metrics[:db_query_duration_seconds].observe({ query: name || 'unknown', error: e.class.to_s }, duration)
raise e
end
end
end
end
Configuring Prometheus Scraper
On your Google Cloud Compute Engine instances running your Ruby application, configure Prometheus to scrape the /metrics endpoint. If you’re using Kubernetes Engine (GKE), you’ll use service discovery.
# prometheus.yml
scrape_configs:
- job_name: 'ruby_app'
static_configs:
- targets: ['YOUR_APP_INSTANCE_IP:9090'] # Replace with your app's IP and port where metrics are exposed
labels:
environment: 'production'
app: 'my-ruby-app'
- job_name: 'postgres'
static_configs:
- targets: ['YOUR_POSTGRES_INSTANCE_IP:9187'] # Assuming you're using pg_exporter on port 9187
labels:
environment: 'production'
cluster: 'main-pg-cluster'
For PostgreSQL, you’ll typically use the postgres_exporter. Ensure it’s installed and configured to connect to your PostgreSQL instances. The exporter will expose metrics on a specific port (default is 9187).
Visualizing and Alerting with Grafana
Grafana is your go-to for visualizing the metrics collected by Prometheus and for setting up alerts. Connect Grafana to your Prometheus data source.
Key Dashboards and Alerts for Ruby/PostgreSQL
Here are some essential metrics and corresponding Grafana queries/alerts:
- Application Error Rate: Monitor the rate of exceptions.
Grafana Query (PromQL):
rate(http_requests_total{status=~"5..", environment="production"}[5m])
Alert: Trigger if the error rate exceeds a threshold (e.g., 1% of total requests) for more than 5 minutes.
- Request Latency: Track the duration of HTTP requests.
Grafana Query (PromQL):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, environment, app))
Alert: Notify if the 95th percentile latency exceeds a defined SLO (e.g., 500ms) for a sustained period.
- Database Query Latency: Identify slow database queries.
Grafana Query (PromQL):
histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le, query, environment))
Alert: Alert on high 95th percentile query latency, especially for critical queries identified by the query label.
- PostgreSQL Connection Count: Monitor active connections to prevent exhaustion.
Grafana Query (PromQL – assuming pg_exporter):
pg_stat_activity_count{environment="production"}
Alert: Trigger if the connection count approaches the configured max_connections limit.
- PostgreSQL Replication Lag: Crucial for high availability and disaster recovery.
Grafana Query (PromQL – assuming pg_exporter):
pg_replication_lag_seconds{environment="production"}
Alert: Notify immediately if replication lag exceeds a few seconds (depending on your RPO/RTO requirements).
Google Cloud Operations Suite (formerly Stackdriver) Integration
While Prometheus and Grafana provide deep visibility into your application and database metrics, Google Cloud Operations Suite offers centralized logging and monitoring for your GCP resources. It’s essential for capturing system-level metrics and logs from your Compute Engine instances and Cloud SQL instances.
Leveraging Cloud Logging and Monitoring
Ensure the Cloud Logging and Cloud Monitoring agents are installed and running on your Compute Engine instances. For Cloud SQL, logs and metrics are automatically collected.
# On Compute Engine instances, verify agent status sudo systemctl status google-fluentd sudo systemctl status google-cloud-monitoring-agent
You can create custom metrics in Cloud Monitoring based on log entries. For example, if your Ruby application logs specific error messages, you can create a metric to count these occurrences.
Creating Log-Based Metrics in Cloud Monitoring
Navigate to Cloud Monitoring > Logs-based Metrics in the Google Cloud Console. Create a new metric:
- Metric Type: Counter
- Log Filter: Example:
resource.type="gce_instance" AND jsonPayload.message:"FATAL" AND jsonPayload.database:"your_db_name"(for PostgreSQL FATAL errors) - Metric Name: e.g.,
postgresql_fatal_errors
This allows you to correlate application-level metrics from Prometheus with infrastructure-level metrics and logs from Google Cloud Operations Suite, providing a holistic view of your system’s health.
High Availability and Disaster Recovery Considerations
Monitoring is a cornerstone of HA/DR. Ensure your monitoring setup itself is resilient. For critical alerts, consider using multiple notification channels (e.g., PagerDuty, Slack, email).
For PostgreSQL, leverage Cloud SQL’s built-in HA features (regional replicas) and configure automated backups. Monitor replication lag closely, as discussed earlier, to ensure failover readiness. Your Ruby application should be designed to handle temporary database unavailability gracefully, perhaps with retry mechanisms and circuit breakers.
Regularly test your failover procedures and disaster recovery plans. Monitoring provides the data to validate that these processes are working as expected.