Server Monitoring Best Practices: Keeping Your Shopify App and PostgreSQL Clusters Alive on DigitalOcean
Proactive PostgreSQL Monitoring with Prometheus and Grafana on DigitalOcean
Maintaining the health and performance of PostgreSQL clusters is paramount for any application, especially those with high transaction volumes like Shopify apps. Relying solely on DigitalOcean’s basic metrics can lead to reactive firefighting rather than proactive problem resolution. This section details a robust monitoring setup using Prometheus for metric collection and Grafana for visualization, specifically tailored for PostgreSQL on DigitalOcean droplets.
1. Deploying Prometheus and Grafana
The most straightforward approach for a production environment is to deploy Prometheus and Grafana as Docker containers. This ensures isolation, easy updates, and portability. We’ll use Docker Compose for orchestration.
Create a docker-compose.yml file:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
restart: unless-stopped
grafana:
image: grafana/grafana:10.3.3
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Next, configure Prometheus to scrape PostgreSQL metrics. This requires the postgres_exporter. We’ll add it to our docker-compose.yml and configure Prometheus to scrape it.
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
restart: unless-stopped
grafana:
image: grafana/grafana:10.3.3
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
postgres_exporter:
image: wrouesnel/postgres_exporter:latest
container_name: postgres_exporter
environment:
# Replace with your actual PostgreSQL connection string
# Example: "postgresql://user:password@host:port/database?sslmode=disable"
DATA_SOURCE_NAME: "postgresql://monitor_user:your_secure_password@your_postgres_host:5432/your_database?sslmode=disable"
ports:
- "9187:9187" # Default port for postgres_exporter
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Create the Prometheus configuration file prometheus/prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'postgres'
static_configs:
- targets: ['postgres_exporter:9187']
metrics_path: /metrics
params:
# Optional: Specify which collectors to enable/disable
# See https://github.com/wrouesnel/postgres_exporter#configuration
# Example:
# collect[]: ['pg_stat_activity', 'pg_replication', 'pg_locks']
Before starting, ensure you have a dedicated monitoring user in your PostgreSQL cluster with appropriate permissions. For example:
CREATE USER monitor_user WITH PASSWORD 'your_secure_password'; GRANT CONNECT ON DATABASE your_database TO monitor_user; -- Grant read-only access to relevant system views GRANT pg_read_all_stats TO monitor_user; -- For replication metrics, you might need: GRANT pg_read_replication TO monitor_user;
Now, deploy the stack:
docker-compose up -d
Access Prometheus at http://your_do_droplet_ip:9090 and Grafana at http://your_do_droplet_ip:3000. Log in to Grafana with default credentials (admin/admin) and change the password. Add Prometheus as a data source (Configuration -> Data Sources -> Add data source -> Prometheus). Use http://prometheus:9090 as the URL.
2. Key PostgreSQL Metrics to Monitor
Here are critical PostgreSQL metrics to track for performance and stability, along with their Prometheus metric names (prefixed with pg_):
- Connection Management:
pg_stat_activity_count: Number of active connections. High numbers can indicate connection leaks or insufficient pooling.pg_stat_database_numbackends: Total number of backends connected to a database.
- Replication Status:
pg_replication_lag_seconds: Replication lag in seconds. Crucial for high availability setups.pg_replication_state: Status of replication (e.g., ‘streaming’, ‘catchup’).
- Resource Utilization:
pg_stat_database_blks_read/pg_stat_database_blks_hit: Block read/hit ratio. A low hit ratio indicates heavy I/O.pg_stat_bgwriter_buffers_backend/pg_stat_bgwriter_buffers_clean: Background writer activity.pg_stat_database_tup_returned/pg_stat_database_tup_fetched: Tuple operations.
- Locking:
pg_locks_count: Number of active locks. High counts or long lock waits can cause application slowdowns.
- Transaction Activity:
pg_stat_database_xact_commit/pg_stat_database_xact_rollback: Commit/rollback rates. High rollback rates are a red flag.
- Database Size and Growth:
pg_database_size_bytes: Total size of databases.
3. Setting up Grafana Dashboards
Import pre-built Grafana dashboards for PostgreSQL. A popular choice is the “PostgreSQL by EnterpriseDB” dashboard (ID 7362) or “PostgreSQL Overview” (ID 1209). You can find these on Grafana.com/dashboards. After importing, select your Prometheus data source.
For custom dashboards, focus on visualizing the key metrics mentioned above. For example, a panel to monitor replication lag:
SELECT avg(pg_replication_lag_seconds) FROM pg_replication_slots WHERE active = true
And a panel for connection counts:
sum(pg_stat_activity_count{datname="your_database"}) by (datname)
4. Alerting with Alertmanager
To make monitoring truly proactive, set up alerting. Prometheus can send alerts to Alertmanager, which then routes them to your preferred notification channels (Slack, PagerDuty, email).
Add Alertmanager to your docker-compose.yml and configure Prometheus to send alerts to it.
version: '3.8'
services:
# ... (prometheus, grafana, postgres_exporter services as above) ...
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Create prometheus/prometheus.yml to include Alertmanager:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'postgres'
static_configs:
- targets: ['postgres_exporter:9187']
metrics_path: /metrics
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
Create alertmanager/alertmanager.yml for basic routing (e.g., to Slack):
global: resolve_timeout: 5m slack_api_url: '' # Replace with your Slack webhook URL route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' # Replace with your alert channel send_resolved: true text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`\n*Description:* {{ .Annotations.description }}\n*Instance:* {{ .Labels.instance }}\n*Started:* {{ .StartsAt | humanize }}\n{{ end }}'
Define alerting rules in Prometheus. Create prometheus/alert.rules.yml:
groups:
- name: postgres_alerts
rules:
- alert: HighReplicationLag
expr: pg_replication_lag_seconds > 60 # Alert if lag is over 60 seconds
for: 5m
labels:
severity: critical
annotations:
summary: "High PostgreSQL replication lag on {{ $labels.instance }}"
description: "Replication lag for database {{ $labels.datname }} on instance {{ $labels.instance }} is {{ $value }} seconds, exceeding the threshold of 60 seconds."
- alert: HighConnectionCount
expr: pg_stat_activity_count{datname="your_database"} > 100 # Alert if connections exceed 100
for: 10m
labels:
severity: warning
annotations:
summary: "High PostgreSQL connection count on {{ $labels.instance }}"
description: "Database {{ $labels.datname }} on instance {{ $labels.instance }} has {{ $value }} active connections, exceeding the threshold of 100."
- alert: HighRollbackRate
expr: rate(pg_stat_database_xact_rollback{datname="your_database"}[5m]) * 60 > 10 # Alert if rollback rate exceeds 10 per minute
for: 5m
labels:
severity: warning
annotations:
summary: "High PostgreSQL rollback rate on {{ $labels.instance }}"
description: "Database {{ $labels.datname }} on instance {{ $labels.instance }} has a rollback rate of {{ $value | printf \"%.2f\" }} per minute, exceeding the threshold of 10."
Update prometheus/prometheus.yml to load these rules:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'postgres'
static_configs:
- targets: ['postgres_exporter:9187']
metrics_path: /metrics
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "/etc/prometheus/alert.rules.yml"
Restart your Docker containers for the changes to take effect:
docker-compose restart
5. Monitoring the Shopify App Itself
Beyond the database, monitoring your Shopify application’s health and performance is critical. This involves tracking application-level metrics, error rates, and response times.
5.1. Application Performance Monitoring (APM) Tools
For a Shopify app, especially one built with PHP (e.g., Laravel, Symfony), integrating an APM solution is highly recommended. Tools like:
- New Relic: Comprehensive APM, infrastructure monitoring, and error tracking.
- Datadog APM: Powerful distributed tracing and performance analysis.
- Sentry: Primarily error tracking but offers performance monitoring features.
These tools typically involve installing an agent or a PHP extension (like newrelic.so or the Datadog PHP tracer) on your application servers. They automatically instrument your code to capture request traces, database query times, external API calls, and errors.
5.2. Custom Application Metrics with Prometheus Client Libraries
You can expose custom application metrics directly to Prometheus. For PHP, use libraries like prometheus_client_php.
<?php
require 'vendor/autoload.php';
use Prometheus\CollectorRegistry;
use Prometheus\Renderers\RendererInterface;
use Prometheus\Storage\InMemory;
$adapter = new InMemory();
$registry = new CollectorRegistry($adapter);
// Counter for total API requests
$counter = $registry->registerCounter('shopify_app', 'api_requests_total', 'Total number of API requests received', ['endpoint']);
// Gauge for current active background jobs
$gauge = $registry->registerGauge('shopify_app', 'background_jobs_active', 'Number of active background jobs', ['job_type']);
// Histogram for request durations
$histogram = $registry->registerHistogram('shopify_app', 'request_duration_seconds', 'Histogram of request durations', ['endpoint']);
// --- In your application's request handling logic ---
// Example: Increment API request counter
$endpoint = $_SERVER['REQUEST_URI']; // Or a more specific route identifier
$counter->inc(['endpoint' => $endpoint]);
// Example: Track request duration
$startTime = microtime(true);
// ... your application logic ...
$duration = microtime(true) - $startTime;
$histogram->observe($duration, ['endpoint' => $endpoint]);
// Example: Update active background job gauge
// $gauge->set(5, ['job_type' => 'order_processing']); // When jobs start
// $gauge->set(4, ['job_type' => 'order_processing']); // When jobs finish
// --- Expose metrics endpoint ---
// In a separate script (e.g., /metrics.php) or via a dedicated route
header('Content-Type: ' . RendererInterface::CONTENT_TYPE_TEXT_PLAIN);
$renderer = new \Prometheus\Renderers\TextRenderer();
echo $renderer->render($registry);
?>
Configure Prometheus to scrape this /metrics.php endpoint. Add a new job to prometheus/prometheus.yml:
scrape_configs:
# ... other jobs ...
- job_name: 'shopify_app'
static_configs:
- targets: ['your_app_server_ip:80'] # Replace with your app server's IP and port
metrics_path: /metrics.php # Or the actual path to your metrics endpoint
scheme: http # or https
5.3. Log Aggregation and Analysis
Centralized logging is crucial for debugging. Use tools like:
- ELK Stack (Elasticsearch, Logstash, Kibana): Powerful for large-scale log management.
- Loki (with Promtail and Grafana): A more lightweight, Prometheus-native logging system.
- DigitalOcean Log Management: A managed service if you prefer simplicity.
Configure your application to log in a structured format (e.g., JSON). Use tools like Promtail to ship logs from your DigitalOcean droplets to your chosen log aggregation system. In Grafana, you can then query and visualize these logs alongside your metrics.
6. DigitalOcean Droplet and Infrastructure Monitoring
Don’t neglect the underlying infrastructure. DigitalOcean provides basic metrics for CPU, memory, disk I/O, and network traffic. However, for deeper insights, consider:
- Node Exporter: Deploy
node_exporteras a Docker container or directly on your droplets to expose detailed host-level metrics (CPU, RAM, disk usage, network stats) to Prometheus. - Blackbox Exporter: Monitor the availability and performance of your application endpoints (HTTP, HTTPS, TCP) from an external perspective.
- DigitalOcean Alerts: Configure basic alerts within the DigitalOcean control panel for critical resource thresholds (e.g., CPU utilization > 90% for 1 hour).
Add Node Exporter to your docker-compose.yml:
services:
# ... other services ...
node_exporter:
image: prom/node-exporter:v1.7.0
container_name: node_exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
And update prometheus/prometheus.yml to scrape it:
scrape_configs:
# ... other jobs ...
- job_name: 'node_exporter'
static_configs:
- targets: ['node_exporter:9100'] # If running on the same Docker host as Prometheus
# If node_exporter is on a different droplet, use its IP:
# - targets: ['your_app_server_ip:9100']
7. Regular Audits and Tuning
Monitoring is not a set-and-forget solution. Regularly review your dashboards and alerts. Analyze historical data to identify performance bottlenecks and optimize queries, indexing strategies, and PostgreSQL configuration parameters (e.g., shared_buffers, work_mem). Use tools like pg_stat_statements to find slow queries.
Periodically adjust alert thresholds based on observed performance patterns and business requirements. Ensure your monitoring setup scales with your application’s growth.