Server Monitoring Best Practices: Keeping Your Shopify App and PostgreSQL Clusters Alive on DigitalOcean

Proactive PostgreSQL Monitoring with Prometheus and Grafana on DigitalOcean

Maintaining the health and performance of PostgreSQL clusters is paramount for any application, especially those with high transaction volumes like Shopify apps. Relying solely on DigitalOcean’s basic metrics can lead to reactive firefighting rather than proactive problem resolution. This section details a robust monitoring setup using Prometheus for metric collection and Grafana for visualization, specifically tailored for PostgreSQL on DigitalOcean droplets.

1. Deploying Prometheus and Grafana

The most straightforward approach for a production environment is to deploy Prometheus and Grafana as Docker containers. This ensures isolation, easy updates, and portability. We’ll use Docker Compose for orchestration.

Create a docker-compose.yml file:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.3.3
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Next, configure Prometheus to scrape PostgreSQL metrics. This requires the postgres_exporter. We’ll add it to our docker-compose.yml and configure Prometheus to scrape it.

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.3.3
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  postgres_exporter:
    image: wrouesnel/postgres_exporter:latest
    container_name: postgres_exporter
    environment:
      # Replace with your actual PostgreSQL connection string
      # Example: "postgresql://user:password@host:port/database?sslmode=disable"
      DATA_SOURCE_NAME: "postgresql://monitor_user:your_secure_password@your_postgres_host:5432/your_database?sslmode=disable"
    ports:
      - "9187:9187" # Default port for postgres_exporter
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Create the Prometheus configuration file prometheus/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres_exporter:9187']
    metrics_path: /metrics
    params:
      # Optional: Specify which collectors to enable/disable
      # See https://github.com/wrouesnel/postgres_exporter#configuration
      # Example:
      # collect[]: ['pg_stat_activity', 'pg_replication', 'pg_locks']

Before starting, ensure you have a dedicated monitoring user in your PostgreSQL cluster with appropriate permissions. For example:

CREATE USER monitor_user WITH PASSWORD 'your_secure_password';
GRANT CONNECT ON DATABASE your_database TO monitor_user;
-- Grant read-only access to relevant system views
GRANT pg_read_all_stats TO monitor_user;
-- For replication metrics, you might need:
GRANT pg_read_replication TO monitor_user;

Now, deploy the stack:

docker-compose up -d

Access Prometheus at http://your_do_droplet_ip:9090 and Grafana at http://your_do_droplet_ip:3000. Log in to Grafana with default credentials (admin/admin) and change the password. Add Prometheus as a data source (Configuration -> Data Sources -> Add data source -> Prometheus). Use http://prometheus:9090 as the URL.

2. Key PostgreSQL Metrics to Monitor

Here are critical PostgreSQL metrics to track for performance and stability, along with their Prometheus metric names (prefixed with pg_):

Connection Management:
- pg_stat_activity_count: Number of active connections. High numbers can indicate connection leaks or insufficient pooling.
- pg_stat_database_numbackends: Total number of backends connected to a database.
Replication Status:
- pg_replication_lag_seconds: Replication lag in seconds. Crucial for high availability setups.
- pg_replication_state: Status of replication (e.g., ‘streaming’, ‘catchup’).
Resource Utilization:
- pg_stat_database_blks_read / pg_stat_database_blks_hit: Block read/hit ratio. A low hit ratio indicates heavy I/O.
- pg_stat_bgwriter_buffers_backend / pg_stat_bgwriter_buffers_clean: Background writer activity.
- pg_stat_database_tup_returned / pg_stat_database_tup_fetched: Tuple operations.
Locking:
- pg_locks_count: Number of active locks. High counts or long lock waits can cause application slowdowns.
Transaction Activity:
- pg_stat_database_xact_commit / pg_stat_database_xact_rollback: Commit/rollback rates. High rollback rates are a red flag.
Database Size and Growth:
- pg_database_size_bytes: Total size of databases.

3. Setting up Grafana Dashboards

Import pre-built Grafana dashboards for PostgreSQL. A popular choice is the “PostgreSQL by EnterpriseDB” dashboard (ID 7362) or “PostgreSQL Overview” (ID 1209). You can find these on Grafana.com/dashboards. After importing, select your Prometheus data source.

For custom dashboards, focus on visualizing the key metrics mentioned above. For example, a panel to monitor replication lag:

SELECT avg(pg_replication_lag_seconds) FROM pg_replication_slots WHERE active = true

And a panel for connection counts:

sum(pg_stat_activity_count{datname="your_database"}) by (datname)

4. Alerting with Alertmanager

To make monitoring truly proactive, set up alerting. Prometheus can send alerts to Alertmanager, which then routes them to your preferred notification channels (Slack, PagerDuty, email).

Add Alertmanager to your docker-compose.yml and configure Prometheus to send alerts to it.

version: '3.8'

services:
  # ... (prometheus, grafana, postgres_exporter services as above) ...

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Create prometheus/prometheus.yml to include Alertmanager:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres_exporter:9187']
    metrics_path: /metrics

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Create alertmanager/alertmanager.yml for basic routing (e.g., to Slack):

global:
  resolve_timeout: 5m
  slack_api_url: '' # Replace with your Slack webhook URL

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts' # Replace with your alert channel
        send_resolved: true
        text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`\n*Description:* {{ .Annotations.description }}\n*Instance:* {{ .Labels.instance }}\n*Started:* {{ .StartsAt | humanize }}\n{{ end }}'

Define alerting rules in Prometheus. Create prometheus/alert.rules.yml:

groups:
- name: postgres_alerts
  rules:
  - alert: HighReplicationLag
    expr: pg_replication_lag_seconds > 60 # Alert if lag is over 60 seconds
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High PostgreSQL replication lag on {{ $labels.instance }}"
      description: "Replication lag for database {{ $labels.datname }} on instance {{ $labels.instance }} is {{ $value }} seconds, exceeding the threshold of 60 seconds."

  - alert: HighConnectionCount
    expr: pg_stat_activity_count{datname="your_database"} > 100 # Alert if connections exceed 100
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High PostgreSQL connection count on {{ $labels.instance }}"
      description: "Database {{ $labels.datname }} on instance {{ $labels.instance }} has {{ $value }} active connections, exceeding the threshold of 100."

  - alert: HighRollbackRate
    expr: rate(pg_stat_database_xact_rollback{datname="your_database"}[5m]) * 60 > 10 # Alert if rollback rate exceeds 10 per minute
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High PostgreSQL rollback rate on {{ $labels.instance }}"
      description: "Database {{ $labels.datname }} on instance {{ $labels.instance }} has a rollback rate of {{ $value | printf \"%.2f\" }} per minute, exceeding the threshold of 10."

Update prometheus/prometheus.yml to load these rules:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres_exporter:9187']
    metrics_path: /metrics

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "/etc/prometheus/alert.rules.yml"

Restart your Docker containers for the changes to take effect:

docker-compose restart

5. Monitoring the Shopify App Itself

Beyond the database, monitoring your Shopify application’s health and performance is critical. This involves tracking application-level metrics, error rates, and response times.

5.1. Application Performance Monitoring (APM) Tools

For a Shopify app, especially one built with PHP (e.g., Laravel, Symfony), integrating an APM solution is highly recommended. Tools like:

New Relic: Comprehensive APM, infrastructure monitoring, and error tracking.
Datadog APM: Powerful distributed tracing and performance analysis.
Sentry: Primarily error tracking but offers performance monitoring features.

These tools typically involve installing an agent or a PHP extension (like newrelic.so or the Datadog PHP tracer) on your application servers. They automatically instrument your code to capture request traces, database query times, external API calls, and errors.

5.2. Custom Application Metrics with Prometheus Client Libraries

You can expose custom application metrics directly to Prometheus. For PHP, use libraries like prometheus_client_php.

<?php
require 'vendor/autoload.php';

use Prometheus\CollectorRegistry;
use Prometheus\Renderers\RendererInterface;
use Prometheus\Storage\InMemory;

$adapter = new InMemory();
$registry = new CollectorRegistry($adapter);

// Counter for total API requests
$counter = $registry->registerCounter('shopify_app', 'api_requests_total', 'Total number of API requests received', ['endpoint']);

// Gauge for current active background jobs
$gauge = $registry->registerGauge('shopify_app', 'background_jobs_active', 'Number of active background jobs', ['job_type']);

// Histogram for request durations
$histogram = $registry->registerHistogram('shopify_app', 'request_duration_seconds', 'Histogram of request durations', ['endpoint']);

// --- In your application's request handling logic ---

// Example: Increment API request counter
$endpoint = $_SERVER['REQUEST_URI']; // Or a more specific route identifier
$counter->inc(['endpoint' => $endpoint]);

// Example: Track request duration
$startTime = microtime(true);
// ... your application logic ...
$duration = microtime(true) - $startTime;
$histogram->observe($duration, ['endpoint' => $endpoint]);

// Example: Update active background job gauge
// $gauge->set(5, ['job_type' => 'order_processing']); // When jobs start
// $gauge->set(4, ['job_type' => 'order_processing']); // When jobs finish

// --- Expose metrics endpoint ---
// In a separate script (e.g., /metrics.php) or via a dedicated route

header('Content-Type: ' . RendererInterface::CONTENT_TYPE_TEXT_PLAIN);

$renderer = new \Prometheus\Renderers\TextRenderer();
echo $renderer->render($registry);
?>

Configure Prometheus to scrape this /metrics.php endpoint. Add a new job to prometheus/prometheus.yml:

scrape_configs:
  # ... other jobs ...

  - job_name: 'shopify_app'
    static_configs:
      - targets: ['your_app_server_ip:80'] # Replace with your app server's IP and port
    metrics_path: /metrics.php # Or the actual path to your metrics endpoint
    scheme: http # or https

5.3. Log Aggregation and Analysis

Centralized logging is crucial for debugging. Use tools like:

ELK Stack (Elasticsearch, Logstash, Kibana): Powerful for large-scale log management.
Loki (with Promtail and Grafana): A more lightweight, Prometheus-native logging system.
DigitalOcean Log Management: A managed service if you prefer simplicity.

Configure your application to log in a structured format (e.g., JSON). Use tools like Promtail to ship logs from your DigitalOcean droplets to your chosen log aggregation system. In Grafana, you can then query and visualize these logs alongside your metrics.

6. DigitalOcean Droplet and Infrastructure Monitoring

Don’t neglect the underlying infrastructure. DigitalOcean provides basic metrics for CPU, memory, disk I/O, and network traffic. However, for deeper insights, consider:

Node Exporter: Deploy node_exporter as a Docker container or directly on your droplets to expose detailed host-level metrics (CPU, RAM, disk usage, network stats) to Prometheus.
Blackbox Exporter: Monitor the availability and performance of your application endpoints (HTTP, HTTPS, TCP) from an external perspective.
DigitalOcean Alerts: Configure basic alerts within the DigitalOcean control panel for critical resource thresholds (e.g., CPU utilization > 90% for 1 hour).

Add Node Exporter to your docker-compose.yml:

services:
  # ... other services ...

  node_exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node_exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

And update prometheus/prometheus.yml to scrape it:

scrape_configs:
  # ... other jobs ...

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100'] # If running on the same Docker host as Prometheus
      # If node_exporter is on a different droplet, use its IP:
      # - targets: ['your_app_server_ip:9100']

7. Regular Audits and Tuning

Monitoring is not a set-and-forget solution. Regularly review your dashboards and alerts. Analyze historical data to identify performance bottlenecks and optimize queries, indexing strategies, and PostgreSQL configuration parameters (e.g., shared_buffers, work_mem). Use tools like pg_stat_statements to find slow queries.

Periodically adjust alert thresholds based on observed performance patterns and business requirements. Ensure your monitoring setup scales with your application’s growth.