Server Monitoring Best Practices: Keeping Your Shopify App and PostgreSQL Clusters Alive on OVH

Proactive PostgreSQL Monitoring with Prometheus and Grafana on OVH

Maintaining the health and performance of PostgreSQL clusters, especially those backing critical Shopify applications, demands a robust monitoring strategy. On OVH infrastructure, this often involves self-managed PostgreSQL instances. We’ll focus on a Prometheus-based stack, leveraging exporters and Grafana for visualization and alerting.

Deploying the PostgreSQL Exporter

The postgres_exporter is essential for exposing PostgreSQL metrics to Prometheus. It requires a dedicated PostgreSQL user with specific privileges. Ensure this user has read-only access to relevant system catalogs and statistics views.

First, create a monitoring user in your PostgreSQL cluster:

-- Connect to your PostgreSQL instance as a superuser
CREATE USER monitor WITH PASSWORD 'your_strong_password';
GRANT pg_read_all_stats TO monitor;
GRANT pg_stat_statements_user TO monitor;
-- For specific database monitoring, grant access to that database
GRANT CONNECT ON DATABASE your_app_db TO monitor;
GRANT USAGE ON SCHEMA pg_catalog TO monitor;
GRANT SELECT ON pg_stat_activity TO monitor;
GRANT SELECT ON pg_stat_database TO monitor;
GRANT SELECT ON pg_stat_replication TO monitor;
GRANT SELECT ON pg_stat_statements TO monitor;
GRANT SELECT ON pg_locks TO monitor;
GRANT SELECT ON pg_settings TO monitor;
GRANT SELECT ON pg_stat_user_tables TO monitor;
GRANT SELECT ON pg_stat_user_indexes TO monitor;

Next, install and configure the postgres_exporter. This can be done via Docker or directly on a host. For this example, we’ll assume a Docker deployment on a dedicated monitoring host or within your OVH cloud environment.

Create a .pgpass file for the user running the exporter to avoid embedding credentials directly in configuration:

# ~/.pgpass
hostname:port:database:username:password
your_pg_host:5432:*:monitor:your_strong_password

Set appropriate permissions for the .pgpass file:

chmod 0600 ~/.pgpass

Run the exporter using Docker:

docker run -d \
  --name postgres_exporter \
  -p 9187:9187 \
  -e DATA_SOURCE_NAME="postgresql://monitor:your_strong_password@your_pg_host:5432/postgres?sslmode=disable" \
  quay.io/prometheus_community/postgres-exporter:latest

Note: Replace your_pg_host with the actual hostname or IP of your PostgreSQL cluster. Adjust sslmode as per your PostgreSQL configuration. For production, using sslmode=verify-full with proper certificates is highly recommended.

Configuring Prometheus to Scrape PostgreSQL Metrics

Edit your Prometheus configuration file (typically prometheus.yml) to include a scrape job for the PostgreSQL exporter.

scrape_configs:
  - job_name: 'postgres'
    static_configs:
      - targets: ['your_exporter_host:9187'] # Replace with your exporter's host and port
    metrics_path: /metrics
    params:
      collect[]:
        - pg_stat_activity
        - pg_stat_database
        - pg_stat_replication
        - pg_stat_statements
        - pg_locks
        - pg_settings
        - pg_stat_user_tables
        - pg_stat_user_indexes
        - pg_up
        - pg_postmaster_start_time
        - pg_database_size
        - pg_replication_lag

After updating the configuration, reload or restart your Prometheus server:

# If running Prometheus as a systemd service
sudo systemctl reload prometheus

# Or restart if needed
sudo systemctl restart prometheus

Verify that Prometheus is scraping the PostgreSQL exporter by navigating to its UI (usually http://your_prometheus_host:9090/targets) and checking the status of the ‘postgres’ job.

Key PostgreSQL Metrics for Shopify Applications

When monitoring PostgreSQL for a Shopify app, prioritize metrics that indicate performance bottlenecks, resource contention, and potential failures. Here are some critical ones:

pg_stat_activity_count: Number of active connections. High numbers can indicate connection pool exhaustion or slow queries.
pg_stat_database_numbackends: Total number of backends connected to a database.
pg_stat_replication_lag_seconds: Replication lag for standby servers. Crucial for high availability and disaster recovery.
pg_stat_statements_calls: Number of times a statement has been executed. Helps identify frequently run queries.
pg_stat_statements_total_time_seconds: Total time spent executing a statement. Highlights performance-critical queries.
pg_locks_count: Number of active locks. High lock counts can lead to deadlocks and query slowdowns.
pg_database_size_bytes: Size of databases. Important for capacity planning.
pg_up: Indicates if the PostgreSQL instance is reachable.

Setting Up Grafana Dashboards and Alerts

Grafana provides a powerful interface for visualizing PostgreSQL metrics and setting up alerts. You can import pre-built dashboards or create custom ones.

Importing a Dashboard:

Grafana’s dashboard repository (grafana.com/grafana/dashboards/) has excellent PostgreSQL dashboards. Search for “PostgreSQL” and import a highly-rated one (e.g., ID 7362 or 12000). Ensure your Prometheus data source is configured in Grafana.

Creating Custom Dashboards:

For specific needs, create a new dashboard and add panels. For example, to visualize active connections:

Query:
sum(pg_stat_activity_count{job="postgres"}) by (datname)

Visualization:
Graph or Stat

Title:
Active Connections per Database

Alerting Rules:

Define alerting rules in Prometheus (via a separate alert.rules.yml file, which is then included in prometheus.yml) or directly within Grafana. Here’s an example of a Prometheus alert rule for high replication lag:

groups:
- name: postgresql.rules
  rules:
  - alert: PostgreSQLReplicationLagging
    expr: pg_replication_lag_seconds{job="postgres"} > 60 # Alert if lag is over 60 seconds
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "PostgreSQL replication lag detected on {{ $labels.instance }}"
      description: "Replication lag on {{ $labels.instance }} is {{ $value }} seconds, exceeding the threshold."

Ensure your Prometheus server is configured to send alerts to Alertmanager, which then routes them to your preferred notification channels (Slack, PagerDuty, email, etc.).

Monitoring Shopify App Performance with Prometheus Node Exporter

Beyond the database, your Shopify application servers themselves require monitoring. The Prometheus node_exporter is the standard for collecting hardware and OS metrics.

Install node_exporter on each of your application servers. This can be done by downloading the binary or using a package manager.

# Example for Debian/Ubuntu
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo rm -rf node_exporter-1.7.0.linux-amd64*

Create a systemd service file for node_exporter:

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
  --collector.cpu \
  --collector.diskstats \
  --collector.filesystem \
  --collector.meminfo \
  --collector.netdev \
  --collector.stat \
  --collector.time \
  --collector.loadavg \
  --collector.textfile \
  --collector.vmstat

Restart=on-failure

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Configure Prometheus to scrape these instances:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'app_server_1_ip:9100' # Replace with your app server IPs
          - 'app_server_2_ip:9100'
          # ... add all your app servers
    metrics_path: /metrics

Application-Specific Metrics for Shopify Apps

For a Shopify application, you’ll likely need to instrument your code to expose custom metrics. This could include:

API request latency (broken down by endpoint).
Number of Shopify API calls (and their success/failure rates).
Background job queue lengths and processing times.
Cache hit/miss ratios.
Error rates (e.g., exceptions caught).

You can use Prometheus client libraries for your application’s language (e.g., Python, Ruby, PHP) to expose these metrics via an HTTP endpoint (typically /metrics) on each application server. This endpoint will then be scraped by Prometheus.

Example (Python with Flask and Prometheus client):

from flask import Flask, Response
from prometheus_client import generate_latest, Counter, Histogram, Gauge
import time
import random

app = Flask(__name__)

# Define custom metrics
shopify_api_calls = Counter('shopify_api_calls_total', 'Total number of Shopify API calls', ['endpoint', 'method', 'status'])
request_latency = Histogram('shopify_app_request_latency_seconds', 'Shopify app request latency', buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, float('inf')])
background_job_queue_size = Gauge('shopify_background_job_queue_size', 'Current size of the background job queue')

@app.route('/')
def index():
    # Simulate some work
    with request_latency.time():
        time.sleep(random.uniform(0.01, 0.5))
        # Simulate a Shopify API call
        try:
            # Replace with actual Shopify API call logic
            status_code = 200
            shopify_api_calls.labels(endpoint='/admin/api/2023-10/products.json', method='GET', status=status_code).inc()
        except Exception as e:
            status_code = 500
            shopify_api_calls.labels(endpoint='/admin/api/2023-10/products.json', method='GET', status=status_code).inc()
            # Log the error
            print(f"Error calling Shopify API: {e}")

    # Simulate background job queue update
    background_job_queue_size.set(random.randint(0, 100))

    return "Hello, Shopify App!"

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype='text/plain')

if __name__ == '__main__':
    # Run on a different port if your app runs on 80/443
    app.run(host='0.0.0.0', port=5001)

Add a scrape job for these application metrics in your prometheus.yml:

scrape_configs:
  - job_name: 'shopify_app'
    static_configs:
      - targets:
          - 'app_server_1_ip:5001' # Port your app metrics are exposed on
          - 'app_server_2_ip:5001'
    metrics_path: /metrics

OVH Specific Considerations

When operating on OVH, keep these points in mind:

Networking: Ensure your Prometheus server can reach your PostgreSQL instances and application servers. This might involve configuring OVH Security Groups or Firewall rules to allow traffic on specific ports (e.g., 5432 for PostgreSQL, 9187 for postgres_exporter, 9100 for node_exporter, your app’s metrics port).
Instance Types: Choose appropriate OVH instance types for your PostgreSQL clusters and application servers based on CPU, RAM, and I/O requirements. Monitoring helps validate these choices.
Managed Databases: If you opt for OVH’s managed PostgreSQL services, the monitoring approach might differ. You’ll need to check what metrics are exposed by OVH’s managed service and if they integrate with Prometheus or require a different tool. Often, you can still deploy exporters within your application’s network space to monitor the managed endpoint.
High Availability: For PostgreSQL, implement streaming replication and monitor replication lag closely. Ensure your monitoring setup can detect failover events and alert on them.
Cost: Be mindful of data transfer costs between OVH regions or out to the internet if your monitoring infrastructure is external.

Conclusion

A comprehensive monitoring strategy using Prometheus and Grafana is crucial for keeping your Shopify application and its PostgreSQL backend healthy and performant on OVH. By focusing on key database and system metrics, and instrumenting your application for custom insights, you can proactively identify and resolve issues before they impact your users. Regularly review your dashboards and alert thresholds to adapt to your application’s evolving needs.