Server Monitoring Best Practices: Keeping Your Shopify App and DynamoDB Clusters Alive on Linode

Establishing a Robust Monitoring Foundation with Prometheus and Grafana

For a high-availability Shopify app and its associated DynamoDB clusters hosted on Linode, a proactive and granular monitoring strategy is paramount. We’ll leverage Prometheus for metrics collection and alerting, and Grafana for visualization. This setup provides deep insights into application performance, resource utilization, and potential failure points.

Deploying Prometheus on Linode

A dedicated Linode instance is recommended for Prometheus to ensure its stability and performance are not impacted by application workloads. We’ll use Docker for easy deployment and management.

Prometheus Configuration (`prometheus.yml`)

The core of Prometheus configuration lies in its `prometheus.yml` file. This defines scrape targets, alerting rules, and global settings. For our Shopify app, we’ll need to expose metrics from our application servers and potentially from the Linode host itself. For DynamoDB, we’ll rely on AWS CloudWatch metrics exposed via a Prometheus exporter.

First, let’s set up a basic `prometheus.yml` for scraping our application instances. Assume your Shopify app instances are running on Linode and accessible via IP addresses or hostnames. We’ll also include a scrape job for the Linode host’s node exporter.

global:
  scrape_interval: 15s # How frequently to scrape targets by default.
  evaluation_interval: 15s # How frequently to evaluate rules.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape Linode host metrics via node_exporter
  - job_name: 'linode_node_exporter'
    static_configs:
      - targets: [':9100', ':9100'] # Replace with your Linode IPs

  # Scrape Shopify App instances
  - job_name: 'shopify_app'
    static_configs:
      - targets: [':8080', ':8080'] # Assuming app metrics are exposed on port 8080
    # If your app uses a different path for metrics, configure it here:
    # metrics_path: /metrics

  # Scrape DynamoDB metrics via a CloudWatch exporter (e.g., aws-cloudwatch-exporter)
  # This requires a separate deployment of the exporter.
  - job_name: 'dynamodb_cloudwatch'
    static_configs:
      - targets: [':9119'] # Assuming exporter runs on port 9119
    # You'll need to configure the cloudwatch exporter to pull specific DynamoDB metrics.
    # Example metrics to monitor: ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests, TableSizeBytes

Note: Replace placeholders like <LINODE_HOST_IP_1>, <APP_SERVER_IP_1>, and <CLOUDWATCH_EXPORTER_IP> with your actual IP addresses or hostnames. Ensure your application servers expose metrics on the specified port (e.g., 8080) and that the Linode hosts have node_exporter running and accessible on port 9100.

Deploying Prometheus with Docker

Create a `docker-compose.yml` file to manage the Prometheus container. Mount your `prometheus.yml` configuration file into the container.

version: '3.7'

services:
  prometheus:
    image: prom/prometheus:v2.45.0 # Use a specific, stable version
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    restart: unless-stopped

volumes:
  prometheus_data:

To start Prometheus, navigate to the directory containing `docker-compose.yml` and `prometheus.yml`, and run:

docker-compose up -d

Instrumenting Your Shopify App for Metrics

Your Shopify app needs to expose metrics that Prometheus can scrape. The specific implementation depends on your app’s language and framework. Here are examples for common stacks.

PHP (using a Prometheus client library)

For a PHP application, you can use a client library like prometheus_client_php. This involves initializing the client and registering custom metrics.

<?php
require_once 'vendor/autoload.php'; // Assuming you've installed via Composer

use Prometheus\CollectorRegistry;
use Prometheus\Render\RenderText;
use Prometheus\Storage\InMemory;

// Initialize registry
$registry = new CollectorRegistry(new InMemory());

// Register custom metrics
$requestCounter = $registry->registerCounter(
    'shopify_app', // Namespace
    'requests_total', // Metric name
    'Total number of HTTP requests received',
    ['method', 'endpoint'] // Labels
);

$requestDuration = $registry->registerHistogram(
    'shopify_app',
    'request_duration_seconds',
    'Duration of HTTP requests in seconds',
    ['method', 'endpoint']
);

// Example: In your request handling logic
$method = $_SERVER['REQUEST_METHOD'];
$endpoint = $_SERVER['REQUEST_URI']; // Or a more specific route identifier

$start_time = microtime(true);

// ... your application logic ...

$duration = microtime(true) - $start_time;

// Increment counter and observe duration
$requestCounter->inc(['method' => $method, 'endpoint' => $endpoint]);
$requestDuration->observe($duration, ['method' => $method, 'endpoint' => $endpoint]);

// Expose metrics endpoint (e.g., /metrics)
if ($_SERVER['REQUEST_URI'] === '/metrics') {
    header('Content-type: text/plain');
    $renderer = new RenderText();
    echo $renderer->render($registry);
    exit;
}
?>

Ensure your web server (e.g., Nginx) is configured to route requests to /metrics to this PHP script and that Prometheus is configured to scrape this endpoint.

Python (using a Prometheus client library)

For a Python application (e.g., Flask or Django), use the prometheus_client library.

from prometheus_client import start_http_server, Counter, Histogram
import time
import random

# Initialize metrics
REQUEST_COUNT = Counter('shopify_app_requests_total', 'Total number of HTTP requests received', ['method', 'endpoint'])
REQUEST_DURATION = Histogram('shopify_app_request_duration_seconds', 'Duration of HTTP requests in seconds', ['method', 'endpoint'])

def process_request(method, endpoint):
    start_time = time.time()
    # Simulate work
    time.sleep(random.uniform(0.1, 1.0))
    duration = time.time() - start_time

    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

# Example usage within a web framework (e.g., Flask)
# from flask import Flask, request
# app = Flask(__name__)

# @app.route('/api/v1/resource')
# def get_resource():
#     process_request(request.method, '/api/v1/resource')
#     return {"data": "some_resource"}

# Start up the server to expose the metrics.
# start_http_server(8000) # Exposes metrics on port 8000

# In your main application file, you'd call process_request for each incoming request.
# For example, if using Flask:
# @app.before_request
# def before_request_func():
#     g.start_time = time.time()

# @app.after_request
# def after_request_func(response):
#     process_request(request.method, request.path)
#     return response

# If running standalone for testing:
# if __name__ == '__main__':
#     start_http_server(8000)
#     print("Metrics exposed on port 8000")
#     while True:
#         process_request('GET', '/test_endpoint')
#         time.sleep(5)

You would typically run the start_http_server in a separate thread or process, or integrate it with your web framework’s request lifecycle. Ensure Prometheus is configured to scrape the port where metrics are exposed (e.g., 8000).

Monitoring DynamoDB with AWS CloudWatch Exporter

Prometheus does not natively integrate with AWS DynamoDB. The standard approach is to use the AWS CloudWatch Exporter, which queries CloudWatch metrics and exposes them in Prometheus format. This exporter needs to be deployed and configured to pull relevant DynamoDB metrics.

Deploying AWS CloudWatch Exporter

You can deploy the CloudWatch Exporter as a Docker container on a Linode instance. This instance should have appropriate IAM permissions to access CloudWatch.

version: '3.7'

services:
  aws-cloudwatch-exporter:
    image: docker.io/nerdswords/aws-cloudwatch-exporter:latest # Use a specific version for stability
    container_name: aws-cloudwatch-exporter
    ports:
      - "9119:9119" # Default Prometheus exposition port
    environment:
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
      - AWS_REGION=${AWS_REGION}
    volumes:
      - ./cloudwatch_config.yml:/etc/cloudwatch_exporter/config.yml
    restart: unless-stopped

You’ll need to create a cloudwatch_config.yml file. This file specifies which CloudWatch metrics to scrape. Ensure the IAM user/role associated with these credentials has permissions for cloudwatch:GetMetricStatistics and cloudwatch:ListMetrics.

# cloudwatch_config.yml
discovery:
  region: us-east-1 # Your AWS region
  # Optional: Specify specific EC2 instances if needed, but for DynamoDB, we focus on services.

metrics:
  - namespace: AWS/DynamoDB
    name: ConsumedReadCapacityUnits
    dimensions:
      - name: TableName
        value: your-dynamodb-table-name # Replace with your table name
    statistics:
      - Average
      - Sum
    period: 60 # seconds

  - namespace: AWS/DynamoDB
    name: ConsumedWriteCapacityUnits
    dimensions:
      - name: TableName
        value: your-dynamodb-table-name
    statistics:
      - Average
      - Sum
    period: 60

  - namespace: AWS/DynamoDB
    name: ThrottledRequests
    dimensions:
      - name: TableName
        value: your-dynamodb-table-name
    statistics:
      - Sum
    period: 60

  - namespace: AWS/DynamoDB
    name: TableSizeBytes
    dimensions:
      - name: TableName
        value: your-dynamodb-table-name
    statistics:
      - Average
    period: 300 # Less frequent for size

  # Add more metrics as needed, e.g., for Global Secondary Indexes (GSIs)
  # - namespace: AWS/DynamoDB
  #   name: ConsumedReadCapacityUnits
  #   dimensions:
  #     - name: TableName
  #       value: your-dynamodb-table-name
  #     - name: IndexName
  #       value: your-gsi-name
  #   statistics:
  #     - Average
  #     - Sum
  #   period: 60

Important: Replace your-dynamodb-table-name and your-gsi-name with your actual DynamoDB table and index names. You can also specify multiple tables or indexes by repeating the metric configuration block with different value for TableName or IndexName.

To run the exporter:

# Create a .env file for AWS credentials
echo "AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY" >> .env
echo "AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY" >> .env
echo "AWS_REGION=us-east-1" >> .env

# Then run docker-compose
docker-compose up -d

Configuring Grafana for Visualization

Grafana provides a user-friendly interface to visualize the metrics collected by Prometheus. We’ll add Prometheus as a data source and create dashboards.

Adding Prometheus Data Source in Grafana

1. **Install Grafana:** Deploy Grafana on a Linode instance, typically using Docker.

version: '3.7'

services:
  grafana:
    image: grafana/grafana:10.2.1 # Use a specific version
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

volumes:
  grafana_data:

2. **Access Grafana:** Open your browser to http://<GRAFANA_SERVER_IP>:3000. Log in with default credentials (admin/admin) and change the password.

3. **Add Data Source:** Navigate to Configuration (gear icon) -> Data Sources -> Add data source. Select “Prometheus”.

4. **Configure Prometheus URL:** Enter the URL of your Prometheus instance, typically http://<PROMETHEUS_SERVER_IP>:9090. If Grafana and Prometheus are on the same Docker network, you might use the service name: http://prometheus:9090.

5. **Save & Test:** Click “Save & Test”. You should see a “Data source is working” message.

Creating Dashboards

You can create custom dashboards or import pre-built ones from Grafana’s dashboard repository. Here are key metrics to visualize:

Application Performance: Request rate (shopify_app_requests_total), request duration (shopify_app_request_duration_seconds), error rates (if you instrument them).
Resource Utilization: CPU usage, memory usage, disk I/O (from node_exporter).
DynamoDB Performance: Consumed Read/Write Capacity Units (aws_dynamodb_consumed_read_capacity_units, aws_dynamodb_consumed_write_capacity_units), throttled requests (aws_dynamodb_throttled_requests), table size (aws_dynamodb_table_size_bytes).
System Health: Network traffic, open file descriptors.

Example Grafana Query for DynamoDB Throttled Requests:

sum(rate(aws_dynamodb_throttled_requests_sum{job="dynamodb_cloudwatch", TableName="your-dynamodb-table-name"}[5m])) by (TableName)

Example Grafana Query for Application Request Latency (95th Percentile):

histogram_quantile(0.95, sum(rate(shopify_app_request_duration_seconds_bucket{job="shopify_app"}[5m])) by (le, endpoint, method))

Setting Up Alerting with Prometheus Alertmanager

Alerting is crucial for proactive issue resolution. Prometheus Alertmanager handles alerts generated by Prometheus, deduplicates them, groups them, and routes them to the correct receivers (e.g., Slack, PagerDuty).

Alerting Rules (`alert.rules.yml`)

Define alerting rules in a separate YAML file. These rules are expressions that Prometheus evaluates. If the expression returns an empty set of series, the alert fires.

groups:
- name: shopify_app_alerts
  rules:
  - alert: HighRequestLatency
    expr: histogram_quantile(0.95, sum(rate(shopify_app_request_duration_seconds_bucket{job="shopify_app"}[5m])) by (le, endpoint, method)) > 2 # Alert if 95th percentile latency exceeds 2 seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High request latency detected for {{ $labels.endpoint }}"
      description: "The 95th percentile request latency for {{ $labels.method }} {{ $labels.endpoint }} has been above 2s for 5 minutes. Current value: {{ $value | printf "%.2f" }}s"

  - alert: HighErrorRate
    # Assuming you have an 'errors_total' counter metric
    expr: |
      sum(rate(shopify_app_errors_total{job="shopify_app"}[5m])) by (endpoint)
      /
      sum(rate(shopify_app_requests_total{job="shopify_app"}[5m])) by (endpoint)
      > 0.05 # Alert if error rate exceeds 5%
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for {{ $labels.endpoint }}"
      description: "The error rate for {{ $labels.endpoint }} has exceeded 5% for 5 minutes. Current value: {{ $value | printf "%.2f" }}%"

- name: dynamodb_alerts
  rules:
  - alert: DynamoDBThrottledRequests
    expr: sum(rate(aws_dynamodb_throttled_requests_sum{job="dynamodb_cloudwatch"}[5m])) by (TableName) > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "DynamoDB throttled requests detected for {{ $labels.TableName }}"
      description: "Throttled requests for DynamoDB table '{{ $labels.TableName }}' have been detected for 10 minutes. This indicates potential capacity issues."

  - alert: DynamoDBHighReadCapacityUtilization
    expr: |
      sum(rate(aws_dynamodb_consumed_read_capacity_units_sum{job="dynamodb_cloudwatch"}[5m])) by (TableName)
      /
      avg(aws_dynamodb_provisioned_read_capacity_units_sum{job="dynamodb_cloudwatch"}) by (TableName)
      > 0.8 # Alert if read capacity utilization exceeds 80%
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High DynamoDB read capacity utilization for {{ $labels.TableName }}"
      description: "DynamoDB table '{{ $labels.TableName }}' is experiencing over 80% read capacity utilization for 15 minutes. Consider scaling up provisioned capacity."

  - alert: DynamoDBHighWriteCapacityUtilization
    expr: |
      sum(rate(aws_dynamodb_consumed_write_capacity_units_sum{job="dynamodb_cloudwatch"}[5m])) by (TableName)
      /
      avg(aws_dynamodb_provisioned_write_capacity_units_sum{job="dynamodb_cloudwatch"}) by (TableName)
      > 0.8 # Alert if write capacity utilization exceeds 80%
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High DynamoDB write capacity utilization for {{ $labels.TableName }}"
      description: "DynamoDB table '{{ $labels.TableName }}' is experiencing over 80% write capacity utilization for 15 minutes. Consider scaling up provisioned capacity."

Note: For DynamoDB metrics like aws_dynamodb_provisioned_read_capacity_units_sum, ensure your CloudWatch exporter is configured to scrape them. You might need to adjust the period and statistics in cloudwatch_config.yml for these.

Configuring Alertmanager

Deploy Alertmanager similarly to Prometheus, using Docker. Its configuration file (`alertmanager.yml`) defines notification receivers and routing rules.

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications' # Default receiver

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: '<YOUR_SLACK_WEBHOOK_URL>' # Replace with your Slack webhook URL
        channel: '#alerts' # Replace with your alert channel
        send_resolved: true
        title: '[{{ .Status | toUpper }}{{ if .CommonLabels.severity }} - {{ .CommonLabels.severity | toUpper }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}'
        text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}'

# Example for PagerDuty (uncomment and configure if needed)
#  - name: 'pagerduty-notifications'
#    pagerduty_configs:
#      - service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY>'
#        send_resolved: true

Update your prometheus.yml to point to Alertmanager:

# ... (previous prometheus.yml content) ...

alerting:
  alertmanagers:
    - static_configs:
        - targets: [':9093'] # Or 'alertmanager:9093' if on same Docker network

Ensure your Prometheus and Alertmanager Docker configurations include volumes for their respective configuration files and data. Restart Prometheus and Alertmanager after updating configurations.

Linode Specific Considerations

When running on Linode, pay attention to:

Network Security: Configure Linode’s firewall (or use `ufw`) to only allow necessary ports (e.g., 9090 for Prometheus, 3000 for Grafana, 9100 for node_exporter, 9119 for CloudWatch exporter) from trusted IP ranges.
Resource Allocation: Ensure your Linode instances hosting Prometheus, Grafana, and the CloudWatch exporter have sufficient CPU, RAM, and disk I/O. Prometheus can become resource-intensive with many targets and long retention periods.
High Availability: For critical production environments, consider deploying Prometheus and Alertmanager in a highly available setup (e.g., multiple Prometheus instances scraping the same targets, federated Prometheus, or using Thanos/Cortex).
Log Management: Centralize logs from your application servers and monitoring components using a log aggregation system (e.g., ELK stack, Loki) for easier debugging.

By implementing this comprehensive monitoring stack, you gain deep visibility into your Shopify app and DynamoDB clusters on Linode, enabling you to detect issues early, optimize performance, and ensure high availability.