Server Monitoring Best Practices: Keeping Your Shopify App and Elasticsearch Clusters Alive on DigitalOcean

Proactive Elasticsearch Health Checks with `curl` and `jq`

Maintaining the health of an Elasticsearch cluster, especially one supporting a critical Shopify app, requires more than just basic CPU and memory monitoring. Elasticsearch has its own internal metrics that, when tracked, can predict and prevent issues before they impact your application. We’ll leverage the Elasticsearch REST API, accessible via curl, and process its JSON output with jq for actionable insights.

A fundamental check is the cluster health API. This provides a high-level overview of the cluster’s status (green, yellow, red), the number of nodes, and shard allocation status. A ‘yellow’ or ‘red’ status indicates problems that need immediate attention.

Cluster Health Status

To get the cluster health, execute the following command. Replace ELASTICSEARCH_HOST with your Elasticsearch endpoint (e.g., elasticsearch.yourdomain.com or a DigitalOcean Droplet IP).

curl -X GET "http://ELASTICSEARCH_HOST:9200/_cluster/health?pretty"

To programmatically check for a healthy cluster status (i.e., ‘green’), we can pipe the output to jq. This is ideal for scripting and automated alerts.

curl -s "http://ELASTICSEARCH_HOST:9200/_cluster/health" | jq -r '.status'

This command will output green, yellow, or red. In a monitoring script, you would check if the output is not equal to green and trigger an alert.

Node Statistics and Shard Allocation

Beyond cluster health, understanding individual node status and shard allocation is crucial. Unassigned shards (indicated by ‘yellow’ cluster health) mean data is not replicated or available as expected. The _cat/shards API is invaluable here.

curl -X GET "http://ELASTICSEARCH_HOST:9200/_cat/shards?v"

To specifically identify unassigned shards, we can filter this output. A common cause for unassigned shards is insufficient disk space on nodes or misconfiguration of shard allocation rules.

curl -s "http://ELASTICSEARCH_HOST:9200/_cat/shards" | awk '$2 == "UNASSIGNED" {print $0}'

If this command returns any lines, it signifies unassigned shards. Further investigation into node disk usage and Elasticsearch logs would be necessary.

Disk Usage Monitoring

Elasticsearch nodes are sensitive to disk space. When disks fill up, indexing and searching performance degrades, and shards can become unassigned. The _nodes/stats/fs API provides detailed filesystem statistics for each node.

curl -s "http://ELASTICSEARCH_HOST:9200/_nodes/stats/fs" | jq '.nodes | to_entries[] | {node: .value.name, total_disk: (.value.fs.data[0].total | tonumber / (1024*1024*1024)), free_disk: (.value.fs.data[0].free | tonumber / (1024*1024*1024)), usable_disk: (.value.fs.data[0].available | tonumber / (1024*1024*1024))}'

This jq query extracts the node name, total disk space, free disk space, and usable disk space (which is what Elasticsearch considers available for data). You’d typically set thresholds (e.g., alert if usable_disk is below 20% of total_disk) in your monitoring system.

Shopify App Performance Metrics and Alerting

Your Shopify app’s performance is directly tied to the responsiveness of your Elasticsearch cluster. Monitoring key application-level metrics and setting up timely alerts is paramount. We’ll focus on common indicators of application health and how to instrument them.

Request Latency and Error Rates

For a PHP-based Shopify app, integrating application performance monitoring (APM) is crucial. Tools like New Relic, Datadog, or even custom Prometheus exporters can capture these metrics. If you’re not using a full APM solution, you can implement basic timing and error logging within your PHP application.

Here’s a simplified example of how you might measure the latency of an Elasticsearch query within a PHP script and log errors:

<?php
// Assume $elasticsearchClient is an initialized Elasticsearch client object
$startTime = microtime(true);
$error = null;
$response = null;

try {
    // Example: Searching for products
    $params = [
        'index' => 'products',
        'body'  => [
            'query' => [
                'match' => ['title' => 'Awesome T-Shirt']
            ]
        ]
    ];
    $response = $elasticsearchClient->search($params);
    $endTime = microtime(true);
    $latency = ($endTime - $startTime) * 1000; // Latency in milliseconds

    // Log successful request metrics
    error_log(sprintf("ES_SEARCH_SUCCESS: index=products, query_time_ms=%f", $latency));

    // Check for Elasticsearch-level errors in the response if applicable
    if (isset($response['error'])) {
        $error = $response['error']['type'] . ': ' . $response['error']['reason'];
        error_log(sprintf("ES_SEARCH_ERROR_IN_RESPONSE: %s", $error));
        // Trigger application-level alert for ES error
        trigger_error("Elasticsearch query returned an error: " . $error, E_USER_WARNING);
    }

} catch (\Exception $e) {
    $endTime = microtime(true);
    $latency = ($endTime - $startTime) * 1000; // Latency in milliseconds
    $error = $e->getMessage();
    error_log(sprintf("ES_SEARCH_EXCEPTION: query_time_ms=%f, error=%s", $latency, $error));
    // Trigger application-level alert for exception
    trigger_error("Elasticsearch query failed: " . $error, E_USER_ERROR);
}

// Example of how to use the latency and error for alerting
if ($latency > 5000) { // Alert if query takes longer than 5 seconds
    error_log(sprintf("ES_SEARCH_HIGH_LATENCY: query_time_ms=%f", $latency));
    // Send alert to monitoring system (e.g., PagerDuty, Slack)
}

if ($error) {
    // Send alert for specific errors
}
?>

These logs can be aggregated by a log management system (like ELK stack, Splunk, or Datadog Logs) and used to create dashboards and alerts. For instance, an alert could be triggered if the average search latency exceeds a threshold for a sustained period, or if the rate of ES_SEARCH_EXCEPTION logs spikes.

Queue Depth and Background Jobs

Many Shopify app functionalities, such as data synchronization, order processing, or report generation, rely on background job queues. If these queues back up, it indicates that your workers are not keeping pace with the incoming tasks, which can lead to stale data or delayed processing. For a PHP application, common queueing systems include Redis (with libraries like Predis or PhpRedis) or RabbitMQ.

Monitoring the size of your Redis queues is a good starting point. You can use the LLEN command to get the length of a list (which often represents a queue).

# Connect to Redis CLI
redis-cli

# Check the length of a specific queue (e.g., 'product_sync_queue')
LLEN product_sync_queue

In a monitoring script (e.g., a Python script running on a DigitalOcean Droplet), you would connect to Redis and periodically check queue lengths. If a queue length exceeds a predefined threshold (e.g., 1000 jobs), an alert should be triggered.

import redis
import time
import os

REDIS_HOST = os.environ.get('REDIS_HOST', 'localhost')
REDIS_PORT = int(os.environ.get('REDIS_PORT', 6379))
QUEUE_NAME = 'product_sync_queue'
QUEUE_THRESHOLD = 1000
ALERT_INTERVAL_SECONDS = 300 # Only alert once every 5 minutes

last_alert_time = 0

try:
    r = redis.StrictRedis(host=REDIS_HOST, port=REDIS_PORT, db=0, decode_responses=True)
    r.ping() # Check connection
    print(f"Successfully connected to Redis at {REDIS_HOST}:{REDIS_PORT}")
except redis.exceptions.ConnectionError as e:
    print(f"Error connecting to Redis: {e}")
    exit(1)

while True:
    try:
        queue_length = r.llen(QUEUE_NAME)
        print(f"Queue '{QUEUE_NAME}' length: {queue_length}")

        if queue_length > QUEUE_THRESHOLD:
            current_time = time.time()
            if current_time - last_alert_time > ALERT_INTERVAL_SECONDS:
                print(f"ALERT: Queue '{QUEUE_NAME}' has {queue_length} jobs, exceeding threshold of {QUEUE_THRESHOLD}.")
                # Here you would integrate with your alerting system (e.g., PagerDuty API, Slack webhook)
                # send_alert_to_slack(f"High queue depth for {QUEUE_NAME}: {queue_length} jobs")
                last_alert_time = current_time
        else:
            # Optionally reset last_alert_time if queue is healthy, to ensure immediate alerts if it grows again
            pass

    except redis.exceptions.RedisError as e:
        print(f"Redis error: {e}")
        # Potentially trigger an alert for Redis connectivity issues

    time.sleep(60) # Check every minute

DigitalOcean Infrastructure Monitoring and Maintenance

While Elasticsearch and your application have their own health metrics, the underlying DigitalOcean infrastructure is the foundation. Proactive monitoring and maintenance of your Droplets and managed databases are essential for stability.

Droplet Resource Utilization

DigitalOcean provides basic metrics for Droplets (CPU, Memory, Disk I/O, Network). However, for deeper insights and automated alerting, it’s best to deploy a dedicated monitoring agent. node_exporter for Prometheus is a popular choice, providing detailed system-level metrics.

First, install Prometheus node_exporter on your Elasticsearch and application Droplets. On a Debian/Ubuntu system:

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
sudo mv node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Once running, node_exporter will expose metrics on port 9100. You would then configure your Prometheus server to scrape these targets. Key metrics to monitor include:

node_cpu_seconds_total: Monitor CPU usage, especially user and system time. High sustained usage can indicate an overloaded system.
node_memory_MemAvailable_bytes: Track available memory. Low available memory can lead to swapping and performance degradation.
node_disk_io_time_seconds_total: Disk I/O wait times. High values suggest disk bottlenecks.
node_network_receive_errs_total and node_network_transmit_errs_total: Network errors can indicate connectivity issues or faulty network hardware/configuration.

Alerting rules in Prometheus (configured via Alertmanager) can be set up for these metrics. For example, an alert for high CPU usage on an Elasticsearch node might look like this:

groups:
- name: host_alerts
  rules:
  - alert: HighCpuUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} has been running with CPU usage above 85% for the last 10 minutes."

DigitalOcean Managed Databases

If you’re using DigitalOcean’s Managed Databases (e.g., for PostgreSQL or Redis), leverage their built-in monitoring and alerting features. These services abstract away much of the infrastructure management, but you still need to monitor application-specific performance.

Key metrics for managed databases include:

Connection Count: High connection counts can exhaust database resources.
Query Performance: Monitor slow queries and overall query throughput.
Replication Lag: If using read replicas, monitor the lag between the primary and replicas.
Disk Usage: Ensure the database disk doesn’t fill up.
CPU/Memory Usage: While managed by DO, high utilization can still impact performance.

DigitalOcean’s control panel provides dashboards for these metrics. You can also set up alerts directly within the DO control panel for thresholds like disk usage exceeding 80% or replication lag surpassing a few seconds.

Automated Backups and Disaster Recovery

Regular, automated backups are non-negotiable. For Elasticsearch, consider snapshotting to an external storage service (like S3-compatible storage or DigitalOcean Spaces). For your application’s database (if separate from Elasticsearch), ensure regular backups are configured and tested.

Elasticsearch snapshots can be configured via the Snapshot API. You’ll need to set up a repository first:

# Register a repository (e.g., to S3-compatible storage)
curl -X PUT "http://ELASTICSEARCH_HOST:9200/_snapshot/my_s3_repository" -H 'Content-Type: application/json' -d'
{
  "type": "s3",
  "settings": {
    "bucket": "your-s3-bucket-name",
    "region": "your-s3-region",
    "endpoint": "your-s3-endpoint",
    "access_key": "YOUR_ACCESS_KEY",
    "secret_key": "YOUR_SECRET_KEY"
  }
}
'

Then, you can trigger manual or scheduled snapshots. Automating this process with a cron job or a dedicated orchestration tool is recommended. Regularly test your restore process to ensure backups are valid and the recovery procedure is well-documented.

Conclusion: A Layered Approach to Resilience

Keeping a Shopify app and its Elasticsearch cluster alive on DigitalOcean is a multi-faceted challenge. It requires a layered monitoring strategy that encompasses:

Elasticsearch Internal Health: Proactive checks on cluster status, shard allocation, and disk usage.
Application Performance: Monitoring request latency, error rates, and background job queue depths.
Infrastructure Stability: Tracking Droplet resource utilization and leveraging DigitalOcean’s managed service monitoring.
Data Durability: Implementing and testing automated backup and recovery procedures.

By combining these approaches, you can build a robust monitoring system that not only detects issues but also predicts and prevents them, ensuring your Shopify app remains performant and available.