Server Monitoring Best Practices: Keeping Your Shopify App and Elasticsearch Clusters Alive on Linode

Proactive Elasticsearch Health Checks with `curl` and `jq`

Maintaining the health of your Elasticsearch cluster is paramount, especially when it’s supporting a critical Shopify application. Downtime translates directly to lost revenue and user frustration. While Elasticsearch’s built-in monitoring tools are powerful, direct programmatic checks offer a layer of proactive defense. We’ll leverage `curl` for API interaction and `jq` for parsing JSON responses to build robust health checks.

A fundamental check is the cluster health API. This endpoint provides a high-level overview of the cluster’s status (green, yellow, or red) and details about shards. A ‘red’ status indicates that some primary shards are not allocated, meaning data is unavailable. A ‘yellow’ status means all data is available, but some replica shards are not yet allocated, posing a risk during node failures.

Basic Cluster Health Check Script

This Bash script connects to your Elasticsearch cluster and checks the overall health. It’s designed to be run periodically via cron or a similar scheduler.

First, ensure you have `curl` and `jq` installed on your monitoring server. On most Debian/Ubuntu systems:

sudo apt update
sudo apt install -y curl jq

Now, let’s craft the script. Replace ELASTICSEARCH_HOST with your cluster’s endpoint (e.g., http://localhost:9200 or https://your-es-domain.com).

#!/bin/bash

# Configuration
ELASTICSEARCH_HOST="http://localhost:9200" # Or your Elasticsearch endpoint
ALERT_EMAIL="[email protected]"
CLUSTER_NAME="MyShopifyESCluster"

# --- Cluster Health Check ---
HEALTH_URL="${ELASTICSEARCH_HOST}/_cluster/health?pretty"

echo "Checking Elasticsearch cluster health..."
HEALTH_RESPONSE=$(curl -s -X GET "$HEALTH_URL")

if [ $? -ne 0 ]; then
    echo "ERROR: Failed to connect to Elasticsearch at ${ELASTICSEARCH_HOST}." | mail -s "ALERT: Elasticsearch Connection Failed - ${CLUSTER_NAME}" "$ALERT_EMAIL"
    exit 1
fi

CLUSTER_STATUS=$(echo "$HEALTH_RESPONSE" | jq -r '.status')

echo "Cluster Status: ${CLUSTER_STATUS}"

if [ "$CLUSTER_STATUS" == "red" ]; then
    echo "CRITICAL: Elasticsearch cluster status is RED!" | mail -s "ALERT: Elasticsearch Cluster RED - ${CLUSTER_NAME}" "$ALERT_EMAIL"
    exit 1
elif [ "$CLUSTER_STATUS" == "yellow" ]; then
    echo "WARNING: Elasticsearch cluster status is YELLOW." | mail -s "ALERT: Elasticsearch Cluster YELLOW - ${CLUSTER_NAME}" "$ALERT_EMAIL"
    # Optionally, you might want to investigate further here, e.g., check unassigned shards
fi

# --- Node Count Check ---
NODE_COUNT_URL="${ELASTICSEARCH_HOST}/_cat/nodes?h=ip,name,heap.percent,load_1m&format=json"
NODE_RESPONSE=$(curl -s -X GET "$NODE_COUNT_URL")

if [ $? -ne 0 ]; then
    echo "ERROR: Failed to retrieve node information from Elasticsearch." | mail -s "ALERT: Elasticsearch Node Info Failed - ${CLUSTER_NAME}" "$ALERT_EMAIL"
    exit 1
fi

TOTAL_NODES=$(echo "$NODE_RESPONSE" | jq 'length')
EXPECTED_NODES=3 # Define your expected number of nodes

echo "Total Nodes: ${TOTAL_NODES}"

if [ "$TOTAL_NODES" -lt "$EXPECTED_NODES" ]; then
    echo "WARNING: Elasticsearch cluster has fewer nodes than expected (${TOTAL_NODES}/${EXPECTED_NODES})." | mail -s "ALERT: Elasticsearch Node Count Low - ${CLUSTER_NAME}" "$ALERT_EMAIL"
fi

# --- High Heap Usage Check ---
HIGH_HEAP_THRESHOLD=85 # Percentage
echo "Checking for nodes with high heap usage (>${HIGH_HEAP_THRESHOLD}%)..."
HIGH_HEAP_NODES=$(echo "$NODE_RESPONSE" | jq --argjson threshold "$HIGH_HEAP_THRESHOLD" '[.[] | select(.heap.percent > $threshold)] | map(.name + " (Heap: " + (.heap.percent|tostring) + "%)") | .[]')

if [ -n "$HIGH_HEAP_NODES" ]; then
    echo "WARNING: Nodes with high heap usage detected:" | mail -s "ALERT: Elasticsearch High Heap Usage - ${CLUSTER_NAME}" "$ALERT_EMAIL"
    echo "$HIGH_HEAP_NODES" | mail -s "ALERT: Elasticsearch High Heap Usage - ${CLUSTER_NAME}" "$ALERT_EMAIL"
fi

echo "Elasticsearch health checks completed successfully."
exit 0

To make this script executable:

chmod +x check_es_health.sh

And to schedule it, add an entry to your crontab:

# Run every 5 minutes
*/5 * * * * /path/to/your/script/check_es_health.sh >> /var/log/es_health_check.log 2>&1

Monitoring Shopify App Performance Metrics

Your Shopify app’s performance is directly tied to its underlying infrastructure. Key metrics to monitor include:

Request Latency: How long does it take for your app to respond to requests?
Error Rates: What percentage of requests result in errors (e.g., 5xx, 4xx)?
Resource Utilization: CPU, memory, and disk I/O on your Linode instances.
Database Performance: Query times, connection counts, and slow queries (if applicable).
External Service Dependencies: Latency and error rates for API calls to Shopify, payment gateways, etc.

Linode Instance Monitoring with `node_exporter` and Prometheus

Prometheus is a de facto standard for time-series monitoring. We’ll deploy node_exporter on each Linode instance running your Shopify app and Elasticsearch nodes to expose system-level metrics.

1. Install `node_exporter` on Linode Instances:

Download the latest release from the official Prometheus GitHub repository. Adjust the version and architecture as needed.

# Example for amd64
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
sudo mv node_exporter /usr/local/bin/

2. Create a Systemd Service for `node_exporter`:

This ensures `node_exporter` runs as a service and restarts automatically.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nobody
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Save this content to /etc/systemd/system/node_exporter.service and then enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

By default, node_exporter listens on port 9100. Ensure this port is accessible from your Prometheus server (e.g., by configuring Linode firewall rules or security groups).

Configuring Prometheus for Scraping

On your Prometheus server, edit the prometheus.yml configuration file to include your Linode instances.

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. Default is every 1 minute.

scrape_configs:
  # Job for Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Job for Elasticsearch nodes
  - job_name: 'elasticsearch'
    static_configs:
      - targets:
          - 'es-node-1.yourdomain.com:9100' # Replace with your actual Linode IPs/hostnames
          - 'es-node-2.yourdomain.com:9100'
          - 'es-node-3.yourdomain.com:9100'
    # Add labels for easier querying, e.g., environment, role
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '$1'
      - target_label: cluster
        replacement: 'elasticsearch'
      - target_label: environment
        replacement: 'production'

  # Job for Shopify App nodes
  - job_name: 'shopify_app'
    static_configs:
      - targets:
          - 'app-node-1.yourdomain.com:9100'
          - 'app-node-2.yourdomain.com:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '$1'
      - target_label: role
        replacement: 'app'
      - target_label: environment
        replacement: 'production'

After updating prometheus.yml, reload or restart your Prometheus service.

Alerting with Prometheus Alertmanager

Alerting is crucial for proactive issue resolution. Prometheus integrates with Alertmanager to handle alerts.

1. Define Alerting Rules:

Create a rule file (e.g., alerts.yml) and reference it in your prometheus.yml.

groups:
- name: general.rules
  rules:
  - alert: HighCpuLoad
    expr: node_load1 > 1.5 # Adjust threshold based on your instance specs
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} has a load average of {{ $value }} for more than 5 minutes."

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 # 85% memory usage
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} is using {{ printf \"%.2f\" $value }}% of memory."

  - alert: ElasticsearchClusterRed
    # This requires a custom exporter or a script that feeds status into Prometheus.
    # For simplicity, we'll assume a hypothetical metric or use the health check script output.
    # A more robust solution involves an Elasticsearch exporter.
    expr: elasticsearch_cluster_status == 0 # Assuming 0 for green, 1 for yellow, 2 for red
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster is RED on {{ $labels.instance }}"
      description: "Elasticsearch cluster {{ $labels.instance }} is in a RED state. Data may be unavailable."

  - alert: ElasticsearchNodeDown
    # This alert fires if node_exporter is not reachable for a node
    expr: up{job="elasticsearch"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch node {{ $labels.instance }} is down"
      description: "Prometheus cannot scrape metrics from Elasticsearch node {{ $labels.instance }}."

Ensure your prometheus.yml includes the alerting rules configuration:

rule_files:
  - "alerts.yml" # Path to your alert rules file

2. Configure Alertmanager:

Alertmanager needs to be configured with receivers (e.g., email, Slack). A basic alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver if no specific route matches

receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    from: '[email protected]'
    smarthost: 'smtp.yourdomain.com:587'
    auth_username: 'smtp_user'
    auth_password: 'smtp_password'
    require_tls: true

  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#alerts'
    send_resolved: true

Ensure Prometheus is configured to send alerts to Alertmanager in its prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 'localhost:9093' # Address of your Alertmanager instance

Application-Specific Metrics and Tracing

Beyond system metrics, instrumenting your Shopify app itself is vital. This involves:

Custom Metrics: Track business-specific events like orders processed, API calls to Shopify, cache hit rates, etc. Libraries like Prometheus Client for PHP or prometheus_client for Python can be used.
Distributed Tracing: Tools like Jaeger or Zipkin can help pinpoint bottlenecks across microservices or complex application flows.
Log Aggregation: Centralize logs from all your Linode instances (app, Elasticsearch, Nginx, etc.) using tools like ELK stack (Elasticsearch, Logstash, Kibana) or Loki/Promtail/Grafana. This is invaluable for debugging.

For a Shopify app, monitoring the performance of webhooks is also critical. Ensure your webhook endpoints are responsive and handle incoming requests efficiently. Implement retry mechanisms and dead-letter queues for failed webhook deliveries.

Linode Specific Considerations

Linode provides its own set of monitoring tools accessible via the Cloud Manager. While these offer a good overview, they are often reactive. Integrating them with your Prometheus/Alertmanager setup can provide a more unified view.

Linode API Monitoring: You can use `curl` or a dedicated client library to query the Linode API for instance status, resource usage (CPU, network, disk I/O), and even trigger actions based on alerts. This can be integrated into your Prometheus setup using a custom exporter or a script.

# Example: Get CPU utilization for a specific Linode
LINODE_ID="1234567"
API_TOKEN="your_linode_api_token"

curl -H "Authorization: Bearer ${API_TOKEN}" \
     "https://api.linode.com/v4/linode/instances/${LINODE_ID}/stats?range=1h&interval=5m" | jq '.data[-1].cpu'

# This output can be scraped by Prometheus if exposed via a custom exporter.

Network Monitoring: Pay close attention to network traffic between your app servers and Elasticsearch cluster, as well as external API calls. Linode’s network graphs are a starting point, but Prometheus can provide more granular, time-series data.

Conclusion

A comprehensive monitoring strategy for your Shopify app and Elasticsearch cluster on Linode involves layering multiple tools. Proactive checks using `curl` and `jq` for Elasticsearch health, robust system metrics collection via `node_exporter` and Prometheus, and intelligent alerting with Alertmanager form the foundation. Don’t forget to instrument your application code for deeper insights. This multi-faceted approach ensures high availability and performance, safeguarding your business operations.