Server Monitoring Best Practices: Keeping Your Shopify App and Elasticsearch Clusters Alive on OVH

Proactive Elasticsearch Cluster Health Checks with `curl` and `jq`

Maintaining the stability of an Elasticsearch cluster, especially one supporting a high-traffic Shopify app, requires constant vigilance. While dedicated monitoring tools are essential, a robust set of quick, scriptable checks can provide immediate insights and form the backbone of automated alerting. We’ll focus on leveraging `curl` for API interaction and `jq` for parsing JSON responses to build these checks.

The primary goal is to ensure the cluster is not only up but also healthy, with all nodes participating and no critical errors accumulating. We’ll start with the cluster health API, a fundamental endpoint for understanding the overall state.

Basic Cluster Health Endpoint Check

A simple `curl` command to the `_cluster/health` endpoint is the first step. This returns a JSON object detailing the cluster’s status (green, yellow, or red), the number of nodes, shards, and pending tasks. We’ll pipe this through `jq` to extract key metrics.

Consider a scenario where your Elasticsearch cluster is accessible at http://elasticsearch.your-domain.com:9200. The following command checks the overall health and status:

curl -s "http://elasticsearch.your-domain.com:9200/_cluster/health" | jq .

The output will look something like this:

{
  "cluster_name" : "my-production-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 10,
  "active_shards" : 30,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue" : 0,
  "active_shards_percent_as_number" : 100.0
}

For automated monitoring, we’re particularly interested in the status field. A red status indicates unassigned primary shards, which is a critical failure. A yellow status means some primary shards are assigned, but replica shards are not, which can impact resilience but might not immediately halt operations. We want to alert on anything other than green.

Automated Health Status Alerting Script

Let’s create a simple Bash script that checks the cluster status and exits with a non-zero status code if it’s not green. This is ideal for integration with monitoring systems like Nagios, Zabbix, or even a simple cron job that sends alerts.

#!/bin/bash

ES_HOST="http://elasticsearch.your-domain.com:9200"
ALERT_THRESHOLD="yellow" # Alert on yellow or red

HEALTH_STATUS=$(curl -s "${ES_HOST}/_cluster/health" | jq -r '.status')

if [[ "$HEALTH_STATUS" == "green" ]]; then
    echo "Elasticsearch cluster health is GREEN."
    exit 0
else
    echo "Elasticsearch cluster health is ${HEALTH_STATUS} (Threshold: ${ALERT_THRESHOLD}). Alerting!"
    exit 1
fi

To make this script executable, run:

chmod +x check_es_health.sh

You can then run it manually or schedule it with cron:

# Example cron entry to run every 5 minutes
*/5 * * * * /path/to/your/scripts/check_es_health.sh >> /var/log/es_health_check.log 2>&1

Monitoring Node Status and JVM Heap Usage

Beyond the overall cluster health, it’s crucial to monitor individual nodes. Node failures can degrade performance or lead to data loss if not detected promptly. The _nodes/stats endpoint provides detailed statistics for each node, including JVM heap usage, which is a common bottleneck.

We can use `jq` to filter for nodes that are not reporting or have excessive JVM heap usage. Let’s define a threshold for JVM heap usage, say 85%.

#!/bin/bash

ES_HOST="http://elasticsearch.your-domain.com:9200"
JVM_HEAP_THRESHOLD_PERCENT=85

# Get all nodes and their JVM heap usage
NODE_STATS=$(curl -s "${ES_HOST}/_nodes/stats/jvm" | jq '.nodes | to_entries[] | {key: .key, value: .value.jvm.mem.heap_used_percent}')

# Check if any nodes are reporting high heap usage
HIGH_HEAP_NODES=$(echo "$NODE_STATS" | jq --argjson threshold "$JVM_HEAP_THRESHOLD_PERCENT" '[.[] | select(.value > $threshold)] | length')

if [[ "$HIGH_HEAP_NODES" -gt 0 ]]; then
    echo "ALERT: High JVM heap usage detected on ${HIGH_HEAP_NODES} node(s)."
    echo "$NODE_STATS" | jq --argjson threshold "$JVM_HEAP_THRESHOLD_PERCENT" '.[] | select(.value > $threshold)'
    exit 1
else
    echo "All nodes reporting acceptable JVM heap usage."
    exit 0
fi

This script iterates through each node, extracts its JVM heap usage percentage, and compares it against the defined threshold. If any node exceeds the threshold, it logs the alert and the specific nodes affected.

Detecting Unassigned Shards

Unassigned shards are a direct indicator of problems, whether it’s a node failure, disk space issues, or allocation rules preventing assignment. The cluster health API already reports the count, but we can get more granular information using the _cluster/allocation/explain endpoint, though this is more for debugging specific shard issues. For proactive monitoring, we’ll focus on the count of unassigned shards from the health API.

#!/bin/bash

ES_HOST="http://elasticsearch.your-domain.com:9200"

UNASSIGNED_SHARDS=$(curl -s "${ES_HOST}/_cluster/health" | jq '.unassigned_shards')

if [[ "$UNASSIGNED_SHARDS" -gt 0 ]]; then
    echo "ALERT: ${UNASSIGNED_SHARDS} unassigned shard(s) detected."
    # For more detail, you might query _cat/shards?h=index,shard,prirep,state,unassigned.reason
    # and parse that output. For simplicity, we'll just alert on the count.
    exit 1
else
    echo "No unassigned shards detected."
    exit 0
fi

This script is straightforward: it fetches the `unassigned_shards` count and alerts if it’s greater than zero. For more detailed diagnostics when this alert fires, you could integrate calls to the _cat/shards API, filtering for shards in an “UNASSIGNED” state and examining their reasons.

Shopify App Integration: API Endpoint Health

Your Shopify app likely interacts with Elasticsearch for search, indexing, or analytics. Monitoring the health of these specific API endpoints is critical. If your app’s search endpoint becomes slow or unresponsive, it directly impacts the user experience on Shopify.

We can simulate a typical search request to your app’s Elasticsearch-backed API. Assume your app exposes a search endpoint at http://your-app-api.com/search which proxies requests to Elasticsearch.

#!/bin/bash

APP_SEARCH_ENDPOINT="http://your-app-api.com/search"
SEARCH_QUERY='{"query": {"match": {"title": "example"}}}' # A sample query
TIMEOUT_SECONDS=5

# Use curl to hit the app's search endpoint
# -X POST to send the query body
# -H "Content-Type: application/json" to specify the content type
# --data to send the query
# -w "%{http_code}" to get the HTTP status code
# -o /dev/null to discard the response body (we only care about status and time)
# -s to be silent
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time $TIMEOUT_SECONDS -X POST -H "Content-Type: application/json" --data "$SEARCH_QUERY" "$APP_SEARCH_ENDPOINT")

# Check HTTP status code. 200 is typically OK. Adjust if your API uses other success codes.
if [[ "$HTTP_CODE" -eq 200 ]]; then
    echo "Shopify App Search Endpoint is RESPONSIVE (HTTP $HTTP_CODE)."
    exit 0
elif [[ "$HTTP_CODE" -eq 000 ]]; then
    echo "Shopify App Search Endpoint is UNREACHABLE (Timeout or connection error)."
    exit 1
else
    echo "Shopify App Search Endpoint returned an ERROR (HTTP $HTTP_CODE)."
    exit 1
fi

This script checks if the application’s search endpoint is reachable and returns a successful HTTP status code within a specified timeout. If the endpoint times out or returns an error code (e.g., 5xx), it indicates a problem with either the application itself or its connection to Elasticsearch.

OVH Specific Considerations: Network and Firewall

When running these checks on OVH infrastructure, always consider network accessibility and firewall rules. Ensure that the monitoring server or the cron job’s execution environment has network access to your Elasticsearch cluster’s IP address and port (default 9200). If your Elasticsearch cluster is within a private network or behind an OVH firewall, you’ll need to configure security groups or firewall rules to allow inbound traffic from your monitoring source.

For example, if your Elasticsearch cluster is in a dedicated server or a public cloud instance, you might need to adjust the OVH firewall rules via the OVH Control Panel or API to permit TCP traffic on port 9200 from the IP address of your monitoring agent.

Additionally, if you’re using OVH’s managed Elasticsearch service, consult their documentation for specific endpoint URLs and any authentication requirements (e.g., API keys or tokens) that need to be included in your `curl` commands.

Advanced Monitoring with Prometheus and Grafana

While the `curl` and `jq` scripts are excellent for basic health checks and simple alerting, a production environment demands more sophisticated tooling. Prometheus, coupled with Grafana, provides a powerful combination for time-series monitoring and visualization.

Prometheus Configuration:

You’ll need an Elasticsearch Exporter to expose metrics in a Prometheus-readable format. A popular choice is the `elasticsearch_exporter`.

1. Install Elasticsearch Exporter: Download and run the exporter binary. It typically listens on port 9114.

# Example using Docker
docker run -d \
  -p 9114:9114 \
  --name elasticsearch_exporter \
  quay.io/prometheuscommunity/elasticsearch-exporter:latest \
  --es.uri="http://elasticsearch.your-domain.com:9200"

2. Configure Prometheus to Scrape the Exporter: Add the exporter to your `prometheus.yml` configuration.

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['elasticsearch.your-domain.com:9114'] # Or the IP/hostname of your exporter

3. Restart Prometheus: Apply the configuration changes.

Grafana Dashboards:

Once Prometheus is scraping metrics, you can import pre-built Elasticsearch dashboards into Grafana or create your own. Search for “Elasticsearch” dashboards on Grafana.com. These dashboards typically visualize:

Cluster Health (status, nodes, shards)
JVM Heap Usage
CPU and Memory Usage per Node
Indexing and Search Latency
Disk I/O and Space Usage
Network Traffic

These visualizations provide a historical view, allowing you to identify trends, predict potential issues, and optimize performance proactively. Setting up alerting rules within Prometheus (using Alertmanager) based on these metrics is the next logical step for robust, automated incident response.

Server Monitoring Best Practices: Keeping Your Shopify App and Elasticsearch Clusters Alive on OVH

Proactive Elasticsearch Cluster Health Checks with `curl` and `jq`

Basic Cluster Health Endpoint Check

Automated Health Status Alerting Script

Monitoring Node Status and JVM Heap Usage

Detecting Unassigned Shards

Shopify App Integration: API Endpoint Health

OVH Specific Considerations: Network and Firewall

Advanced Monitoring with Prometheus and Grafana

Recent Posts

Top Categories

Our Products

Our Services