Server Monitoring Best Practices: Keeping Your Shopify App and Elasticsearch Clusters Alive on OVH
Proactive Elasticsearch Cluster Health Checks with `curl` and `jq`
Maintaining the stability of an Elasticsearch cluster, especially one supporting a high-traffic Shopify app, requires constant vigilance. While dedicated monitoring tools are essential, a robust set of quick, scriptable checks can provide immediate insights and form the backbone of automated alerting. We’ll focus on leveraging `curl` for API interaction and `jq` for parsing JSON responses to build these checks.
The primary goal is to ensure the cluster is not only up but also healthy, with all nodes participating and no critical errors accumulating. We’ll start with the cluster health API, a fundamental endpoint for understanding the overall state.
Basic Cluster Health Endpoint Check
A simple `curl` command to the `_cluster/health` endpoint is the first step. This returns a JSON object detailing the cluster’s status (green, yellow, or red), the number of nodes, shards, and pending tasks. We’ll pipe this through `jq` to extract key metrics.
Consider a scenario where your Elasticsearch cluster is accessible at http://elasticsearch.your-domain.com:9200. The following command checks the overall health and status:
curl -s "http://elasticsearch.your-domain.com:9200/_cluster/health" | jq .
The output will look something like this:
{
"cluster_name" : "my-production-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 10,
"active_shards" : 30,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue" : 0,
"active_shards_percent_as_number" : 100.0
}
For automated monitoring, we’re particularly interested in the status field. A red status indicates unassigned primary shards, which is a critical failure. A yellow status means some primary shards are assigned, but replica shards are not, which can impact resilience but might not immediately halt operations. We want to alert on anything other than green.
Automated Health Status Alerting Script
Let’s create a simple Bash script that checks the cluster status and exits with a non-zero status code if it’s not green. This is ideal for integration with monitoring systems like Nagios, Zabbix, or even a simple cron job that sends alerts.
#!/bin/bash
ES_HOST="http://elasticsearch.your-domain.com:9200"
ALERT_THRESHOLD="yellow" # Alert on yellow or red
HEALTH_STATUS=$(curl -s "${ES_HOST}/_cluster/health" | jq -r '.status')
if [[ "$HEALTH_STATUS" == "green" ]]; then
echo "Elasticsearch cluster health is GREEN."
exit 0
else
echo "Elasticsearch cluster health is ${HEALTH_STATUS} (Threshold: ${ALERT_THRESHOLD}). Alerting!"
exit 1
fi
To make this script executable, run:
chmod +x check_es_health.sh
You can then run it manually or schedule it with cron:
# Example cron entry to run every 5 minutes */5 * * * * /path/to/your/scripts/check_es_health.sh >> /var/log/es_health_check.log 2>&1
Monitoring Node Status and JVM Heap Usage
Beyond the overall cluster health, it’s crucial to monitor individual nodes. Node failures can degrade performance or lead to data loss if not detected promptly. The _nodes/stats endpoint provides detailed statistics for each node, including JVM heap usage, which is a common bottleneck.
We can use `jq` to filter for nodes that are not reporting or have excessive JVM heap usage. Let’s define a threshold for JVM heap usage, say 85%.
#!/bin/bash
ES_HOST="http://elasticsearch.your-domain.com:9200"
JVM_HEAP_THRESHOLD_PERCENT=85
# Get all nodes and their JVM heap usage
NODE_STATS=$(curl -s "${ES_HOST}/_nodes/stats/jvm" | jq '.nodes | to_entries[] | {key: .key, value: .value.jvm.mem.heap_used_percent}')
# Check if any nodes are reporting high heap usage
HIGH_HEAP_NODES=$(echo "$NODE_STATS" | jq --argjson threshold "$JVM_HEAP_THRESHOLD_PERCENT" '[.[] | select(.value > $threshold)] | length')
if [[ "$HIGH_HEAP_NODES" -gt 0 ]]; then
echo "ALERT: High JVM heap usage detected on ${HIGH_HEAP_NODES} node(s)."
echo "$NODE_STATS" | jq --argjson threshold "$JVM_HEAP_THRESHOLD_PERCENT" '.[] | select(.value > $threshold)'
exit 1
else
echo "All nodes reporting acceptable JVM heap usage."
exit 0
fi
This script iterates through each node, extracts its JVM heap usage percentage, and compares it against the defined threshold. If any node exceeds the threshold, it logs the alert and the specific nodes affected.
Detecting Unassigned Shards
Unassigned shards are a direct indicator of problems, whether it’s a node failure, disk space issues, or allocation rules preventing assignment. The cluster health API already reports the count, but we can get more granular information using the _cluster/allocation/explain endpoint, though this is more for debugging specific shard issues. For proactive monitoring, we’ll focus on the count of unassigned shards from the health API.
#!/bin/bash
ES_HOST="http://elasticsearch.your-domain.com:9200"
UNASSIGNED_SHARDS=$(curl -s "${ES_HOST}/_cluster/health" | jq '.unassigned_shards')
if [[ "$UNASSIGNED_SHARDS" -gt 0 ]]; then
echo "ALERT: ${UNASSIGNED_SHARDS} unassigned shard(s) detected."
# For more detail, you might query _cat/shards?h=index,shard,prirep,state,unassigned.reason
# and parse that output. For simplicity, we'll just alert on the count.
exit 1
else
echo "No unassigned shards detected."
exit 0
fi
This script is straightforward: it fetches the `unassigned_shards` count and alerts if it’s greater than zero. For more detailed diagnostics when this alert fires, you could integrate calls to the _cat/shards API, filtering for shards in an “UNASSIGNED” state and examining their reasons.
Shopify App Integration: API Endpoint Health
Your Shopify app likely interacts with Elasticsearch for search, indexing, or analytics. Monitoring the health of these specific API endpoints is critical. If your app’s search endpoint becomes slow or unresponsive, it directly impacts the user experience on Shopify.
We can simulate a typical search request to your app’s Elasticsearch-backed API. Assume your app exposes a search endpoint at http://your-app-api.com/search which proxies requests to Elasticsearch.
#!/bin/bash
APP_SEARCH_ENDPOINT="http://your-app-api.com/search"
SEARCH_QUERY='{"query": {"match": {"title": "example"}}}' # A sample query
TIMEOUT_SECONDS=5
# Use curl to hit the app's search endpoint
# -X POST to send the query body
# -H "Content-Type: application/json" to specify the content type
# --data to send the query
# -w "%{http_code}" to get the HTTP status code
# -o /dev/null to discard the response body (we only care about status and time)
# -s to be silent
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time $TIMEOUT_SECONDS -X POST -H "Content-Type: application/json" --data "$SEARCH_QUERY" "$APP_SEARCH_ENDPOINT")
# Check HTTP status code. 200 is typically OK. Adjust if your API uses other success codes.
if [[ "$HTTP_CODE" -eq 200 ]]; then
echo "Shopify App Search Endpoint is RESPONSIVE (HTTP $HTTP_CODE)."
exit 0
elif [[ "$HTTP_CODE" -eq 000 ]]; then
echo "Shopify App Search Endpoint is UNREACHABLE (Timeout or connection error)."
exit 1
else
echo "Shopify App Search Endpoint returned an ERROR (HTTP $HTTP_CODE)."
exit 1
fi
This script checks if the application’s search endpoint is reachable and returns a successful HTTP status code within a specified timeout. If the endpoint times out or returns an error code (e.g., 5xx), it indicates a problem with either the application itself or its connection to Elasticsearch.
OVH Specific Considerations: Network and Firewall
When running these checks on OVH infrastructure, always consider network accessibility and firewall rules. Ensure that the monitoring server or the cron job’s execution environment has network access to your Elasticsearch cluster’s IP address and port (default 9200). If your Elasticsearch cluster is within a private network or behind an OVH firewall, you’ll need to configure security groups or firewall rules to allow inbound traffic from your monitoring source.
For example, if your Elasticsearch cluster is in a dedicated server or a public cloud instance, you might need to adjust the OVH firewall rules via the OVH Control Panel or API to permit TCP traffic on port 9200 from the IP address of your monitoring agent.
Additionally, if you’re using OVH’s managed Elasticsearch service, consult their documentation for specific endpoint URLs and any authentication requirements (e.g., API keys or tokens) that need to be included in your `curl` commands.
Advanced Monitoring with Prometheus and Grafana
While the `curl` and `jq` scripts are excellent for basic health checks and simple alerting, a production environment demands more sophisticated tooling. Prometheus, coupled with Grafana, provides a powerful combination for time-series monitoring and visualization.
Prometheus Configuration:
You’ll need an Elasticsearch Exporter to expose metrics in a Prometheus-readable format. A popular choice is the `elasticsearch_exporter`.
1. Install Elasticsearch Exporter: Download and run the exporter binary. It typically listens on port 9114.
# Example using Docker docker run -d \ -p 9114:9114 \ --name elasticsearch_exporter \ quay.io/prometheuscommunity/elasticsearch-exporter:latest \ --es.uri="http://elasticsearch.your-domain.com:9200"
2. Configure Prometheus to Scrape the Exporter: Add the exporter to your `prometheus.yml` configuration.
scrape_configs:
- job_name: 'elasticsearch'
static_configs:
- targets: ['elasticsearch.your-domain.com:9114'] # Or the IP/hostname of your exporter
3. Restart Prometheus: Apply the configuration changes.
Grafana Dashboards:
Once Prometheus is scraping metrics, you can import pre-built Elasticsearch dashboards into Grafana or create your own. Search for “Elasticsearch” dashboards on Grafana.com. These dashboards typically visualize:
- Cluster Health (status, nodes, shards)
- JVM Heap Usage
- CPU and Memory Usage per Node
- Indexing and Search Latency
- Disk I/O and Space Usage
- Network Traffic
These visualizations provide a historical view, allowing you to identify trends, predict potential issues, and optimize performance proactively. Setting up alerting rules within Prometheus (using Alertmanager) based on these metrics is the next logical step for robust, automated incident response.