Server Monitoring Best Practices: Keeping Your Magento 2 App and Elasticsearch Clusters Alive on OVH
Proactive Elasticsearch Health Checks with `curl` and `jq`
Maintaining the health of your Elasticsearch cluster, especially when powering a Magento 2 instance, is paramount. Downtime directly translates to lost revenue and customer frustration. While dedicated monitoring solutions are essential, a robust set of `curl` commands, augmented by `jq` for parsing JSON output, provides a powerful, scriptable layer for immediate, low-overhead checks. These can be integrated into cron jobs or your existing monitoring pipeline.
We’ll focus on key metrics: cluster health status, node count, and shard allocation. This approach allows for rapid detection of issues before they cascade.
Cluster Health Status
The most critical endpoint is the cluster health API. It provides an immediate overview of the cluster’s state. A `green` status indicates all primary and replica shards are allocated and operational. `yellow` means all primary shards are allocated, but some replicas are not, which is acceptable for read operations but a risk for failover. `red` signifies that one or more primary shards are not allocated, leading to data unavailability for those shards.
Here’s a `curl` command to fetch this information. We’ll pipe the output to `jq` to extract the `status` field.
Assuming your Elasticsearch is running on localhost:9200 and requires basic authentication (replace user and password with your actual credentials):
curl -s -u "user:password" "http://localhost:9200/_cluster/health" | jq -r '.status'
To automate this, you can create a simple shell script. This script will check the status and exit with a non-zero code if the status is not `green`, signaling a problem to your monitoring system.
check_es_health.sh:
#!/bin/bash
ES_HOST="localhost:9200"
ES_USER="user"
ES_PASS="password"
HEALTH_STATUS=$(curl -s -u "${ES_USER}:${ES_PASS}" "http://${ES_HOST}/_cluster/health" | jq -r '.status')
if [ "$HEALTH_STATUS" != "green" ]; then
echo "Elasticsearch cluster health is NOT green: ${HEALTH_STATUS}"
exit 1
else
echo "Elasticsearch cluster health is green."
exit 0
fi
Make this script executable:
chmod +x check_es_health.sh
Node Count and Shard Allocation
Beyond the overall health status, it’s crucial to monitor the number of nodes in your cluster and how shards are distributed. A sudden drop in node count can indicate a node failure. Unassigned shards, even in a `yellow` or `green` cluster, can point to underlying issues like disk space exhaustion or network problems preventing allocation.
We can extend the cluster health API call to get more granular details. The following command retrieves the number of nodes and the count of unassigned shards.
curl -s -u "user:password" "http://localhost:9200/_cluster/health" | jq '{nodes: .number_of_nodes, unassigned_shards: .unassigned_shards}'
This output provides a JSON object with two key-value pairs. You can integrate checks for these values into your monitoring scripts. For instance, you might want to alert if the node count drops below a certain threshold or if there are any unassigned shards.
Node-Specific Metrics
To diagnose issues at a node level, the _nodes/stats API is invaluable. It provides detailed metrics on CPU usage, memory, disk I/O, JVM heap usage, and more for each node.
To get basic stats for all nodes:
curl -s -u "user:password" "http://localhost:9200/_nodes/stats" | jq '.nodes | keys[]'
This command lists the node IDs. To get specific metrics, like JVM heap usage, for a particular node (replace NODE_ID with an actual ID from the previous command):
curl -s -u "user:password" "http://localhost:9200/_nodes/stats/jvm" | jq '.nodes | .[NODE_ID].jvm.mem.heap_used_percent'
A more comprehensive script might iterate through all nodes and check for high JVM heap usage or low disk space. Disk usage is particularly critical as Elasticsearch will stop indexing if a disk reaches its high watermark.
Example script snippet to check JVM heap usage across all nodes:
#!/bin/bash
ES_HOST="localhost:9200"
ES_USER="user"
ES_PASS="password"
HEAP_THRESHOLD=85 # Alert if heap usage is above 85%
NODES_STATS=$(curl -s -u "${ES_USER}:${ES_PASS}" "http://${ES_HOST}/_nodes/stats/jvm" | jq '.nodes')
echo "$NODES_STATS" | jq -c 'to_entries[]' | while read -r node_entry; do
NODE_NAME=$(echo "$node_entry" | jq -r '.key')
HEAP_USED_PERCENT=$(echo "$node_entry" | jq -r '.value.jvm.mem.heap_used_percent')
if [ "$HEAP_USED_PERCENT" -gt "$HEAP_THRESHOLD" ]; then
echo "WARNING: Node ${NODE_NAME} has high JVM heap usage: ${HEAP_USED_PERCENT}%"
# In a real scenario, you'd trigger an alert here
fi
done
Magento 2 Specific Considerations
For Magento 2, Elasticsearch is not just a search engine; it’s a critical component for product catalog indexing, layered navigation, and more. When Elasticsearch experiences issues, these core functionalities degrade or fail entirely.
Indexing Status: While not directly exposed via a simple `curl` command to the Elasticsearch API itself, you can infer indexing health from Magento’s admin panel or by checking the Elasticsearch index refresh interval and document counts. A stalled document count or an unusually long refresh interval can indicate indexing problems. For deeper inspection, you might need to query Magento’s database or use Magento CLI commands.
Index Size and Document Counts: Monitoring the size of your Elasticsearch indices and the number of documents within them can help detect anomalies. If a product update doesn’t reflect in the document count, it’s a strong indicator of an indexing failure.
curl -s -u "user:password" "http://localhost:9200/_cat/indices?v&h=index,docs.count,store.size" | grep "magento2_"
This command lists indices starting with `magento2_`, showing their document count and size. You can script checks against these values. For example, if the document count for `magento2_catalog_product_1` hasn’t changed in an hour and you expect updates, it’s a red flag.
OVH Specifics and Network Monitoring
When running on OVH, network latency and connectivity between your Magento application servers and the Elasticsearch cluster are critical. Ensure your security groups and firewall rules on OVH are configured to allow traffic on port 9200 (and 9300 for inter-node communication if applicable) between your instances. Network issues can manifest as timeouts when querying Elasticsearch, leading to Magento errors.
Basic Connectivity Test: From your Magento application server, a simple `ping` to the Elasticsearch server’s IP (if ICMP is allowed) or a `curl` to a known endpoint can verify basic network reachability.
# From Magento App Server
curl -s -o /dev/null -w "%{http_code}\n" -u "user:password" "http://ELASTICSEARCH_IP:9200/_cluster/health"
If this command returns a non-200 HTTP code (e.g., 000 for connection refused, or a timeout), it points to a network or firewall issue on OVH. Check your OVH Control Panel for instance firewall rules and security group configurations.
OVH Instance Monitoring: Leverage OVH’s built-in monitoring tools in the Control Panel. These provide insights into CPU, RAM, disk I/O, and network traffic for your instances. Correlate spikes or sustained high usage with Elasticsearch performance issues.
Integrating with a Full Monitoring Stack
While the `curl` and `jq` approach is excellent for targeted checks and scripting, it’s not a replacement for a comprehensive monitoring solution. Tools like Prometheus with the Elasticsearch Exporter, Zabbix, or Datadog offer:
- Centralized dashboards for visualizing metrics over time.
- Alerting rules with sophisticated notification channels (email, Slack, PagerDuty).
- Historical data retention for trend analysis and capacity planning.
- Automated discovery of nodes and services.
- Log aggregation and analysis.
Your `curl` scripts can feed data into these systems. For example, a script that exits with a non-zero status can be monitored by Nagios-compatible agents (like NRPE) or directly by Prometheus’s `node_exporter`’s `textfile collector`.
Example using Prometheus `textfile` collector:
Place the following script in /etc/node_exporter/textfile_collector/es_health.sh (ensure the directory exists and is readable by the user running `node_exporter`):
#!/bin/bash
ES_HOST="localhost:9200"
ES_USER="user"
ES_PASS="password"
HEALTH_STATUS=$(curl -s -u "${ES_USER}:${ES_PASS}" "http://${ES_HOST}/_cluster/health" | jq -r '.status')
if [ "$HEALTH_STATUS" != "green" ]; then
echo "elasticsearch_cluster_health_status{status=\"${HEALTH_STATUS}\"} 1"
echo "elasticsearch_cluster_health_status{status=\"green\"} 0"
else
echo "elasticsearch_cluster_health_status{status=\"green\"} 1"
echo "elasticsearch_cluster_health_status{status=\"${HEALTH_STATUS}\"} 0" # Ensure other statuses are 0
fi
# Add more metrics as needed, e.g., unassigned shards
UNASSIGNED_SHARDS=$(curl -s -u "${ES_USER}:${ES_PASS}" "http://${ES_HOST}/_cluster/health" | jq -r '.unassigned_shards')
echo "elasticsearch_cluster_unassigned_shards ${UNASSIGNED_SHARDS}"
Make it executable and ensure it runs periodically (e.g., via cron, updating a file in /var/lib/node_exporter/textfile_collector/):
chmod +x es_health.sh # Add to cron: */1 * * * * /path/to/es_health.sh > /var/lib/node_exporter/textfile_collector/es_health.prom
Prometheus will then scrape elasticsearch_cluster_health_status and elasticsearch_cluster_unassigned_shards metrics.