Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on DigitalOcean

Proactive Elasticsearch Health Checks with Custom Scripts

Relying solely on DigitalOcean’s basic droplet metrics for Elasticsearch clusters is a recipe for disaster. Elasticsearch’s internal state is far more nuanced. We need to go deeper. A robust monitoring strategy involves custom scripts that query Elasticsearch’s APIs directly, providing insights into shard health, cluster status, and resource utilization beyond what the OS level exposes.

Here’s a Python script that checks for unassigned shards, red cluster status, and high JVM heap usage. This script can be scheduled via cron on a dedicated monitoring node or one of the Elasticsearch nodes themselves (though a separate node is preferred for isolation).

Elasticsearch Health Check Script (Python)

import requests
import json
import sys
import time

# --- Configuration ---
ELASTICSEARCH_HOST = "http://localhost:9200"  # Or your Elasticsearch endpoint
ALERT_THRESHOLD_JVM_HEAP_PERCENT = 85  # Alert if JVM heap usage exceeds this percentage
ALERT_THRESHOLD_UNASSIGNED_SHARDS = 0  # Alert if there are any unassigned shards
REQUIRED_CLUSTER_STATUS = "green"  # Acceptable cluster status

# --- Functions ---
def check_cluster_health():
    try:
        response = requests.get(f"{ELASTICSEARCH_HOST}/_cluster/health", timeout=10)
        response.raise_for_status()  # Raise an exception for bad status codes
        health_data = response.json()
        return health_data
    except requests.exceptions.RequestException as e:
        print(f"ERROR: Could not connect to Elasticsearch cluster health API: {e}", file=sys.stderr)
        return None

def check_node_stats():
    try:
        response = requests.get(f"{ELASTICSEARCH_HOST}/_nodes/stats/jvm", timeout=10)
        response.raise_for_status()
        stats_data = response.json()
        return stats_data
    except requests.exceptions.RequestException as e:
        print(f"ERROR: Could not connect to Elasticsearch nodes stats API: {e}", file=sys.stderr)
        return None

def alert(message):
    # In a production environment, this would integrate with PagerDuty, Slack, etc.
    # For now, we'll just print to stderr.
    print(f"ALERT: {message}", file=sys.stderr)

# --- Main Logic ---
if __name__ == "__main__":
    print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] Running Elasticsearch health check...")

    # 1. Check Cluster Health
    health = check_cluster_health()
    if health:
        cluster_status = health.get("status")
        unassigned_shards = health.get("unassigned_shards", 0)
        initializing_shards = health.get("initializing_shards", 0)
        relocating_shards = health.get("relocating_shards", 0)

        print(f"  Cluster Status: {cluster_status} ({unassigned_shards} unassigned, {initializing_shards} initializing, {relocating_shards} relocating)")

        if cluster_status != REQUIRED_CLUSTER_STATUS:
            alert(f"Elasticsearch cluster status is '{cluster_status}', expected '{REQUIRED_CLUSTER_STATUS}'. Unassigned shards: {unassigned_shards}")

        if unassigned_shards > ALERT_THRESHOLD_UNASSIGNED_SHARDS:
            alert(f"Found {unassigned_shards} unassigned shards. Threshold is {ALERT_THRESHOLD_UNASSIGNED_SHARDS}.")

    # 2. Check Node JVM Heap Usage
    node_stats = check_node_stats()
    if node_stats:
        nodes = node_stats.get("nodes", {})
        for node_id, stats in nodes.items():
            node_name = stats.get("name", node_id)
            jvm_heap_used_percent = stats.get("jvm", {}).get("heap_used_percent")
            jvm_heap_max_bytes = stats.get("jvm", {}).get("heap_max_in_bytes", 0)
            jvm_heap_used_bytes = stats.get("jvm", {}).get("heap_used_in_bytes", 0)

            if jvm_heap_used_percent is not None:
                print(f"  Node '{node_name}': JVM Heap Usage = {jvm_heap_used_percent}% ({jvm_heap_used_bytes / (1024*1024):.2f}MB / {jvm_heap_max_bytes / (1024*1024):.2f}MB)")
                if jvm_heap_used_percent > ALERT_THRESHOLD_JVM_HEAP_PERCENT:
                    alert(f"Node '{node_name}' JVM heap usage is at {jvm_heap_used_percent}%, exceeding threshold of {ALERT_THRESHOLD_JVM_HEAP_PERCENT}%.")
            else:
                print(f"  Node '{node_name}': JVM heap usage data not available.")

    print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] Elasticsearch health check finished.")
    sys.exit(0) # Exit with success code if no critical errors occurred during script execution itself

To integrate this script:

Save the script as es_health_check.py.
Install the requests library: pip install requests.
Configure the ELASTICSEARCH_HOST variable.
Schedule it using cron: crontab -e and add a line like: */5 * * * * /usr/bin/python3 /path/to/your/es_health_check.py >> /var/log/es_health_check.log 2>&1. This runs it every 5 minutes and logs output.
Modify the alert() function to send notifications to your preferred alerting system (e.g., PagerDuty, Slack webhook, email).

Monitoring PHP Application Performance with New Relic

For PHP applications, especially those running on DigitalOcean droplets, New Relic provides invaluable deep-dive performance insights. It goes beyond basic CPU/memory monitoring to trace transactions, identify slow database queries, and pinpoint bottlenecks within your PHP code.

The core of New Relic’s PHP monitoring is the agent. Installation typically involves downloading and running an installer script provided by New Relic. Once installed, it needs to be enabled in your php.ini configuration.

New Relic Agent Configuration

After running the New Relic installer (e.g., curl -Ls https://download.newrelic.com/install/newrelic-php5.sh | sudo bash), you’ll need to ensure the agent is loaded and configured. This usually involves editing your main php.ini file or a dedicated New Relic configuration file.

; In your php.ini or a file included by it (e.g., /etc/php/7.4/fpm/conf.d/newrelic.ini)

; Enable the New Relic extension
extension=newrelic.so

; Your New Relic license key
newrelic.license_key = "YOUR_LICENSE_KEY_HERE"

; The name of your application as it will appear in New Relic
newrelic.app_name = "YourAppName-Production"

; Optional: Set to true to enable high-security mode (disables certain data collection)
; newrelic.high_security = true

; Optional: If your PHP app is behind a proxy, configure proxy settings
; newrelic.proxy = "tcp://proxy.example.com:8080"
; newrelic.proxy_user = "proxy_user"
; newrelic.proxy_pass = "proxy_password"

; Optional: Log level for New Relic agent
; newrelic.loglevel = "info"

After modifying php.ini, you must restart your PHP-FPM service (or Apache, if using mod_php) for the changes to take effect.

sudo systemctl restart php7.4-fpm
# Or for Apache:
# sudo systemctl restart apache2

Leveraging New Relic for PHP Performance Analysis

Once the agent is active, New Relic will automatically start collecting data. Key areas to monitor include:

Transactions: Identify the slowest web requests and background jobs. Look for transactions with high average response times or high throughput that are consuming excessive resources.
Databases: Analyze slow database queries. New Relic can show you the exact SQL statements that are taking too long, helping you optimize indexes or query logic.
External Services: Monitor calls to external APIs or services. High latency here can significantly impact your application’s perceived performance.
Errors: Track PHP exceptions and errors. New Relic provides stack traces and context for debugging.
JVM (for Elasticsearch): While New Relic’s primary PHP agent doesn’t directly monitor Elasticsearch JVM, you would use the dedicated New Relic Infrastructure agent or the Elasticsearch integration for that.

When investigating a slow transaction, drill down into the “Breakdown” tab. This will show you the percentage of time spent in different parts of your application stack (e.g., framework code, database calls, external calls, custom instrumentation). This is crucial for pinpointing where to focus optimization efforts.

DigitalOcean Droplet Metrics and Alerting with Prometheus & Alertmanager

While New Relic and custom Elasticsearch scripts cover application-level and cluster-specific health, we still need to monitor the underlying infrastructure: the DigitalOcean droplets. Prometheus, coupled with Alertmanager, offers a powerful, open-source solution for collecting metrics and managing alerts.

The standard way to get system metrics from Linux hosts into Prometheus is via the node_exporter. This exporter runs as a service on each droplet and exposes metrics like CPU usage, memory, disk I/O, and network traffic via an HTTP endpoint.

Deploying Node Exporter on DigitalOcean Droplets

You can download pre-compiled binaries or build node_exporter from source. For simplicity, let’s use a common installation method.

# On each DigitalOcean droplet you want to monitor
NODE_EXPORTER_VERSION="1.5.0" # Check for the latest version
wget "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
tar xvfz "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
sudo mv "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter" /usr/local/bin/
rm -rf "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64" "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"

# Create a systemd service file
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/.*)" \
    --collector.netdev.ignore-devices="^(veth|lo|docker|eth0$$)" \
    --web.listen-address=":9100"

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd, enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Ensure that port 9100 is open in your DigitalOcean firewall or security groups for your Prometheus server to scrape it. If you are running Prometheus on a separate DigitalOcean droplet, you’ll need to configure its firewall accordingly.

Prometheus Configuration for Scraping

Your Prometheus server’s prometheus.yml configuration needs to include scrape jobs for your droplets.

scrape_configs:
  - job_name: 'digitalocean_nodes'
    static_configs:
      - targets:
          - 'your_droplet_ip_1:9100'
          - 'your_droplet_ip_2:9100'
          - 'your_droplet_ip_3:9100'
        labels:
          env: 'production'
          role: 'webserver' # Or 'elasticsearch_node' etc.

  - job_name: 'elasticsearch_cluster'
    static_configs:
      - targets:
          - 'elasticsearch_node_1:9200'
          - 'elasticsearch_node_2:9200'
          - 'elasticsearch_node_3:9200'
        labels:
          env: 'production'
          role: 'elasticsearch'
    # You might add specific scrape configs for Elasticsearch metrics if using an exporter
    # For example, if you have a custom exporter for Elasticsearch metrics.
    # For basic health checks, the custom script is often sufficient.

  # Add scrape jobs for other services as needed (e.g., your PHP-FPM exporter)

After updating prometheus.yml, reload the Prometheus configuration:

# Assuming Prometheus is running as a systemd service
sudo systemctl reload prometheus

Setting Up Alertmanager for Notifications

Alertmanager handles deduplication, grouping, and routing of alerts generated by Prometheus. A basic Alertmanager configuration file (alertmanager.yml) might look like this:

global:
  # The default receiver for alerts
  # In a real setup, you'd configure SMTP, Slack, PagerDuty etc. here
  # For demonstration, we'll use a webhook to a simple receiver
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver

receivers:
  - name: 'default-receiver'
    webhook_configs:
      - url: 'http://your-alert-handler-service:8080/alerts' # Replace with your actual webhook endpoint

# Example of routing specific alerts
# routes:
#   - receiver: 'elasticsearch-alerts'
#     matchers:
#       - service="elasticsearch"
#     continue: true # Allows alerts to also go to the default receiver if needed
#
# receivers:
#   - name: 'elasticsearch-alerts'
#     slack_configs:
#       - api_url: 'YOUR_SLACK_WEBHOOK_URL'
#         channel: '#elasticsearch-alerts'

You’ll need to configure Prometheus to send alerts to Alertmanager:

# In prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager_host:9093' # Your Alertmanager instance address

Example Prometheus Alerting Rules

Create a rule file (e.g., /etc/prometheus/rules/node_alerts.yml) and add it to your prometheus.yml under the rule_files directive.

groups:
  - name: node_alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected on {{ $labels.instance }}"
          description: "Instance {{ $labels.instance }} has been running at over 90% CPU for the last 10 minutes."

      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Instance {{ $labels.instance }} has less than 15% disk space remaining on the root filesystem for the last 15 minutes."

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Instance {{ $labels.instance }} has been using over 85% of memory for the last 10 minutes."

  - name: elasticsearch_alerts
    rules:
      - alert: ElasticsearchClusterRed
        # This rule assumes you have an exporter that exposes cluster status,
        # or you've adapted the Python script to expose metrics to Prometheus.
        # For simplicity, let's assume a hypothetical metric `es_cluster_status_code`
        # where 0=green, 1=yellow, 2=red.
        # If using the Python script, you'd need to modify it to expose metrics.
        expr: es_cluster_status_code{role="elasticsearch"} > 1 # Alert on RED status
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Elasticsearch cluster is RED on {{ $labels.instance }}"
          description: "The Elasticsearch cluster on {{ $labels.instance }} is in a RED state. Shard allocation issues may be present."

      - alert: ElasticsearchUnassignedShards
        # Similar assumption as above for `es_unassigned_shards` metric
        expr: es_unassigned_shards{role="elasticsearch"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unassigned Elasticsearch shards detected on {{ $labels.instance }}"
          description: "There are {{ $value }} unassigned shards in the Elasticsearch cluster on {{ $labels.instance }}."

Remember to add the rule file to your prometheus.yml:

# In prometheus.yml
rule_files:
  - '/etc/prometheus/rules/*.yml'

Correlating Logs with Application Performance

Metrics tell you *what* is happening, but logs tell you *why*. For a comprehensive monitoring strategy, logs are indispensable. Centralizing logs from your PHP application and Elasticsearch nodes allows for faster debugging and correlation between events.

A common stack for log aggregation is the ELK stack (Elasticsearch, Logstash, Kibana) or its more modern successor, the Elastic Stack (Elasticsearch, Beats, Logstash, Kibana). For DigitalOcean deployments, using Filebeat to collect logs and send them to a central Elasticsearch cluster is a highly effective approach.

Filebeat Configuration for PHP and Elasticsearch Logs

On each server (PHP app servers and Elasticsearch nodes), install Filebeat. Then, configure it to tail relevant log files.

# /etc/filebeat/filebeat.yml

filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/php/app.log  # Your PHP application logs
      # - /var/log/apache2/error.log # If using Apache
      # - /var/log/nginx/error.log   # If using Nginx
    fields_under_root: true
    fields:
      environment: production
      app_name: "YourAppName"
    # Optional: Parse JSON logs
    # json. நாள்:
    #   keys_under_root: true
    #   overwrite_keys: true

  - type: log
    enabled: true
    paths:
      - /var/log/elasticsearch/your_cluster_name.log # Adjust path for Elasticsearch logs
    fields_under_root: true
    fields:
      environment: production
      app_name: "Elasticsearch"
    # Elasticsearch logs are often structured, consider JSON parsing if applicable

output.elasticsearch:
  hosts: ["your_central_elasticsearch_host:9200"]
  # If using authentication:
  # username: "elastic"
  # password: "changeme"

# Optional: If you want to process logs with Logstash before sending to Elasticsearch
# output.logstash:
#   hosts: ["your_logstash_host:5044"]

# Optional: Enable monitoring of Filebeat itself
# monitoring.enabled: true
# monitoring.elasticsearch:
#   hosts: ["your_central_elasticsearch_host:9200"]

After configuring filebeat.yml, restart the Filebeat service:

sudo systemctl restart filebeat
sudo systemctl status filebeat

Kibana for Log Visualization and Analysis

With logs flowing into Elasticsearch, Kibana becomes your primary tool for searching, visualizing, and analyzing them. Create dashboards to:

Monitor PHP error rates and types.
Track specific transaction IDs across your application and Elasticsearch logs.
Visualize Elasticsearch cluster events (e.g., shard rebalancing, indexing failures).
Correlate application errors with specific Elasticsearch queries or responses.

For instance, if New Relic flags a slow transaction, you can use the transaction ID from New Relic to filter logs in Kibana, searching both your PHP application logs and Elasticsearch logs for that specific ID. This provides the full context needed to diagnose the root cause.