Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on AWS

Establishing a Robust Monitoring Foundation with CloudWatch

For any production PHP application hosted on AWS, a comprehensive monitoring strategy is paramount. This begins with leveraging Amazon CloudWatch effectively. Beyond the default metrics, we need to push custom metrics and logs to gain granular insights into application performance and infrastructure health.

A common pitfall is relying solely on CPU utilization and network traffic. While important, these high-level indicators often mask deeper application-level issues. We need to monitor things like PHP-FPM process counts, request latency, error rates, and memory usage per process.

Custom Metrics for PHP-FPM and Application Performance

We can use the CloudWatch Agent to collect custom metrics. For PHP-FPM, we can scrape the status page if enabled, or more reliably, use a small script that periodically reports key metrics. Let’s consider a Python script that reports active processes, idle processes, and slow requests.

First, ensure PHP-FPM’s `pm.status_path` is configured and accessible (e.g., `/fpm-status`). If not, you might need to enable it in your PHP-FPM pool configuration:

; /etc/php/7.4/fpm/pool.d/www.conf
pm.status_path = /fpm-status

Then, create a Python script (e.g., `/opt/scripts/monitor_php_fpm.py`) to collect and send metrics to CloudWatch:

import boto3
import requests
import time
import os

# --- Configuration ---
PHP_FPM_STATUS_URL = "http://localhost/fpm-status" # Adjust if not on localhost or using a different path
NAMESPACE = "MyPHPApp"
REGION = os.environ.get("AWS_REGION", "us-east-1") # Get region from environment or default
# --- End Configuration ---

cloudwatch = boto3.client('cloudwatch', region_name=REGION)

def get_php_fpm_stats():
    try:
        response = requests.get(PHP_FPM_STATUS_URL, timeout=5)
        response.raise_for_status() # Raise an exception for bad status codes
        data = response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching PHP-FPM status: {e}")
        return None

    stats = {}
    lines = data.splitlines()
    for line in lines:
        if ":" in line:
            key, value = line.split(":", 1)
            key = key.strip()
            value = value.strip()
            if key == "pool":
                stats.setdefault("pool", []).append(value)
            elif key == "process manager":
                stats["process_manager"] = value
            elif key == "start time":
                stats["start_time"] = value
            elif key == "accepted conn":
                stats["accepted_conn"] = int(value)
            elif key == "active processes":
                stats["active_processes"] = int(value)
            elif key == "idle processes":
                stats["idle_processes"] = int(value)
            elif key == "requests":
                stats["requests"] = int(value)
            elif key == "slow requests":
                stats["slow_requests"] = int(value)
    return stats

def put_metric(metric_name, value, dimensions=None):
    try:
        cloudwatch.put_metric_data(
            Namespace=NAMESPACE,
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': 'Count', # Adjust unit as needed (e.g., 'Seconds' for latency)
                    'Dimensions': dimensions if dimensions else []
                },
            ]
        )
        print(f"Put metric: {metric_name}={value}")
    except Exception as e:
        print(f"Error putting metric {metric_name}: {e}")

if __name__ == "__main__":
    stats = get_php_fpm_stats()
    if stats:
        # Example: Sending metrics for the 'www' pool (adjust if you have multiple pools)
        # In a real-world scenario, you'd iterate through pools if multiple exist
        pool_name = "www" # Assuming a single pool named 'www'
        dimensions = [{'Name': 'Pool', 'Value': pool_name}]

        if "active_processes" in stats:
            put_metric("ActiveProcesses", stats["active_processes"], dimensions)
        if "idle_processes" in stats:
            put_metric("IdleProcesses", stats["idle_processes"], dimensions)
        if "slow_requests" in stats:
            put_metric("SlowRequests", stats["slow_requests"], dimensions)
        if "requests" in stats:
            put_metric("TotalRequests", stats["requests"], dimensions)
        if "accepted_conn" in stats:
            put_metric("AcceptedConnections", stats["accepted_conn"], dimensions)

        # Example: Application-level metrics (e.g., request latency, error count)
        # These would typically be instrumented within your PHP application itself
        # For demonstration, let's assume we have these values available
        # app_request_latency_seconds = 0.15
        # app_error_count = 2
        # put_metric("RequestLatency", app_request_latency_seconds, dimensions + [{'Name': 'Endpoint', 'Value': '/api/v1/users'}])
        # put_metric("ErrorCount", app_error_count, dimensions + [{'Name': 'ErrorType', 'Value': '5xx'}])




To run this script periodically, we can use cron. Configure cron to run this script every minute:



# crontab -e
* * * * * /usr/bin/python3 /opt/scripts/monitor_php_fpm.py >> /var/log/monitor_php_fpm.log 2>&1



Next, configure the CloudWatch Agent to collect these custom metrics. Create a configuration file (e.g., `/opt/aws/amazon-cloudwatch-agent/bin/config.json`):



{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyPHPApp",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "statsd": {
        "service_address": "udp:127.0.0.1:8125",
        "metrics_collection_interval": 60
      },
      "prometheus": {
        "prometheus_config_path": "/opt/aws/amazon-cloudwatch-agent/prometheus_config.yml",
        "log_group_name": "/aws/ecs/containerinsights/my-app"
      },
      "emf": {
        "queue_size": 10000,
        "batch_flush_interval": 60
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/php-fpm/www-error.log",
            "log_group_name": "/aws/php-fpm/www/error",
            "log_stream_name": "{instance_id}/php-fpm-error"
          },
          {
            "file_path": "/var/log/php-fpm/www-access.log",
            "log_group_name": "/aws/php-fpm/www/access",
            "log_stream_name": "{instance_id}/php-fpm-access"
          }
        ]
      }
    }
  }
}



And a Prometheus configuration file (`/opt/aws/amazon-cloudwatch-agent/prometheus_config.yml`) to scrape metrics from applications that expose them in Prometheus format (e.g., using a Prometheus client library in your PHP app):



global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'php_app'
    static_configs:
      - targets: ['localhost:9100'] # Assuming your PHP app exposes metrics on port 9100
        labels:
          application: 'my-php-app'



Start and enable the agent:



sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl start amazon-cloudwatch-agent



Log Aggregation and Analysis with CloudWatch Logs Insights



Beyond metrics, centralized logging is crucial. Configure the CloudWatch Agent to stream your PHP error logs, access logs, and any application-specific logs to CloudWatch Logs. This enables powerful querying with CloudWatch Logs Insights.



For PHP error logs, ensure they are configured to log to a file. In your `php.ini`:



error_log = /var/log/php-fpm/www-error.log



And for access logs (if using PHP-FPM's access log feature or a web server like Nginx):



; Example for PHP-FPM access log (less common, usually web server logs)
access.log = /var/log/php-fpm/www-access.log



The CloudWatch Agent configuration snippet above already includes directives for collecting these files. Once logs are flowing into CloudWatch, you can use Logs Insights to query them. For example, to find all fatal errors in the last hour:



fields @timestamp, @message
| filter @message like 'Fatal error'
| sort @timestamp desc
| limit 50



Or to analyze request latency from Nginx access logs (assuming a common log format):



fields @timestamp, client, method, request, status, bytes, upstream_response_time
| parse '\"* * *\" * * * \"*\" * * \"*\" * * \"*\" * \"*\" *' as method, request, protocol, status, bytes, referer, user_agent, cookie, upstream_addr, upstream_response_time, request_time
| filter status like /^[5]/ # Filter for 5xx errors
| stats avg(request_time) as avg_request_time, avg(upstream_response_time) as avg_upstream_response_time by status, bin(5m)
| sort @timestamp asc



Monitoring Elasticsearch Clusters on AWS



Monitoring Elasticsearch, especially on AWS (whether self-managed on EC2 or using Amazon OpenSearch Service, formerly Elasticsearch Service), requires a different set of tools and considerations. The primary focus shifts to cluster health, shard status, indexing performance, and query latency.



Key Elasticsearch Metrics to Track



Regardless of deployment method, certain metrics are universally important:



Cluster Health: Status (green, yellow, red), number of nodes, number of shards (total, unassigned, relocating).
Node Stats: CPU utilization, JVM heap usage, disk I/O, disk space remaining.
Indexing Performance: Index rate (docs/sec), indexing latency (ms).
Search Performance: Search rate (queries/sec), search latency (ms).
JVM Metrics: Garbage collection activity, thread pool usage.
Shard Allocation: Number of shards that are unassigned or relocating.



Leveraging CloudWatch for OpenSearch Service



If you are using Amazon OpenSearch Service, AWS provides a set of built-in CloudWatch metrics. These are automatically collected and available in your OpenSearch Service domain's monitoring tab or directly via the CloudWatch API/console.



Essential OpenSearch Service metrics include:



ClusterStatus.red, ClusterStatus.yellow, ClusterStatus.green
Nodes.count
JVMMemoryPressure
MasterCPUUtilization, MasterJVMMemoryPressure
DataNodeCPUUtilization, DataNodeJVMMemoryPressure
IndexingRate, SearchRate
IndexingLatency, SearchLatency
UnassignedShards



You can create CloudWatch Alarms based on these metrics. For instance, an alarm for `ClusterStatus.red` or `UnassignedShards` exceeding a small threshold (e.g., 0) is critical.



# Example CloudWatch Alarm creation via AWS CLI
aws cloudwatch put-metric-alarm \
    --alarm-name "OpenSearch-Cluster-Red-Status" \
    --alarm-description "Alarm when OpenSearch cluster status is red" \
    --metric-name ClusterStatus.red \
    --namespace "AWS/ES" \
    --statistic Sum \
    --period 300 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions "Name=DomainName,Value=your-opensearch-domain-name" \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:my-ops-topic



Monitoring Self-Managed Elasticsearch on EC2



For self-managed Elasticsearch clusters on EC2, you'll need to deploy agents to collect metrics. The CloudWatch Agent is again a good choice, but you'll also want to leverage Elasticsearch's own monitoring APIs and potentially tools like Prometheus with the Elasticsearch exporter.



1. Elasticsearch Monitoring APIs:



Elasticsearch exposes a wealth of information via its REST APIs. The `_cat` APIs are particularly useful for quick checks, while `_nodes/stats` and `_cluster/stats` provide detailed metrics.



# Check cluster health
curl -X GET "http://localhost:9200/_cluster/health?pretty"

# Get node statistics
curl -X GET "http://localhost:9200/_nodes/stats?pretty"

# Get cluster statistics
curl -X GET "http://localhost:9200/_cluster/stats?pretty"



You can script these API calls to push custom metrics to CloudWatch. A Python script similar to the PHP-FPM monitor can be adapted:



import boto3
import requests
import time
import os

# --- Configuration ---
ES_HOST = "http://localhost:9200"
NAMESPACE = "MyElasticsearchCluster"
REGION = os.environ.get("AWS_REGION", "us-east-1")
CLUSTER_NAME = "my-es-cluster" # Identify your cluster
# --- End Configuration ---

cloudwatch = boto3.client('cloudwatch', region_name=REGION)

def get_es_stats():
    stats = {}
    try:
        # Cluster Health
        health_resp = requests.get(f"{ES_HOST}/_cluster/health?pretty", timeout=5)
        health_resp.raise_for_status()
        health_data = health_resp.json()
        stats["cluster_status"] = health_data["status"]
        stats["unassigned_shards"] = health_data["unassigned_shards"]
        stats["relocating_shards"] = health_data["relocating_shards"]
        stats["active_shards"] = health_data["active_shards"]
        stats["total_shards"] = health_data["total_shards"]

        # Node Stats (aggregating across all nodes for simplicity, could be per-node)
        nodes_resp = requests.get(f"{ES_HOST}/_nodes/stats?pretty", timeout=5)
        nodes_resp.raise_for_status()
        nodes_data = nodes_resp.json()

        total_cpu = 0
        total_heap_used_percent = 0
        total_disk_used_percent = 0
        num_nodes = 0

        for node_id, node_info in nodes_data["nodes"].items():
            num_nodes += 1
            total_cpu += node_info["os"]["cpu"]["load_average"]["1m"] # Example: 1m load avg
            total_heap_used_percent += node_info["jvm"]["mem"]["heap_used_percent"]
            total_disk_used_percent += node_info["fs"]["total"]["percent_used"]

        if num_nodes > 0:
            stats["avg_cpu_load_1m"] = total_cpu / num_nodes
            stats["avg_heap_used_percent"] = total_heap_used_percent / num_nodes
            stats["avg_disk_used_percent"] = total_disk_used_percent / num_nodes
            stats["node_count"] = num_nodes

        # Indexing/Search Stats (from cluster stats)
        cluster_stats_resp = requests.get(f"{ES_HOST}/_cluster/stats?pretty", timeout=5)
        cluster_stats_resp.raise_for_status()
        cluster_stats_data = cluster_stats_resp.json()
        stats["total_indexing_rate"] = cluster_stats_data["indices"]["indexing"]["index_total"] # This is a cumulative count, need rate calculation over time
        stats["total_search_rate"] = cluster_stats_data["indices"]["search"]["query_total"] # Cumulative count

    except requests.exceptions.RequestException as e:
        print(f"Error fetching Elasticsearch stats: {e}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None
    return stats

def put_metric(metric_name, value, dimensions=None):
    try:
        cloudwatch.put_metric_data(
            Namespace=NAMESPACE,
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': 'Count', # Default unit, adjust as needed
                    'Dimensions': dimensions if dimensions else []
                },
            ]
        )
        print(f"Put metric: {metric_name}={value}")
    except Exception as e:
        print(f"Error putting metric {metric_name}: {e}")

if __name__ == "__main__":
    stats = get_es_stats()
    if stats:
        dimensions = [{'Name': 'ClusterName', 'Value': CLUSTER_NAME}]

        # Map ES status to numerical values for CloudWatch
        status_map = {"green": 0, "yellow": 1, "red": 2}
        if "cluster_status" in stats:
            put_metric("ClusterStatusNumeric", status_map.get(stats["cluster_status"], -1), dimensions)
            put_metric("ClusterStatus", stats["cluster_status"], dimensions) # Also send as string if needed for specific dashboards

        if "unassigned_shards" in stats:
            put_metric("UnassignedShards", stats["unassigned_shards"], dimensions)
        if "relocating_shards" in stats:
            put_metric("RelocatingShards", stats["relocating_shards"], dimensions)
        if "active_shards" in stats:
            put_metric("ActiveShards", stats["active_shards"], dimensions)
        if "total_shards" in stats:
            put_metric("TotalShards", stats["total_shards"], dimensions)

        if "node_count" in stats:
            put_metric("NodeCount", stats["node_count"], dimensions)
        if "avg_cpu_load_1m" in stats:
            put_metric("AvgCpuLoad1m", stats["avg_cpu_load_1m"], dimensions + [{'Name': 'MetricType', 'Value': 'LoadAverage'}])
        if "avg_heap_used_percent" in stats:
            put_metric("AvgJvmHeapUsedPercent", stats["avg_heap_used_percent"], dimensions)
        if "avg_disk_used_percent" in stats:
            put_metric("AvgDiskUsedPercent", stats["avg_disk_used_percent"], dimensions)

        # For rates, you'd typically calculate the delta between two collection intervals.
        # This requires storing the previous value. For simplicity, we'll just report cumulative here.
        # A more robust solution would involve a stateful collector or using Prometheus.
        if "total_indexing_rate" in stats:
            put_metric("TotalIndexingDocs", stats["total_indexing_rate"], dimensions)
        if "total_search_rate" in stats:
            put_metric("TotalSearchQueries", stats["total_search_rate"], dimensions)




Schedule this script using cron, similar to the PHP-FPM monitor.



2. CloudWatch Agent for Logs and System Metrics:



Configure the CloudWatch Agent to collect Elasticsearch logs (e.g., `elasticsearch.log`, `gc.log`) and system-level metrics (CPU, memory, disk) from your EC2 instances. The agent configuration would look similar to the PHP example, but with different log file paths and potentially system metrics enabled.



{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyElasticsearchCluster",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "ClusterName": "my-es-cluster"
    },
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_user",
          "cpu_usage_system",
          "cpu_usage_idle"
        ],
        "totalcpu": true
      },
      "disk": {
        "measurement": [
          "used_percent",
          "inodes_free"
        ],
        "resources": [
          "/"
        ]
      },
      "mem": {
        "measurement": [
          "mem_used_percent"
        ]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/elasticsearch/elasticsearch.log",
            "log_group_name": "/aws/elasticsearch/cluster.log",
            "log_stream_name": "{instance_id}/elasticsearch"
          },
          {
            "file_path": "/var/log/elasticsearch/gc.log",
            "log_group_name": "/aws/elasticsearch/gc.log",
            "log_stream_name": "{instance_id}/gc"
          }
        ]
      }
    }
  }
}



Alerting Strategies



Effective alerting is the culmination of your monitoring efforts. Define clear thresholds for critical metrics and configure CloudWatch Alarms to notify your team via SNS, PagerDuty, Slack, etc.



PHP Application:
High PHP-FPM error rate (e.g., > 5 errors/minute).
Low available PHP-FPM workers (e.g., active processes > 90% of max_children).
Sustained high request latency (e.g., P95 latency > 500ms for 5 minutes).
Application-specific critical errors logged.
Elasticsearch Cluster:
Cluster status is yellow or red.
Unassigned shards > 0 for more than a few minutes.
High JVM heap usage (e.g., > 85%).
Low disk space remaining (e.g., < 20%).
High indexing or search latency (e.g., P95 latency > 1 second for 5 minutes).



Remember to tune your alarm thresholds based on historical data and acceptable performance levels. Avoid alert fatigue by focusing on actionable alerts that truly indicate a problem requiring immediate attention.

Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on AWS

Establishing a Robust Monitoring Foundation with CloudWatch

Custom Metrics for PHP-FPM and Application Performance

Log Aggregation and Analysis with CloudWatch Logs Insights

Monitoring Elasticsearch Clusters on AWS

Key Elasticsearch Metrics to Track

Leveraging CloudWatch for OpenSearch Service

Monitoring Self-Managed Elasticsearch on EC2

Alerting Strategies

Recent Posts

Top Categories

Our Products

Our Services