Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on AWS
Establishing a Robust Monitoring Foundation with CloudWatch
For any production PHP application hosted on AWS, a comprehensive monitoring strategy is paramount. This begins with leveraging Amazon CloudWatch effectively. Beyond the default metrics, we need to push custom metrics and logs to gain granular insights into application performance and infrastructure health.
A common pitfall is relying solely on CPU utilization and network traffic. While important, these high-level indicators often mask deeper application-level issues. We need to monitor things like PHP-FPM process counts, request latency, error rates, and memory usage per process.
Custom Metrics for PHP-FPM and Application Performance
We can use the CloudWatch Agent to collect custom metrics. For PHP-FPM, we can scrape the status page if enabled, or more reliably, use a small script that periodically reports key metrics. Let’s consider a Python script that reports active processes, idle processes, and slow requests.
First, ensure PHP-FPM’s `pm.status_path` is configured and accessible (e.g., `/fpm-status`). If not, you might need to enable it in your PHP-FPM pool configuration:
; /etc/php/7.4/fpm/pool.d/www.conf pm.status_path = /fpm-status
Then, create a Python script (e.g., `/opt/scripts/monitor_php_fpm.py`) to collect and send metrics to CloudWatch:
import boto3
import requests
import time
import os
# --- Configuration ---
PHP_FPM_STATUS_URL = "http://localhost/fpm-status" # Adjust if not on localhost or using a different path
NAMESPACE = "MyPHPApp"
REGION = os.environ.get("AWS_REGION", "us-east-1") # Get region from environment or default
# --- End Configuration ---
cloudwatch = boto3.client('cloudwatch', region_name=REGION)
def get_php_fpm_stats():
try:
response = requests.get(PHP_FPM_STATUS_URL, timeout=5)
response.raise_for_status() # Raise an exception for bad status codes
data = response.text
except requests.exceptions.RequestException as e:
print(f"Error fetching PHP-FPM status: {e}")
return None
stats = {}
lines = data.splitlines()
for line in lines:
if ":" in line:
key, value = line.split(":", 1)
key = key.strip()
value = value.strip()
if key == "pool":
stats.setdefault("pool", []).append(value)
elif key == "process manager":
stats["process_manager"] = value
elif key == "start time":
stats["start_time"] = value
elif key == "accepted conn":
stats["accepted_conn"] = int(value)
elif key == "active processes":
stats["active_processes"] = int(value)
elif key == "idle processes":
stats["idle_processes"] = int(value)
elif key == "requests":
stats["requests"] = int(value)
elif key == "slow requests":
stats["slow_requests"] = int(value)
return stats
def put_metric(metric_name, value, dimensions=None):
try:
cloudwatch.put_metric_data(
Namespace=NAMESPACE,
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': 'Count', # Adjust unit as needed (e.g., 'Seconds' for latency)
'Dimensions': dimensions if dimensions else []
},
]
)
print(f"Put metric: {metric_name}={value}")
except Exception as e:
print(f"Error putting metric {metric_name}: {e}")
if __name__ == "__main__":
stats = get_php_fpm_stats()
if stats:
# Example: Sending metrics for the 'www' pool (adjust if you have multiple pools)
# In a real-world scenario, you'd iterate through pools if multiple exist
pool_name = "www" # Assuming a single pool named 'www'
dimensions = [{'Name': 'Pool', 'Value': pool_name}]
if "active_processes" in stats:
put_metric("ActiveProcesses", stats["active_processes"], dimensions)
if "idle_processes" in stats:
put_metric("IdleProcesses", stats["idle_processes"], dimensions)
if "slow_requests" in stats:
put_metric("SlowRequests", stats["slow_requests"], dimensions)
if "requests" in stats:
put_metric("TotalRequests", stats["requests"], dimensions)
if "accepted_conn" in stats:
put_metric("AcceptedConnections", stats["accepted_conn"], dimensions)
# Example: Application-level metrics (e.g., request latency, error count)
# These would typically be instrumented within your PHP application itself
# For demonstration, let's assume we have these values available
# app_request_latency_seconds = 0.15
# app_error_count = 2
# put_metric("RequestLatency", app_request_latency_seconds, dimensions + [{'Name': 'Endpoint', 'Value': '/api/v1/users'}])
# put_metric("ErrorCount", app_error_count, dimensions + [{'Name': 'ErrorType', 'Value': '5xx'}])
To run this script periodically, we can use cron. Configure cron to run this script every minute:
# crontab -e
* * * * * /usr/bin/python3 /opt/scripts/monitor_php_fpm.py >> /var/log/monitor_php_fpm.log 2>&1
Next, configure the CloudWatch Agent to collect these custom metrics. Create a configuration file (e.g., `/opt/aws/amazon-cloudwatch-agent/bin/config.json`):
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "MyPHPApp",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"statsd": {
"service_address": "udp:127.0.0.1:8125",
"metrics_collection_interval": 60
},
"prometheus": {
"prometheus_config_path": "/opt/aws/amazon-cloudwatch-agent/prometheus_config.yml",
"log_group_name": "/aws/ecs/containerinsights/my-app"
},
"emf": {
"queue_size": 10000,
"batch_flush_interval": 60
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/php-fpm/www-error.log",
"log_group_name": "/aws/php-fpm/www/error",
"log_stream_name": "{instance_id}/php-fpm-error"
},
{
"file_path": "/var/log/php-fpm/www-access.log",
"log_group_name": "/aws/php-fpm/www/access",
"log_stream_name": "{instance_id}/php-fpm-access"
}
]
}
}
}
}
And a Prometheus configuration file (`/opt/aws/amazon-cloudwatch-agent/prometheus_config.yml`) to scrape metrics from applications that expose them in Prometheus format (e.g., using a Prometheus client library in your PHP app):
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'php_app'
static_configs:
- targets: ['localhost:9100'] # Assuming your PHP app exposes metrics on port 9100
labels:
application: 'my-php-app'
Start and enable the agent:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl start amazon-cloudwatch-agent
Log Aggregation and Analysis with CloudWatch Logs Insights
Beyond metrics, centralized logging is crucial. Configure the CloudWatch Agent to stream your PHP error logs, access logs, and any application-specific logs to CloudWatch Logs. This enables powerful querying with CloudWatch Logs Insights.
For PHP error logs, ensure they are configured to log to a file. In your `php.ini`:
error_log = /var/log/php-fpm/www-error.log
And for access logs (if using PHP-FPM's access log feature or a web server like Nginx):
; Example for PHP-FPM access log (less common, usually web server logs)
access.log = /var/log/php-fpm/www-access.log
The CloudWatch Agent configuration snippet above already includes directives for collecting these files. Once logs are flowing into CloudWatch, you can use Logs Insights to query them. For example, to find all fatal errors in the last hour:
fields @timestamp, @message
| filter @message like 'Fatal error'
| sort @timestamp desc
| limit 50
Or to analyze request latency from Nginx access logs (assuming a common log format):
fields @timestamp, client, method, request, status, bytes, upstream_response_time
| parse '\"* * *\" * * * \"*\" * * \"*\" * * \"*\" * \"*\" *' as method, request, protocol, status, bytes, referer, user_agent, cookie, upstream_addr, upstream_response_time, request_time
| filter status like /^[5]/ # Filter for 5xx errors
| stats avg(request_time) as avg_request_time, avg(upstream_response_time) as avg_upstream_response_time by status, bin(5m)
| sort @timestamp asc
Monitoring Elasticsearch Clusters on AWS
Monitoring Elasticsearch, especially on AWS (whether self-managed on EC2 or using Amazon OpenSearch Service, formerly Elasticsearch Service), requires a different set of tools and considerations. The primary focus shifts to cluster health, shard status, indexing performance, and query latency.
Key Elasticsearch Metrics to Track
Regardless of deployment method, certain metrics are universally important:
- Cluster Health: Status (green, yellow, red), number of nodes, number of shards (total, unassigned, relocating).
- Node Stats: CPU utilization, JVM heap usage, disk I/O, disk space remaining.
- Indexing Performance: Index rate (docs/sec), indexing latency (ms).
- Search Performance: Search rate (queries/sec), search latency (ms).
- JVM Metrics: Garbage collection activity, thread pool usage.
- Shard Allocation: Number of shards that are unassigned or relocating.
Leveraging CloudWatch for OpenSearch Service
If you are using Amazon OpenSearch Service, AWS provides a set of built-in CloudWatch metrics. These are automatically collected and available in your OpenSearch Service domain's monitoring tab or directly via the CloudWatch API/console.
Essential OpenSearch Service metrics include:
ClusterStatus.red,ClusterStatus.yellow,ClusterStatus.greenNodes.countJVMMemoryPressureMasterCPUUtilization,MasterJVMMemoryPressureDataNodeCPUUtilization,DataNodeJVMMemoryPressureIndexingRate,SearchRateIndexingLatency,SearchLatencyUnassignedShards
You can create CloudWatch Alarms based on these metrics. For instance, an alarm for `ClusterStatus.red` or `UnassignedShards` exceeding a small threshold (e.g., 0) is critical.
# Example CloudWatch Alarm creation via AWS CLI
aws cloudwatch put-metric-alarm \
--alarm-name "OpenSearch-Cluster-Red-Status" \
--alarm-description "Alarm when OpenSearch cluster status is red" \
--metric-name ClusterStatus.red \
--namespace "AWS/ES" \
--statistic Sum \
--period 300 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--dimensions "Name=DomainName,Value=your-opensearch-domain-name" \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:my-ops-topic
Monitoring Self-Managed Elasticsearch on EC2
For self-managed Elasticsearch clusters on EC2, you'll need to deploy agents to collect metrics. The CloudWatch Agent is again a good choice, but you'll also want to leverage Elasticsearch's own monitoring APIs and potentially tools like Prometheus with the Elasticsearch exporter.
1. Elasticsearch Monitoring APIs:
Elasticsearch exposes a wealth of information via its REST APIs. The `_cat` APIs are particularly useful for quick checks, while `_nodes/stats` and `_cluster/stats` provide detailed metrics.
# Check cluster health curl -X GET "http://localhost:9200/_cluster/health?pretty" # Get node statistics curl -X GET "http://localhost:9200/_nodes/stats?pretty" # Get cluster statistics curl -X GET "http://localhost:9200/_cluster/stats?pretty"
You can script these API calls to push custom metrics to CloudWatch. A Python script similar to the PHP-FPM monitor can be adapted:
import boto3
import requests
import time
import os
# --- Configuration ---
ES_HOST = "http://localhost:9200"
NAMESPACE = "MyElasticsearchCluster"
REGION = os.environ.get("AWS_REGION", "us-east-1")
CLUSTER_NAME = "my-es-cluster" # Identify your cluster
# --- End Configuration ---
cloudwatch = boto3.client('cloudwatch', region_name=REGION)
def get_es_stats():
stats = {}
try:
# Cluster Health
health_resp = requests.get(f"{ES_HOST}/_cluster/health?pretty", timeout=5)
health_resp.raise_for_status()
health_data = health_resp.json()
stats["cluster_status"] = health_data["status"]
stats["unassigned_shards"] = health_data["unassigned_shards"]
stats["relocating_shards"] = health_data["relocating_shards"]
stats["active_shards"] = health_data["active_shards"]
stats["total_shards"] = health_data["total_shards"]
# Node Stats (aggregating across all nodes for simplicity, could be per-node)
nodes_resp = requests.get(f"{ES_HOST}/_nodes/stats?pretty", timeout=5)
nodes_resp.raise_for_status()
nodes_data = nodes_resp.json()
total_cpu = 0
total_heap_used_percent = 0
total_disk_used_percent = 0
num_nodes = 0
for node_id, node_info in nodes_data["nodes"].items():
num_nodes += 1
total_cpu += node_info["os"]["cpu"]["load_average"]["1m"] # Example: 1m load avg
total_heap_used_percent += node_info["jvm"]["mem"]["heap_used_percent"]
total_disk_used_percent += node_info["fs"]["total"]["percent_used"]
if num_nodes > 0:
stats["avg_cpu_load_1m"] = total_cpu / num_nodes
stats["avg_heap_used_percent"] = total_heap_used_percent / num_nodes
stats["avg_disk_used_percent"] = total_disk_used_percent / num_nodes
stats["node_count"] = num_nodes
# Indexing/Search Stats (from cluster stats)
cluster_stats_resp = requests.get(f"{ES_HOST}/_cluster/stats?pretty", timeout=5)
cluster_stats_resp.raise_for_status()
cluster_stats_data = cluster_stats_resp.json()
stats["total_indexing_rate"] = cluster_stats_data["indices"]["indexing"]["index_total"] # This is a cumulative count, need rate calculation over time
stats["total_search_rate"] = cluster_stats_data["indices"]["search"]["query_total"] # Cumulative count
except requests.exceptions.RequestException as e:
print(f"Error fetching Elasticsearch stats: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred: {e}")
return None
return stats
def put_metric(metric_name, value, dimensions=None):
try:
cloudwatch.put_metric_data(
Namespace=NAMESPACE,
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': 'Count', # Default unit, adjust as needed
'Dimensions': dimensions if dimensions else []
},
]
)
print(f"Put metric: {metric_name}={value}")
except Exception as e:
print(f"Error putting metric {metric_name}: {e}")
if __name__ == "__main__":
stats = get_es_stats()
if stats:
dimensions = [{'Name': 'ClusterName', 'Value': CLUSTER_NAME}]
# Map ES status to numerical values for CloudWatch
status_map = {"green": 0, "yellow": 1, "red": 2}
if "cluster_status" in stats:
put_metric("ClusterStatusNumeric", status_map.get(stats["cluster_status"], -1), dimensions)
put_metric("ClusterStatus", stats["cluster_status"], dimensions) # Also send as string if needed for specific dashboards
if "unassigned_shards" in stats:
put_metric("UnassignedShards", stats["unassigned_shards"], dimensions)
if "relocating_shards" in stats:
put_metric("RelocatingShards", stats["relocating_shards"], dimensions)
if "active_shards" in stats:
put_metric("ActiveShards", stats["active_shards"], dimensions)
if "total_shards" in stats:
put_metric("TotalShards", stats["total_shards"], dimensions)
if "node_count" in stats:
put_metric("NodeCount", stats["node_count"], dimensions)
if "avg_cpu_load_1m" in stats:
put_metric("AvgCpuLoad1m", stats["avg_cpu_load_1m"], dimensions + [{'Name': 'MetricType', 'Value': 'LoadAverage'}])
if "avg_heap_used_percent" in stats:
put_metric("AvgJvmHeapUsedPercent", stats["avg_heap_used_percent"], dimensions)
if "avg_disk_used_percent" in stats:
put_metric("AvgDiskUsedPercent", stats["avg_disk_used_percent"], dimensions)
# For rates, you'd typically calculate the delta between two collection intervals.
# This requires storing the previous value. For simplicity, we'll just report cumulative here.
# A more robust solution would involve a stateful collector or using Prometheus.
if "total_indexing_rate" in stats:
put_metric("TotalIndexingDocs", stats["total_indexing_rate"], dimensions)
if "total_search_rate" in stats:
put_metric("TotalSearchQueries", stats["total_search_rate"], dimensions)
Schedule this script using cron, similar to the PHP-FPM monitor.
2. CloudWatch Agent for Logs and System Metrics:
Configure the CloudWatch Agent to collect Elasticsearch logs (e.g., `elasticsearch.log`, `gc.log`) and system-level metrics (CPU, memory, disk) from your EC2 instances. The agent configuration would look similar to the PHP example, but with different log file paths and potentially system metrics enabled.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "MyElasticsearchCluster",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}",
"ClusterName": "my-es-cluster"
},
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_user",
"cpu_usage_system",
"cpu_usage_idle"
],
"totalcpu": true
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"resources": [
"/"
]
},
"mem": {
"measurement": [
"mem_used_percent"
]
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/elasticsearch/elasticsearch.log",
"log_group_name": "/aws/elasticsearch/cluster.log",
"log_stream_name": "{instance_id}/elasticsearch"
},
{
"file_path": "/var/log/elasticsearch/gc.log",
"log_group_name": "/aws/elasticsearch/gc.log",
"log_stream_name": "{instance_id}/gc"
}
]
}
}
}
}
Alerting Strategies
Effective alerting is the culmination of your monitoring efforts. Define clear thresholds for critical metrics and configure CloudWatch Alarms to notify your team via SNS, PagerDuty, Slack, etc.
- PHP Application:
- High PHP-FPM error rate (e.g., > 5 errors/minute).
- Low available PHP-FPM workers (e.g., active processes > 90% of max_children).
- Sustained high request latency (e.g., P95 latency > 500ms for 5 minutes).
- Application-specific critical errors logged.
- Elasticsearch Cluster:
- Cluster status is yellow or red.
- Unassigned shards > 0 for more than a few minutes.
- High JVM heap usage (e.g., > 85%).
- Low disk space remaining (e.g., < 20%).
- High indexing or search latency (e.g., P95 latency > 1 second for 5 minutes).
Remember to tune your alarm thresholds based on historical data and acceptable performance levels. Avoid alert fatigue by focusing on actionable alerts that truly indicate a problem requiring immediate attention.