• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • Home
  • Projects
  • Products
  • Themes
  • Tools
  • Request for Quote

Vengala Vinay

Having 12+ Years of Experience in Software Development

  • Home
  • WordPress
  • PHP
    • Codeigniter
  • Django
  • Magento
  • Selenium
  • Server
Home » Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on AWS

Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on AWS

Establishing a Robust Monitoring Foundation with CloudWatch

For any production PHP application hosted on AWS, a comprehensive monitoring strategy is paramount. This begins with leveraging Amazon CloudWatch effectively. Beyond the default metrics, we need to push custom metrics and logs to gain granular insights into application performance and infrastructure health.

A common pitfall is relying solely on CPU utilization and network traffic. While important, these high-level indicators often mask deeper application-level issues. We need to monitor things like PHP-FPM process counts, request latency, error rates, and memory usage per process.

Custom Metrics for PHP-FPM and Application Performance

We can use the CloudWatch Agent to collect custom metrics. For PHP-FPM, we can scrape the status page if enabled, or more reliably, use a small script that periodically reports key metrics. Let’s consider a Python script that reports active processes, idle processes, and slow requests.

First, ensure PHP-FPM’s `pm.status_path` is configured and accessible (e.g., `/fpm-status`). If not, you might need to enable it in your PHP-FPM pool configuration:

; /etc/php/7.4/fpm/pool.d/www.conf
pm.status_path = /fpm-status

Then, create a Python script (e.g., `/opt/scripts/monitor_php_fpm.py`) to collect and send metrics to CloudWatch:

import boto3
import requests
import time
import os

# --- Configuration ---
PHP_FPM_STATUS_URL = "http://localhost/fpm-status" # Adjust if not on localhost or using a different path
NAMESPACE = "MyPHPApp"
REGION = os.environ.get("AWS_REGION", "us-east-1") # Get region from environment or default
# --- End Configuration ---

cloudwatch = boto3.client('cloudwatch', region_name=REGION)

def get_php_fpm_stats():
    try:
        response = requests.get(PHP_FPM_STATUS_URL, timeout=5)
        response.raise_for_status() # Raise an exception for bad status codes
        data = response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching PHP-FPM status: {e}")
        return None

    stats = {}
    lines = data.splitlines()
    for line in lines:
        if ":" in line:
            key, value = line.split(":", 1)
            key = key.strip()
            value = value.strip()
            if key == "pool":
                stats.setdefault("pool", []).append(value)
            elif key == "process manager":
                stats["process_manager"] = value
            elif key == "start time":
                stats["start_time"] = value
            elif key == "accepted conn":
                stats["accepted_conn"] = int(value)
            elif key == "active processes":
                stats["active_processes"] = int(value)
            elif key == "idle processes":
                stats["idle_processes"] = int(value)
            elif key == "requests":
                stats["requests"] = int(value)
            elif key == "slow requests":
                stats["slow_requests"] = int(value)
    return stats

def put_metric(metric_name, value, dimensions=None):
    try:
        cloudwatch.put_metric_data(
            Namespace=NAMESPACE,
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': 'Count', # Adjust unit as needed (e.g., 'Seconds' for latency)
                    'Dimensions': dimensions if dimensions else []
                },
            ]
        )
        print(f"Put metric: {metric_name}={value}")
    except Exception as e:
        print(f"Error putting metric {metric_name}: {e}")

if __name__ == "__main__":
    stats = get_php_fpm_stats()
    if stats:
        # Example: Sending metrics for the 'www' pool (adjust if you have multiple pools)
        # In a real-world scenario, you'd iterate through pools if multiple exist
        pool_name = "www" # Assuming a single pool named 'www'
        dimensions = [{'Name': 'Pool', 'Value': pool_name}]

        if "active_processes" in stats:
            put_metric("ActiveProcesses", stats["active_processes"], dimensions)
        if "idle_processes" in stats:
            put_metric("IdleProcesses", stats["idle_processes"], dimensions)
        if "slow_requests" in stats:
            put_metric("SlowRequests", stats["slow_requests"], dimensions)
        if "requests" in stats:
            put_metric("TotalRequests", stats["requests"], dimensions)
        if "accepted_conn" in stats:
            put_metric("AcceptedConnections", stats["accepted_conn"], dimensions)

        # Example: Application-level metrics (e.g., request latency, error count)
        # These would typically be instrumented within your PHP application itself
        # For demonstration, let's assume we have these values available
        # app_request_latency_seconds = 0.15
        # app_error_count = 2
        # put_metric("RequestLatency", app_request_latency_seconds, dimensions + [{'Name': 'Endpoint', 'Value': '/api/v1/users'}])
        # put_metric("ErrorCount", app_error_count, dimensions + [{'Name': 'ErrorType', 'Value': '5xx'}])




To run this script periodically, we can use cron. Configure cron to run this script every minute:

# crontab -e
* * * * * /usr/bin/python3 /opt/scripts/monitor_php_fpm.py >> /var/log/monitor_php_fpm.log 2>&1

Next, configure the CloudWatch Agent to collect these custom metrics. Create a configuration file (e.g., `/opt/aws/amazon-cloudwatch-agent/bin/config.json`):

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyPHPApp",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "statsd": {
        "service_address": "udp:127.0.0.1:8125",
        "metrics_collection_interval": 60
      },
      "prometheus": {
        "prometheus_config_path": "/opt/aws/amazon-cloudwatch-agent/prometheus_config.yml",
        "log_group_name": "/aws/ecs/containerinsights/my-app"
      },
      "emf": {
        "queue_size": 10000,
        "batch_flush_interval": 60
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/php-fpm/www-error.log",
            "log_group_name": "/aws/php-fpm/www/error",
            "log_stream_name": "{instance_id}/php-fpm-error"
          },
          {
            "file_path": "/var/log/php-fpm/www-access.log",
            "log_group_name": "/aws/php-fpm/www/access",
            "log_stream_name": "{instance_id}/php-fpm-access"
          }
        ]
      }
    }
  }
}

And a Prometheus configuration file (`/opt/aws/amazon-cloudwatch-agent/prometheus_config.yml`) to scrape metrics from applications that expose them in Prometheus format (e.g., using a Prometheus client library in your PHP app):

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'php_app'
    static_configs:
      - targets: ['localhost:9100'] # Assuming your PHP app exposes metrics on port 9100
        labels:
          application: 'my-php-app'

Start and enable the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl start amazon-cloudwatch-agent

Log Aggregation and Analysis with CloudWatch Logs Insights

Beyond metrics, centralized logging is crucial. Configure the CloudWatch Agent to stream your PHP error logs, access logs, and any application-specific logs to CloudWatch Logs. This enables powerful querying with CloudWatch Logs Insights.

For PHP error logs, ensure they are configured to log to a file. In your `php.ini`:

error_log = /var/log/php-fpm/www-error.log

And for access logs (if using PHP-FPM's access log feature or a web server like Nginx):

; Example for PHP-FPM access log (less common, usually web server logs)
access.log = /var/log/php-fpm/www-access.log

The CloudWatch Agent configuration snippet above already includes directives for collecting these files. Once logs are flowing into CloudWatch, you can use Logs Insights to query them. For example, to find all fatal errors in the last hour:

fields @timestamp, @message
| filter @message like 'Fatal error'
| sort @timestamp desc
| limit 50

Or to analyze request latency from Nginx access logs (assuming a common log format):

fields @timestamp, client, method, request, status, bytes, upstream_response_time
| parse '\"* * *\" * * * \"*\" * * \"*\" * * \"*\" * \"*\" *' as method, request, protocol, status, bytes, referer, user_agent, cookie, upstream_addr, upstream_response_time, request_time
| filter status like /^[5]/ # Filter for 5xx errors
| stats avg(request_time) as avg_request_time, avg(upstream_response_time) as avg_upstream_response_time by status, bin(5m)
| sort @timestamp asc

Monitoring Elasticsearch Clusters on AWS

Monitoring Elasticsearch, especially on AWS (whether self-managed on EC2 or using Amazon OpenSearch Service, formerly Elasticsearch Service), requires a different set of tools and considerations. The primary focus shifts to cluster health, shard status, indexing performance, and query latency.

Key Elasticsearch Metrics to Track

Regardless of deployment method, certain metrics are universally important:

  • Cluster Health: Status (green, yellow, red), number of nodes, number of shards (total, unassigned, relocating).
  • Node Stats: CPU utilization, JVM heap usage, disk I/O, disk space remaining.
  • Indexing Performance: Index rate (docs/sec), indexing latency (ms).
  • Search Performance: Search rate (queries/sec), search latency (ms).
  • JVM Metrics: Garbage collection activity, thread pool usage.
  • Shard Allocation: Number of shards that are unassigned or relocating.

Leveraging CloudWatch for OpenSearch Service

If you are using Amazon OpenSearch Service, AWS provides a set of built-in CloudWatch metrics. These are automatically collected and available in your OpenSearch Service domain's monitoring tab or directly via the CloudWatch API/console.

Essential OpenSearch Service metrics include:

  • ClusterStatus.red, ClusterStatus.yellow, ClusterStatus.green
  • Nodes.count
  • JVMMemoryPressure
  • MasterCPUUtilization, MasterJVMMemoryPressure
  • DataNodeCPUUtilization, DataNodeJVMMemoryPressure
  • IndexingRate, SearchRate
  • IndexingLatency, SearchLatency
  • UnassignedShards

You can create CloudWatch Alarms based on these metrics. For instance, an alarm for `ClusterStatus.red` or `UnassignedShards` exceeding a small threshold (e.g., 0) is critical.

# Example CloudWatch Alarm creation via AWS CLI
aws cloudwatch put-metric-alarm \
    --alarm-name "OpenSearch-Cluster-Red-Status" \
    --alarm-description "Alarm when OpenSearch cluster status is red" \
    --metric-name ClusterStatus.red \
    --namespace "AWS/ES" \
    --statistic Sum \
    --period 300 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions "Name=DomainName,Value=your-opensearch-domain-name" \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:my-ops-topic

Monitoring Self-Managed Elasticsearch on EC2

For self-managed Elasticsearch clusters on EC2, you'll need to deploy agents to collect metrics. The CloudWatch Agent is again a good choice, but you'll also want to leverage Elasticsearch's own monitoring APIs and potentially tools like Prometheus with the Elasticsearch exporter.

1. Elasticsearch Monitoring APIs:

Elasticsearch exposes a wealth of information via its REST APIs. The `_cat` APIs are particularly useful for quick checks, while `_nodes/stats` and `_cluster/stats` provide detailed metrics.

# Check cluster health
curl -X GET "http://localhost:9200/_cluster/health?pretty"

# Get node statistics
curl -X GET "http://localhost:9200/_nodes/stats?pretty"

# Get cluster statistics
curl -X GET "http://localhost:9200/_cluster/stats?pretty"

You can script these API calls to push custom metrics to CloudWatch. A Python script similar to the PHP-FPM monitor can be adapted:

import boto3
import requests
import time
import os

# --- Configuration ---
ES_HOST = "http://localhost:9200"
NAMESPACE = "MyElasticsearchCluster"
REGION = os.environ.get("AWS_REGION", "us-east-1")
CLUSTER_NAME = "my-es-cluster" # Identify your cluster
# --- End Configuration ---

cloudwatch = boto3.client('cloudwatch', region_name=REGION)

def get_es_stats():
    stats = {}
    try:
        # Cluster Health
        health_resp = requests.get(f"{ES_HOST}/_cluster/health?pretty", timeout=5)
        health_resp.raise_for_status()
        health_data = health_resp.json()
        stats["cluster_status"] = health_data["status"]
        stats["unassigned_shards"] = health_data["unassigned_shards"]
        stats["relocating_shards"] = health_data["relocating_shards"]
        stats["active_shards"] = health_data["active_shards"]
        stats["total_shards"] = health_data["total_shards"]

        # Node Stats (aggregating across all nodes for simplicity, could be per-node)
        nodes_resp = requests.get(f"{ES_HOST}/_nodes/stats?pretty", timeout=5)
        nodes_resp.raise_for_status()
        nodes_data = nodes_resp.json()

        total_cpu = 0
        total_heap_used_percent = 0
        total_disk_used_percent = 0
        num_nodes = 0

        for node_id, node_info in nodes_data["nodes"].items():
            num_nodes += 1
            total_cpu += node_info["os"]["cpu"]["load_average"]["1m"] # Example: 1m load avg
            total_heap_used_percent += node_info["jvm"]["mem"]["heap_used_percent"]
            total_disk_used_percent += node_info["fs"]["total"]["percent_used"]

        if num_nodes > 0:
            stats["avg_cpu_load_1m"] = total_cpu / num_nodes
            stats["avg_heap_used_percent"] = total_heap_used_percent / num_nodes
            stats["avg_disk_used_percent"] = total_disk_used_percent / num_nodes
            stats["node_count"] = num_nodes

        # Indexing/Search Stats (from cluster stats)
        cluster_stats_resp = requests.get(f"{ES_HOST}/_cluster/stats?pretty", timeout=5)
        cluster_stats_resp.raise_for_status()
        cluster_stats_data = cluster_stats_resp.json()
        stats["total_indexing_rate"] = cluster_stats_data["indices"]["indexing"]["index_total"] # This is a cumulative count, need rate calculation over time
        stats["total_search_rate"] = cluster_stats_data["indices"]["search"]["query_total"] # Cumulative count

    except requests.exceptions.RequestException as e:
        print(f"Error fetching Elasticsearch stats: {e}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None
    return stats

def put_metric(metric_name, value, dimensions=None):
    try:
        cloudwatch.put_metric_data(
            Namespace=NAMESPACE,
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': 'Count', # Default unit, adjust as needed
                    'Dimensions': dimensions if dimensions else []
                },
            ]
        )
        print(f"Put metric: {metric_name}={value}")
    except Exception as e:
        print(f"Error putting metric {metric_name}: {e}")

if __name__ == "__main__":
    stats = get_es_stats()
    if stats:
        dimensions = [{'Name': 'ClusterName', 'Value': CLUSTER_NAME}]

        # Map ES status to numerical values for CloudWatch
        status_map = {"green": 0, "yellow": 1, "red": 2}
        if "cluster_status" in stats:
            put_metric("ClusterStatusNumeric", status_map.get(stats["cluster_status"], -1), dimensions)
            put_metric("ClusterStatus", stats["cluster_status"], dimensions) # Also send as string if needed for specific dashboards

        if "unassigned_shards" in stats:
            put_metric("UnassignedShards", stats["unassigned_shards"], dimensions)
        if "relocating_shards" in stats:
            put_metric("RelocatingShards", stats["relocating_shards"], dimensions)
        if "active_shards" in stats:
            put_metric("ActiveShards", stats["active_shards"], dimensions)
        if "total_shards" in stats:
            put_metric("TotalShards", stats["total_shards"], dimensions)

        if "node_count" in stats:
            put_metric("NodeCount", stats["node_count"], dimensions)
        if "avg_cpu_load_1m" in stats:
            put_metric("AvgCpuLoad1m", stats["avg_cpu_load_1m"], dimensions + [{'Name': 'MetricType', 'Value': 'LoadAverage'}])
        if "avg_heap_used_percent" in stats:
            put_metric("AvgJvmHeapUsedPercent", stats["avg_heap_used_percent"], dimensions)
        if "avg_disk_used_percent" in stats:
            put_metric("AvgDiskUsedPercent", stats["avg_disk_used_percent"], dimensions)

        # For rates, you'd typically calculate the delta between two collection intervals.
        # This requires storing the previous value. For simplicity, we'll just report cumulative here.
        # A more robust solution would involve a stateful collector or using Prometheus.
        if "total_indexing_rate" in stats:
            put_metric("TotalIndexingDocs", stats["total_indexing_rate"], dimensions)
        if "total_search_rate" in stats:
            put_metric("TotalSearchQueries", stats["total_search_rate"], dimensions)




Schedule this script using cron, similar to the PHP-FPM monitor.

2. CloudWatch Agent for Logs and System Metrics:

Configure the CloudWatch Agent to collect Elasticsearch logs (e.g., `elasticsearch.log`, `gc.log`) and system-level metrics (CPU, memory, disk) from your EC2 instances. The agent configuration would look similar to the PHP example, but with different log file paths and potentially system metrics enabled.

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "MyElasticsearchCluster",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "ClusterName": "my-es-cluster"
    },
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_user",
          "cpu_usage_system",
          "cpu_usage_idle"
        ],
        "totalcpu": true
      },
      "disk": {
        "measurement": [
          "used_percent",
          "inodes_free"
        ],
        "resources": [
          "/"
        ]
      },
      "mem": {
        "measurement": [
          "mem_used_percent"
        ]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/elasticsearch/elasticsearch.log",
            "log_group_name": "/aws/elasticsearch/cluster.log",
            "log_stream_name": "{instance_id}/elasticsearch"
          },
          {
            "file_path": "/var/log/elasticsearch/gc.log",
            "log_group_name": "/aws/elasticsearch/gc.log",
            "log_stream_name": "{instance_id}/gc"
          }
        ]
      }
    }
  }
}

Alerting Strategies

Effective alerting is the culmination of your monitoring efforts. Define clear thresholds for critical metrics and configure CloudWatch Alarms to notify your team via SNS, PagerDuty, Slack, etc.

  • PHP Application:
    • High PHP-FPM error rate (e.g., > 5 errors/minute).
    • Low available PHP-FPM workers (e.g., active processes > 90% of max_children).
    • Sustained high request latency (e.g., P95 latency > 500ms for 5 minutes).
    • Application-specific critical errors logged.
  • Elasticsearch Cluster:
    • Cluster status is yellow or red.
    • Unassigned shards > 0 for more than a few minutes.
    • High JVM heap usage (e.g., > 85%).
    • Low disk space remaining (e.g., < 20%).
    • High indexing or search latency (e.g., P95 latency > 1 second for 5 minutes).

Remember to tune your alarm thresholds based on historical data and acceptable performance levels. Avoid alert fatigue by focusing on actionable alerts that truly indicate a problem requiring immediate attention.

Primary Sidebar

A little about the Author

Having 12+ Years of Experience in Software Development, Vinay is a principal software architect, senior systems engineer, and elite technical consultant. He specializes in bespoke PHP/WordPress development, high-performance Magento 2 & Shopify architectures, custom plugin/theme development from scratch, and legacy code modernization (including VB6, VB.NET, PyQt, and Crystal Reports). Known for solving complex database bottlenecks, speed optimization (Core Web Vitals), and advanced security code auditing, Vinay engineers production-ready systems designed to scale under heavy concurrent load conditions.



Chat on WhatsApp

Recent Posts

  • Go Goroutines vs. Node.js Event Loop: Scaling I/O-Bound Microservices Under High Load
  • Elixir Phoenix vs. Go Gin: Concurrency Models and Fault Tolerance Under Peak Request Volume
  • Python Celery vs. Go Channels: Distributed Task Queue Overhead and Memory Reliability
  • Scala Pekko vs. Go Goroutines: Actor Model vs. CSP for Event-Driven Reactive Systems
  • Java Loom Virtual Threads vs. Go Goroutines: Under-the-Hood Scheduler and Thread Overhead Comparison

Categories

  • apache (1)
  • Business & Monetization (390)
  • Centos (4)
  • Comparisons & Decision Making (55)
  • Debian (2)
  • Debugging & Troubleshooting (584)
  • Desktop Applications (14)
  • DevOps (7)
  • DevOps & Cloud Scaling (962)
  • Django (1)
  • Laravel (4)
  • Migration & Architecture (192)
  • Mobile Applications (24)
  • MySQL (1)
  • Performance & Optimization (806)
  • PHP (5)
  • PHP Development (21)
  • Plugins & Themes (244)
  • Programming Languages (9)
  • Python (19)
  • Ruby on Rails (1)
  • Security & Compliance (543)
  • SEO & Growth (491)
  • Server (23)
  • Ubuntu (9)
  • VB6 & VB.NET (8)
  • Web Applications & Frontend (19)
  • Web Assembly (Wasm) (2)
  • WordPress (22)
  • WordPress Plugin Development (7)
  • WordPress Theme Development (357)

Recent Posts

  • Go Goroutines vs. Node.js Event Loop: Scaling I/O-Bound Microservices Under High Load
  • Elixir Phoenix vs. Go Gin: Concurrency Models and Fault Tolerance Under Peak Request Volume
  • Python Celery vs. Go Channels: Distributed Task Queue Overhead and Memory Reliability

Top Categories

  • DevOps & Cloud Scaling (962)
  • Performance & Optimization (806)
  • Debugging & Troubleshooting (584)
  • Security & Compliance (543)
  • SEO & Growth (491)
  • Business & Monetization (390)

Our Products

  • ERP & LMS Systems (4)
  • Directories & Marketplaces (4)
  • Healthcare Portals (3)
  • Point of Sale (POS) (2)
  • E-Commerce Engines (2)

Our Services

  • E-Commerce Development (10)
  • WordPress Development (8)
  • Python & Desktop GUI (7)
  • General Consulting (7)
  • Legacy Modernization (5)
  • Mobile App Development (4)

Copyright © 2026 · Vinay Vengala