Server Monitoring Best Practices: Keeping Your Shopify App and MongoDB Clusters Alive on AWS

Establishing a Robust Monitoring Foundation with AWS CloudWatch

For any production Shopify app hosted on AWS, particularly those leveraging MongoDB clusters, a comprehensive monitoring strategy is non-negotiable. AWS CloudWatch serves as the foundational layer for this, providing essential metrics, logs, and alarms. We’ll focus on key areas: EC2 instance health, RDS/DocumentDB performance (if applicable), and application-level metrics.

EC2 Instance Health Monitoring

Beyond the default CloudWatch metrics (CPU Utilization, Network In/Out, Disk Read/Write Ops), we need to ensure our application servers are truly healthy. This involves custom metrics and log analysis.

Key Metrics to Monitor:

CPU Utilization: Standard, but essential. High sustained CPU can indicate inefficient code or insufficient resources.
Memory Utilization: Crucial for applications. CloudWatch agent can be configured to send memory metrics.
Disk I/O Operations: High I/O can bottleneck applications, especially database-intensive ones.
Network Traffic: Monitor for unusual spikes or drops that might indicate network issues or DoS attacks.
Process Health: Ensure critical application processes (e.g., your PHP-FPM workers, Node.js processes) are running. This often requires custom scripting and sending custom metrics.

Configuring the CloudWatch Agent for Enhanced Metrics

The CloudWatch agent allows us to collect system-level metrics (like memory utilization) and custom application metrics. Here’s a sample configuration file for an EC2 instance running a typical web application stack.

Create a file named amazon-cloudwatch-agent.json on your EC2 instance.

Example amazon-cloudwatch-agent.json:

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "ShopifyApp/EC2",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "aggregation_dimensions": [
      [ "InstanceId" ]
    ],
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_iowait",
          "cpu_usage_user",
          "cpu_usage_system"
        ],
        "totalcpu": false
      },
      "disk": {
        "measurement": [
          "disk_read_ops",
          "disk_write_ops",
          "disk_read_bytes",
          "disk_write_bytes"
        ],
        "resources": [
          "xvda",
          "xvdb"
        ],
        "ignore_file_system_types": [
          "sysfs",
          "devtmpfs",
          "tmpfs",
          "devfs",
          "iso9660",
          "overlay",
          "aufs",
          "squashfs"
        ]
      },
      "mem": {
        "measurement": [
          "mem_used_percent",
          "mem_available_percent"
        ]
      },
      "net": {
        "measurement": [
          "bytes_recv",
          "bytes_sent",
          "packets_recv",
          "packets_sent"
        ]
      },
      "statsd": {
        "service_address": "udp:localhost:8125",
        "metrics_collection_interval": 60
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "ShopifyApp/Nginx/Access",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "ShopifyApp/Nginx/Error",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/php-fpm/www-error.log",
            "log_group_name": "ShopifyApp/PHP-FPM/Error",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

After creating the file, install and start the agent:

sudo yum install amazon-cloudwatch-agent -y  # For Amazon Linux/CentOS/RHEL
# or
sudo apt-get install amazon-cloudwatch-agent -y # For Ubuntu/Debian

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/path/to/your/amazon-cloudwatch-agent.json -s

MongoDB Cluster Monitoring on AWS (DocumentDB or EC2-hosted)

Monitoring your MongoDB cluster is paramount. The approach differs slightly depending on whether you’re using AWS DocumentDB or self-hosting MongoDB on EC2 instances.

DocumentDB Monitoring

DocumentDB integrates seamlessly with CloudWatch. Key metrics are available by default. Ensure you’re monitoring the following:

CPUUtilization
DatabaseConnections
ReadIOPS, WriteIOPS
ReadLatency, WriteLatency
FreeableMemory
NetworkReceiveThroughput, NetworkTransmitThroughput
DiskQueueDepth

Beyond these, consider enabling performance insights for deeper query analysis and setting up alarms on critical thresholds.

Self-Hosted MongoDB on EC2 Monitoring

For self-hosted MongoDB, you’ll need to leverage the CloudWatch agent’s custom metrics capabilities and potentially external tools.

Key MongoDB Metrics to Collect:

Operations per Second (inserts, queries, updates, deletes)
Connection Count
Memory Usage (resident, virtual)
Disk Usage
Replication Lag (if applicable)
Lock Percentages
Network Traffic

You can collect these using the MongoDB `mongostat` and `mongotop` commands, or by querying the `serverStatus` command and sending the data as custom CloudWatch metrics. A common pattern is to use a Python script that runs periodically.

import boto3
import pymongo
import time
import json
from datetime import datetime

# --- Configuration ---
MONGO_HOST = "your_mongodb_host"
MONGO_PORT = 27017
CLOUDWATCH_NAMESPACE = "ShopifyApp/MongoDB"
# IAM role for EC2 instance should have CloudWatch PutMetricData permissions

# --- CloudWatch Client ---
cloudwatch = boto3.client('cloudwatch')

def get_mongo_stats():
    client = None
    try:
        client = pymongo.MongoClient(MONGO_HOST, MONGO_PORT, serverSelectionTimeoutMS=5000)
        client.admin.command('ping') # Test connection

        db = client.admin
        stats = db.command('serverStatus')

        metrics = []

        # Basic Metrics
        metrics.append({
            'MetricName': 'OperationsInsert',
            'Value': stats['opcounters']['insert'],
            'Unit': 'Count'
        })
        metrics.append({
            'MetricName': 'OperationsQuery',
            'Value': stats['opcounters']['query'],
            'Unit': 'Count'
        })
        metrics.append({
            'MetricName': 'OperationsUpdate',
            'Value': stats['opcounters']['update'],
            'Unit': 'Count'
        })
        metrics.append({
            'MetricName': 'OperationsDelete',
            'Value': stats['opcounters']['delete'],
            'Unit': 'Count'
        })
        metrics.append({
            'MetricName': 'ConnectionsCurrent',
            'Value': stats['connections']['current'],
            'Unit': 'Count'
        })
        metrics.append({
            'MetricName': 'ConnectionsAvailable',
            'Value': stats['connections']['available'],
            'Unit': 'Count'
        })
        metrics.append({
            'MetricName': 'NetworkIn',
            'Value': stats['network']['bytesIn'],
            'Unit': 'Bytes'
        })
        metrics.append({
            'MetricName': 'NetworkOut',
            'Value': stats['network']['bytesOut'],
            'Unit': 'Bytes'
        })
        metrics.append({
            'MetricName': 'MemoryResident',
            'Value': stats['mem']['resident'],
            'Unit': 'Megabytes'
        })
        metrics.append({
            'MetricName': 'MemoryVirtual',
            'Value': stats['mem']['virtual'],
            'Unit': 'Megabytes'
        })

        # Replication Lag (if replica set)
        if 'repl' in stats:
            for member in stats['repl']['members']:
                if member['self']:
                    metrics.append({
                        'MetricName': 'ReplicationLag',
                        'Value': member['optimeLag'],
                        'Unit': 'Seconds'
                    })
                    break # Assuming only one self member

        return metrics

    except pymongo.errors.ConnectionFailure as e:
        print(f"Could not connect to MongoDB: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None
    finally:
        if client:
            client.close()

def put_metrics_to_cloudwatch(metrics_data):
    if not metrics_data:
        return

    try:
        response = cloudwatch.put_metric_data(
            Namespace=CLOUDWATCH_NAMESPACE,
            MetricData=metrics_data
        )
        print(f"Successfully put metrics to CloudWatch: {response}")
    except Exception as e:
        print(f"Error putting metrics to CloudWatch: {e}")

if __name__ == "__main__":
    print("Starting MongoDB monitoring script...")
    while True:
        mongo_metrics = get_mongo_stats()
        if mongo_metrics:
            put_metrics_to_cloudwatch(mongo_metrics)
        else:
            print("Failed to retrieve MongoDB metrics.")

        print("Waiting for next interval...")
        time.sleep(60) # Collect metrics every 60 seconds

To run this script, you’ll need to install the Boto3 and PyMongo libraries:

pip install boto3 pymongo

Then, configure your EC2 instance’s IAM role to have permissions for cloudwatch:PutMetricData. You can run this script as a systemd service or a cron job.

Application-Level Metrics and Logging

Monitoring the underlying infrastructure is only half the battle. We need visibility into our Shopify app’s performance and behavior.

Custom Application Metrics

Instrument your application code to emit custom metrics. For a PHP application, this might involve using a library that can send metrics to a StatsD endpoint, which the CloudWatch agent can then collect. If using Node.js, libraries like @aws-sdk/client-cloudwatch can be used directly.

Example (Conceptual PHP using a StatsD client):

<?php
// Assuming you have a StatsD client library configured
// e.g., using `php-statsd-client` or similar

$statsd = new StatsDClient(['host' => '127.0.0.1', 'port' => 8125]);

// Example: Track API request duration
$startTime = microtime(true);

// ... your API request processing logic ...

$endTime = microtime(true);
$duration = ($endTime - $startTime) * 1000; // in milliseconds
$statsd->timing('api.request.duration', $duration);

// Example: Track successful order creations
if ($orderWasCreatedSuccessfully) {
    $statsd->increment('shopify.orders.created');
} else {
    $statsd->increment('shopify.orders.failed');
}

// Example: Track specific shopify API call latency
$shopifyApiStartTime = microtime(true);
// ... call Shopify API ...
$shopifyApiEndTime = microtime(true);
$shopifyApiDuration = ($shopifyApiEndTime - $shopifyApiStartTime) * 1000;
$statsd->timing('shopify.api.call.latency', $shopifyApiDuration);

?>

Ensure your amazon-cloudwatch-agent.json is configured to collect StatsD metrics (as shown in the earlier EC2 configuration example). The metrics will appear under the `ShopifyApp/EC2` namespace (or whatever you configure).

Centralized Logging with CloudWatch Logs

Application logs are invaluable for debugging. Configure your web server (Nginx/Apache), PHP-FPM, and your application itself to log to files that the CloudWatch agent monitors. The agent will then stream these logs to CloudWatch Logs, enabling centralized searching, filtering, and analysis.

Key Log Files to Monitor:

Nginx Access and Error Logs
PHP-FPM Error Logs
Application-specific logs (e.g., storage/logs/laravel.log for Laravel)
MongoDB logs (if self-hosted)

The amazon-cloudwatch-agent.json configuration includes examples for Nginx and PHP-FPM. For application logs, add entries like this to the "files" section:

          {
            "file_path": "/var/www/your-app/storage/logs/laravel.log",
            "log_group_name": "ShopifyApp/Laravel/App",
            "log_stream_name": "{instance_id}"
          }

Alerting Strategies with CloudWatch Alarms

Metrics and logs are only useful if they trigger action when something goes wrong. CloudWatch Alarms are essential for proactive issue detection.

Critical Alarms to Configure:

High CPU Utilization: e.g., > 80% for 15 minutes on any web server.
Low Memory Availability: e.g., < 10% available memory for 10 minutes.
High Disk I/O Wait: e.g., > 50ms average latency for 5 minutes.
High Error Rates: Monitor Nginx error logs or application error metrics. Use CloudWatch Logs metric filters to count error occurrences.
Database Connection Issues: High number of failed connections or exceeding connection limits.
Replication Lag (MongoDB): If replication lag exceeds a defined threshold (e.g., > 60 seconds).
Application-Specific Thresholds: e.g., API request latency exceeding acceptable limits, high rate of failed Shopify API calls.

Example CloudWatch Logs Metric Filter for Nginx Errors:

In the CloudWatch console, navigate to your Nginx error log group (e.g., `ShopifyApp/Nginx/Error`) and create a metric filter. The filter pattern could be:

[error]

This simple pattern will count lines containing “error”. You can create more sophisticated patterns. Then, create an alarm based on this new metric (e.g., “Error Count > 10 in 5 minutes”).

Beyond CloudWatch: Advanced Considerations

While CloudWatch is powerful, consider these for a truly resilient system:

Distributed Tracing: For complex microservice architectures or deep debugging, tools like AWS X-Ray or open-source alternatives (Jaeger, Zipkin) can trace requests across multiple services.
Synthetic Monitoring: Use CloudWatch Synthetics Canaries or similar tools to simulate user interactions with your Shopify app (e.g., adding to cart, checkout) to proactively detect availability issues.
APM Tools: Application Performance Monitoring tools (e.g., Datadog, New Relic, Dynatrace) offer deeper insights into application code performance, database query analysis, and user experience monitoring, often with more intuitive dashboards than raw CloudWatch.
Automated Remediation: Integrate alarms with AWS Lambda or Systems Manager Automation documents to automatically restart services, scale instances, or trigger other recovery actions.

By implementing a layered monitoring strategy encompassing infrastructure, database, and application-level metrics, coupled with robust alerting and logging, you can significantly improve the reliability and performance of your Shopify app and its underlying MongoDB clusters on AWS.