Server Monitoring Best Practices: Keeping Your Shopify App and DynamoDB Clusters Alive on AWS

Establishing a Robust Monitoring Foundation with CloudWatch

For any production Shopify app hosted on AWS, a comprehensive monitoring strategy is paramount. This begins with leveraging Amazon CloudWatch, the foundational observability service. We’ll focus on key metrics for both your application instances (e.g., EC2, ECS, EKS) and your DynamoDB clusters. The goal is not just to collect data, but to establish actionable alarms that prevent outages and performance degradation.

Application Instance Monitoring (EC2/ECS/EKS)

For EC2 instances, the default CloudWatch agent provides essential system-level metrics. However, for deeper insights, especially within containerized environments like ECS or EKS, we need to go further. This involves custom metrics and log aggregation.

Key EC2 Metrics and Alarms

Start with the basics. Ensure you have alarms configured for:

CPUUtilization: A sustained high CPU (e.g., > 80% for 15 minutes) indicates a potential bottleneck.
MemoryUtilization: While not a default EC2 metric, you can collect this using the CloudWatch agent. High memory usage can lead to swapping and performance issues.
DiskReadOps/DiskWriteOps: Spikes or sustained high I/O can signal disk contention.
NetworkIn/NetworkOut: Excessive traffic can indicate unexpected load or potential DDoS activity.
StatusCheckFailed: A critical alarm indicating instance-level or system-level issues.

Here’s a sample CloudWatch alarm configuration for CPU Utilization using the AWS CLI:

Replace YOUR_ALARM_NAME, YOUR_SNS_TOPIC_ARN, and YOUR_EC2_INSTANCE_ID with your specific values.

aws cloudwatch put-metric-alarm \
    --alarm-name "HighCPUUtilization-MyApp" \
    --alarm-description "Alarm when CPU exceeds 80% for 15 minutes" \
    --metric-name CPUUtilization \
    --namespace AWS/EC2 \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions Name=InstanceId,Value=YOUR_EC2_INSTANCE_ID \
    --evaluation-periods 3 \
    --datapoints-to-alarm 3 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:YOUR_SNS_TOPIC_ARN

Containerized Applications (ECS/EKS) – Custom Metrics and Logs

For containerized workloads, standard EC2 metrics are insufficient. You need to monitor container-specific metrics and aggregate application logs.

ECS: Use the CloudWatch Agent to collect metrics from your ECS tasks. Configure the agent to collect container-level CPU and memory utilization. For application logs, ensure your container definitions point to awslogs as the log driver, sending logs to CloudWatch Logs.

{
  "containerDefinitions": [
    {
      "name": "my-app-container",
      "image": "my-docker-repo/my-app:latest",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      // ... other container settings
    }
  ]
}

EKS: Deploy the CloudWatch Agent as a DaemonSet to collect node-level and pod-level metrics. For application logs, use a Fluentd or Fluent Bit DaemonSet to collect logs from pods and forward them to CloudWatch Logs. You can also leverage Prometheus and Grafana, integrating with CloudWatch Container Insights for a more advanced observability stack.

DynamoDB Cluster Monitoring

DynamoDB is a critical component for many Shopify apps. Proactive monitoring of its performance and capacity is essential to prevent read/write throttling and ensure low latency.

Key DynamoDB Metrics and Alarms

Focus on these core DynamoDB metrics:

ConsumedReadCapacityUnits/ConsumedWriteCapacityUnits: Monitor these against provisioned capacity.
ReadThrottleEvents/WriteThrottleEvents: These are critical indicators of insufficient capacity.
SuccessfulRequestLatency: Track the P90 or P99 latency for reads and writes. High latency directly impacts user experience.
SystemErrors: Monitor for any server-side errors.
ThrottledRequests: A broader metric for throttling across different operations.

Here’s an example alarm for DynamoDB Write Throttling:

aws cloudwatch put-metric-alarm \
    --alarm-name "DynamoDBWriteThrottling-MyAppTable" \
    --alarm-description "Alarm when WriteThrottleEvents exceed 0 for 5 minutes" \
    --metric-name WriteThrottleEvents \
    --namespace AWS/DynamoDB \
    --statistic Sum \
    --period 300 \
    --threshold 0 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=TableName,Value=YOUR_DYNAMODB_TABLE_NAME \
    --evaluation-periods 1 \
    --datapoints-to-alarm 1 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:YOUR_SNS_TOPIC_ARN

For latency, you’d typically alarm on the P90 or P99 percentile. For example, to alarm if P90 Read Latency exceeds 100ms for 10 minutes:

aws cloudwatch put-metric-alarm \
    --alarm-name "HighReadLatency-MyAppTable" \
    --alarm-description "Alarm when P90 Read Latency exceeds 100ms for 10 minutes" \
    --metric-name `SuccessfulRequestLatency` \
    --namespace AWS/DynamoDB \
    --statistic p90 \
    --period 600 \
    --threshold 100 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=TableName,Value=YOUR_DYNAMODB_TABLE_NAME \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:YOUR_SNS_TOPIC_ARN

Implementing Application Performance Monitoring (APM)

While CloudWatch provides infrastructure and service-level metrics, Application Performance Monitoring (APM) tools offer deep visibility into your application’s code execution, database queries, and external service calls. For a PHP-based Shopify app, integrating a robust APM solution is crucial for pinpointing performance bottlenecks that CloudWatch alone cannot reveal.

Choosing and Integrating an APM Tool

Popular APM solutions include Datadog APM, New Relic, Dynatrace, and AWS X-Ray. For this example, let’s consider integrating Datadog, a widely adopted platform offering comprehensive tracing and metrics.

Datadog Integration for PHP Applications

The primary mechanism for APM in PHP is through a Zend/SAPI extension. Datadog provides the dd-trace-php extension.

Installation:

PECL: If you have PECL installed, this is the simplest method:

pecl install datadog-trace

Manual Compilation: Download the source, compile, and install.
Docker: If using Docker, you’ll typically install it within your Dockerfile.

Configuration:

Once installed, you need to enable the extension in your php.ini file. Add the following lines:

[datadog]
extension=dd_trace.so

datadog.agent_host = 127.0.0.1
datadog.agent_port = 8126
datadog.service = my-shopify-app
datadog.env = production
datadog.version = 1.0.0
datadog.enabled = true

Ensure your PHP-FPM or Apache configuration reloads to pick up these changes. The datadog.agent_host and datadog.agent_port assume the Datadog Agent is running on the same host or accessible via that address. If using ECS/EKS, the Datadog Agent might be a sidecar container or a DaemonSet.

Shopify App Specific Tracing

The dd-trace-php extension automatically instruments many common PHP functions and frameworks. For Shopify apps, you’ll want to ensure:

Framework Instrumentation: If using Laravel, Symfony, or another framework, ensure the Datadog tracer is configured to instrument its components (e.g., routing, ORM, templating).
HTTP Client Tracing: Crucially, trace outgoing HTTP requests made by your app to the Shopify API. The tracer should automatically pick up cURL and Guzzle requests.
Database Query Tracing: Ensure your database interactions (e.g., with MySQL, PostgreSQL, or even DynamoDB via an ORM) are traced.
Custom Traces: For critical business logic or specific API endpoints, you can add custom spans to provide more granular insights.

<?php
require_once '/path/to/vendor/autoload.php'; // Assuming composer installed

use DDTrace\Trace;

// Example of custom tracing for a critical function
#[Trace\TraceFunction(operationName: 'shopify.api.get_products')]
function get_shopify_products($shopDomain, $accessToken) {
    $url = "https://{$shopDomain}/admin/api/2023-10/products.json";
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_HTTPHEADER, [
        "X-Shopify-Access-Token: {$accessToken}",
        "Content-Type: application/json"
    ]);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200) {
        // Log error, potentially create a custom error span
        \DDTrace\GlobalTracer::get()
            ->getActiveSpan()
            ->setError(new \Exception("Shopify API Error: {$httpCode}"));
        return false;
    }

    return json_decode($response, true);
}

// In your controller or service:
$products = get_shopify_products('my-shop.myshopify.com', 'your_token');
?>

This custom span will appear in Datadog, allowing you to see the duration, success/failure, and associated metadata of this specific Shopify API call.

Log Aggregation and Analysis

Centralized logging is indispensable for debugging production issues, auditing, and understanding application behavior. For a distributed system like a Shopify app on AWS, logs from various instances and services must be aggregated into a single, searchable location.

Leveraging CloudWatch Logs

CloudWatch Logs is the native AWS solution for log aggregation. It’s well-integrated with other AWS services and provides features for log retention, filtering, and metric-based alarms.

Configuring Log Forwarding

EC2: Install the CloudWatch Agent and configure it to tail your application log files. The agent can then stream these logs to CloudWatch Logs log groups.

{
  "agent": {
    "metrics_collection_interval": 60
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/my-app/app.log",
            "log_group_name": "/aws/my-shopify-app/app-logs",
            "log_stream_name": "{instance_id}/app"
          },
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/aws/my-shopify-app/nginx-access",
            "log_stream_name": "{instance_id}/nginx-access"
          }
        ]
      }
    }
  }
}

ECS/EKS: As mentioned earlier, configure your container definitions (ECS) or DaemonSets (EKS with Fluentd/Fluent Bit) to use the awslogs log driver or forward logs to CloudWatch Logs.

Log Analysis and Alerting

Once logs are in CloudWatch Logs, you can:

Create Metric Filters: Extract numerical data from log events to create CloudWatch Metrics. This is powerful for alerting on specific error patterns. For example, to count occurrences of “FATAL ERROR” in your application logs:

aws logs put-metric-filter \
    --log-group-name "/aws/my-shopify-app/app-logs" \
    --filter-name "FatalErrorCount" \
    --filter-pattern "FATAL ERROR" \
    --metric-transformations metricName=FatalErrors,metricNamespace=MyApp/Logs,metricValue=1,defaultValue=0

You can then create a CloudWatch Alarm on the FatalErrors metric.

Create Log Metric Filters for DynamoDB Errors: If your application logs DynamoDB errors or throttling events, you can create metric filters to capture these and alarm on them.
Use CloudWatch Logs Insights: For ad-hoc querying and analysis, Logs Insights provides a powerful query language to search across your log data.

Alternative: Centralized Logging with Elasticsearch/OpenSearch

For very large-scale applications or when advanced search capabilities are required, consider a dedicated log aggregation platform like AWS OpenSearch Service (formerly Elasticsearch Service). This typically involves using Fluentd or Fluent Bit as log forwarders from your instances/containers to your OpenSearch cluster.

Proactive Health Checks and Synthetic Monitoring

Beyond reactive monitoring (alarms triggered by metrics or logs), proactive health checks and synthetic monitoring ensure your application is not only running but also performing its critical functions from an end-user perspective.

Application Health Endpoints

Implement a dedicated health check endpoint in your Shopify app (e.g., /health or /status). This endpoint should:

Check the status of critical dependencies: Database connections, cache services, external API reachability (e.g., a quick call to a non-critical Shopify API endpoint).
Return a 200 OK status code if all checks pass, and a non-2xx status code (e.g., 503 Service Unavailable) if any check fails.
Optionally, return a JSON payload with details about the status of each dependency.

<?php
// Example for a simple PHP/Slim Framework health check
use Psr\Http\Message\ResponseInterface as Response;
use Psr\Http\Message\ServerRequestInterface as Request;

$app->get('/health', function (Request $request, Response $response, array $args) {
    $dependencies = [
        'database' => checkDatabaseConnection(),
        'shopify_api' => checkShopifyApiReachability(),
        // Add other critical dependencies
    ];

    $allHealthy = true;
    foreach ($dependencies as $dependency => $status) {
        if (!$status) {
            $allHealthy = false;
            break;
        }
    }

    if ($allHealthy) {
        $response->getBody()->write(json_encode(['status' => 'ok', 'dependencies' => $dependencies]));
        return $response->withHeader('Content-Type', 'application/json')->withStatus(200);
    } else {
        $response->getBody()->write(json_encode(['status' => 'degraded', 'dependencies' => $dependencies]));
        return $response->withHeader('Content-Type', 'application/json')->withStatus(503);
    }
});

function checkDatabaseConnection(): bool {
    // Implement your DB connection check logic (e.g., PDO::ping)
    try {
        // Assuming $pdo is your PDO instance
        $pdo->query('SELECT 1');
        return true;
    } catch (\PDOException $e) {
        // Log the error
        return false;
    }
}

function checkShopifyApiReachability(): bool {
    // Implement a lightweight check, e.g., HEAD request to a public endpoint
    // or a quick GET to a non-resource-intensive endpoint.
    // Avoid authenticated calls if possible for a simple health check.
    $ch = curl_init("https://your-shop.myshopify.com/admin/api/2023-10/shop.json"); // Example, might need auth
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5); // Short timeout
    curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    return ($httpCode >= 200 && $httpCode < 300);
}
?>

Synthetic Monitoring with CloudWatch Synthetics

CloudWatch Synthetics allows you to create "canaries" – configurable scripts that run on a schedule to simulate user interactions. For your Shopify app, you can create canaries to:

Hit the Health Endpoint: Regularly poll your application's health check endpoint.
Simulate User Flows: For critical user journeys (e.g., product search, adding to cart, checkout initiation), create more complex canaries using Node.js scripts.
Test API Endpoints: Directly test key API endpoints your app relies on, including Shopify's API.

Configure alarms on canary failures or increased latency. This provides an external perspective on your application's availability and performance, independent of internal metrics.

Automated Remediation and Incident Response

The ultimate goal of monitoring is to maintain uptime. This involves not just detecting issues but also responding to them effectively. Automation plays a key role here.

Leveraging AWS Lambda for Automated Actions

AWS Lambda functions can be triggered by CloudWatch Alarms via SNS topics. This allows for automated remediation actions:

Auto-Scaling: If CPU or memory utilization is consistently high, trigger an Auto Scaling action to add more instances.
Restarting Services: For transient issues, a Lambda function could attempt to restart a specific service or container.
Draining Traffic: If an instance is unhealthy, a Lambda function can remove it from the load balancer's target group.
Notifying On-Call Engineers: While SNS can do this directly, Lambda can enrich notifications or route them to specific on-call systems (PagerDuty, Opsgenie) via their APIs.

import json
import boto3

autoscaling = boto3.client('autoscaling')
sns = boto3.client('sns')

def lambda_handler(event, context):
    alarm_name = event['detail']['alarmName']
    new_state_value = event['detail']['newStateValue']
    metric_name = event['detail']['metricName']
    dimensions = event['detail']['dimensions']

    print(f"Alarm '{alarm_name}' changed state to '{new_state_value}'")

    if new_state_value == 'ALARM':
        if "CPUUtilization" in metric_name and "InstanceId" in dimensions:
            instance_id = next((d['value'] for d in dimensions if d['name'] == 'InstanceId'), None)
            if instance_id:
                print(f"High CPU detected on {instance_id}. Considering scaling up.")
                # In a real scenario, you'd likely trigger an Auto Scaling Group action
                # or notify a human operator. Directly modifying ASG capacity here
                # can be risky without proper safeguards.
                # Example: Triggering a specific ASG scale-up action
                try:
                    asg_name = "your-app-autoscaling-group-name" # Replace with your ASG name
                    autoscaling.set_desired_capacity(
                        AutoScalingGroupName=asg_name,
                        DesiredCapacity=autoscaling.describe_auto_scaling_groups(
                            AutoScalingGroupNames=[asg_name]
                        )['AutoScalingGroups'][0]['DesiredCapacity'] + 1
                    )
                    print(f"Increased desired capacity for {asg_name} to scale up.")
                except Exception as e:
                    print(f"Error scaling ASG: {e}")

        # Add more remediation logic for other alarms (e.g., memory, throttling)

    # Optionally, send a notification to a different SNS topic for processed events
    # sns.publish(
    #     TopicArn='arn:aws:sns:us-east-1:123456789012:ProcessedAlarmsTopic',
    #     Message=json.dumps(event),
    #     Subject=f"Processed Alarm: {alarm_name} - {new_state_value}"
    # )

    return {
        'statusCode': 200,
        'body': json.dumps('Remediation process initiated.')
    }

# Example event structure from CloudWatch Alarms (simplified)
# {
#   "version": "0",
#   "id": "...",
#   "detail-type": "CloudWatch Alarm State Change",
#   "source": "aws.cloudwatch",
#   "account": "123456789012",
#   "time": "...",
#   "region": "us-east-1",
#   "resources": ["arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighCPUUtilization-MyApp"],
#   "detail": {
#     "alarmName": "HighCPUUtilization-MyApp",
#     "stateChangeTime": "...",
#     "previousStateValue": "OK",
#     "newStateValue": "ALARM",
#     "metricName": "CPUUtilization",
#     "dimensions": [{"name": "InstanceId", "value": "i-0123456789abcdef0"}],
#     "threshold": 80.0,
#     "comparisonOperator": "GreaterThanOrEqualToThreshold",
#     "evaluationPeriods": 3,
#     "datapointsToAlarm": 3,
#     "alarmDescription": "Alarm when CPU exceeds 80% for 15 minutes"
#   }
# }

Incident Management Playbooks

Even with automation, human intervention is often required. Develop clear incident management playbooks that outline:

Roles and Responsibilities: Who is on call? Who is the incident commander?
Escalation Paths: When and how to escalate to senior engineers or external teams.
Communication Channels: How to communicate status internally and externally (e.g., status page).
Runbooks: Step-by-step guides for diagnosing and resolving common incidents (e.g., "DynamoDB Throttling," "High Application Latency").

Integrate your monitoring alerts with your incident management platform (e.g., PagerDuty, Opsgenie) to ensure timely notifications and efficient response.