Server Monitoring Best Practices: Keeping Your Shopify App and DynamoDB Clusters Alive on AWS
Establishing a Robust Monitoring Foundation with CloudWatch
For any production Shopify app hosted on AWS, a comprehensive monitoring strategy is paramount. This begins with leveraging Amazon CloudWatch, the foundational observability service. We’ll focus on key metrics for both your application instances (e.g., EC2, ECS, EKS) and your DynamoDB clusters. The goal is not just to collect data, but to establish actionable alarms that prevent outages and performance degradation.
Application Instance Monitoring (EC2/ECS/EKS)
For EC2 instances, the default CloudWatch agent provides essential system-level metrics. However, for deeper insights, especially within containerized environments like ECS or EKS, we need to go further. This involves custom metrics and log aggregation.
Key EC2 Metrics and Alarms
Start with the basics. Ensure you have alarms configured for:
- CPUUtilization: A sustained high CPU (e.g., > 80% for 15 minutes) indicates a potential bottleneck.
- MemoryUtilization: While not a default EC2 metric, you can collect this using the CloudWatch agent. High memory usage can lead to swapping and performance issues.
- DiskReadOps/DiskWriteOps: Spikes or sustained high I/O can signal disk contention.
- NetworkIn/NetworkOut: Excessive traffic can indicate unexpected load or potential DDoS activity.
- StatusCheckFailed: A critical alarm indicating instance-level or system-level issues.
Here’s a sample CloudWatch alarm configuration for CPU Utilization using the AWS CLI:
Replace YOUR_ALARM_NAME, YOUR_SNS_TOPIC_ARN, and YOUR_EC2_INSTANCE_ID with your specific values.
aws cloudwatch put-metric-alarm \
--alarm-name "HighCPUUtilization-MyApp" \
--alarm-description "Alarm when CPU exceeds 80% for 15 minutes" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanOrEqualToThreshold \
--dimensions Name=InstanceId,Value=YOUR_EC2_INSTANCE_ID \
--evaluation-periods 3 \
--datapoints-to-alarm 3 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:YOUR_SNS_TOPIC_ARN
Containerized Applications (ECS/EKS) – Custom Metrics and Logs
For containerized workloads, standard EC2 metrics are insufficient. You need to monitor container-specific metrics and aggregate application logs.
ECS: Use the CloudWatch Agent to collect metrics from your ECS tasks. Configure the agent to collect container-level CPU and memory utilization. For application logs, ensure your container definitions point to awslogs as the log driver, sending logs to CloudWatch Logs.
{
"containerDefinitions": [
{
"name": "my-app-container",
"image": "my-docker-repo/my-app:latest",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/my-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
// ... other container settings
}
]
}
EKS: Deploy the CloudWatch Agent as a DaemonSet to collect node-level and pod-level metrics. For application logs, use a Fluentd or Fluent Bit DaemonSet to collect logs from pods and forward them to CloudWatch Logs. You can also leverage Prometheus and Grafana, integrating with CloudWatch Container Insights for a more advanced observability stack.
DynamoDB Cluster Monitoring
DynamoDB is a critical component for many Shopify apps. Proactive monitoring of its performance and capacity is essential to prevent read/write throttling and ensure low latency.
Key DynamoDB Metrics and Alarms
Focus on these core DynamoDB metrics:
- ConsumedReadCapacityUnits/ConsumedWriteCapacityUnits: Monitor these against provisioned capacity.
- ReadThrottleEvents/WriteThrottleEvents: These are critical indicators of insufficient capacity.
- SuccessfulRequestLatency: Track the P90 or P99 latency for reads and writes. High latency directly impacts user experience.
- SystemErrors: Monitor for any server-side errors.
- ThrottledRequests: A broader metric for throttling across different operations.
Here’s an example alarm for DynamoDB Write Throttling:
aws cloudwatch put-metric-alarm \
--alarm-name "DynamoDBWriteThrottling-MyAppTable" \
--alarm-description "Alarm when WriteThrottleEvents exceed 0 for 5 minutes" \
--metric-name WriteThrottleEvents \
--namespace AWS/DynamoDB \
--statistic Sum \
--period 300 \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=TableName,Value=YOUR_DYNAMODB_TABLE_NAME \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:YOUR_SNS_TOPIC_ARN
For latency, you’d typically alarm on the P90 or P99 percentile. For example, to alarm if P90 Read Latency exceeds 100ms for 10 minutes:
aws cloudwatch put-metric-alarm \
--alarm-name "HighReadLatency-MyAppTable" \
--alarm-description "Alarm when P90 Read Latency exceeds 100ms for 10 minutes" \
--metric-name `SuccessfulRequestLatency` \
--namespace AWS/DynamoDB \
--statistic p90 \
--period 600 \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=TableName,Value=YOUR_DYNAMODB_TABLE_NAME \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:YOUR_SNS_TOPIC_ARN
Implementing Application Performance Monitoring (APM)
While CloudWatch provides infrastructure and service-level metrics, Application Performance Monitoring (APM) tools offer deep visibility into your application’s code execution, database queries, and external service calls. For a PHP-based Shopify app, integrating a robust APM solution is crucial for pinpointing performance bottlenecks that CloudWatch alone cannot reveal.
Choosing and Integrating an APM Tool
Popular APM solutions include Datadog APM, New Relic, Dynatrace, and AWS X-Ray. For this example, let’s consider integrating Datadog, a widely adopted platform offering comprehensive tracing and metrics.
Datadog Integration for PHP Applications
The primary mechanism for APM in PHP is through a Zend/SAPI extension. Datadog provides the dd-trace-php extension.
Installation:
- PECL: If you have PECL installed, this is the simplest method:
pecl install datadog-trace
- Manual Compilation: Download the source, compile, and install.
- Docker: If using Docker, you’ll typically install it within your Dockerfile.
Configuration:
Once installed, you need to enable the extension in your
php.inifile. Add the following lines:[datadog] extension=dd_trace.so datadog.agent_host = 127.0.0.1 datadog.agent_port = 8126 datadog.service = my-shopify-app datadog.env = production datadog.version = 1.0.0 datadog.enabled = true
Ensure your PHP-FPM or Apache configuration reloads to pick up these changes. The
datadog.agent_hostanddatadog.agent_portassume the Datadog Agent is running on the same host or accessible via that address. If using ECS/EKS, the Datadog Agent might be a sidecar container or a DaemonSet.Shopify App Specific Tracing
The
dd-trace-phpextension automatically instruments many common PHP functions and frameworks. For Shopify apps, you’ll want to ensure:- Framework Instrumentation: If using Laravel, Symfony, or another framework, ensure the Datadog tracer is configured to instrument its components (e.g., routing, ORM, templating).
- HTTP Client Tracing: Crucially, trace outgoing HTTP requests made by your app to the Shopify API. The tracer should automatically pick up cURL and Guzzle requests.
- Database Query Tracing: Ensure your database interactions (e.g., with MySQL, PostgreSQL, or even DynamoDB via an ORM) are traced.
- Custom Traces: For critical business logic or specific API endpoints, you can add custom spans to provide more granular insights.
<?php require_once '/path/to/vendor/autoload.php'; // Assuming composer installed use DDTrace\Trace; // Example of custom tracing for a critical function #[Trace\TraceFunction(operationName: 'shopify.api.get_products')] function get_shopify_products($shopDomain, $accessToken) { $url = "https://{$shopDomain}/admin/api/2023-10/products.json"; $ch = curl_init($url); curl_setopt($ch, CURLOPT_HTTPHEADER, [ "X-Shopify-Access-Token: {$accessToken}", "Content-Type: application/json" ]); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $response = curl_exec($ch); $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); curl_close($ch); if ($httpCode !== 200) { // Log error, potentially create a custom error span \DDTrace\GlobalTracer::get() ->getActiveSpan() ->setError(new \Exception("Shopify API Error: {$httpCode}")); return false; } return json_decode($response, true); } // In your controller or service: $products = get_shopify_products('my-shop.myshopify.com', 'your_token'); ?>This custom span will appear in Datadog, allowing you to see the duration, success/failure, and associated metadata of this specific Shopify API call.
Log Aggregation and Analysis
Centralized logging is indispensable for debugging production issues, auditing, and understanding application behavior. For a distributed system like a Shopify app on AWS, logs from various instances and services must be aggregated into a single, searchable location.
Leveraging CloudWatch Logs
CloudWatch Logs is the native AWS solution for log aggregation. It’s well-integrated with other AWS services and provides features for log retention, filtering, and metric-based alarms.
Configuring Log Forwarding
EC2: Install the CloudWatch Agent and configure it to tail your application log files. The agent can then stream these logs to CloudWatch Logs log groups.
{ "agent": { "metrics_collection_interval": 60 }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/my-app/app.log", "log_group_name": "/aws/my-shopify-app/app-logs", "log_stream_name": "{instance_id}/app" }, { "file_path": "/var/log/nginx/access.log", "log_group_name": "/aws/my-shopify-app/nginx-access", "log_stream_name": "{instance_id}/nginx-access" } ] } } } }ECS/EKS: As mentioned earlier, configure your container definitions (ECS) or DaemonSets (EKS with Fluentd/Fluent Bit) to use the
awslogslog driver or forward logs to CloudWatch Logs.Log Analysis and Alerting
Once logs are in CloudWatch Logs, you can:
- Create Metric Filters: Extract numerical data from log events to create CloudWatch Metrics. This is powerful for alerting on specific error patterns. For example, to count occurrences of “FATAL ERROR” in your application logs:
aws logs put-metric-filter \ --log-group-name "/aws/my-shopify-app/app-logs" \ --filter-name "FatalErrorCount" \ --filter-pattern "FATAL ERROR" \ --metric-transformations metricName=FatalErrors,metricNamespace=MyApp/Logs,metricValue=1,defaultValue=0You can then create a CloudWatch Alarm on the
FatalErrorsmetric.- Create Log Metric Filters for DynamoDB Errors: If your application logs DynamoDB errors or throttling events, you can create metric filters to capture these and alarm on them.
- Use CloudWatch Logs Insights: For ad-hoc querying and analysis, Logs Insights provides a powerful query language to search across your log data.
Alternative: Centralized Logging with Elasticsearch/OpenSearch
For very large-scale applications or when advanced search capabilities are required, consider a dedicated log aggregation platform like AWS OpenSearch Service (formerly Elasticsearch Service). This typically involves using Fluentd or Fluent Bit as log forwarders from your instances/containers to your OpenSearch cluster.
Proactive Health Checks and Synthetic Monitoring
Beyond reactive monitoring (alarms triggered by metrics or logs), proactive health checks and synthetic monitoring ensure your application is not only running but also performing its critical functions from an end-user perspective.
Application Health Endpoints
Implement a dedicated health check endpoint in your Shopify app (e.g.,
/healthor/status). This endpoint should:- Check the status of critical dependencies: Database connections, cache services, external API reachability (e.g., a quick call to a non-critical Shopify API endpoint).
- Return a 200 OK status code if all checks pass, and a non-2xx status code (e.g., 503 Service Unavailable) if any check fails.
- Optionally, return a JSON payload with details about the status of each dependency.
<?php // Example for a simple PHP/Slim Framework health check use Psr\Http\Message\ResponseInterface as Response; use Psr\Http\Message\ServerRequestInterface as Request; $app->get('/health', function (Request $request, Response $response, array $args) { $dependencies = [ 'database' => checkDatabaseConnection(), 'shopify_api' => checkShopifyApiReachability(), // Add other critical dependencies ]; $allHealthy = true; foreach ($dependencies as $dependency => $status) { if (!$status) { $allHealthy = false; break; } } if ($allHealthy) { $response->getBody()->write(json_encode(['status' => 'ok', 'dependencies' => $dependencies])); return $response->withHeader('Content-Type', 'application/json')->withStatus(200); } else { $response->getBody()->write(json_encode(['status' => 'degraded', 'dependencies' => $dependencies])); return $response->withHeader('Content-Type', 'application/json')->withStatus(503); } }); function checkDatabaseConnection(): bool { // Implement your DB connection check logic (e.g., PDO::ping) try { // Assuming $pdo is your PDO instance $pdo->query('SELECT 1'); return true; } catch (\PDOException $e) { // Log the error return false; } } function checkShopifyApiReachability(): bool { // Implement a lightweight check, e.g., HEAD request to a public endpoint // or a quick GET to a non-resource-intensive endpoint. // Avoid authenticated calls if possible for a simple health check. $ch = curl_init("https://your-shop.myshopify.com/admin/api/2023-10/shop.json"); // Example, might need auth curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 5); // Short timeout curl_exec($ch); $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); curl_close($ch); return ($httpCode >= 200 && $httpCode < 300); } ?>Synthetic Monitoring with CloudWatch Synthetics
CloudWatch Synthetics allows you to create "canaries" – configurable scripts that run on a schedule to simulate user interactions. For your Shopify app, you can create canaries to:
- Hit the Health Endpoint: Regularly poll your application's health check endpoint.
- Simulate User Flows: For critical user journeys (e.g., product search, adding to cart, checkout initiation), create more complex canaries using Node.js scripts.
- Test API Endpoints: Directly test key API endpoints your app relies on, including Shopify's API.
Configure alarms on canary failures or increased latency. This provides an external perspective on your application's availability and performance, independent of internal metrics.
Automated Remediation and Incident Response
The ultimate goal of monitoring is to maintain uptime. This involves not just detecting issues but also responding to them effectively. Automation plays a key role here.
Leveraging AWS Lambda for Automated Actions
AWS Lambda functions can be triggered by CloudWatch Alarms via SNS topics. This allows for automated remediation actions:
- Auto-Scaling: If CPU or memory utilization is consistently high, trigger an Auto Scaling action to add more instances.
- Restarting Services: For transient issues, a Lambda function could attempt to restart a specific service or container.
- Draining Traffic: If an instance is unhealthy, a Lambda function can remove it from the load balancer's target group.
- Notifying On-Call Engineers: While SNS can do this directly, Lambda can enrich notifications or route them to specific on-call systems (PagerDuty, Opsgenie) via their APIs.
import json import boto3 autoscaling = boto3.client('autoscaling') sns = boto3.client('sns') def lambda_handler(event, context): alarm_name = event['detail']['alarmName'] new_state_value = event['detail']['newStateValue'] metric_name = event['detail']['metricName'] dimensions = event['detail']['dimensions'] print(f"Alarm '{alarm_name}' changed state to '{new_state_value}'") if new_state_value == 'ALARM': if "CPUUtilization" in metric_name and "InstanceId" in dimensions: instance_id = next((d['value'] for d in dimensions if d['name'] == 'InstanceId'), None) if instance_id: print(f"High CPU detected on {instance_id}. Considering scaling up.") # In a real scenario, you'd likely trigger an Auto Scaling Group action # or notify a human operator. Directly modifying ASG capacity here # can be risky without proper safeguards. # Example: Triggering a specific ASG scale-up action try: asg_name = "your-app-autoscaling-group-name" # Replace with your ASG name autoscaling.set_desired_capacity( AutoScalingGroupName=asg_name, DesiredCapacity=autoscaling.describe_auto_scaling_groups( AutoScalingGroupNames=[asg_name] )['AutoScalingGroups'][0]['DesiredCapacity'] + 1 ) print(f"Increased desired capacity for {asg_name} to scale up.") except Exception as e: print(f"Error scaling ASG: {e}") # Add more remediation logic for other alarms (e.g., memory, throttling) # Optionally, send a notification to a different SNS topic for processed events # sns.publish( # TopicArn='arn:aws:sns:us-east-1:123456789012:ProcessedAlarmsTopic', # Message=json.dumps(event), # Subject=f"Processed Alarm: {alarm_name} - {new_state_value}" # ) return { 'statusCode': 200, 'body': json.dumps('Remediation process initiated.') } # Example event structure from CloudWatch Alarms (simplified) # { # "version": "0", # "id": "...", # "detail-type": "CloudWatch Alarm State Change", # "source": "aws.cloudwatch", # "account": "123456789012", # "time": "...", # "region": "us-east-1", # "resources": ["arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighCPUUtilization-MyApp"], # "detail": { # "alarmName": "HighCPUUtilization-MyApp", # "stateChangeTime": "...", # "previousStateValue": "OK", # "newStateValue": "ALARM", # "metricName": "CPUUtilization", # "dimensions": [{"name": "InstanceId", "value": "i-0123456789abcdef0"}], # "threshold": 80.0, # "comparisonOperator": "GreaterThanOrEqualToThreshold", # "evaluationPeriods": 3, # "datapointsToAlarm": 3, # "alarmDescription": "Alarm when CPU exceeds 80% for 15 minutes" # } # }Incident Management Playbooks
Even with automation, human intervention is often required. Develop clear incident management playbooks that outline:
- Roles and Responsibilities: Who is on call? Who is the incident commander?
- Escalation Paths: When and how to escalate to senior engineers or external teams.
- Communication Channels: How to communicate status internally and externally (e.g., status page).
- Runbooks: Step-by-step guides for diagnosing and resolving common incidents (e.g., "DynamoDB Throttling," "High Application Latency").
Integrate your monitoring alerts with your incident management platform (e.g., PagerDuty, Opsgenie) to ensure timely notifications and efficient response.