How We Audited a High-Traffic Shopify Enterprise Stack on DigitalOcean and Mitigated Race conditions during high-concurrency payment processing
Deep Dive: Auditing a High-Traffic Shopify Enterprise Stack on DigitalOcean
Our engagement involved a high-traffic Shopify Plus enterprise deployment hosted on DigitalOcean. The primary concern was a series of intermittent, yet critical, race conditions occurring during peak payment processing periods. These events led to duplicate orders, failed transactions, and significant customer dissatisfaction. The stack comprised a custom Shopify theme, several third-party Shopify Apps, a robust backend API layer built with Node.js and Express, a PostgreSQL database, and a Redis cache, all orchestrated within a Kubernetes cluster on DigitalOcean Kubernetes Engine (DOKE).
Identifying the Root Cause: Race Conditions in Payment Processing Logic
The initial investigation focused on the payment gateway integration and the Shopify webhook handlers. We observed that during high concurrency, multiple webhook events for the same order could be processed almost simultaneously. The critical section of code involved updating the order status and initiating fulfillment requests. A typical flow looked like this:
- Shopify webhook `orders/paid` is received.
- Backend API retrieves order details from Shopify.
- Backend API checks if payment has already been processed (e.g., by querying its internal order status table).
- If not processed, the API marks the order as paid, initiates fulfillment, and updates the internal status.
- If processed, the webhook is ignored.
The race condition occurred when two `orders/paid` webhooks for the same order arrived concurrently. Both instances of the API handler would execute the check for payment status. If the database update hadn’t yet been committed by the first instance, the second instance would also see the order as unpaid and proceed to process the payment again, leading to duplicate fulfillment requests or incorrect inventory deductions.
Diagnostic Workflow and Tooling
Our diagnostic approach involved a multi-pronged strategy:
1. Log Analysis and Correlation
We leveraged a centralized logging system (ELK stack: Elasticsearch, Logstash, Kibana) to aggregate logs from all microservices and the Kubernetes cluster. Key metrics and log patterns we searched for included:
- Timestamps of incoming `orders/paid` webhooks.
- API request/response times for order status checks and updates.
- Database transaction logs, specifically for order status modifications.
- Any errors or exceptions related to payment processing or order updates.
- Correlation IDs were crucial for tracing a single order’s lifecycle across different services.
A sample log snippet illustrating the problem:
{
"timestamp": "2023-10-27T10:30:01.123Z",
"level": "info",
"message": "Received Shopify webhook",
"webhook_topic": "orders/paid",
"order_id": "gid://shopify/Order/1234567890",
"correlation_id": "abc-123"
}
{
"timestamp": "2023-10-27T10:30:01.150Z",
"level": "info",
"message": "Processing payment for order",
"order_id": "gid://shopify/Order/1234567890",
"correlation_id": "abc-123"
}
{
"timestamp": "2023-10-27T10:30:01.160Z",
"level": "info",
"message": "Received Shopify webhook",
"webhook_topic": "orders/paid",
"order_id": "gid://shopify/Order/1234567890",
"correlation_id": "def-456"
}
{
"timestamp": "2023-10-27T10:30:01.180Z",
"level": "info",
"message": "Order status check: Not processed",
"order_id": "gid://shopify/Order/1234567890",
"correlation_id": "def-456"
}
{
"timestamp": "2023-10-27T10:30:01.200Z",
"level": "info",
"message": "Processing payment for order",
"order_id": "gid://shopify/Order/1234567890",
"correlation_id": "def-456"
}
2. Application Performance Monitoring (APM)
We integrated Datadog APM to gain deep visibility into the Node.js application’s performance. This allowed us to:
- Identify slow database queries.
- Trace requests across microservices.
- Pinpoint the exact lines of code responsible for the contention.
- Monitor the latency of Shopify API calls.
3. Database-Level Analysis
For PostgreSQL, we enabled detailed logging and used tools like pg_stat_statements to identify long-running transactions and lock contention. We also analyzed the query plans for the order status update operations.
-- Example query to find slow queries
SELECT
query,
calls,
total_time,
rows
FROM
pg_stat_statements
ORDER BY
total_time DESC
LIMIT 10;
Mitigation Strategy: Implementing Atomic Operations and Idempotency
The core of the solution was to ensure that the critical section of updating order status and processing payments was atomic and idempotent. We explored several approaches:
1. Database-Level Locking (Optimistic and Pessimistic)
Initially, we considered pessimistic locking (e.g., `SELECT … FOR UPDATE`) on the order record in our internal database. However, this can lead to deadlocks and reduced throughput under high concurrency. Instead, we opted for an optimistic locking strategy combined with a unique transaction identifier.
2. Idempotent Webhook Processing with Unique Transaction IDs
The most effective solution involved making the webhook handler idempotent. Each incoming webhook request from Shopify was assigned a unique `X-Shopify-Webhook-Id` header. We stored this ID alongside the order processing record in our database. Before processing any payment-related action, we would check if a record with that specific `X-Shopify-Webhook-Id` already existed.
Here’s a conceptual Node.js snippet demonstrating this:
const express = require('express');
const bodyParser = require('body-parser');
const crypto = require('crypto'); // For HMAC verification (essential for security)
const db = require('./db'); // Your database connection module
const app = express();
app.use(bodyParser.json());
// Middleware for HMAC verification (crucial for security)
const verifyShopifyWebhook = (req, res, next) => {
const hmacHeader = req.headers['x-shopify-hmac-sha256'];
const sharedSecret = process.env.SHOPIFY_SHARED_SECRET; // Load from environment variables
const body = req.rawBody; // Ensure raw body is available
if (!hmacHeader || !sharedSecret) {
return res.status(400).send('HMAC header or shared secret missing.');
}
const generatedHash = crypto.createHmac('sha256', sharedSecret).update(body).digest('base64');
if (!crypto.timingSafeEqual(Buffer.from(hmacHeader), Buffer.createHmac('sha256', sharedSecret).update(body).digest('base64'))) {
return res.status(401).send('HMAC verification failed.');
}
next();
};
// Middleware to capture raw body for HMAC verification
app.use((req, res, next) => {
req.rawBody = '';
req.on('data', (chunk) => {
req.rawBody += chunk;
});
req.on('end', () => {
next();
});
});
// Endpoint for orders/paid webhooks
app.post('/webhooks/orders/paid', verifyShopifyWebhook, async (req, res) => {
const webhookId = req.headers['x-shopify-webhook-id'];
const orderData = req.body; // Parsed JSON body
const orderId = orderData.id; // Shopify's internal order ID
try {
// 1. Check if this webhook has already been processed using its unique ID
const existingWebhook = await db.getWebhookById(webhookId);
if (existingWebhook) {
console.log(`Webhook ${webhookId} for order ${orderId} already processed. Skipping.`);
return res.status(200).send('Already processed.');
}
// 2. Start a database transaction for atomicity
await db.beginTransaction();
// 3. Check internal order status to prevent duplicate processing if webhookId check fails (belt-and-suspenders)
const internalOrder = await db.getOrderById(orderId);
if (internalOrder && internalOrder.status === 'paid') {
console.warn(`Order ${orderId} already marked as paid internally. Recording webhook ${webhookId}.`);
// Still record the webhook to prevent future processing if this was a race condition
await db.recordWebhook(webhookId, orderId, 'duplicate_internal_status');
await db.commitTransaction();
return res.status(200).send('Already paid internally.');
}
// 4. Mark order as paid and initiate fulfillment
await db.updateOrderStatus(orderId, 'paid');
await db.createFulfillmentRequest(orderId, orderData.line_items); // Example
// 5. Record the webhook ID to ensure idempotency
await db.recordWebhook(webhookId, orderId, 'success');
// 6. Commit the transaction
await db.commitTransaction();
console.log(`Successfully processed payment for order ${orderId} with webhook ${webhookId}.`);
res.status(200).send('Webhook processed successfully.');
} catch (error) {
await db.rollbackTransaction(); // Rollback on any error
console.error(`Error processing webhook ${webhookId} for order ${orderId}:`, error);
// Depending on the error, you might want to return 500 or retry later
res.status(500).send('Internal Server Error.');
}
});
// Placeholder database functions (implement these in ./db.js)
// db.getWebhookById(webhookId)
// db.getOrderById(orderId)
// db.updateOrderStatus(orderId, status)
// db.createFulfillmentRequest(orderId, lineItems)
// db.recordWebhook(webhookId, orderId, status)
// db.beginTransaction()
// db.commitTransaction()
// db.rollbackTransaction()
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`Server listening on port ${PORT}`);
});
The database schema for tracking webhooks would be simple:
CREATE TABLE shopify_webhook_log (
webhook_id VARCHAR(255) PRIMARY KEY,
order_id VARCHAR(255) NOT NULL,
processed_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(50) NOT NULL,
-- other relevant fields
);
CREATE INDEX idx_shopify_webhook_log_order_id ON shopify_webhook_log (order_id);
This approach ensures that even if multiple `orders/paid` webhooks arrive for the same order concurrently, only the first one to successfully insert its `webhook_id` into the `shopify_webhook_log` table and complete the transaction will proceed. Subsequent attempts with the same `webhook_id` will be immediately rejected.
3. Transactional Integrity and Database Transactions
Within the Node.js application, we enforced transactional integrity using PostgreSQL’s transaction capabilities. All operations related to updating order status, creating fulfillment requests, and logging the webhook processing were wrapped in a single database transaction. This guarantees that either all operations succeed, or none of them do, preventing partial updates and maintaining data consistency.
Deployment and Monitoring on DigitalOcean
The updated application was deployed to the existing DigitalOcean Kubernetes Engine (DOKE) cluster. Key considerations for deployment and ongoing monitoring included:
1. Kubernetes Configuration
We ensured that the webhook processing service had appropriate resource limits and requests defined in its Kubernetes deployment manifests to handle peak loads. Horizontal Pod Autoscaler (HPA) was configured to scale the number of pods based on CPU utilization or custom metrics (e.g., queue depth if using a message queue).
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-processor
spec:
replicas: 3 # Initial replicas
selector:
matchLabels:
app: payment-processor
template:
metadata:
labels:
app: payment-processor
spec:
containers:
- name: app
image: your-docker-repo/payment-processor:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: payment-processor-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-processor
minReplicas: 3
maxReplicas: 15
targetCPUUtilizationPercentage: 70
2. DigitalOcean Load Balancer Configuration
The DigitalOcean Load Balancer was configured to distribute incoming webhook traffic across the available pods of the payment processing service. Health checks were essential to ensure traffic was only sent to healthy instances.
# Example DigitalOcean Load Balancer Health Check Configuration (Conceptual) health_check: protocol: "HTTP" port: 3000 path: "/healthz" # A simple /healthz endpoint in the app interval_seconds: 15 timeout_seconds: 5 unhealthy_threshold: 3 healthy_threshold: 2
3. Enhanced Monitoring and Alerting
Post-deployment, we intensified monitoring. Alerts were configured in Datadog for:
- Increased error rates in the payment processing service.
- High latency in database transactions related to order updates.
- Failed health checks on the payment processing pods.
- Any duplicate `webhook_id` entries detected in the `shopify_webhook_log` table (indicating a potential bypass of the idempotency check).
A specific alert was set up to monitor the rate of `duplicate_internal_status` entries in the `shopify_webhook_log` table, which would signal a potential regression or a new race condition scenario.
Conclusion and Future Considerations
By systematically auditing the stack, identifying the precise race condition in the payment processing logic, and implementing a robust idempotency mechanism using unique webhook IDs and database transactions, we successfully stabilized the high-concurrency payment flow. The use of DigitalOcean’s managed Kubernetes services and integrated monitoring tools facilitated both the diagnosis and the ongoing operational stability of the solution. Future considerations include exploring Shopify’s new Order Editing API for more granular control and potentially offloading more complex order state management to a dedicated event-driven architecture using services like Kafka or RabbitMQ for even greater resilience.