Disaster Recovery 101: Architecting Auto-Failovers for DynamoDB and Shopify Deployments on Linode

Automated Cross-Region Failover for DynamoDB with AWS Lambda and SNS

Achieving true disaster recovery for critical applications necessitates automated failover mechanisms. For applications leveraging Amazon DynamoDB, a common strategy involves replicating data to a secondary region and orchestrating a seamless switchover when the primary region becomes unavailable. This section details a robust, automated failover architecture using AWS Lambda, Amazon SNS, and DynamoDB Global Tables.

While DynamoDB Global Tables offer multi-region replication, they don’t inherently provide automated failover in the event of a regional outage. Our approach augments Global Tables with a detection and failover system.

Architecture Overview

The core components of this automated failover system are:

DynamoDB Global Tables: Configured for multi-region replication (e.g., us-east-1 and us-west-2). This ensures data consistency across regions.
AWS Lambda Function (Health Checker): Periodically polls a critical DynamoDB table in the primary region. If the table is unresponsive or exhibits high latency beyond a threshold, it triggers the failover process.
Amazon SNS Topic: Receives notifications from the Lambda health checker. This topic will fan out alerts to various endpoints, including a Lambda function responsible for initiating the failover.
AWS Lambda Function (Failover Orchestrator): Subscribed to the SNS topic. This function is responsible for updating application configurations (e.g., DNS records, API endpoints) to point to the secondary region’s DynamoDB endpoint.
Amazon Route 53 (or equivalent DNS provider): Used to manage DNS records that direct application traffic to the appropriate regional endpoint.

Implementing the Health Checker Lambda

This Lambda function will reside in the primary region and periodically check the health of a specific DynamoDB table. We’ll use Python for its ease of use with the AWS SDK (Boto3).

Prerequisites:

An IAM role with permissions for dynamodb:GetItem (or a similar read operation on your critical table) and sns:Publish.
A DynamoDB table in the primary region.
An SNS topic ARN.

Lambda Function Code (Python):

import boto3
import os
import json
import time

# Environment Variables
PRIMARY_REGION_DDB_TABLE = os.environ['PRIMARY_REGION_DDB_TABLE']
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
HEALTH_CHECK_ITEM_KEY = os.environ.get('HEALTH_CHECK_ITEM_KEY', 'health_check_id') # Default key name
HEALTH_CHECK_ITEM_VALUE = os.environ.get('HEALTH_CHECK_ITEM_VALUE', 'ok') # Default value
MAX_LATENCY_MS = int(os.environ.get('MAX_LATENCY_MS', 500)) # Default to 500ms

dynamodb = boto3.client('dynamodb')
sns = boto3.client('sns')

def lambda_handler(event, context):
    start_time = time.time()
    try:
        # Perform a simple read operation to check latency and availability
        response = dynamodb.get_item(
            TableName=PRIMARY_REGION_DDB_TABLE,
            Key={
                'id': {'S': HEALTH_CHECK_ITEM_KEY} # Assuming 'id' is your partition key
            }
        )

        # Check if the expected item is returned (optional but good practice)
        if 'Item' not in response or response['Item'].get('status', {}).get('S') != HEALTH_CHECK_ITEM_VALUE:
            raise Exception("Health check item missing or has unexpected status.")

        end_time = time.time()
        latency_ms = (end_time - start_time) * 1000

        print(f"DynamoDB health check successful. Latency: {latency_ms:.2f}ms")

        if latency_ms > MAX_LATENCY_MS:
            print(f"Latency {latency_ms:.2f}ms exceeds threshold of {MAX_LATENCY_MS}ms. Triggering failover.")
            publish_failover_notification(f"DynamoDB latency exceeded threshold ({latency_ms:.2f}ms).")
            return {
                'statusCode': 503,
                'body': json.dumps('Service Unavailable due to high latency')
            }

        return {
            'statusCode': 200,
            'body': json.dumps('DynamoDB health check passed')
        }

    except Exception as e:
        print(f"DynamoDB health check failed: {e}")
        publish_failover_notification(f"DynamoDB health check failed: {str(e)}")
        return {
            'statusCode': 503,
            'body': json.dumps('Service Unavailable due to DynamoDB error')
        }

def publish_failover_notification(message):
    try:
        sns.publish(
            TopicArn=SNS_TOPIC_ARN,
            Message=message,
            Subject='DynamoDB Primary Region Unhealthy - Initiating Failover'
        )
        print(f"Published notification to SNS topic: {SNS_TOPIC_ARN}")
    except Exception as e:
        print(f"Failed to publish to SNS: {e}")

Configuration:

Set the following environment variables in your Lambda function:

PRIMARY_REGION_DDB_TABLE: The name of your DynamoDB table in the primary region.
SNS_TOPIC_ARN: The ARN of the SNS topic to publish notifications to.
HEALTH_CHECK_ITEM_KEY (Optional): The partition key value used for the health check item. Defaults to ‘health_check_id’.
HEALTH_CHECK_ITEM_VALUE (Optional): The expected status value of the health check item. Defaults to ‘ok’.
MAX_LATENCY_MS (Optional): The maximum acceptable latency in milliseconds. Defaults to 500ms.

Trigger: Configure a CloudWatch Events (EventBridge) rule to trigger this Lambda function on a schedule (e.g., every 1 minute).

Implementing the Failover Orchestrator Lambda

This Lambda function will be subscribed to the SNS topic. Upon receiving a notification, it will execute the failover logic. This typically involves updating DNS records to point to the secondary region’s infrastructure.

Prerequisites:

An IAM role with permissions for route53:ChangeResourceRecordSets (if using Route 53) and potentially other services to update application configurations.
An SNS topic (the same one used by the health checker).
A Route 53 hosted zone and record set for your application’s endpoint.
The secondary region’s DynamoDB endpoint (if not using Global Tables, or for specific configurations).

Lambda Function Code (Python):

import boto3
import json
import os

# Environment Variables
PRIMARY_REGION_HOSTED_ZONE_ID = os.environ['PRIMARY_REGION_HOSTED_ZONE_ID'] # Route 53 Hosted Zone ID for primary
SECONDARY_REGION_HOSTED_ZONE_ID = os.environ['SECONDARY_REGION_HOSTED_ZONE_ID'] # Route 53 Hosted Zone ID for secondary
APPLICATION_DNS_NAME = os.environ['APPLICATION_DNS_NAME'] # e.g., app.yourdomain.com
PRIMARY_REGION_NAME = os.environ.get('PRIMARY_REGION_NAME', 'us-east-1')
SECONDARY_REGION_NAME = os.environ.get('SECONDARY_REGION_NAME', 'us-west-2')

route53 = boto3.client('route53')

def lambda_handler(event, context):
    print(f"Received event: {json.dumps(event)}")

    # Extract message from SNS event
    message = event['Records'][0]['Sns']['Message']
    print(f"SNS Message: {message}")

    # Basic check to prevent re-triggering if already failed over
    current_record = get_current_dns_record()
    if current_record and current_record['SetIdentifier'] and 'secondary' in current_record['SetIdentifier'].lower():
        print("Failover already in progress or completed. Skipping.")
        return {'statusCode': 200, 'body': 'Already failed over.'}

    try:
        print("Initiating failover to secondary region...")
        failover_to_secondary()
        print("Failover initiated successfully.")
        return {
            'statusCode': 200,
            'body': json.dumps('Failover initiated.')
        }
    except Exception as e:
        print(f"Error during failover: {e}")
        # Consider sending an alert here if failover fails
        return {
            'statusCode': 500,
            'body': json.dumps(f'Failover failed: {str(e)}')
        }

def get_current_dns_record():
    """
    Retrieves the current DNS record for the application.
    Assumes a weighted or failover routing policy with SetIdentifier.
    """
    try:
        response = route53.list_resource_record_sets(
            HostedZoneId=PRIMARY_REGION_HOSTED_ZONE_ID, # Check primary zone first
            StartRecordName=APPLICATION_DNS_NAME,
            StartRecordType='A',
            MaxItems='1'
        )
        if response['ResourceRecordSets']:
            record = response['ResourceRecordSets'][0]
            if record['Name'] == APPLICATION_DNS_NAME + '.': # Route 53 appends a dot
                return record
        
        # If not found in primary, check secondary (this logic might need refinement based on your setup)
        response = route53.list_resource_record_sets(
            HostedZoneId=SECONDARY_REGION_HOSTED_ZONE_ID,
            StartRecordName=APPLICATION_DNS_NAME,
            StartRecordType='A',
            MaxItems='1'
        )
        if response['ResourceRecordSets']:
            record = response['ResourceRecordSets'][0]
            if record['Name'] == APPLICATION_DNS_NAME + '.':
                return record

        return None
    except Exception as e:
        print(f"Error retrieving DNS record: {e}")
        return None

def failover_to_secondary():
    """
    Updates DNS to point to the secondary region.
    This example assumes a simple A record failover or weighted routing.
    For more complex setups (e.g., ELB, CloudFront), this function would
    need to update those resources.
    """
    # This is a simplified example. You'll likely need to adjust
    # based on your specific Route 53 configuration (e.g., weighted, failover, latency-based).
    # If using failover routing, you'd update the primary record to point to secondary.
    # If using weighted, you'd adjust weights.

    # Example: Assuming a simple failover routing policy where the primary record
    # needs to be updated to point to the secondary region's IP/endpoint.
    # You would need to know the IP/endpoint of your secondary region's application.

    # For DynamoDB Global Tables, the application itself needs to be aware of
    # which region is active. This might involve updating an environment variable
    # or a configuration service that your application polls.

    # If your application's endpoint is managed by Route 53, you'd update the record set.
    # Let's assume we are updating a record set with a specific SetIdentifier.

    # --- Placeholder for actual DNS update logic ---
    # This part is highly dependent on your Route 53 setup.
    # If you have a failover routing policy, you'd modify the primary record.
    # If you have weighted routing, you'd adjust weights.
    # If your application endpoints are behind CloudFront or ELBs, you'd update those.

    # Example: Update a record set in the primary hosted zone to point to secondary
    # This requires knowing the IP/endpoint of the secondary region's application.
    # For DynamoDB, the application needs to know which region's endpoint to use.
    # If your app uses a config value for DDB endpoint, update that.

    print(f"Simulating DNS update to point {APPLICATION_DNS_NAME} to secondary region.")
    # In a real scenario, you would use route53.change_resource_record_sets API call.
    # Example:
    # change_batch = {
    #     'Changes': [
    #         {
    #             'Action': 'UPSERT',
    #             'ResourceRecordSet': {
    #                 'Name': APPLICATION_DNS_NAME,
    #                 'Type': 'A', # Or CNAME, depending on your setup
    #                 'TTL': 60,
    #                 'SetIdentifier': 'app-secondary', # Example identifier
    #                 'ResourceRecords': [
    #                     {'Value': 'SECONDARY_REGION_APP_IP_OR_ENDPOINT'}
    #                 ]
    #             }
    #         }
    #     ]
    # }
    # route53.change_resource_record_sets(HostedZoneId=PRIMARY_REGION_HOSTED_ZONE_ID, ChangeBatch=change_batch)

    # For DynamoDB Global Tables, the application needs to be reconfigured to
    # target the secondary region's DynamoDB endpoint if it's not already doing so.
    # This might involve updating an SSM Parameter Store value, a configuration file,
    # or an environment variable that the application reads.

    # Example: Update an SSM parameter
    # ssm = boto3.client('ssm')
    # ssm.put_parameter(
    #     Name='/app/config/dynamodb_endpoint',
    #     Value=f'dynamodb.{SECONDARY_REGION_NAME}.amazonaws.com',
    #     Type='String',
    #     Overwrite=True
    # )

    # For this example, we'll just print a message.
    pass

Configuration:

Set the following environment variables in your Lambda function:

PRIMARY_REGION_HOSTED_ZONE_ID: The Route 53 Hosted Zone ID for your primary region.
SECONDARY_REGION_HOSTED_ZONE_ID: The Route 53 Hosted Zone ID for your secondary region.
APPLICATION_DNS_NAME: The DNS name of your application (e.g., api.example.com).
PRIMARY_REGION_NAME (Optional): The name of your primary AWS region. Defaults to ‘us-east-1’.
SECONDARY_REGION_NAME (Optional): The name of your secondary AWS region. Defaults to ‘us-west-2’.

Trigger: Subscribe this Lambda function to the SNS topic created earlier. Ensure the IAM role has the necessary permissions to modify Route 53 records or other configuration endpoints.

Important Considerations for Failover Orchestrator:

DNS Strategy: The example uses Route 53. If you use a different DNS provider, adapt the code accordingly. Consider using latency-based, failover, or weighted routing policies in Route 53 for more sophisticated traffic management.
Application Reconfiguration: For DynamoDB Global Tables, the application itself needs to be aware of the active region. This might involve updating configuration parameters (e.g., via AWS Systems Manager Parameter Store, AWS AppConfig) that your application polls. The Lambda function should orchestrate these updates.
Idempotency: Ensure the failover logic is idempotent to prevent multiple failover attempts if the SNS message is delivered more than once.
Rollback: Implement a mechanism for rolling back to the primary region once it recovers. This could be another Lambda function triggered manually or by a separate health check in the secondary region.
Testing: Thoroughly test your failover and rollback procedures in a staging environment. Simulate regional outages to validate the automation.

Shopify Deployment Considerations on Linode

When deploying Shopify applications or services on Linode, the principles of disaster recovery and automated failover remain critical, though the specific tooling differs from AWS. For a Linode-based deployment, consider the following:

High Availability within a Linode Region

Before cross-region failover, ensure high availability within a single Linode region:

Multiple Linode Instances: Deploy your application on at least two Linode instances within the same region.
Load Balancer: Utilize a Linode NodeBalancer or a software load balancer (like HAProxy or Nginx) running on a dedicated instance to distribute traffic across your application instances. Configure health checks on the load balancer to automatically remove unhealthy instances from rotation.
Database Replication: If using a managed database service (like Linode Managed Databases for MySQL or PostgreSQL) or self-hosting, configure replication. For MySQL, use asynchronous or semi-synchronous replication. For PostgreSQL, use streaming replication. Ensure read replicas are available for read traffic and can be promoted to primary in case of failure.

Cross-Region Failover Strategy for Linode

Automating failover across Linode regions requires a similar detection and orchestration mechanism as described for AWS, but adapted for Linode’s API and services.

1. Data Replication

Database:

Manual Replication Setup: For databases not managed by Linode, you’ll need to set up cross-region replication yourself. This typically involves configuring database instances in two different Linode regions and establishing a replication channel between them. For MySQL, this might involve setting up a primary instance in Region A and a read replica in Region B, which can then be promoted.
Data Synchronization Tools: For file-based data or object storage, consider tools like rsync, Syncthing, or cloud-native object storage replication if you’re using a hybrid approach.

2. Health Monitoring and Detection

External Monitoring Service: Use an external monitoring service (e.g., UptimeRobot, Pingdom, or a custom solution using services like AWS CloudWatch/EventBridge if you have hybrid infrastructure) to periodically check the health of your application endpoints in the primary region. This service should be geographically independent of your Linode deployments.

Custom Health Check Endpoint: Ensure your application exposes a dedicated health check endpoint (e.g., /healthz) that performs critical checks, including database connectivity.

Alerting: Configure the monitoring service to send alerts (e.g., via webhooks, email, Slack) to an orchestration system when the primary region becomes unresponsive.

3. Failover Orchestration

This is the most complex part and requires custom scripting or a dedicated orchestration tool.

Webhook Receiver: Set up a small service (e.g., a lightweight Flask/Django app, or a serverless function if using a cloud provider for orchestration) that listens for webhook alerts from your monitoring service.
Linode API Interaction: This service will interact with the Linode API to perform failover actions.
DNS Updates: The primary mechanism for directing traffic is DNS. You’ll need to update your domain’s DNS records to point to the IP address of your application stack in the secondary region. This can be done programmatically using the Linode API to manage DNS records within a Linode DNS Zone.
Database Promotion: If using database replication, the orchestration script must promote the replica in the secondary region to become the new primary. This involves stopping replication and potentially reconfiguring other replicas to point to the new primary.
Application Configuration Updates: If your application relies on configuration files or environment variables for database connection strings or API endpoints, these will need to be updated in the secondary region and potentially pushed to newly launched instances.

Example Script Snippet (Bash with Linode CLI):

#!/bin/bash

# --- Configuration ---
PRIMARY_REGION="us-east"
SECONDARY_REGION="us-west"
PRIMARY_LINODE_APP_IP="YOUR_PRIMARY_APP_IP" # IP of your load balancer or app server in primary
SECONDARY_LINODE_APP_IP="YOUR_SECONDARY_APP_IP" # IP of your load balancer or app server in secondary
DOMAIN_NAME="app.yourdomain.com"
LINODE_DNS_ZONE_ID="YOUR_DNS_ZONE_ID" # Found in Linode Cloud Manager -> Domains
LINODE_API_TOKEN="YOUR_LINODE_API_TOKEN" # Store securely!

# --- Functions ---

# Function to update DNS record via Linode API (using linode-cli)
update_dns_record() {
    local domain_id=$1
    local record_id=$2 # ID of the specific A record to update
    local new_ip=$3
    local name=$4 # The subdomain part, e.g., "app" or "@" for root

    echo "Updating DNS record for ${name}.${DOMAIN_NAME} to ${new_ip}..."

    # Note: The linode-cli syntax for updating specific records might vary.
    # You might need to fetch the record first, then update it.
    # A more robust approach uses curl with the Linode API directly.

    # Example using curl (more reliable for specific updates)
    RECORD_DATA=$(curl -s -H "Authorization: Bearer ${LINODE_API_TOKEN}" \
        "https://api.linode.com/v4/domains/${domain_id}/records/${record_id}")

    if [ "$(echo "$RECORD_DATA" | jq -r '.errors | length')" -gt 0 ]; then
        echo "Error fetching DNS record: $(echo "$RECORD_DATA" | jq -r '.errors[0].message')"
        return 1
    fi

    local current_target=$(echo "$RECORD_DATA" | jq -r '.target')
    if [ "$current_target" == "$new_ip" ]; then
        echo "DNS record already points to ${new_ip}. No update needed."
        return 0
    fi

    # Construct the update payload
    # This assumes you are updating an A record. Adjust 'type' if needed.
    cat <<EOF | curl -s -X PUT \
        -H "Authorization: Bearer ${LINODE_API_TOKEN}" \
        -H "Content-Type: application/json" \
        -d @- \
        "https://api.linode.com/v4/domains/${domain_id}/records/${record_id}"
    {
        "target": "${new_ip}",
        "type": "$(echo "$RECORD_DATA" | jq -r '.type')",
        "name": "$(echo "$RECORD_DATA" | jq -r '.name')",
        "ttl_sec": $(echo "$RECORD_DATA" | jq -r '.ttl_sec'),
        "priority": $(echo "$RECORD_DATA" | jq -r '.priority')
    }
    EOF

    if [ $? -eq 0 ]; then
        echo "DNS update successful."
        return 0
    else
        echo "DNS update failed."
        return 1
    fi
}

# --- Main Failover Logic ---

echo "Received failover alert for primary region ${PRIMARY_REGION}."

# 1. Get the DNS record ID for the domain
# This requires fetching the zone first, then finding the specific record.
# For simplicity, let's assume you know the record ID or can fetch it.
# You'd typically query for records where 'name' matches your subdomain (e.g., 'app')
# and 'type' is 'A'.

# Example: Fetching records for the domain
echo "Fetching DNS records for zone ${LINODE_DNS_ZONE_ID}..."
RECORDS_RESPONSE=$(curl -s -H "Authorization: Bearer ${LINODE_API_TOKEN}" \
    "https://api.linode.com/v4/domains/${LINODE_DNS_ZONE_ID}/records")

if [ "$(echo "$RECORDS_RESPONSE" | jq -r '.errors | length')" -gt 0 ]; then
    echo "Error fetching DNS records: $(echo "$RECORDS_RESPONSE" | jq -r '.errors[0].message')"
    exit 1
fi

# Find the specific record for ${DOMAIN_NAME} (or subdomain like app.${DOMAIN_NAME})
# This logic needs to be robust to find the correct record ID.
# For simplicity, let's assume we are updating the root domain (@) or a specific subdomain.
RECORD_ID=""
RECORD_NAME=""
RECORD_TYPE=""
TARGET_IP=""

# Example: Finding the A record for the root domain
# Adjust 'name' and 'type' as needed for your specific record.
echo "$RECORDS_RESPONSE" | jq -c '.data[] | select(.type == "A" and .name == "@")' | while read -r record; do
    RECORD_ID=$(echo "$record" | jq -r '.id')
    RECORD_NAME=$(echo "$record" | jq -r '.name')
    RECORD_TYPE=$(echo "$record" | jq -r '.type')
    TARGET_IP=$(echo "$record" | jq -r '.target')
    echo "Found A record for root domain: ID=${RECORD_ID}, Name=${RECORD_NAME}, Target=${TARGET_IP}"
    break # Assuming only one root A record
done

if [ -z "$RECORD_ID" ]; then
    echo "Could not find the A record for the root domain. Please check your DNS configuration."
    exit 1
fi

# 2. Check if already failed over
if [ "$TARGET_IP" == "$SECONDARY_LINODE_APP_IP" ]; then
    echo "DNS record already points to the secondary IP (${SECONDARY_LINODE_APP_IP}). Failover already completed."
    exit 0
fi

# 3. Perform DNS update
echo "Initiating failover: Updating DNS to point to ${SECONDARY_LINODE_APP_IP}..."
if update_dns_record "${LINODE_DNS_ZONE_ID}" "${RECORD_ID}" "${SECONDARY_LINODE_APP_IP}" "${RECORD_NAME}"; then
    echo "DNS failover initiated successfully."

    # 4. (Optional) Trigger database promotion if applicable
    # echo "Promoting database replica in ${SECONDARY_REGION}..."
    # ssh user@secondary-db-host "sudo pg_ctl promote" # Example for PostgreSQL

    # 5. (Optional) Notify other systems or teams
    # curl -X POST -H "Content-Type: application/json" --data '{"text":"Shopify app failed over to secondary region: '${SECONDARY_REGION}'"}' YOUR_SLACK_WEBHOOK_URL

else
    echo "DNS failover failed."
    exit 1
fi

exit 0

Security Note: API tokens should be stored securely, not directly in scripts. Use environment variables, Linode Secrets Manager, or a dedicated secrets management solution.

Rollback Strategy

A robust DR plan includes a clear rollback procedure. Once the primary region is restored:

Data Synchronization: Ensure data written to the secondary region’s database is replicated back to the primary region (if your replication strategy supports this, e.g., bidirectional replication or a final sync).
DNS Reversal: Execute a script (similar to the failover script but reversing the IPs) to point DNS back to the primary region’s infrastructure.
Database Role Reversal: If a replica was promoted, demote it back to a replica role and re-establish replication from the original primary.

Conclusion

Architecting automated failover for critical applications like those interacting with DynamoDB or hosted on Linode requires careful planning, robust monitoring, and precise orchestration. By leveraging cloud-native services like AWS Lambda and SNS, or by building custom solutions with API interactions on platforms like Linode, you can significantly reduce downtime and ensure business continuity in the face of regional disruptions.