Disaster Recovery 101: Architecting Auto-Failovers for MongoDB and Shopify Deployments on DigitalOcean

Automated MongoDB Failover with DigitalOcean Managed Databases

Achieving high availability for your MongoDB deployments is paramount, especially when leveraging cloud platforms like DigitalOcean. While DigitalOcean’s Managed MongoDB service offers built-in replication and failover, understanding the underlying mechanisms and how to augment them for critical applications is key. This section details how to architect for automated failover, focusing on scenarios where you might need custom logic or integration with external monitoring.

DigitalOcean Managed MongoDB, by default, configures a replica set. This means you have a primary node and one or more secondary nodes. In the event of a primary node failure, the replica set automatically elects a new primary from the available secondaries. This process is generally robust and requires minimal intervention for standard deployments. However, for mission-critical applications, we often need to go a step further by implementing external health checks and automated remediation workflows.

Leveraging DigitalOcean’s Built-in Failover

When you provision a Managed MongoDB cluster on DigitalOcean, you select a cluster size that dictates the number of nodes. For high availability, a minimum of three nodes is recommended (one primary, two secondaries) to ensure a quorum can be maintained even if one node fails. The internal MongoDB election process relies on this quorum. If the primary becomes unreachable, the remaining nodes initiate an election. The secondary with the most up-to-date data and that can be reached by a majority of the replica set members will be elected as the new primary.

You can monitor the health of your replica set via the DigitalOcean control panel. It will indicate the status of each node and the overall cluster health. For programmatic access and integration into your CI/CD or monitoring pipelines, DigitalOcean provides an API.

External Health Checks and Alerting

While DigitalOcean’s internal failover is automatic, you might want to be proactively alerted or trigger custom actions before a full failure event, or to verify the failover has completed successfully. This can be achieved using external monitoring tools that periodically check the health of the MongoDB primary. Tools like Prometheus with `mongodb_exporter`, Datadog, or even custom scripts can be employed.

A simple health check script can connect to the MongoDB instance and execute a read operation. If the operation fails or times out, an alert can be triggered. For more advanced checks, you can query the replica set status (`rs.status()`) to ensure a primary is available and that secondaries are in sync.

Example: Python Health Check Script

This Python script uses the pymongo library to check MongoDB replica set health. It connects to the primary and verifies its status.

import pymongo
from pymongo.errors import ConnectionFailure, OperationFailure
import os
import time

# --- Configuration ---
MONGO_URI = os.environ.get("MONGO_URI", "mongodb://user:password@your_do_mongo_host:27017/?replicaSet=your_replica_set_name")
READ_PREFERENCE = pymongo.ReadPreference.PRIMARY
TIMEOUT_MS = 5000  # 5 seconds
RETRY_DELAY_SEC = 10
MAX_RETRIES = 3

def check_mongo_health(uri, read_preference, timeout_ms):
    """
    Checks the health of the MongoDB primary.
    Returns True if healthy, False otherwise.
    """
    try:
        client = pymongo.MongoClient(uri, readPreference=read_preference, serverSelectionTimeoutMS=timeout_ms)
        # The ismaster command is cheap and does not require auth.
        client.admin.command('ismaster')
        print("MongoDB connection successful.")
        return True
    except ConnectionFailure as e:
        print(f"MongoDB connection failed: {e}")
        return False
    except OperationFailure as e:
        print(f"MongoDB operation failed: {e}")
        return False
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return False
    finally:
        if 'client' in locals() and client:
            client.close()

if __name__ == "__main__":
    retries = 0
    while retries < MAX_RETRIES:
        if check_mongo_health(MONGO_URI, READ_PREFERENCE, TIMEOUT_MS):
            print("MongoDB primary is healthy.")
            exit(0) # Success
        else:
            retries += 1
            print(f"MongoDB primary is unhealthy. Retrying in {RETRY_DELAY_SEC} seconds... ({retries}/{MAX_RETRIES})")
            time.sleep(RETRY_DELAY_SEC)

    print("MongoDB primary remains unhealthy after multiple retries.")
    exit(1) # Failure

Automating Failover Actions (Advanced)

For true automation, you’d integrate this health check script with an alerting system (e.g., Alertmanager, PagerDuty) and potentially a remediation system. If the health check consistently fails, an alert is fired. A separate automation service could then:

Verify Failover: After an alert, wait a short period (e.g., 30-60 seconds) to allow DigitalOcean’s internal failover to complete. Then, re-run the health check.
Notify Application Teams: If the failover is successful, notify relevant teams via Slack or email.
Trigger Application Restart/Reconfiguration: If your application instances need to be explicitly pointed to the new primary (though most MongoDB drivers handle this automatically if configured with the replica set name), this is where you’d trigger that action. This might involve updating application configuration files and restarting application services.
Escalate: If the failover still hasn’t completed successfully after a predefined period, escalate the issue to on-call engineers.

This level of automation typically involves a combination of:

Monitoring Agent: Running the health check script periodically (e.g., via cron or a dedicated monitoring agent).
Alerting System: Receiving alerts from the monitoring agent.
Automation/Orchestration Tool: A system like Ansible, Rundeck, or a custom-built service that reacts to alerts and executes remediation playbooks/scripts.

Architecting Shopify Deployments for Resilience

Shopify, as a platform, is inherently designed for high availability. However, when you’re building custom applications, themes, or integrations that interact with Shopify’s APIs, or when you’re hosting your own backend services that power your Shopify store, you need to architect for resilience. This involves understanding Shopify’s API rate limits, implementing robust error handling, and designing your own infrastructure for failover.

Shopify API Resilience Patterns

Interacting with Shopify’s REST and GraphQL APIs requires careful consideration of potential failures:

Rate Limiting: Shopify imposes rate limits on API requests. Exceeding these limits will result in `429 Too Many Requests` errors. Implement aggressive backoff and retry strategies with exponential backoff.
Network Issues: Transient network problems can cause requests to fail. Implement retry logic for `5xx` server errors and network-related exceptions.
API Downtime: While rare, Shopify’s APIs can experience downtime. Your application should degrade gracefully, perhaps by serving cached data or displaying a maintenance message, rather than crashing.
Data Consistency: When performing multiple API operations that must succeed together (e.g., creating an order and updating inventory), use Shopify’s Transactional API or implement compensating transactions in your application logic.

Example: Python Shopify API Retry Logic

This Python snippet demonstrates a basic retry mechanism for Shopify API calls using the requests library. For production, consider using a dedicated Shopify API client library that might already implement these patterns.

import requests
import time
import os
import json

# --- Configuration ---
SHOPIFY_STORE_DOMAIN = os.environ.get("SHOPIFY_STORE_DOMAIN", "your-store.myshopify.com")
SHOPIFY_API_VERSION = "2023-10" # Or your preferred API version
SHOPIFY_ACCESS_TOKEN = os.environ.get("SHOPIFY_ACCESS_TOKEN", "shpat_your_private_app_token")

API_ENDPOINT = f"https://{SHOPIFY_STORE_DOMAIN}/admin/api/{SHOPIFY_API_VERSION}/orders.json"

MAX_RETRIES = 5
INITIAL_BACKOFF_SEC = 1
MAX_BACKOFF_SEC = 60

def make_shopify_request(method, url, **kwargs):
    """
    Makes a Shopify API request with retry logic for rate limiting and server errors.
    """
    headers = {
        "X-Shopify-Access-Token": SHOPIFY_ACCESS_TOKEN,
        "Content-Type": "application/json"
    }
    headers.update(kwargs.get("headers", {}))
    kwargs["headers"] = headers

    retries = 0
    backoff_time = INITIAL_BACKOFF_SEC

    while retries < MAX_RETRIES:
        try:
            response = requests.request(method, url, timeout=30, **kwargs) # 30-second timeout

            # Check for rate limiting
            if response.status_code == 429:
                retry_after = int(response.headers.get("X-Shopify-Shop-Api-Call-Limit", "1").split('/')[1]) # Example: "40/40"
                print(f"Rate limit hit. Retrying in {backoff_time} seconds. Limit: {retry_after}")
                time.sleep(backoff_time)
                backoff_time = min(backoff_time * 2, MAX_BACKOFF_SEC)
                retries += 1
                continue # Retry the request

            # Check for server errors
            if response.status_code >= 500:
                print(f"Server error ({response.status_code}). Retrying in {backoff_time} seconds.")
                time.sleep(backoff_time)
                backoff_time = min(backoff_time * 2, MAX_BACKOFF_SEC)
                retries += 1
                continue # Retry the request

            # Successful response (2xx)
            response.raise_for_status() # Raise HTTPError for bad responses (4xx client errors)
            return response

        except requests.exceptions.RequestException as e:
            print(f"Request exception: {e}. Retrying in {backoff_time} seconds.")
            time.sleep(backoff_time)
            backoff_time = min(backoff_time * 2, MAX_BACKOFF_SEC)
            retries += 1
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            # Decide if this is a retryable error or not
            raise e # Re-raise for now

    print(f"Max retries ({MAX_RETRIES}) reached for URL: {url}")
    raise Exception(f"Failed to complete Shopify API request after multiple retries.")

if __name__ == "__main__":
    # Example: Fetching orders
    try:
        response = make_shopify_request("GET", API_ENDPOINT)
        orders = response.json()
        print(f"Successfully fetched {len(orders.get('orders', []))} orders.")
        # print(json.dumps(orders, indent=2))

        # Example: Creating a simple order (requires appropriate permissions)
        # new_order_data = {
        #     "order": {
        #         "email": "[email protected]",
        #         "financial_status": "pending",
        #         "line_items": [
        #             {
        #                 "variant_id": 1234567890, # Replace with a valid variant ID
        #                 "quantity": 1
        #             }
        #         ]
        #     }
        # }
        # create_response = make_shopify_request("POST", API_ENDPOINT, json=new_order_data)
        # created_order = create_response.json()
        # print(f"Successfully created order: {created_order.get('order', {}).get('id')}")

    except Exception as e:
        print(f"Error during Shopify API interaction: {e}")
        exit(1)

Hosting Your Own Backend Services for Shopify

If your Shopify store relies on custom backend services (e.g., for complex inventory management, custom pricing rules, or integrations with third-party systems), these services must be architected for high availability and failover. DigitalOcean Droplets, Kubernetes, or App Platform can host these services.

Example: Nginx Load Balancer with Health Checks

A common pattern is to use a load balancer in front of multiple instances of your backend application. Nginx is a popular choice for this. You can configure Nginx to perform active health checks on your backend servers and automatically remove unhealthy servers from the rotation.

# /etc/nginx/nginx.conf or a file in /etc/nginx/conf.d/

# Define your backend application servers
upstream backend_app {
    # Define health check parameters
    # fail_timeout=5s: Time after which a server is considered failed if it doesn't respond.
    # max_fails=3: Number of consecutive failures before marking a server as down.
    # backup: If all primary servers fail, traffic can be directed to backup servers.
    server app1.yourdomain.com:8080 fail_timeout=5s max_fails=3;
    server app2.yourdomain.com:8080 fail_timeout=5s max_fails=3;
    server app3.yourdomain.com:8080 fail_timeout=5s max_fails=3;

    # Optional: Add backup servers if needed
    # server app_backup1.yourdomain.com:8080 backup;
    # server app_backup2.yourdomain.com:8080 backup;
}

server {
    listen 80;
    server_name your-shopify-backend.com;

    location / {
        proxy_pass http://backend_app;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Enable active health checks for Nginx Plus (commercial feature)
        # For open-source Nginx, health checks are typically done via a separate monitoring tool
        # or by using the 'health_check' directive in newer versions (experimental/limited).
        # A common open-source approach is to have a dedicated health check endpoint
        # on your application servers and use a separate monitoring system (like Prometheus)
        # to scrape these endpoints and potentially trigger actions if they fail.

        # Example of a basic health check endpoint your app should expose:
        # GET /healthz -> returns 200 OK if healthy, 500 Internal Server Error otherwise.
    }

    # Optional: Health check endpoint for the load balancer itself (if needed)
    # location /nginx_health {
    #     return 200 "OK";
    #     add_header Content-Type text/plain;
    # }
}

# For Nginx Open Source, you'd typically use a separate tool to monitor
# the /healthz endpoint of your backend servers and then use that tool's
# alerting to trigger actions (e.g., restart a failed instance, notify ops).
# Or, if using a service like DigitalOcean's Load Balancer, it has built-in health checks.

Note on Nginx Open Source Health Checks: True active health checks (where Nginx actively probes backend servers) are more robust in Nginx Plus. For open-source Nginx, the `fail_timeout` and `max_fails` directives are crucial for marking servers as down after a period of unresponsiveness. For more sophisticated active health checks in open-source Nginx, you might need to integrate with external monitoring tools or explore community-driven solutions.

DigitalOcean Load Balancer Integration

For managed load balancing on DigitalOcean, use their Load Balancer service. It provides built-in health checks that are configurable through the control panel or API. This is often simpler and more integrated than managing Nginx yourself for basic load balancing and failover.

Configuring DigitalOcean Load Balancer Health Checks

When setting up a DigitalOcean Load Balancer:

Protocol: Choose HTTP, HTTPS, TCP, or TLS. For web applications, HTTP/HTTPS is common.
Port: The port your backend application listens on (e.g., 8080).
Path: A specific URL path on your application servers that should return a 200 OK status code when the server is healthy (e.g., `/healthz`).
Check Interval: How often the load balancer should check the health of each backend server.
Response Timeout: How long to wait for a response before considering the check failed.
Healthy Threshold: The number of consecutive successful checks required to mark a server as healthy.
Unhealthy Threshold: The number of consecutive failed checks required to mark a server as unhealthy.

These settings ensure that the DigitalOcean Load Balancer automatically routes traffic away from unhealthy application instances, providing a seamless failover experience for your Shopify-integrated backend services.