Automating Multi-Region Redundancy for Shopify Architectures on OVH

Establishing a Multi-Region Disaster Recovery Strategy for Shopify on OVHcloud

This document outlines a robust, automated multi-region disaster recovery (DR) strategy for Shopify architectures hosted on OVHcloud infrastructure. The primary objective is to minimize Recovery Time Objective (RTO) and Recovery Point Objective (RPO) in the event of a regional outage, ensuring business continuity for critical e-commerce operations.

Core Components and Architecture Overview

Our DR strategy hinges on maintaining a near-real-time replica of the primary Shopify environment in a secondary OVHcloud region. This involves replicating:

Application Servers: Stateless web servers running the Shopify application stack (e.g., Ruby on Rails, Nginx, Puma).
Database: A replicated PostgreSQL instance (or equivalent) storing all transactional data.
Cache: A replicated Redis instance for session management and performance optimization.
Static Assets: Object storage (e.g., OVHcloud Object Storage) for product images, themes, and other static content.
Configuration: Centralized configuration management for seamless failover.

The primary region will host the active production environment. The secondary region will host a standby, synchronized replica. Failover will be a manual or semi-automated process triggered by monitoring alerts or direct intervention.

Database Replication: PostgreSQL Streaming Replication

For PostgreSQL, we will implement streaming replication to ensure low RPO. This involves setting up a primary instance in Region A and a standby replica in Region B. The standby will continuously apply WAL (Write-Ahead Logging) records from the primary.

Primary Region (Region A) PostgreSQL Configuration

On the primary PostgreSQL server, modify postgresql.conf and pg_hba.conf.

`postgresql.conf` (Primary)

# Enable WAL archiving (optional but recommended for point-in-time recovery)
wal_level = replica
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' # Adjust path as needed

# Enable logical replication if needed for specific Shopify plugins/integrations
# max_replication_slots = 1
# max_wal_senders = 3

# Set a unique server_id
server_id = 1
hot_standby = on
wal_sender_timeout = 60s
listen_addresses = '*'
port = 5432

`pg_hba.conf` (Primary)

# Allow replication connections from the standby server's IP
host    replication     replicator      <standby_ip_address>/32       md5
# Allow application connections from web servers
host    all             all             <web_server_subnet>/24        md5

Create a replication user and grant necessary privileges:

CREATE USER replicator WITH REPLICATION LOGIN PASSWORD 'your_replication_password';
GRANT CONNECT ON DATABASE your_shopify_db TO replicator;
-- If using logical replication, additional grants might be needed.

Restart PostgreSQL on the primary:

sudo systemctl restart postgresql

Secondary Region (Region B) PostgreSQL Configuration

On the standby PostgreSQL server, ensure postgresql.conf is configured for standby mode and pg_hba.conf allows local connections.

`postgresql.conf` (Standby)

# Set a unique server_id
server_id = 2
hot_standby = on
wal_receiver_status_interval = 10s
wal_receiver_timeout = 60s
listen_addresses = '*'
port = 5432

# Recovery settings - these will be in a separate recovery.conf or within postgresql.conf for PG12+
# For PostgreSQL 12 and later, recovery settings are in postgresql.conf
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p' # If using archive_mode on primary
primary_conninfo = 'host=<primary_ip_address> port=5432 user=replicator password=your_replication_password'
recovery_target_timeline = 'latest'

Important: For PostgreSQL versions prior to 12, recovery settings are typically in a separate recovery.conf file. For PostgreSQL 12+, these are integrated into postgresql.conf.

Initial Data Synchronization

Before starting the standby, perform an initial data dump from the primary and restore it on the standby. Use pg_basebackup for a consistent snapshot.

# On the standby server:
sudo systemctl stop postgresql

# Clean any existing data directory
sudo rm -rf /var/lib/postgresql/data/*

# Perform base backup from primary
pg_basebackup -h <primary_ip_address> -U replicator -D /var/lib/postgresql/data -P -v -R -S standby_slot_name # -R creates standby.signal and recovery.conf/postgresql.conf settings

# For PG12+, -R will create standby.signal and configure postgresql.conf
# For older versions, you might need to manually create recovery.conf and configure it.

# Ensure correct ownership
sudo chown -R postgres:postgres /var/lib/postgresql/data

# Create standby.signal file for PG12+ if not created by -R
# sudo touch /var/lib/postgresql/data/standby.signal

# Start PostgreSQL on the standby
sudo systemctl start postgresql

Verify replication status on the primary:

SELECT * FROM pg_stat_replication;

And on the standby:

SELECT pg_is_in_recovery(); -- Should return 1 (true)
SELECT * FROM pg_stat_wal_receiver;

Application Server Deployment and Synchronization

Application servers should be stateless. Configuration and code deployment should be managed via a CI/CD pipeline that targets both regions. For DR, we’ll deploy identical application stacks in both regions.

Infrastructure as Code (IaC)

Utilize tools like Terraform or Ansible to provision and configure identical server environments in both regions. This ensures consistency and simplifies deployment.

Code Deployment

The CI/CD pipeline should be configured to deploy application code to both regions. During normal operations, traffic is directed to Region A. In a DR scenario, traffic is redirected to Region B.

Configuration Management

Centralize application configuration (database credentials, API keys, etc.) using a tool like HashiCorp Consul, AWS Systems Manager Parameter Store (if using AWS components), or a secure Git repository with access controls. Ensure this configuration is accessible from both regions.

Cache Replication (Redis)

For Redis, we can leverage Redis Sentinel for high availability within a region, but for multi-region DR, Redis Replication (master-replica) is the primary mechanism. A replica in Region B will asynchronously replicate data from the master in Region A.

Redis Configuration (Master – Region A)

# redis.conf
port 6379
bind 0.0.0.0
# Ensure replication is enabled by not setting slaveof
# For security, restrict access
protected-mode no
# Consider using ACLs or firewall rules

Redis Configuration (Replica – Region B)

# redis.conf
port 6379
bind 0.0.0.0
# Configure replication
slaveof <redis_master_ip_region_a> 6379
# For Redis 6+, use replicaof
# replicaof <redis_master_ip_region_a> 6379

# Optional: If master is lost, promote this replica to master (manual intervention needed for DR)
# auto-failover no # Typically disabled for manual DR failover

# For security, restrict access
protected-mode no
# Consider using ACLs or firewall rules

Restart Redis on both master and replica. Monitor replication status using INFO replication on the replica.

Static Asset Synchronization (OVHcloud Object Storage)

Shopify typically uses a CDN for static assets. If you are self-hosting static assets (e.g., product images, theme files) in OVHcloud Object Storage, you need a strategy to synchronize these between regions.

Option 1: Cross-Region Replication (if supported by OVHcloud Object Storage)

Check if OVHcloud Object Storage offers native cross-region replication. If so, configure it to automatically replicate objects from the primary bucket to a secondary bucket in Region B.

Option 2: Manual/Automated Sync Script

If native replication is not available, implement a script (e.g., Python with Boto3, or using s3cmd/rclone) that periodically syncs objects. This script would run on a schedule (e.g., cron job) or be triggered by events.

import boto3
import os

# Configure your OVHcloud Object Storage credentials and endpoints
# Ensure you have separate credentials/keys for each region
primary_region_endpoint = os.environ.get("OVH_PRIMARY_ENDPOINT")
primary_bucket_name = os.environ.get("OVH_PRIMARY_BUCKET")
secondary_region_endpoint = os.environ.get("OVH_SECONDARY_ENDPOINT")
secondary_bucket_name = os.environ.get("OVH_SECONDARY_BUCKET")

# Use appropriate access keys and secret keys
s3_primary = boto3.client('s3', endpoint_url=primary_region_endpoint,
                          aws_access_key_id=os.environ.get("OVH_PRIMARY_ACCESS_KEY"),
                          aws_secret_access_key=os.environ.get("OVH_PRIMARY_SECRET_KEY"))

s3_secondary = boto3.client('s3', endpoint_url=secondary_region_endpoint,
                            aws_access_key_id=os.environ.get("OVH_SECONDARY_ACCESS_KEY"),
                            aws_secret_access_key=os.environ.get("OVH_SECONDARY_SECRET_KEY"))

def sync_objects():
    paginator = s3_primary.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=primary_bucket_name)

    for page in pages:
        if 'Contents' in page:
            for obj in page['Contents']:
                key = obj['Key']
                last_modified = obj['LastModified']

                try:
                    # Check if object exists in secondary bucket and if it's older
                    secondary_obj_info = s3_secondary.head_object(Bucket=secondary_bucket_name, Key=key)
                    secondary_last_modified = secondary_obj_info['LastModified']

                    if last_modified > secondary_last_modified:
                        print(f"Object {key} is newer in primary. Copying...")
                        copy_source = {
                            'Bucket': primary_bucket_name,
                            'Key': key
                        }
                        s3_secondary.copy(copy_source, secondary_bucket_name, key)
                        print(f"Copied {key} to secondary bucket.")
                    else:
                        print(f"Object {key} is up-to-date in secondary bucket.")

                except s3_secondary.exceptions.ClientError as e:
                    if e.response['Error']['Code'] == '404':
                        # Object does not exist in secondary bucket, copy it
                        print(f"Object {key} not found in secondary. Copying...")
                        copy_source = {
                            'Bucket': primary_bucket_name,
                            'Key': key
                        }
                        s3_secondary.copy(copy_source, secondary_bucket_name, key)
                        print(f"Copied {key} to secondary bucket.")
                    else:
                        print(f"Error checking/copying object {key}: {e}")
                except Exception as e:
                    print(f"An unexpected error occurred for object {key}: {e}")

if __name__ == "__main__":
    # Ensure environment variables are set for credentials and endpoints
    if not all([primary_region_endpoint, primary_bucket_name, secondary_region_endpoint, secondary_bucket_name]):
        print("Error: Missing environment variables for OVHcloud Object Storage configuration.")
    else:
        sync_objects()

This script would need to be scheduled to run frequently (e.g., every 5-15 minutes) depending on RPO requirements. Ensure the credentials used have read access to the primary bucket and write access to the secondary bucket.

DNS and Load Balancing for Failover

A critical part of DR is redirecting traffic to the standby region. This is typically achieved using DNS failover or a global load balancer.

Option 1: DNS Failover (e.g., OVHcloud Managed DNS)

Configure DNS records (e.g., A, CNAME) for your Shopify domain. Set up health checks for the primary region’s load balancer or application servers. If health checks fail, the DNS provider automatically updates the DNS record to point to the secondary region’s IP address.

Option 2: Global Load Balancer (e.g., Cloudflare, Akamai)

If using a third-party global load balancer, configure it with origin servers in both regions. Implement health checks and automatic failover policies. This is generally more sophisticated and offers lower failover times than DNS-only solutions.

OVHcloud Load Balancer Configuration (Example)

OVHcloud offers Load Balancer services. You would configure a load balancer in Region A pointing to your application servers there. In Region B, you would have a similar setup. For DR, you’d manage the public-facing IP address or DNS record to point to the appropriate region’s load balancer.

# Conceptual OVHcloud Load Balancer Configuration (via API/CLI/Dashboard)

# Region A Load Balancer:
# - Frontend IP: e.g., 192.0.2.10
# - Backend Pool: Points to application servers in Region A (e.g., 10.0.1.10, 10.0.1.11)
# - Health Check: TCP on port 80/443, or HTTP GET /health

# Region B Load Balancer:
# - Frontend IP: e.g., 192.0.2.20 (or use same IP if DNS is managed externally)
# - Backend Pool: Points to application servers in Region B (e.g., 10.0.2.10, 10.0.2.11)
# - Health Check: TCP on port 80/443, or HTTP GET /health

# DNS Configuration (External):
# - @.yourdomain.com -> A 192.0.2.10 (Primary)
# - Health check on 192.0.2.10. If fails, update DNS to point to 192.0.2.20.

Monitoring and Alerting

Comprehensive monitoring is essential for detecting failures and triggering alerts. Implement checks for:

Database replication lag (pg_stat_replication, pg_stat_wal_receiver).
Redis replication status.
Application server health checks (HTTP 200 OK on a dedicated endpoint).
Load balancer health checks.
Object storage sync status.
Network connectivity between regions.
Resource utilization (CPU, memory, disk I/O) in both regions.

Utilize OVHcloud’s monitoring tools, Prometheus/Grafana, or third-party solutions like Datadog or New Relic. Configure alerts to notify the operations team via email, Slack, or PagerDuty.

Failover and Failback Procedures

Failover Steps (Manual Trigger)

Verify Outage: Confirm the primary region is indeed unavailable via monitoring dashboards and direct checks.
Stop Writes to Primary: If possible, prevent any further writes to the primary database to avoid split-brain scenarios.
Promote Standby Database: Promote the PostgreSQL replica in Region B to become the new primary. For older PG versions, this involves removing recovery.conf and starting normally. For PG12+, it’s typically just removing standby.signal and restarting.
Update Application Configuration: Reconfigure application servers in Region B to point to the newly promoted primary database.
Update DNS/Load Balancer: Redirect all traffic to Region B’s application servers. This is the most critical step for user-facing failover.
Verify Application Functionality: Thoroughly test the application in Region B to ensure all critical functions are working.
Start Redis Replication (if applicable): If Redis was also replicated, ensure the replica in Region B is now the master and new replicas can connect to it.

Failback Steps (Planned Event)

Failback is generally performed during a maintenance window and involves reversing the failover process. It requires careful planning to minimize downtime.

Prepare New Primary (Region A): Restore Region A’s PostgreSQL instance. This might involve setting it up as a replica of Region B’s current primary, or performing a fresh restore from a backup.
Synchronize Data: Ensure Region A’s database is fully synchronized with Region B’s current primary.
Update Application Configuration: Reconfigure application servers in Region A to point to their local database.
Redirect Traffic: Gradually shift traffic back to Region A using DNS or load balancer updates.
Demote Region B Database: Once Region A is handling all traffic, demote Region B’s database back to a replica.
Verify Functionality: Test thoroughly in Region A.

Security Considerations

Ensure secure network configurations (VPCs, firewalls) are in place in both regions. Use strong credentials for replication users and access keys for object storage. Encrypt sensitive data in transit and at rest.

Conclusion

Implementing a multi-region DR strategy for a Shopify architecture on OVHcloud requires careful planning and execution across database, application, cache, and asset layers. By leveraging replication technologies and robust automation, you can significantly reduce downtime and data loss in the face of regional disasters.