Disaster Recovery 101: Architecting Auto-Failovers for DynamoDB and C Deployments on Google Cloud
Multi-Region DynamoDB Architecture for High Availability
Achieving true disaster recovery for critical applications necessitates a robust data strategy. For DynamoDB, this means leveraging its Global Tables feature. Global Tables provide a fully managed, multi-region, multi-active database solution. Writes to any region are replicated automatically to all other regions, with eventual consistency. This is the foundational layer for our auto-failover strategy.
The key to enabling auto-failover is ensuring that your application can seamlessly switch its read and write endpoints to a healthy region. This involves a multi-pronged approach: intelligent routing at the application level and a robust health-checking mechanism.
Implementing Application-Level Region Routing
Your application’s data access layer must be aware of regional endpoints and health status. We can achieve this by maintaining a configuration that maps regions to their respective DynamoDB endpoints and a health status flag for each region. A common pattern is to use a centralized configuration store (like AWS Systems Manager Parameter Store or even a small, highly available Redis instance) that the application can query.
Consider a Python application using the Boto3 SDK. We can abstract the DynamoDB client creation and region selection logic:
import boto3
import os
import logging
from botocore.exceptions import ClientError
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# In a real-world scenario, this would be fetched from a config service
# or environment variables, and would include health status.
REGIONAL_ENDPOINTS = {
"us-east-1": {"endpoint": "dynamodb.us-east-1.amazonaws.com", "healthy": True},
"us-west-2": {"endpoint": "dynamodb.us-west-2.amazonaws.com", "healthy": True},
# Add other regions as needed
}
class MultiRegionDynamoDBClient:
def __init__(self, table_name):
self.table_name = table_name
self.clients = {}
self.current_region = os.environ.get("AWS_REGION", "us-east-1") # Default or detected region
self._initialize_clients()
def _initialize_clients(self):
for region, config in REGIONAL_ENDPOINTS.items():
if config["healthy"]:
try:
session = boto3.Session(region_name=region)
self.clients[region] = session.resource(
"dynamodb", endpoint_url=f"https://{config['endpoint']}"
).Table(self.table_name)
logger.info(f"Initialized DynamoDB client for region: {region}")
except Exception as e:
logger.error(f"Failed to initialize client for {region}: {e}")
REGIONAL_ENDPOINTS[region]["healthy"] = False # Mark as unhealthy
def _get_healthy_client(self):
# Prioritize current region if healthy
if self.current_region in self.clients and REGIONAL_ENDPOINTS.get(self.current_region, {}).get("healthy", False):
return self.clients[self.current_region]
# Fallback to any other healthy region
for region, client in self.clients.items():
if REGIONAL_ENDPOINTS.get(region, {}).get("healthy", False):
logger.warning(f"Switching to DynamoDB region: {region}")
self.current_region = region # Update current region for future calls
return client
raise ConnectionError("No healthy DynamoDB regions available.")
def get_item(self, key):
try:
client = self._get_healthy_client()
response = client.get_item(Key=key)
return response.get("Item")
except ClientError as e:
logger.error(f"DynamoDB get_item failed: {e}")
# Implement retry logic or region health update here
raise
except ConnectionError as e:
logger.error(f"DynamoDB connection error: {e}")
# Trigger failover process
raise
def put_item(self, item):
try:
client = self._get_healthy_client()
response = client.put_item(Item=item)
return response
except ClientError as e:
logger.error(f"DynamoDB put_item failed: {e}")
# Implement retry logic or region health update here
raise
except ConnectionError as e:
logger.error(f"DynamoDB connection error: {e}")
# Trigger failover process
raise
# Add other DynamoDB operations (scan, query, delete, etc.) similarly
The `REGIONAL_ENDPOINTS` dictionary would be dynamically updated by a separate health-checking process. The `_get_healthy_client` method prioritizes the application’s current region and falls back to any other available healthy region. Crucially, it raises a `ConnectionError` if no healthy endpoints are found, which our failover orchestrator will catch.
Automated Health Checking and Failover Orchestration
A dedicated service is required to continuously monitor the health of each DynamoDB region and, by extension, the health of the application instances deployed in those regions. This orchestrator will update the `REGIONAL_ENDPOINTS` configuration and, if necessary, trigger a global traffic shift.
We can leverage Google Cloud’s native services for this. Cloud Monitoring can be configured to probe application endpoints. When a probe fails consistently for a specific region, Cloud Monitoring can trigger a Pub/Sub notification. A Cloud Function subscribed to this topic will then act as our failover orchestrator.
Cloud Monitoring Probe Configuration
Define a Uptime Check in Cloud Monitoring that targets a specific health check endpoint exposed by your application in each region. This endpoint should perform a basic read operation against DynamoDB and return a 200 OK if successful, or a non-2xx status code if it fails.

Configure the Uptime Check to alert on consecutive failures. The alert policy should then publish a message to a designated Pub/Sub topic (e.g., `dynamodb-failover-alerts`).
Cloud Function Failover Orchestrator
This Cloud Function will receive messages from the Pub/Sub topic. Upon receiving an alert, it will:
- Identify the region that has become unhealthy based on the alert payload.
- Update the `REGIONAL_ENDPOINTS` configuration (e.g., in Parameter Store) to mark that region as unhealthy.
- If this is the primary region, initiate a global traffic shift.
import base64
import json
import os
import google.auth
from google.cloud import pubsub_v1, storage, run_v2
from google.api_core import exceptions
# Initialize Google Cloud clients
try:
credentials, project_id = google.auth.default()
except google.auth.exceptions.DefaultCredentialsError:
logger.error("Could not automatically determine credentials. Ensure you are running in a GCP environment or have set GOOGLE_APPLICATION_CREDENTIALS.")
raise
# Configuration
HEALTH_CHECK_CONFIG_BUCKET = os.environ.get("HEALTH_CHECK_CONFIG_BUCKET", "your-config-bucket-name")
HEALTH_CHECK_CONFIG_FILE = "dynamodb_regions.json"
PRIMARY_REGION = os.environ.get("PRIMARY_REGION", "us-east-1") # The region to failover TO if primary fails
def update_region_health(region_to_update, is_healthy):
"""Updates the health status of a region in GCS."""
storage_client = storage.Client(project=project_id, credentials=credentials)
bucket = storage_client.bucket(HEALTH_CHECK_CONFIG_BUCKET)
blob = bucket.blob(HEALTH_CHECK_CONFIG_FILE)
try:
config_data = json.loads(blob.download_as_text())
except exceptions.NotFound:
logger.warning(f"Config file {HEALTH_CHECK_CONFIG_FILE} not found. Creating new.")
config_data = {}
except Exception as e:
logger.error(f"Error downloading or parsing config file: {e}")
raise
if region_to_update not in config_data:
logger.warning(f"Region {region_to_update} not found in config. Adding it.")
# Assume default endpoint if not present, but mark as unhealthy if we're updating it to unhealthy
config_data[region_to_update] = {"endpoint": f"dynamodb.{region_to_update}.amazonaws.com", "healthy": is_healthy}
else:
config_data[region_to_update]["healthy"] = is_healthy
try:
blob.upload_from_string(json.dumps(config_data, indent=2), content_type="application/json")
logger.info(f"Updated health for {region_to_update} to {is_healthy}. New config: {config_data}")
return config_data
except Exception as e:
logger.error(f"Error uploading updated config: {e}")
raise
def trigger_global_traffic_shift(new_primary_region):
"""
Initiates a global traffic shift. This is highly dependent on your
load balancing and DNS strategy. Example uses Google Cloud Load Balancing.
"""
logger.info(f"Initiating traffic shift to new primary region: {new_primary_region}")
# This is a placeholder. Actual implementation depends on your GCP setup.
# For example, updating a Global External HTTP(S) Load Balancer's backend service
# to point to the new region's instance group or NEG.
# Example using Cloud Run (if your app is deployed there)
# You'd typically have separate services per region and update traffic splitting.
# Or, if using a Global Load Balancer with NEGs, update the NEG targets.
# For simplicity, let's assume we're updating a global DNS record or LB config.
# This would involve calling the appropriate GCP API.
# Example: Updating a backend service in a Global Load Balancer
# This is a conceptual example and requires specific resource names.
try:
lb_client = run_v2.ServicesClient(project=project_id, credentials=credentials) # Placeholder, use correct LB client
# Example: lb_client.update_backend_service(...) or similar
logger.info(f"Traffic shift to {new_primary_region} initiated (conceptual).")
except Exception as e:
logger.error(f"Failed to initiate traffic shift: {e}")
# Consider re-trying or escalating
def failover_handler(event, context):
"""
Pub/Sub message handler for DynamoDB failover alerts.
"""
logger.info(f"Received Pub/Sub message: {event}")
try:
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
alert_data = json.loads(pubsub_message)
logger.info(f"Alert data: {alert_data}")
# Extract region from alert data (this structure depends on your alert configuration)
# Assuming alert includes resource name or labels that identify the region.
# Example: alert_data['resource']['labels']['region']
# For this example, we'll assume a simplified structure.
unhealthy_region = alert_data.get("resource", {}).get("labels", {}).get("region")
if not unhealthy_region:
logger.error("Could not determine unhealthy region from alert data.")
return
# Update the health status of the unhealthy region
new_config = update_region_health(unhealthy_region, False)
# Check if the primary region has become unhealthy
if unhealthy_region == PRIMARY_REGION:
logger.warning(f"Primary region {PRIMARY_REGION} is unhealthy. Initiating failover.")
# Find a new healthy region to failover to
available_regions = [r for r, cfg in new_config.items() if cfg.get("healthy", False) and r != PRIMARY_REGION]
if not available_regions:
logger.critical("No healthy regions available for failover. System is in critical state.")
# Implement critical alert mechanism here
return
# Simple strategy: pick the first available healthy region
new_primary = available_regions[0]
logger.info(f"Failing over to new primary region: {new_primary}")
# Update the global configuration to reflect the new primary
# This might involve updating another parameter or triggering a DNS update.
# For now, we just log it.
logger.info(f"New primary region set to: {new_primary}")
# Trigger the actual traffic shift
trigger_global_traffic_shift(new_primary)
else:
logger.info(f"Non-primary region {unhealthy_region} is unhealthy. No global failover needed.")
except Exception as e:
logger.error(f"Error processing Pub/Sub message: {e}")
# Consider dead-lettering the message or retrying
logger.info("Failover handler finished.")
The `update_region_health` function interacts with Google Cloud Storage (GCS) to maintain a JSON file containing the health status of each region. This file serves as the central source of truth for the application's regional clients. The `trigger_global_traffic_shift` function is a placeholder; its implementation is highly dependent on your specific GCP networking setup (e.g., Global External HTTP(S) Load Balancer, Cloud DNS, or regional load balancers with traffic management). The core idea is to reroute traffic away from the unhealthy region.
Application Deployment and Health Checks
Your application instances must be deployed in a multi-region active-active or active-passive configuration. For active-active, you'll have instances running in multiple regions, each configured to use the DynamoDB endpoint for its respective region. For active-passive, instances run in a primary region, with a standby deployment in a secondary region ready to be activated.
Each application instance must expose a health check endpoint (e.g., `/healthz`). This endpoint should:
- Perform a quick, low-impact read operation against the local DynamoDB endpoint.
- Return HTTP 200 OK if the read is successful.
- Return HTTP 5xx if the read fails or times out.
# Example Flask health check endpoint
from flask import Flask, jsonify
# Assuming MultiRegionDynamoDBClient is imported and initialized elsewhere
# For simplicity, let's assume a global client instance `dynamodb_client`
app = Flask(__name__)
@app.route('/healthz', methods=['GET'])
def health_check():
try:
# Perform a simple, low-impact read operation
# Replace 'your_primary_key_name' and 'some_value' with actual key structure
# This should ideally target a frequently accessed but not critical item
# or a dedicated health check item.
key_to_check = {"id": "health_check_item"} # Example key
item = dynamodb_client.get_item(key=key_to_check)
if item is not None: # Or check for specific expected content
return jsonify({"status": "ok", "region": os.environ.get("AWS_REGION")}), 200
else:
# Item not found might be a transient issue or a sign of deeper problem
# For health check, we might consider this an error if it's unexpected
return jsonify({"status": "error", "message": "Health check item not found"}), 503
except ConnectionError as e:
# This indicates a problem connecting to DynamoDB, likely a regional outage
app.logger.error(f"DynamoDB connection error during health check: {e}")
return jsonify({"status": "error", "message": "Cannot connect to DynamoDB"}), 503
except Exception as e:
app.logger.error(f"Unexpected error during health check: {e}")
return jsonify({"status": "error", "message": "Internal server error"}), 500
if __name__ == '__main__':
# In production, use a proper WSGI server like Gunicorn
# Ensure dynamodb_client is initialized before running the app
# dynamodb_client = MultiRegionDynamoDBClient(table_name="YourTableName")
app.run(host='0.0.0.0', port=8080)
Cloud Monitoring Uptime Checks will poll this `/healthz` endpoint for each regional deployment. When an instance or an entire region consistently fails these checks, the alert policy triggers the Pub/Sub notification, initiating the failover process.
DNS and Load Balancing for Global Traffic Management
The final piece of the puzzle is how global traffic is directed to your application. This is typically managed by a Global Load Balancer (like Google Cloud's Global External HTTP(S) Load Balancer) or a managed DNS service with health-checking capabilities (like Cloud DNS with health checks).
With a Global Load Balancer, you configure backend services that point to your application deployments in each region (e.g., via Network Endpoint Groups - NEGs). The load balancer itself performs health checks on these backends. When a backend becomes unhealthy, the load balancer automatically stops sending traffic to it. The Cloud Function orchestrator can then be used to dynamically update the load balancer's configuration if a more complex failover strategy is needed (e.g., shifting traffic entirely to a new primary region).
# Conceptual example of updating a Global Load Balancer backend service # This would involve using the Google Cloud SDK or client libraries. # Assume 'gcloud' is configured and authenticated. # Replace placeholders with your actual resource names. # Get current backend service details # gcloud compute backend-services describe YOUR_BACKEND_SERVICE_NAME --global # Update the backend service to remove unhealthy region's NEGs or adjust weights # This is a simplified representation. Actual updates might involve # modifying backend service configurations or NEGs directly. # Example: If using NEGs, you might update the NEG to point to a different set of endpoints # or adjust traffic distribution weights. # If a full failover is required, you might update the primary backend service # to point exclusively to the new healthy region's NEGs. # Example: Updating traffic distribution (if applicable) # gcloud compute backend-services update YOUR_BACKEND_SERVICE_NAME --global \ # --load-balancing-scheme=EXTERNAL_MANAGED \ # --capacity-scaler=1.0 \ # --description="Updated backend service" \ # --health-checks=YOUR_HEALTH_CHECK_NAME \ # --protocol=HTTP \ # --timeout=30s \ # --enable-cdn \ # --custom-request-headers="" \ # --custom-response-headers="" \ # --connection-draining-timeout=30s \ # --session-affinity \ # --affinity-cookie-ttl=0s \ # --log-config=enable=true,sample-rate=1.0 \ # --balancing-mode=UTILIZATION \ # --max-utilization=0.8 \ # --adaptive-service-control \ # --region=YOUR_NEW_PRIMARY_REGION # This is conceptual, LB is global # A more direct approach might be to update the NEGs associated with the backend service. # For example, if a region's NEG becomes unhealthy, you might remove it from the backend service # or replace it with a NEG from a healthy region.
Alternatively, using Cloud DNS with health checks allows you to create DNS records for your application that point to the IP addresses of your regional deployments. Cloud DNS can monitor the health of these IP addresses. If a region's IP becomes unhealthy, Cloud DNS will automatically stop returning that IP in DNS queries, effectively directing traffic to healthy regions.
Considerations and Advanced Scenarios
Data Consistency: DynamoDB Global Tables are eventually consistent. During a failover, there might be a brief period where data written to the old primary region hasn't yet replicated to the new primary. Applications must be designed to tolerate this. If strong consistency is paramount, consider alternative data stores or application-level conflict resolution.
Stateful Applications: If your application instances maintain local state, failover becomes more complex. You'll need mechanisms to synchronize or rehydrate this state in the new region, or ensure that state is managed externally (e.g., in DynamoDB itself, ElastiCache, etc.).
Failback Strategy: Define a clear process for failing back to the original primary region once it has recovered. This might involve manual intervention or an automated process triggered by the same orchestrator.
Testing: Rigorous testing of your failover mechanism is non-negotiable. Simulate region outages, network partitions, and other failure scenarios to validate that your auto-failover works as expected and within your Recovery Time Objective (RTO).
Cost: Multi-region deployments and services like Global Load Balancers incur additional costs. Factor these into your architecture planning.