Disaster Recovery 101: Architecting Auto-Failovers for DynamoDB and PHP Deployments on Linode
Establishing Multi-Region DynamoDB Replication
Automated failover for critical applications hinges on resilient data stores. For DynamoDB, this means leveraging its built-in global tables feature. This isn’t merely about backups; it’s about active-active replication across distinct AWS regions, enabling near-instantaneous read/write capabilities from any replica and providing the foundation for a seamless failover strategy. The setup is declarative and managed via the AWS CLI or SDKs. We’ll focus on the CLI for its directness in scripting.
First, ensure your DynamoDB table exists in your primary region. Let’s assume a table named user_profiles with a partition key user_id.
Creating the Global Table
To create a global table, you first need to enable DynamoDB Streams on your existing table. This stream captures item-level modifications. Then, you can create the global table, specifying the regions you want to replicate to. For this example, we’ll replicate from us-east-1 to eu-west-1.
Step 1: Enable DynamoDB Streams
aws dynamodb update-table --table-name user_profiles --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD --region us-east-1
Step 2: Create the Global Table Replica in a New Region
aws dynamodb create-global-table-replica --global-table-id arn:aws:dynamodb:us-east-1:123456789012:table/user_profiles --region-name eu-west-1
Replace arn:aws:dynamodb:us-east-1:123456789012:table/user_profiles with the actual ARN of your table. The --global-table-id parameter refers to the *primary* region’s table ARN. After this command, DynamoDB will provision the replica table in eu-west-1 and begin replicating data. You can monitor the status using aws dynamodb describe-global-table --region-name us-east-1.
Architecting PHP Application Failover on Linode
For our PHP application deployed on Linode, we’ll employ a multi-region strategy. This involves deploying identical application stacks in at least two Linode regions. The core of the failover mechanism will be a DNS-based approach, leveraging Linode’s DNS Manager and potentially a health check service.
Infrastructure Setup
Assume we have two identical Linode instances, one in us-east (e.g., Newark) and another in eu-central (e.g., Frankfurt). Each instance runs a standard LAMP/LEMP stack, with PHP connecting to its *local* DynamoDB replica. This local connection minimizes latency during normal operation.
Application Configuration:
Your PHP application’s database configuration must be dynamic. Instead of hardcoding endpoint URLs, use environment variables or a configuration file that can be updated during a failover event. For DynamoDB, the endpoint is region-specific. The AWS SDK for PHP handles this automatically if the region is correctly configured.
<?php
// config/database.php
return [
'dynamodb' => [
'region' => getenv('AWS_REGION') ?: 'us-east-1', // Default to primary region
'version' => 'latest',
'credentials' => [
'key' => getenv('AWS_ACCESS_KEY_ID'),
'secret' => getenv('AWS_SECRET_ACCESS_KEY'),
],
],
];
?>
The application server’s environment variables (AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) will dictate which DynamoDB endpoint it connects to. During a failover, the AWS_REGION environment variable on the secondary region’s servers would be updated to point to the primary region’s DynamoDB endpoint, or vice-versa if failing back.
DNS Failover Strategy
We’ll use Linode’s DNS Manager to manage the primary A record for our application (e.g., app.yourdomain.com). This record will initially point to the IP address of the Linode instance in the primary region (us-east).
Step 1: Configure DNS Records in Linode DNS Manager
Create an A record for app.yourdomain.com pointing to the IP of your us-east Linode. Create a second A record for a health check subdomain, e.g., health.app.yourdomain.com, pointing to the IP of your eu-central Linode. This is a common pattern for active-passive DNS failover.
Step 2: Implement Health Checks
On each Linode instance, run a simple HTTP server that responds with a 200 OK status code if the application is healthy, and a non-200 status code (e.g., 503 Service Unavailable) if it’s unhealthy. This health check endpoint should verify connectivity to its local DynamoDB replica.
Example PHP health check script (/var/www/html/health.php):
<?php
require 'vendor/autoload.php'; // Assuming Composer is used
use Aws\DynamoDb\DynamoDbClient;
use Aws\Exception\AwsException;
// Load configuration
$config = require __DIR__ . '/../config/database.php';
$dbConfig = $config['dynamodb'];
// Set region from environment variable or default
$region = getenv('AWS_REGION') ?: $dbConfig['region'];
try {
$dynamoDb = new DynamoDbClient([
'region' => $region,
'version' => $dbConfig['version'],
'credentials' => $dbConfig['credentials'],
]);
// Attempt a simple DynamoDB operation to check connectivity
// e.g., DescribeTable for the user_profiles table
$dynamoDb->describeTable(['TableName' => 'user_profiles']);
// If no exception, the connection is good
http_response_code(200);
echo "OK";
} catch (AwsException $e) {
// Log the error for debugging
error_log("DynamoDB Health Check Failed: " . $e->getMessage());
http_response_code(503);
echo "Service Unavailable";
} catch (Exception $e) {
error_log("General Health Check Error: " . $e->getMessage());
http_response_code(503);
echo "Service Unavailable";
}
?>
Ensure your web server (Nginx/Apache) is configured to serve this script and that the AWS_REGION environment variable is correctly set for the PHP process on each server.
Automating DNS Updates
The crucial part is automating the DNS record update when a failure is detected. This can be achieved using a monitoring service or a custom script that periodically checks the health endpoints and updates DNS via the Linode API.
Step 1: Obtain Linode API Credentials
Generate an API token from your Linode Cloud Manager account with sufficient permissions to manage DNS records.
Step 2: Create a Monitoring Script (Python Example)
This script will run on a separate, highly available monitoring server (or even a scheduled cron job on one of the Linode instances, though less ideal for true disaster recovery). It checks the health of both regions and updates the DNS A record accordingly.
import requests
import os
import json
import time
# --- Configuration ---
LINODE_API_TOKEN = os.environ.get("LINODE_API_TOKEN")
PRIMARY_REGION_IP = "YOUR_PRIMARY_LINODE_IP" # e.g., 192.0.2.1
SECONDARY_REGION_IP = "YOUR_SECONDARY_LINODE_IP" # e.g., 198.51.100.1
PRIMARY_HEALTH_URL = f"http://{PRIMARY_REGION_IP}/health.php"
SECONDARY_HEALTH_URL = f"http://{SECONDARY_REGION_IP}/health.php"
DOMAIN_NAME = "app.yourdomain.com"
LINODE_ZONE_ID = "YOUR_LINODE_DNS_ZONE_ID" # Found in Linode DNS Manager URL or via API
RECORD_ID = "YOUR_APP_A_RECORD_ID" # The ID of the A record for app.yourdomain.com
CHECK_INTERVAL_SECONDS = 60
REQUEST_TIMEOUT = 5
# --- End Configuration ---
HEADERS = {
"Authorization": f"Bearer {LINODE_API_TOKEN}",
"Content-Type": "application/json"
}
def get_dns_record_id(domain, zone_id):
"""Fetches the ID of the A record for the given domain."""
url = f"https://api.linode.com/v4/domains/{zone_id}/records"
try:
response = requests.get(url, headers=HEADERS)
response.raise_for_status()
data = response.json()
for record in data.get("data", []):
if record.get("type") == "A" and record.get("name") == domain:
return record.get("id")
print(f"Error: A record for {domain} not found in zone {zone_id}.")
return None
except requests.exceptions.RequestException as e:
print(f"Error fetching DNS records: {e}")
return None
def update_dns_record(zone_id, record_id, target_ip):
"""Updates a DNS A record with a new IP address."""
url = f"https://api.linode.com/v4/domains/{zone_id}/records/{record_id}"
payload = {
"target": target_ip
}
try:
response = requests.put(url, headers=HEADERS, data=json.dumps(payload))
response.raise_for_status()
print(f"Successfully updated DNS record {record_id} to {target_ip}")
return True
except requests.exceptions.RequestException as e:
print(f"Error updating DNS record {record_id}: {e}")
return False
def check_health(url):
"""Checks the health endpoint of a given URL."""
try:
response = requests.get(url, timeout=REQUEST_TIMEOUT)
return response.status_code == 200
except requests.exceptions.RequestException:
return False
def main():
global RECORD_ID
if not LINODE_API_TOKEN:
print("Error: LINODE_API_TOKEN environment variable not set.")
return
if not LINODE_ZONE_ID:
print("Error: LINODE_ZONE_ID not configured.")
return
# Dynamically fetch RECORD_ID if not hardcoded
if not RECORD_ID:
RECORD_ID = get_dns_record_id(DOMAIN_NAME, LINODE_ZONE_ID)
if not RECORD_ID:
return # Error message already printed by get_dns_record_id
print(f"Starting health checks. Interval: {CHECK_INTERVAL_SECONDS}s")
while True:
primary_healthy = check_health(PRIMARY_HEALTH_URL)
secondary_healthy = check_health(SECONDARY_HEALTH_URL)
current_target_ip = None
try:
# Fetch current DNS record to determine current state
record_url = f"https://api.linode.com/v4/domains/{LINODE_ZONE_ID}/records/{RECORD_ID}"
response = requests.get(record_url, headers=HEADERS)
response.raise_for_status()
current_target_ip = response.json().get("data", {}).get("target")
except requests.exceptions.RequestException as e:
print(f"Could not fetch current DNS record: {e}")
# Continue with checks, but be cautious about updates
if primary_healthy and current_target_ip != PRIMARY_REGION_IP:
print("Primary region is healthy. Failing over to primary.")
update_dns_record(LINODE_ZONE_ID, RECORD_ID, PRIMARY_REGION_IP)
elif not primary_healthy and secondary_healthy and current_target_ip != SECONDARY_REGION_IP:
print("Primary region is unhealthy, secondary is healthy. Failing over to secondary.")
update_dns_record(LINODE_ZONE_ID, RECORD_ID, SECONDARY_REGION_IP)
elif not primary_healthy and not secondary_healthy:
print("Both regions are unhealthy. No DNS change made.")
elif primary_healthy and current_target_ip == SECONDARY_REGION_IP:
print("Primary region is healthy, but DNS points to secondary. Failing back to primary.")
update_dns_record(LINODE_ZONE_ID, RECORD_ID, PRIMARY_REGION_IP)
else:
print("System is stable. No changes needed.")
time.sleep(CHECK_INTERVAL_SECONDS)
if __name__ == "__main__":
main()
Deployment:
- Install Python and the
requestslibrary on your monitoring server:pip install requests. - Set the
LINODE_API_TOKENenvironment variable. - Fill in the configuration variables (IP addresses, domain, zone ID, record ID). You can find the
LINODE_ZONE_IDin the URL when you view your domain in Linode DNS Manager (e.g.,/dns/manage/12345, where 12345 is the ID). TheRECORD_IDcan be found by inspecting the network requests in your browser’s developer tools when viewing the DNS records, or by using theget_dns_record_idfunction. - Run the script:
python your_monitor_script.py. For production, run it using a process manager likesystemdorsupervisor.
Application-Level Failover Considerations
While DNS failover handles traffic redirection, your PHP application needs to be aware of its operational region. If the application relies on region-specific services (e.g., S3 buckets, SQS queues), its configuration must be updated to reflect the new active region. This can be achieved by:
- Updating environment variables on the newly active Linode instance (e.g.,
AWS_REGION). This can be done via SSH commands executed by the monitoring script after DNS update, or through a configuration management tool like Ansible. - Restarting the PHP-FPM service or web server to pick up the new environment variables.
Example of updating environment variables and restarting PHP-FPM via SSH (to be added to the Python script):
import paramiko
def update_remote_env_and_restart(hostname, username, password, region_var_value):
try:
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(hostname, username=username, password=password)
# Update environment variable in a file (e.g., /etc/environment or a custom script)
# This is a simplified example; a robust solution might involve updating a .env file
# or a systemd service file.
# Example: Appending to a custom env file
env_file_path = "/opt/your_app/.env"
update_command = f"echo 'AWS_REGION={region_var_value}' >> {env_file_path}"
stdin, stdout, stderr = client.exec_command(update_command)
print(f"STDOUT: {stdout.read().decode()}")
print(f"STDERR: {stderr.read().decode()}")
# Restart PHP-FPM (adjust service name if necessary)
restart_command = "sudo systemctl restart php8.1-fpm" # Example for PHP 8.1
stdin, stdout, stderr = client.exec_command(restart_command)
print(f"STDOUT: {stdout.read().decode()}")
print(f"STDERR: {stderr.read().decode()}")
client.close()
print(f"Successfully updated environment and restarted PHP-FPM on {hostname}")
return True
except Exception as e:
print(f"Error connecting to {hostname} or executing commands: {e}")
return False
# In the main loop, after updating DNS:
# if update_dns_record(...):
# if current_target_ip == SECONDARY_REGION_IP: # Failing back to primary
# update_remote_env_and_restart(PRIMARY_LINODE_HOSTNAME, 'root', 'YOUR_SSH_PASSWORD', 'us-east-1')
# else: # Failing over to secondary
# update_remote_env_and_restart(SECONDARY_LINODE_HOSTNAME, 'root', 'YOUR_SSH_PASSWORD', 'eu-central-1')
Note: Storing SSH passwords directly in scripts is insecure. Use SSH keys for authentication and consider a secrets management solution.
Testing and Validation
Thorough testing is paramount. Simulate failures by:
- Stopping the web server or PHP-FPM on the primary Linode instance.
- Simulating network partitions.
- Manually triggering the health check script to return an error.
Monitor the DNS propagation time and verify that traffic is correctly routed to the secondary region. Check application logs on both regions to ensure data consistency and proper operation. Perform a failback test to ensure the primary region can resume its role seamlessly.