Automating Multi-Region Redundancy for Laravel Architectures on OVH

Establishing Multi-Region Infrastructure with OVHcloud

Achieving true disaster recovery for a Laravel application necessitates a multi-region strategy. This isn’t merely about having a backup; it’s about maintaining active-passive or active-active availability across geographically distinct data centers. OVHcloud’s global network provides the foundational infrastructure for this. We’ll focus on a common scenario: deploying a primary region (e.g., GRA) and a secondary, standby region (e.g., RBX) for failover.

Database Replication Strategy: PostgreSQL with Streaming Replication

For relational data, PostgreSQL’s built-in streaming replication is a robust and performant choice. We’ll configure a primary instance in GRA and a warm standby in RBX. This setup allows for near real-time data synchronization, minimizing data loss during a failover event.

PostgreSQL Primary Configuration (GRA)

On your primary PostgreSQL server in GRA, modify postgresql.conf and pg_hba.conf.

postgresql.conf:

wal_level = replica
max_wal_senders = 5
wal_keep_size = 1024 # Adjust based on network latency and WAL generation rate
archive_mode = on
archive_command = 'cd .' # Placeholder, actual archiving might be needed for point-in-time recovery
listen_addresses = '*' # Or specific IPs for security
shared_buffers = 1GB # Example, tune based on server RAM
effective_cache_size = 3GB # Example
maintenance_work_mem = 256MB # Example
random_page_cost = 1.1 # Tune for SSDs

pg_hba.conf (ensure the standby server’s IP is allowed for replication):

# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     replicator      <RBX_REPLICA_IP>/32     md5
host    all             all             0.0.0.0/0               md5 # Adjust for security

Restart PostgreSQL after these changes:

sudo systemctl restart postgresql

PostgreSQL Standby Configuration (RBX)

On the standby server in RBX, ensure PostgreSQL is installed but not running. You’ll need to create a recovery configuration file. First, take a base backup from the primary.

# On the RBX standby server
sudo systemctl stop postgresql

# Ensure PGDATA is empty or backed up
sudo rm -rf /var/lib/postgresql/14/main/* # Adjust path as per your PostgreSQL version and installation

# Perform base backup (run this from the RBX server, connecting to GRA)
sudo -u postgres pg_basebackup -h <GRA_PRIMARY_IP> -U replicator -D /var/lib/postgresql/14/main -P -v -R

# After pg_basebackup completes, it creates a postgresql.auto.conf and recovery.signal file.
# You might need to manually create or edit the recovery.conf (or its equivalent in newer PG versions)
# For PostgreSQL 12+, the recovery.signal file and settings in postgresql.conf/postgresql.auto.conf handle this.
# Ensure listen_addresses = '*' or the GRA primary's IP is in postgresql.conf if not already.
# Ensure shared_buffers, etc., are appropriately sized for the RBX server.

# Set correct ownership
sudo chown -R postgres:postgres /var/lib/postgresql/14/main

Start PostgreSQL on the standby:

sudo systemctl start postgresql

Monitor the logs on both servers to confirm replication is active. On the standby, you should see messages indicating it’s streaming WAL from the primary.

Laravel Application Deployment and Configuration

Your Laravel application needs to be deployed to both regions. A common pattern is to use a Git repository and a CI/CD pipeline (e.g., GitLab CI, GitHub Actions, Jenkins) to automate deployments to both GRA and RBX instances.

Environment Configuration for Multi-Region

The key is to manage environment variables dynamically. For database connections, you’ll need to point to the *local* database instance in each region. For failover, this connection string will need to be updated.

// config/database.php (simplified)
'pgsql' => [
    'driver' => 'pgsql',
    'host' => env('DB_HOST', '127.0.0.1'), // This will point to the local PG instance
    'port' => env('DB_PORT', '5432'),
    'database' => env('DB_DATABASE', 'your_db'),
    'username' => env('DB_USERNAME', 'your_user'),
    'password' => env('DB_PASSWORD', 'your_password'),
    'charset' => 'utf8',
    'prefix' => '',
    'schema' => 'public',
    'sslmode' => 'prefer',
],

In your .env file for the GRA deployment:

DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=your_db
DB_USERNAME=your_user
DB_PASSWORD=your_password

And for the RBX deployment (initially, this will also point to its local DB, which is a replica):

DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=your_db
DB_USERNAME=your_user
DB_PASSWORD=your_password

Load Balancing and Failover Orchestration

This is where the “automation” truly comes into play. We need a mechanism to detect failures in the primary region and redirect traffic to the secondary. OVHcloud’s Load Balancer service is a good candidate, but for true multi-region orchestration, external DNS-level or dedicated load balancing solutions are often preferred.

DNS-Based Failover with Health Checks

A common and effective strategy is to use a managed DNS service that supports health checks and automatic record updates. Services like AWS Route 53, Cloudflare DNS, or OVHcloud’s own DNS with advanced features can be leveraged.

The concept:

Configure a primary A record for your domain pointing to the load balancer or IP of your application in GRA.
Configure a secondary A record pointing to the load balancer or IP in RBX.
Set up health checks that ping a specific endpoint on your Laravel application (e.g., /health) in each region.
If the health check for the GRA endpoint fails, the DNS service automatically updates the primary A record to point to the RBX IP.

Implementing a Health Check Endpoint in Laravel

Create a simple controller and route for health checks. This endpoint should ideally check critical dependencies like database connectivity.

// app/Http/Controllers/HealthCheckController.php
namespace App\Http\Controllers;

use Illuminate\Http\Request;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;

class HealthCheckController extends Controller
{
    public function show()
    {
        try {
            // Attempt to connect to the database
            DB::connection()->getPdo();
            $databaseStatus = 'OK';
        } catch (\Exception $e) {
            Log::error("Database connection failed for health check: " . $e->getMessage());
            $databaseStatus = 'ERROR';
        }

        // Add checks for other critical services if necessary (e.g., Redis, SQS)

        if ($databaseStatus === 'OK') {
            return response()->json(['status' => 'UP', 'database' => $databaseStatus], 200);
        } else {
            return response()->json(['status' => 'DOWN', 'database' => $databaseStatus], 503); // Service Unavailable
        }
    }
}

// routes/web.php
use App\Http\Controllers\HealthCheckController;

Route::get('/health', [HealthCheckController::class, 'show']);

Automating Failover with OVHcloud API and Scripting

While DNS-level failover is common, you might want more granular control or to integrate with OVHcloud’s Load Balancer API. This involves scripting the process of detecting a failure and reconfiguring the load balancer or DNS records.

Here’s a conceptual Python script using the OVHcloud SDK to update a load balancer’s frontend target. This assumes you have an OVHcloud Load Balancer already configured with two targets (GRA and RBX) and a health check.

import ovh
import time
import os

# --- Configuration ---
GRA_TARGET_ID = "your_gra_target_id"  # ID of the GRA target in OVH LB
RBX_TARGET_ID = "your_rbx_target_id"  # ID of the RBX target in OVH LB
LB_ID = "your_loadbalancer_id"       # Your OVH Load Balancer ID
FRONTEND_ID = "your_frontend_id"     # The frontend ID to manage
HEALTH_CHECK_ENDPOINT = "/health"
PRIMARY_REGION_HEALTH_URL = "http://your-gra-app.com/health" # Public URL for GRA health check
SECONDARY_REGION_HEALTH_URL = "http://your-rbx-app.com/health" # Public URL for RBX health check
CHECK_INTERVAL = 60  # Seconds between checks
FAILOVER_THRESHOLD = 3 # Number of consecutive failures before failover

# --- OVH API Client Initialization ---
# Ensure you have OVH API credentials configured (e.g., via environment variables)
# export OVH_ENDPOINT='ovh-eu'
# export OVH_APPLICATION_KEY='...'
# export OVH_APPLICATION_SECRET='...'
# export OVH_CONSUMER_KEY='...'
client = ovh.Client()

def get_target_status(target_id):
    """Retrieves the status of a specific load balancer target."""
    try:
        status = client.get(f"/cloud/loadBalancer/{LB_ID}/frontend/{FRONTEND_ID}/backend/target/{target_id}/status")
        return status
    except Exception as e:
        print(f"Error getting status for target {target_id}: {e}")
        return None

def set_frontend_target(target_id):
    """Sets the active target for the frontend."""
    try:
        print(f"Attempting to set frontend {FRONTEND_ID} to target {target_id}...")
        # The API might require a PUT or POST to update the frontend's configuration
        # This is a conceptual representation. Actual API call might differ.
        # You'd typically update the 'defaultBackend' or similar field.
        # Example: client.put(f"/cloud/loadBalancer/{LB_ID}/frontend/{FRONTEND_ID}", body={"defaultBackend": target_id})
        # For simplicity, we'll simulate a successful update.
        print(f"Successfully updated frontend {FRONTEND_ID} to target {target_id}.")
        return True
    except Exception as e:
        print(f"Error setting frontend target to {target_id}: {e}")
        return False

def check_health(url):
    """Performs a simple HTTP GET health check."""
    import requests
    try:
        response = requests.get(url, timeout=5)
        return response.status_code == 200
    except requests.exceptions.RequestException as e:
        print(f"Health check failed for {url}: {e}")
        return False

def monitor_and_failover():
    gra_failures = 0
    rbx_failures = 0
    current_active_target = None # Track the currently active target

    while True:
        print("Running health checks...")

        # Check GRA
        gra_healthy = check_health(PRIMARY_REGION_HEALTH_URL)
        if gra_healthy:
            gra_failures = 0
            print("GRA health check: OK")
        else:
            gra_failures += 1
            print(f"GRA health check: FAILED ({gra_failures}/{FAILOVER_THRESHOLD})")

        # Check RBX
        rbx_healthy = check_health(SECONDARY_REGION_HEALTH_URL)
        if rbx_healthy:
            rbx_failures = 0
            print("RBX health check: OK")
        else:
            rbx_failures += 1
            print(f"RBX health check: FAILED ({rbx_failures}/{FAILOVER_THRESHOLD})")

        # --- Failover Logic ---
        # If GRA is down and RBX is up, and we haven't failed over yet
        if gra_failures >= FAILOVER_THRESHOLD and rbx_healthy and current_active_target != RBX_TARGET_ID:
            print("GRA is unhealthy, attempting failover to RBX...")
            if set_frontend_target(RBX_TARGET_ID):
                current_active_target = RBX_TARGET_ID
                print("Failover to RBX successful.")
            else:
                print("Failover to RBX failed.")

        # --- Failback Logic ---
        # If GRA is healthy again, and RBX is the active target
        elif gra_healthy and current_active_target == RBX_TARGET_ID:
            print("GRA is healthy again, attempting failback to GRA...")
            if set_frontend_target(GRA_TARGET_ID):
                current_active_target = GRA_TARGET_ID
                print("Failback to GRA successful.")
            else:
                print("Failback to GRA failed.")

        # If GRA is healthy and RBX is the active target (e.g., after a temporary RBX outage)
        elif gra_healthy and current_active_target == RBX_TARGET_ID:
             print("GRA is healthy, and RBX is active. No immediate action needed unless RBX fails.")

        # If GRA is healthy and RBX is healthy, and GRA is the active target
        elif gra_healthy and rbx_healthy and current_active_target == GRA_TARGET_ID:
            pass # All good, primary is active

        # If both are down, we're in a degraded state. The script can't fix this.
        elif not gra_healthy and not rbx_healthy:
            print("Both regions are unhealthy. Manual intervention required.")

        time.sleep(CHECK_INTERVAL)

if __name__ == "__main__":
    # Initial check to set the starting active target if needed
    # This part would need refinement to reliably determine initial state
    print("Starting monitoring loop...")
    monitor_and_failover()

Important Considerations for the Script:

API Credentials: Securely manage your OVH API credentials. Environment variables are a good practice.
Target IDs: You’ll need to find the specific IDs for your load balancer targets and frontends within the OVHcloud control panel or via API calls.
API Call Precision: The set_frontend_target function is a placeholder. You must consult the OVHcloud Load Balancer API documentation to determine the exact endpoint and payload for updating the active backend/target. This often involves updating a frontend’s configuration object.
State Management: The script needs to track the current active target to prevent redundant API calls and to manage failback logic.
Error Handling: Robust error handling and retry mechanisms are crucial for production.
Deployment: This script needs to run on a reliable server, potentially within OVHcloud itself, to ensure it has network access to the API and the application endpoints.

Data Consistency and Failback Procedures

When a failover occurs, the RBX PostgreSQL instance becomes the primary. If you’re using streaming replication, the GRA instance will be lagging. For failback, you have a few options:

Promote Standby, Reconfigure Replication: Promote the RBX instance to primary. Then, reconfigure the GRA PostgreSQL instance to replicate from RBX. This is the most common approach.
Downtime for Sync: Schedule a maintenance window, stop writes to RBX, wait for GRA to catch up (if it’s still running), then switch back. This is less ideal for high-availability systems.
Logical Replication (More Complex): For more advanced scenarios, consider logical replication, which can offer more flexibility but adds complexity.

The failback process should be as automated as the failover. This involves:

Stopping writes to the current primary (RBX).
Ensuring the old primary (GRA) has caught up via replication (or performing a manual data sync if necessary).
Reconfiguring GRA PostgreSQL to be the primary again, replicating from RBX (or vice-versa if RBX is to remain primary).
Updating DNS/Load Balancer to point back to GRA.

Testing Your Disaster Recovery Plan

A DR plan is useless if not tested. Regularly simulate failures:

Network Isolation: Block traffic to your primary region’s servers.
Database Shutdown: Stop the primary PostgreSQL instance.
Application Server Failure: Terminate application instances in the primary region.

Document the entire failover and failback process, including the time taken and any issues encountered. Refine your automation scripts and procedures based on these tests. Aim for a Recovery Time Objective (RTO) and Recovery Point Objective (RPO) that meets your business requirements.