Automating Multi-Region Redundancy for Perl Architectures on OVH

Establishing Multi-Region Redundancy for Perl Applications on OVH

This document outlines a robust, automated strategy for achieving multi-region redundancy for Perl-based architectures hosted on OVHcloud. The focus is on minimizing Recovery Time Objective (RTO) and Recovery Point Objective (RPO) through a combination of infrastructure-as-code, automated data synchronization, and intelligent traffic management.

Infrastructure Provisioning with Terraform

We’ll leverage Terraform to provision identical infrastructure stacks across two distinct OVHcloud regions (e.g., GRA and RBX). This ensures consistency and repeatability. The core components include:

Compute instances (e.g., Public Cloud Instances)
Managed Databases (e.g., Managed PostgreSQL)
Load Balancers (e.g., HAProxy Load Balancer)
Object Storage (e.g., OVHcloud Object Storage)

A simplified Terraform configuration for a single region might look like this:

Terraform Configuration Snippet (`main.tf`)

# main.tf

terraform {
  required_providers {
    ovh = {
      source  = "ovh/ovh"
      version = "~> 1.0"
    }
  }
  required_version = ">= 1.0"
}

provider "ovh" {
  endpoint = "ovh-eu" # Or ovh-us, ovh-ca, etc.
}

variable "region" {
  description = "The OVHcloud region to deploy resources in."
  type        = string
}

variable "service_name" {
  description = "A unique name for the service."
  type        = string
  default     = "my-perl-app"
}

# Example: Provisioning a Public Cloud Instance
resource "ovh_cloud_project_instance" "app_server" {
  service_name = var.service_name
  region       = var.region
  name         = "${var.service_name}-app-${var.region}"
  flavor_id    = "s1-2" # Example flavor
  image_id     = "ubuntu-2004" # Example image
  ssh_key_name = "my-ssh-key" # Ensure this key is uploaded to OVHcloud
  network_id   = ovh_cloud_project_network.private_network.id
}

# Example: Provisioning a Managed PostgreSQL Database
resource "ovh_db_postgresql_service" "db_service" {
  service_name = "${var.service_name}-db-${var.region}"
  version      = "13"
  plan         = "professional"
  region       = var.region
  disk_size    = 100 # GB
}

# Example: Private Network
resource "ovh_cloud_project_network" "private_network" {
  service_name = var.service_name
  region       = var.region
  name         = "${var.service_name}-net-${var.region}"
  subnet       = "192.168.1.0/24" # Example subnet
  vlan_id      = 0 # Default VLAN
}

# Output database connection details (sensitive, handle with care)
output "db_host" {
  description = "Managed PostgreSQL hostname."
  value       = ovh_db_postgresql_service.db_service.host
  sensitive   = true
}

output "db_port" {
  description = "Managed PostgreSQL port."
  value       = ovh_db_postgresql_service.db_service.port
}

output "db_user" {
  description = "Managed PostgreSQL username."
  value       = ovh_db_postgresql_service.db_service.users[0].name
  sensitive   = true
}

output "db_password" {
  description = "Managed PostgreSQL password."
  value       = ovh_db_postgresql_service.db_service.users[0].password
  sensitive   = true
}

To deploy to two regions, you would typically use Terraform workspaces or separate configuration files, applying the configuration for each region:

# Initialize Terraform
terraform init

# Set up workspace for GRA
terraform workspace select dev || terraform workspace new dev
terraform apply -var="region=GRA" -auto-approve

# Set up workspace for RBX
terraform workspace select prod || terraform workspace new prod
terraform apply -var="region=RBX" -auto-approve

Automated Data Synchronization

Maintaining data consistency between regions is paramount. For databases, we’ll implement asynchronous replication. For file-based data (uploads, static assets), we’ll use rsync or a cloud-native object storage replication mechanism.

PostgreSQL Asynchronous Replication

OVHcloud Managed PostgreSQL services offer built-in replication capabilities. The primary database will reside in the primary region (e.g., GRA), and a replica will be set up in the secondary region (e.g., RBX). This is typically configured via the OVHcloud API or control panel. For automation, we can use the OVH Terraform provider or a custom script interacting with the OVH API.

A conceptual Terraform snippet for setting up a replica (assuming the primary service already exists):

# Example: Setting up a replica (conceptual, actual resource might differ)
# This assumes you have a way to reference the primary DB service's ID/details

resource "ovh_db_postgresql_replica" "secondary_replica" {
  service_name = "${var.service_name}-db-replica-${var.region}" # Unique name for the replica service
  region       = var.region # The region where the replica will be deployed
  master_id    = "primary-db-service-id" # ID of the primary DB service
  plan         = "professional" # Match plan or choose appropriately
  version      = "13" # Match version
}

Important Note: Direct replication setup via Terraform might require specific resource types or manual steps post-provisioning depending on the OVH API’s current capabilities for managed replicas. Always consult the latest OVH Terraform provider documentation.

Object Storage Replication

OVHcloud Object Storage (S3-compatible) can be used for storing user uploads, static assets, etc. For cross-region redundancy, we can configure replication:

Option 1: Manual rsync (Scripted): Periodically run rsync from a designated instance in the primary region to an instance in the secondary region, syncing data to their respective object storage endpoints. This is less ideal for real-time needs.
Option 2: S3 Replication (if supported/configured): If using S3-compatible endpoints, investigate if OVHcloud offers native cross-region replication features for its Object Storage. If not, third-party tools like s3sync or custom scripts using the AWS SDK (which often works with S3-compatible APIs) can be employed.
Option 3: Application-Level Replication: Modify the Perl application to write critical data to both regions’ object storage endpoints simultaneously or asynchronously. This adds complexity to the application logic.

For automated, near real-time synchronization, a cron job running a script that leverages the OVHcloud SDK (e.g., using a Perl module like Net::OVH::REST or a Python script with boto3) to copy objects between buckets in different regions is a viable approach. Ensure proper IAM policies are in place for cross-region access.

Application Deployment and Configuration

The Perl application needs to be deployed consistently across both regions. This can be achieved using CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins) that build and deploy artifacts to instances in both regions.

Configuration Management

Configuration files (database credentials, API keys, region-specific endpoints) must be managed securely and dynamically. Tools like Ansible, Chef, or Puppet can be used. Alternatively, environment variables injected during deployment or a secrets management system (like HashiCorp Vault) are recommended.

A sample Perl configuration snippet demonstrating dynamic database connection:

# config.pl
use strict;
use warnings;

my %config;

# Load from environment variables
$config{DB}{HOST} = $ENV{DB_HOST} || 'localhost';
$config{DB}{PORT} = $ENV{DB_PORT} || '5432';
$config{DB}{NAME} = $ENV{DB_NAME} || 'myapp_db';
$config{DB}{USER} = $ENV{DB_USER} || 'app_user';
$config{DB}{PASS} = $ENV{DB_PASS} || '';

# Other configurations
$config{APP}{LOG_LEVEL} = $ENV{LOG_LEVEL} || 'info';

return %config;

During deployment, the CI/CD pipeline would ensure these environment variables are set correctly on the target instances for each region, pulling values from a secure source.

Intelligent Traffic Management and Failover

The final piece is directing user traffic and orchestrating failover. This involves DNS and load balancing.

DNS-Based Failover (e.g., OVHcloud Managed DNS)

Utilize OVHcloud’s Managed DNS service. Configure multiple A records pointing to the IP addresses of the load balancers in each region. Implement health checks for each load balancer. If the primary region’s load balancer becomes unresponsive, DNS resolution can be updated (manually or via automation) to point solely to the secondary region’s load balancer.

For more advanced, automated DNS failover, consider third-party services like AWS Route 53, Cloudflare, or Akamai, which offer sophisticated health checking and automatic failover capabilities based on latency or endpoint health.

Load Balancer Configuration (HAProxy)

Deploy HAProxy instances in front of your application servers in each region. Configure HAProxy to perform health checks on the application instances within its region. If all instances in a region fail, HAProxy can return a specific error or redirect traffic (though this is less common for regional failover; DNS is typically the first line).

The primary mechanism for failover will be at the DNS level, directing traffic *away* from a failing region entirely. The HAProxy instances within each region ensure high availability *within* that region.

# haproxy.cfg (Simplified Example)
global
    log /dev/log local0
    log /dev/log local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode http
    option httplog
    option dontlognull
    timeout connect 5000
    timeout client 50000
    timeout server 50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

frontend http_frontend
    bind *:80
    acl is_static url_beg /static/
    use_backend static_backend if is_static
    default_backend app_backend

backend app_backend
    balance roundrobin
    option httpchk GET /healthcheck.txt HTTP/1.1\r\nHost:\ localhost
    server app1 192.168.1.10:80 check
    server app2 192.168.1.11:80 check
    # Add more servers as needed

backend static_backend
    server static_files 192.168.1.20:80

Automated Failover Orchestration

The “automation” in failover is critical. This involves:

Health Monitoring: Implement comprehensive health checks not just on application servers but also on database replicas, load balancers, and critical services. Tools like Prometheus with Alertmanager, Nagios, or Zabbix can be used.
Alerting: Configure alerts to notify the operations team immediately when a region experiences critical failures.
Automated Failover Trigger: This is the most complex part. It can be achieved through:
- Custom Scripts: A script monitoring health check endpoints. Upon detecting a regional failure (e.g., multiple consecutive failures of the primary region’s load balancer), it triggers an API call to update DNS records (e.g., via OVH API or a third-party DNS provider API) to shift traffic to the secondary region.
- Orchestration Tools: Platforms like Kubernetes with custom controllers or specialized disaster recovery orchestration tools.
- Third-Party Services: Utilizing the automated failover features of managed DNS providers.

A simplified conceptual Python script using a hypothetical OVH DNS API client:

import ovh
import time
import requests # For health checks

# --- Configuration ---
PRIMARY_REGION_LB_IP = "YOUR_PRIMARY_LB_IP"
SECONDARY_REGION_LB_IP = "YOUR_SECONDARY_LB_IP"
DNS_ZONE_ID = "YOUR_DNS_ZONE_ID"
DNS_RECORD_NAME = "app.yourdomain.com." # FQDN of the record to update
HEALTH_CHECK_URL_PRIMARY = "http://YOUR_PRIMARY_LB_IP/health"
HEALTH_CHECK_URL_SECONDARY = "http://YOUR_SECONDARY_LB_IP/health"
FAILOVER_THRESHOLD = 3 # Consecutive failures to trigger failover
RECOVERY_THRESHOLD = 5 # Consecutive successes to consider recovery
CHECK_INTERVAL = 30 # Seconds

# --- OVH API Client Initialization ---
# Ensure you have OVH credentials configured (e.g., via environment variables)
client = ovh.Client()

def check_health(url):
    try:
        response = requests.get(url, timeout=5)
        return response.status_code == 200
    except requests.exceptions.RequestException:
        return False

def get_current_dns_ip():
    # This is a placeholder. You'd need to query the OVH API
    # to get the current IP address for DNS_RECORD_NAME.
    # Example: client.get('/domain/zone/%s/record' % DNS_ZONE_ID, fieldType='A', subDomain='app')
    print("INFO: Fetching current DNS IP (placeholder)...")
    # Simulate fetching current IP
    return PRIMARY_REGION_LB_IP # Assume primary is current

def update_dns_record(new_ip):
    print(f"INFO: Attempting to update DNS record {DNS_RECORD_NAME} to {new_ip}")
    # This is a placeholder for the actual OVH API call to update/create DNS record.
    # You might need to delete the old record and create a new one, or update in place.
    # Example: client.put('/domain/zone/%s/record/RECORD_ID' % DNS_ZONE_ID, {'content': new_ip})
    print(f"SUCCESS: DNS record updated to {new_ip}")
    return True

def main():
    primary_healthy = True
    secondary_healthy = True
    primary_failures = 0
    secondary_failures = 0
    current_ip = get_current_dns_ip()
    failover_active = (current_ip == SECONDARY_REGION_LB_IP)

    while True:
        primary_healthy = check_health(HEALTH_CHECK_URL_PRIMARY)
        secondary_healthy = check_health(HEALTH_CHECK_URL_SECONDARY)

        if not primary_healthy:
            primary_failures += 1
        else:
            primary_failures = 0

        if not secondary_healthy:
            secondary_failures += 1
        else:
            secondary_failures = 0

        print(f"Status: Primary Healthy={primary_healthy} ({primary_failures} failures), Secondary Healthy={secondary_healthy} ({secondary_failures} failures), Failover Active={failover_active}")

        # --- Failover Logic ---
        if not failover_active and primary_failures >= FAILOVER_THRESHOLD and secondary_healthy:
            print("ALERT: Primary region unhealthy. Initiating failover to secondary.")
            if update_dns_record(SECONDARY_REGION_LB_IP):
                current_ip = SECONDARY_REGION_LB_IP
                failover_active = True
                primary_failures = 0 # Reset failures after successful failover
                secondary_failures = 0 # Reset secondary failures too
            else:
                print("ERROR: Failed to update DNS for failover.")

        # --- Recovery Logic ---
        elif failover_active and secondary_failures >= FAILOVER_THRESHOLD and primary_healthy:
             print("ALERT: Secondary region unhealthy, but primary has recovered. Initiating failback.")
             if update_dns_record(PRIMARY_REGION_LB_IP):
                 current_ip = PRIMARY_REGION_LB_IP
                 failover_active = False
                 primary_failures = 0
                 secondary_failures = 0
             else:
                 print("ERROR: Failed to update DNS for failback.")

        # Optional: If primary is down and secondary is also down, alert and potentially take manual action
        elif failover_active and primary_failures >= FAILOVER_THRESHOLD and not secondary_healthy:
            print("CRITICAL ALERT: Both regions are unhealthy. Manual intervention required!")
            # Implement more aggressive alerting here

        time.sleep(CHECK_INTERVAL)

if __name__ == "__main__":
    main()

This script requires significant refinement, including robust error handling, proper OVH API integration for DNS record management, and potentially a more sophisticated state management mechanism. It serves as a conceptual blueprint.

Testing and Validation

Regularly test the failover process. This includes:

Simulating failures: Shutting down instances, stopping database services, or blocking network access to a region.
Verifying data integrity after failover.
Testing the failback process to the primary region.
Measuring the actual RTO and RPO achieved during tests.

Document all procedures and test results. This is crucial for compliance and for refining the automation scripts and infrastructure.