Automating Multi-Region Redundancy for Python Architectures on OVH
OVH Public Cloud: Multi-Region Strategy for Python Applications
Achieving robust disaster recovery for Python applications necessitates a multi-region deployment strategy. This post details a practical approach using OVH Public Cloud, focusing on automated failover and data synchronization for a typical web application stack. We’ll cover infrastructure provisioning, application deployment, database replication, and automated health checks.
Infrastructure as Code: Terraform for OVH Deployment
Managing infrastructure across multiple regions manually is error-prone and time-consuming. Terraform provides a declarative way to define and provision your cloud resources. We’ll define two distinct regions, each with its own set of compute instances, load balancers, and databases.
First, ensure you have the OVH provider configured for Terraform. This typically involves setting up environment variables for your OVH API credentials:
export OVH_APPLICATION_KEY="YOUR_OVH_APPLICATION_KEY" export OVH_APPLICATION_SECRET="YOUR_OVH_APPLICATION_SECRET" export OVH_CONSUMER_KEY="YOUR_OVH_CONSUMER_KEY"
Next, define your Terraform configuration. This example uses two regions, ‘GRA’ (Gravelines) and ‘RBX’ (Roubaix), for demonstration. Each region will have a private network, a set of instances, and a managed PostgreSQL database.
# main.tf
provider "ovh" {
endpoint = "ovh-eu"
}
variable "region_primary" {
description = "Primary OVH region"
type = string
default = "GRA"
}
variable "region_secondary" {
description = "Secondary OVH region"
type = string
default = "RBX"
}
# --- Primary Region Resources ---
module "primary_region" {
source = "./modules/region"
region = var.region_primary
name_prefix = "primary"
}
# --- Secondary Region Resources ---
module "secondary_region" {
source = "./modules/region"
region = var.region_secondary
name_prefix = "secondary"
}
# --- Global Resources (e.g., DNS, if managed externally) ---
# For simplicity, we'll assume DNS is managed separately or via OVH's DNS service
# which would require additional provider configuration.
The `modules/region` directory would contain the reusable infrastructure for a single region:
# modules/region/main.tf
variable "region" {
description = "OVH region for this module"
type = string
}
variable "name_prefix" {
description = "Prefix for resources in this region"
type = string
}
resource "ovh_cloud_project_network_private" "app_network" {
service_name = "YOUR_OVH_PROJECT_ID" # Replace with your actual project ID
region = var.region
name = "${var.name_prefix}-app-net"
vlan_id = 100 # Example VLAN ID
}
resource "ovh_cloud_project_instance" "app_server" {
service_name = "YOUR_OVH_PROJECT_ID"
region = var.region
name = "${var.name_prefix}-app-server-01"
flavor_name = "b2-7" # Example flavor
image_name = "Debian 11"
network_id = ovh_cloud_project_network_private.app_network.network_id
ssh_key_name = "your-ssh-key-name" # Ensure this SSH key exists in your OVH account
count = 2 # Deploy two instances for redundancy within the region
}
resource "ovh_cloud_project_database" "app_db" {
service_name = "YOUR_OVH_PROJECT_ID"
region = var.region
engine = "postgresql"
version = "13"
plan_name = "professional-1" # Example plan
name = "${var.name_prefix}-app-db"
disk_size = 100 # GB
}
output "instance_ips" {
value = [for instance in ovh_cloud_project_instance.app_server : instance.public_ip]
}
output "db_endpoint" {
value = ovh_cloud_project_database.app_db.endpoint
}
After defining your Terraform files, initialize and apply the configuration:
terraform init terraform plan terraform apply
Automated Application Deployment with Ansible
Once the infrastructure is provisioned, we need to deploy our Python application consistently across all instances in both regions. Ansible is an excellent choice for this. We’ll create an Ansible playbook that installs dependencies, copies application code, configures environment variables, and starts the application service.
First, gather the IP addresses of your deployed instances. You can output these from Terraform or query the OVH API. For this example, let’s assume you have an Ansible inventory file:
# inventory.ini [all:vars] ansible_user=debian # Or your chosen SSH user [primary_app_servers] primary-app-server-01 ansible_host=PRIMARY_APP_SERVER_01_IP primary-app-server-02 ansible_host=PRIMARY_APP_SERVER_02_IP [secondary_app_servers] secondary-app-server-01 ansible_host=SECONDARY_APP_SERVER_01_IP secondary-app-server-02 ansible_host=SECONDARY_APP_SERVER_02_IP [all_app_servers:children] primary_app_servers secondary_app_servers
Now, create the Ansible playbook. This playbook assumes your Python application uses Gunicorn and systemd for process management.
# deploy_app.yml
---
- name: Deploy Python Application
hosts: all_app_servers
become: yes
vars:
app_repo: "[email protected]:your-org/your-python-app.git"
app_path: "/opt/your_app"
venv_path: "{{ app_path }}/venv"
gunicorn_workers: 3
gunicorn_bind: "0.0.0.0:8000"
db_host: "{{ hostvars[groups['all'][0]]['primary_app_db_endpoint'] }}" # Example: Use primary DB endpoint by default
db_name: "your_db_name"
db_user: "your_db_user"
db_password: "your_db_password"
tasks:
- name: Update apt cache
apt:
update_cache: yes
- name: Install system dependencies (python3, pip, venv, git, etc.)
apt:
name:
- python3
- python3-pip
- python3-venv
- git
- build-essential
state: present
- name: Create application directory
file:
path: "{{ app_path }}"
state: directory
owner: www-data # Or your application user
group: www-data
- name: Clone or update application repository
git:
repo: "{{ app_repo }}"
dest: "{{ app_path }}"
version: main # Or a specific tag/branch
force: yes
- name: Create virtual environment
pip:
virtualenv: "{{ venv_path }}"
virtualenv_command: /usr/bin/python3 -m venv
- name: Install Python dependencies from requirements.txt
pip:
requirements: "{{ app_path }}/requirements.txt"
virtualenv: "{{ venv_path }}"
- name: Copy Gunicorn systemd service file
template:
src: templates/gunicorn.service.j2
dest: /etc/systemd/system/gunicorn.service
notify: Restart Gunicorn
- name: Ensure Gunicorn service is enabled and started
systemd:
name: gunicorn
enabled: yes
state: started
daemon_reload: yes
handlers:
- name: Restart Gunicorn
systemd:
name: gunicorn
state: restarted
daemon_reload: yes
You’ll also need a Jinja2 template for the Gunicorn systemd service:
# templates/gunicorn.service.j2
[Unit]
Description=Gunicorn instance to serve your_app
After=network.target
[Service]
User=www-data
Group=www-data
WorkingDirectory={{ app_path }}
ExecStart={{ venv_path }}/bin/gunicorn \
--workers {{ gunicorn_workers }} \
--bind {{ gunicorn_bind }} \
your_app.wsgi:application # Adjust 'your_app.wsgi:application' to your app's entry point
[Install]
WantedBy=multi-user.target
Run the playbook:
ansible-playbook -i inventory.ini deploy_app.yml
Database Replication and Failover
For disaster recovery, database availability is paramount. OVH Managed Databases for PostgreSQL support logical replication. We’ll configure the primary database in ‘GRA’ to replicate to the secondary database in ‘RBX’.
Step 1: Enable Logical Replication on Primary DB
You can do this via the OVH Control Panel or the OVH API. Ensure the following parameters are set for your primary PostgreSQL instance:
wal_level = logical max_replication_slots = 5 # Adjust as needed max_wal_senders = 5 # Adjust as needed
Restart the primary database instance for these changes to take effect.
Step 2: Create a Replication User
Connect to your primary database and create a user with replication privileges:
-- Connect to your primary database CREATE USER replicator WITH REPLICATION PASSWORD 'your_replication_password'; GRANT rds_replication TO replicator;
Step 3: Configure Logical Replication Slot on Primary DB
On the primary database, create a replication slot. This slot tracks the WAL (Write-Ahead Log) data that needs to be sent to the replica.
-- Connect to your primary database
SELECT pg_create_logical_replication_slot('app_replication_slot', 'pgoutput');
Step 4: Configure Publication on Primary DB
Define what data to replicate. For a full database replication, you can publish all tables.
-- Connect to your primary database CREATE PUBLICATION app_publication FOR ALL TABLES;
Step 5: Configure Subscription on Secondary DB
Connect to your secondary database instance (in ‘RBX’) and create a subscription to pull data from the primary.
-- Connect to your secondary database
CREATE SUBSCRIPTION app_subscription
CONNECTION 'host=PRIMARY_DB_ENDPOINT port=5432 user=replicator password=your_replication_password dbname=your_db_name'
PUBLICATION app_publication
WITH (copy_data = true, create_slot = false); -- create_slot=false because we created it on primary
Replace PRIMARY_DB_ENDPOINT with the actual endpoint of your primary database. The copy_data = true option will perform an initial data copy if the secondary database is empty.
Step 6: Monitoring Replication Lag
You can monitor replication lag by querying the subscription status on the secondary database:
-- Connect to your secondary database SELECT srsubstate, srsublsn, pg_wal_lsn_diff(pg_current_wal_lsn(), srsublsn) AS replication_lag_bytes FROM pg_subscription_rel sr JOIN pg_subscription s ON s.oid = sr.srsubid WHERE s.subname = 'app_subscription';
A consistently high replication_lag_bytes indicates an issue. You might need to adjust network bandwidth, database instance sizes, or the replication slot configuration.
Global Load Balancing and Health Checks
To enable seamless failover, we need a mechanism to direct traffic to the healthy region. OVH’s Load Balancer service can be configured for this. For true multi-region failover, consider a global DNS solution with health checking capabilities, or leverage OVH’s Global Load Balancing if available and suitable for your needs.
For this example, we’ll assume a simpler setup where a single OVH Load Balancer is used, and we’ll manually or script the failover. A more advanced setup would involve a global DNS provider (like Cloudflare, AWS Route 53, or OVH’s DNS) that can perform health checks across regions and update DNS records.
Step 1: Configure OVH Load Balancer
Create an OVH Load Balancer and configure backend pools pointing to the public IPs of your application servers in each region. Define health check probes (e.g., HTTP GET on a health endpoint like /healthz).
Step 2: Implement Application Health Endpoint
Your Python application should expose a /healthz endpoint that returns a 200 OK status if the application is healthy and can connect to its database. This is crucial for the load balancer’s health checks.
# Example Flask health check endpoint
from flask import Flask, jsonify
from your_app.database import get_db_connection # Assume this function exists
app = Flask(__name__)
@app.route('/healthz')
def health_check():
try:
# Attempt to get a database connection to verify DB health
conn = get_db_connection()
conn.close()
return jsonify({"status": "ok"}), 200
except Exception as e:
return jsonify({"status": "error", "message": str(e)}), 503
Step 3: Automated Failover Script
A script can monitor the health of the primary region. If health checks fail consistently, it can trigger a failover. This script would typically:
- Query the health status of the primary region’s load balancer or application servers.
- If primary is unhealthy, update the DNS record (if using global DNS) to point to the secondary region’s load balancer or IPs.
- Alternatively, if using OVH Load Balancer directly, you might need to reconfigure its backend pools to disable the primary region and enable the secondary. This can be done via the OVH API.
- Send notifications (e.g., Slack, email) about the failover event.
Here’s a conceptual Python script using the OVH API to disable a backend pool (representing the primary region):
import ovh
import time
import requests # For health checks
# --- Configuration ---
OVH_ENDPOINT = "https://eu.api.ovh.com/1.0"
APP_KEY = "YOUR_OVH_APPLICATION_KEY"
APP_SECRET = "YOUR_OVH_APPLICATION_SECRET"
CONSUMER_KEY = "YOUR_OVH_CONSUMER_KEY"
PROJECT_ID = "YOUR_OVH_PROJECT_ID"
LOADBALANCER_ID = "YOUR_LOADBALANCER_ID" # The ID of your OVH Load Balancer
PRIMARY_POOL_ID = "YOUR_PRIMARY_POOL_ID" # The ID of the backend pool for the primary region
SECONDARY_POOL_ID = "YOUR_SECONDARY_POOL_ID" # The ID of the backend pool for the secondary region
HEALTH_CHECK_URL = "http://your-app.example.com/healthz" # A global URL or one that resolves to the LB
# --- Initialize OVH Client ---
client = ovh.Client(
endpoint=OVH_ENDPOINT,
application_key=APP_KEY,
application_secret=APP_SECRET,
consumer_key=CONSUMER_KEY,
)
def is_primary_region_healthy():
"""Checks if the primary region is healthy via an external health check URL."""
try:
response = requests.get(HEALTH_CHECK_URL, timeout=5)
return response.status_code == 200
except requests.exceptions.RequestException:
return False
def update_loadbalancer_pool(pool_id, enabled):
"""Enables or disables a backend pool on the OVH Load Balancer."""
try:
client.put(f"/cloud/project/{PROJECT_ID}/loadbalancer/{LOADBALANCER_ID}/pool/{pool_id}",
status="active" if enabled else "disabled")
print(f"Pool {pool_id} set to {'active' if enabled else 'disabled'}")
return True
except Exception as e:
print(f"Error updating pool {pool_id}: {e}")
return False
def perform_failover():
"""Initiates failover to the secondary region."""
print("Primary region is unhealthy. Initiating failover...")
if update_loadbalancer_pool(PRIMARY_POOL_ID, False):
print("Primary pool disabled. Traffic should now be directed to the secondary pool.")
# Optionally, enable secondary pool if it was disabled
# update_loadbalancer_pool(SECONDARY_POOL_ID, True)
# Send notification
send_notification("Failover initiated: Primary region disabled.")
else:
print("Failed to disable primary pool. Manual intervention required.")
send_notification("CRITICAL: Failover failed. Could not disable primary pool.")
def perform_failback():
"""Initiates failback to the primary region."""
print("Primary region is healthy. Initiating failback...")
if update_loadbalancer_pool(PRIMARY_POOL_ID, True):
print("Primary pool re-enabled. Traffic should now be directed back to the primary pool.")
# Optionally, disable secondary pool if it was enabled
# update_loadbalancer_pool(SECONDARY_POOL_ID, False)
send_notification("Failback initiated: Primary region re-enabled.")
else:
print("Failed to re-enable primary pool. Manual intervention required.")
send_notification("WARNING: Failback failed. Could not re-enable primary pool.")
def send_notification(message):
"""Placeholder for sending notifications (e.g., Slack, email)."""
print(f"NOTIFICATION: {message}")
# Implement your notification logic here
def monitor_region_health():
"""Monitors health and triggers failover/failback."""
primary_was_healthy = True
while True:
is_healthy = is_primary_region_healthy()
if not is_healthy and primary_was_healthy:
perform_failover()
primary_was_healthy = False
elif is_healthy and not primary_was_healthy:
perform_failback()
primary_was_healthy = True
else:
status = "healthy" if is_healthy else "unhealthy"
print(f"Primary region is currently {status}. No action needed.")
time.sleep(60) # Check every 60 seconds
if __name__ == "__main__":
# Ensure you have configured your OVH API credentials and have the correct IDs
# You might want to run this script on a separate monitoring instance.
print("Starting region health monitoring...")
monitor_region_health()
Considerations for Production
- Data Consistency: While logical replication is powerful, ensure your application handles potential data inconsistencies during failover, especially if replication lag is significant.
- Stateful Applications: If your application stores state locally (e.g., file uploads, cache), ensure this state is also replicated or managed in a shared, multi-region accessible storage solution.
- DNS Propagation: If using DNS for failover, be mindful of DNS propagation delays. Lowering TTLs can help but increases DNS query load.
- Testing: Regularly test your failover and failback procedures. Simulate region outages to validate your automation and recovery time objectives (RTO).
- Monitoring and Alerting: Implement comprehensive monitoring for all components: infrastructure health, application performance, database replication lag, and load balancer status. Set up alerts for critical events.
- Security: Secure your API credentials, SSH keys, and database access. Use network security groups and firewalls to restrict access.
- Cost Management: Running infrastructure in multiple regions incurs higher costs. Optimize instance sizes and database plans.
By combining Infrastructure as Code, automated deployment, robust database replication, and intelligent load balancing, you can build a highly available and resilient Python architecture on OVH Public Cloud, ensuring business continuity even in the face of regional failures.