Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Laravel Deployments on OVH

Leveraging OVH’s Public Cloud for High Availability PostgreSQL with Automatic Failover

Achieving robust disaster recovery for PostgreSQL, especially in a cloud environment like OVH, hinges on a well-architected automatic failover strategy. This isn’t about manual intervention; it’s about building a system that detects failures and seamlessly transitions operations to a standby instance with minimal to no downtime. We’ll focus on a primary-replica setup with synchronous or asynchronous replication, managed by a robust monitoring and orchestration layer.

OVH Instance and PostgreSQL Configuration for Replication

The foundation of our HA setup involves provisioning at least two OVH Public Cloud instances. One will serve as the primary PostgreSQL server, and the other as the hot standby. For optimal performance and resilience, consider instances with dedicated SSDs and sufficient RAM. Network latency between these instances is critical; placing them within the same OVH region and availability zone is highly recommended.

On the primary PostgreSQL server, we need to configure `postgresql.conf` and `pg_hba.conf` to enable streaming replication. Ensure `wal_level` is set to `replica` (or `logical` if you plan for logical replication later), `max_wal_senders` is sufficient for your replica count, and `wal_keep_segments` (or `wal_keep_size` in newer versions) is large enough to prevent the replica from falling too far behind. For synchronous replication, `synchronous_commit` should be `on` and `synchronous_standby_names` configured to point to your standby.

Primary PostgreSQL Server Configuration Snippets

Edit /etc/postgresql/[version]/main/postgresql.conf on the primary:

wal_level = replica
max_wal_senders = 10
wal_keep_size = 1024MB  # Or wal_keep_segments for older versions
synchronous_commit = on
synchronous_standby_names = 'replica1' # Name of your standby server as defined in recovery.conf/postgresql.conf
listen_addresses = '*'

Edit /etc/postgresql/[version]/main/pg_hba.conf on the primary to allow replication connections:

# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     replicator      [replica_private_ip]/32   md5

Restart PostgreSQL on the primary after these changes:

sudo systemctl restart postgresql

Setting Up the PostgreSQL Standby Server

On the standby server, we’ll use `pg_basebackup` to create an initial copy of the primary’s data directory. This command will also set up the necessary replication configuration. Ensure the user running `pg_basebackup` has replication privileges on the primary.

Performing Base Backup and Configuring Standby

First, stop PostgreSQL on the standby and ensure its data directory is empty or removed:

sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/[version]/main/*

Execute `pg_basebackup` from the standby server:

sudo -u postgres pg_basebackup -h [primary_private_ip] -p 5432 -U replicator -D /var/lib/postgresql/[version]/main -Fp -Xs -P -R

The -R flag is crucial here. It automatically creates the standby.signal file and a postgresql.auto.conf (or recovery.conf for older versions) with the necessary connection details and recovery settings. If you are not using synchronous replication, you might need to adjust synchronous_commit on the primary and ensure primary_conninfo is correctly set in the standby’s configuration.

Ensure the PostgreSQL user on the standby has the correct ownership of the data directory:

sudo chown -R postgres:postgres /var/lib/postgresql/[version]/main

Start PostgreSQL on the standby:

sudo systemctl start postgresql

Verify replication status on the primary using:

SELECT * FROM pg_stat_replication;

And on the standby, check the logs for “consistent recovery applied” messages.

Implementing Automatic Failover with Patroni

While PostgreSQL replication is the backbone, automatic failover requires an intelligent orchestration layer. Patroni is a leading choice for this. It’s a template for PostgreSQL HA and is designed to run on multiple nodes, managing the cluster state using a distributed configuration store like etcd, Consul, or ZooKeeper. For OVH, etcd is a common and robust choice.

Setting Up etcd Cluster

Deploy a small, highly available etcd cluster (e.g., 3 nodes) on separate OVH instances. Ensure these instances are accessible from your PostgreSQL nodes.

# Example etcdctl command to check cluster health
ETCDCTL_API=3 etcdctl endpoint health --endpoints=http://[etcd1_ip]:2379,http://[etcd2_ip]:2379,http://[etcd3_ip]:2379

Configuring and Running Patroni

Install Patroni and its dependencies (e.g., `python3-pip`, `python3-dev`, `build-essential`) on both your primary and standby PostgreSQL servers. Then, create a Patroni configuration file (e.g., patroni.yml).

# patroni.yml
scope: my_pg_cluster
namespace: /service/
restapi:
  listen: 0.0.0.0:8008
  connect_address: !!python/name:__main__:get_private_ip:
    - 8008
etcd:
  hosts:
    - [etcd1_ip]:2379
    - [etcd2_ip]:2379
    - [etcd3_ip]:2379
  protocol: http
postgresql:
  listen: 0.0.0.0:5432
  connect_address: !!python/name:__main__:get_private_ip:
    - 5432
  data_dir: /var/lib/postgresql/[version]/main
  bin_dir: /usr/lib/postgresql/[version]/bin
  config_dir: /etc/postgresql/[version]/main
  pg_hba:
    - host    replication     replicator      0.0.0.0/0       md5
    - host    all             all             0.0.0.0/0       md5
  replication:
    username: replicator
    password: your_replication_password
    ssl: false
  parameters:
    wal_level: replica
    max_wal_senders: 10
    wal_keep_size: 1024MB
    synchronous_commit: on
    synchronous_standby_names: 'replica1' # This will be managed by Patroni
  authentication:
    replication:
      username: replicator
      password: your_replication_password
    superuser:
      username: postgres
      password: your_postgres_password
  replication_mode: sync # or async

The !!python/name:__main__:get_private_ip is a placeholder for a small Python helper function that dynamically determines the instance’s private IP. You’ll need to include this helper function in your Patroni startup script or ensure Patroni can resolve it.

Start Patroni as a systemd service:

# Create a systemd service file, e.g., /etc/systemd/system/patroni.service
[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

# Then enable and start
sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni

Integrating with Laravel Application

Your Laravel application needs to be aware of the active PostgreSQL endpoint. This is typically managed via environment variables. During a failover, the DNS record or a load balancer pointing to the primary PostgreSQL instance must be updated. Patroni provides a REST API that can be queried to determine the current primary.

Dynamic Endpoint Resolution for Laravel

A common pattern is to use a service discovery mechanism or a simple script that polls Patroni’s API. For a direct Laravel integration, you can create a custom database manager or use a service that abstracts the database connection details.

Consider a simple PHP script that checks Patroni’s API:

<?php
// db_resolver.php
function get_current_pg_primary_host() {
    $patroni_api_url = 'http://[patroni_node_ip]:8008/primary'; // Use any Patroni node
    $response = @file_get_contents($patroni_api_url);

    if ($response === FALSE) {
        // Fallback or error handling
        error_log("Failed to connect to Patroni API at {$patroni_api_url}");
        return null;
    }

    $data = json_decode($response, true);

    if (json_last_error() !== JSON_ERROR_NONE || !isset($data['host'])) {
        error_log("Invalid response from Patroni API: " . $response);
        return null;
    }

    return $data['host'];
}

// Example usage in Laravel's database config (config/database.php)
// You would typically call this function to dynamically set the 'host'
// For simplicity, this is illustrative. A more robust solution might involve
// a dedicated service or a cron job updating environment variables.

// In config/database.php:
// 'pgsql' => [
//     'driver' => 'pgsql',
//     'host' => env('DB_HOST', get_current_pg_primary_host()), // Dynamic resolution
//     'port' => env('DB_PORT', 5432),
//     'database' => env('DB_DATABASE', 'your_db'),
//     'username' => env('DB_USERNAME', 'your_user'),
//     'password' => env('DB_PASSWORD', 'your_password'),
//     'charset' => 'utf8',
//     'prefix' => '',
//     'schema' => 'public',
//     'sslmode' => 'prefer',
// ],
?>

A more production-ready approach involves a load balancer (like HAProxy or Nginx) configured to point to the current primary, or a dedicated service that updates a DNS record. Patroni’s API can be used by an external script to trigger these updates.

Testing Failover Scenarios

Thorough testing is non-negotiable. Simulate various failure modes:

Stop the PostgreSQL process on the primary.
Stop the Patroni process on the primary.
Simulate network partition between primary and standby.
Reboot the primary instance.

Monitor Patroni’s logs and etcd cluster state during these tests. Verify that the standby is promoted and that your application can connect to the new primary within acceptable downtime objectives (RTO/RPO).

OVH Specific Considerations

Networking: Ensure your OVH security groups and firewall rules allow traffic between PostgreSQL nodes, etcd nodes, and your application servers on the necessary ports (5432, 2379, 8008, etc.). Use private IP addresses for inter-instance communication within the same region for better performance and security.

Instance Sizing: Over-provisioning slightly for the standby can help it catch up quickly after a failover. Monitor disk I/O and network throughput on your OVH instances.

Monitoring: Integrate OVH’s monitoring tools with your Patroni and PostgreSQL metrics. Set up alerts for replication lag, Patroni health, and etcd cluster status.

By combining PostgreSQL’s robust replication capabilities with Patroni’s intelligent orchestration and careful OVH infrastructure configuration, you can build a highly available PostgreSQL deployment with automated failover, significantly enhancing your application’s resilience.