Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and PHP Deployments on OVH
Automated PostgreSQL Failover with Patroni and OVH Public Cloud
Achieving true high availability for PostgreSQL databases, especially in a cloud environment like OVH Public Cloud, necessitates an automated failover strategy. Manual intervention during an outage is a non-starter for production systems. Patroni emerges as a robust, open-source solution for automating PostgreSQL high availability and failover. It leverages distributed consensus systems like etcd, Consul, or ZooKeeper to manage cluster state and orchestrate failover.
This guide details the architecture and implementation of an automated failover system for PostgreSQL using Patroni on OVH Public Cloud, focusing on a three-node cluster for quorum and resilience. We’ll cover the setup of Patroni, its integration with PostgreSQL, and considerations for application-level failover.
Patroni Architecture and OVH Deployment Strategy
Our strategy involves deploying three PostgreSQL instances, each running Patroni. One instance will act as the primary, while the others will be replicas. Patroni monitors the health of the primary. If it detects a failure, it initiates a consensus-based election process among the remaining nodes to promote one of the replicas to become the new primary. This process is critical for minimizing downtime.
For the distributed consensus system, we’ll opt for etcd. OVH Public Cloud allows for the deployment of multiple virtual machines (VMs) within a private network, which is ideal for hosting both the PostgreSQL cluster and the etcd cluster. We’ll deploy etcd on separate VMs to ensure its availability is independent of the PostgreSQL nodes.
Setting up the etcd Cluster
A minimum of three etcd nodes is recommended for a fault-tolerant cluster. These nodes should reside on the same OVH private network as the PostgreSQL instances.
1. Install etcd:
On each etcd node (e.g., etcd-01, etcd-02, etcd-03), download and install etcd. The exact method may vary based on your chosen OS (e.g., Ubuntu, Debian). Using package managers is generally preferred.
For example, on Ubuntu:
sudo apt update && sudo apt install etcd-server
Configuring etcd Members
Each etcd node needs a configuration file, typically located at /etc/etcd/etcd.conf.yml. Ensure the --initial-cluster and --listen-peer-urls/--listen-client-urls are correctly set to point to the private IPs of your etcd nodes.
Example /etc/etcd/etcd.conf.yml for etcd-01 (replace IPs with actual private IPs):
name: etcd-01 data-dir: /var/lib/etcd listen-peer-urls: http://10.10.0.1:2380 listen-client-urls: http://10.10.0.1:2379,http://127.0.0.1:2379 advertise-client-urls: http://10.10.0.1:2379 initial-advertise-peer-urls: http://10.10.0.1:2380 initial-cluster: etcd-01=http://10.10.0.1:2380,etcd-02=http://10.10.0.2:2380,etcd-03=http://10.10.0.3:2380 initial-cluster-state: new proxy: false
Repeat this configuration for etcd-02 and etcd-03, adjusting the name and IP addresses accordingly.
Starting and Verifying etcd
Start the etcd service and enable it to start on boot:
sudo systemctl start etcd sudo systemctl enable etcd
Verify the cluster status. On any etcd node:
ETCDCTL_API=3 etcdctl member list
You should see all three members listed. If not, check network connectivity between etcd nodes and review the configuration files and logs (journalctl -u etcd).
Setting up the PostgreSQL Cluster with Patroni
We’ll deploy three PostgreSQL instances (e.g., pg-01, pg-02, pg-03) on separate OVH VMs. Each VM will run a PostgreSQL instance and Patroni.
1. Install PostgreSQL and Patroni:
On each PostgreSQL node:
# Install PostgreSQL (e.g., PostgreSQL 14) sudo apt update && sudo apt install postgresql postgresql-contrib # Install Patroni and dependencies (e.g., python3-pip, python3-dev, build-essential) sudo apt install python3-pip python3-dev build-essential sudo pip3 install --upgrade pip sudo pip3 install "patroni[etcd]" psycopg2-binary
Configuring Patroni
Patroni’s configuration is typically stored in a YAML file, e.g., /etc/patroni/patroni.yml. This file defines the PostgreSQL settings, etcd connection details, and replication parameters.
Example /etc/patroni/patroni.yml for pg-01 (replace IPs with actual private IPs):
# Global settings
scope: my_pg_cluster
namespace: /service/
name: pg-01 # Unique name for this node
# PostgreSQL configuration
postgresql:
listen: 0.0.0.0:5432
data_dir: /var/lib/postgresql/14/main
bin_dir: /usr/lib/postgresql/14/bin
config_dir: /etc/postgresql/14/main
pg_hba: /etc/postgresql/14/main/pg_hba.conf
authentication:
replication:
username: replicator
password: YourReplicationPassword
superuser:
username: postgres
password: YourPostgresPassword
parameters:
max_connections: 100
shared_buffers: 256MB
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
# etcd configuration
etcd:
host: 10.10.0.1:2379,10.10.0.2:2379,10.10.0.3:2379
protocol: http
# Replication configuration
replication:
synchronous_mode: false # Set to true for synchronous replication if latency permits
synchronous_node_count: 1 # If synchronous_mode is true
# Tags for node identification
tags:
nofailover: false
clonefrom: false
Important Notes:
- Replace placeholder IPs with the actual private IPs of your etcd and PostgreSQL nodes.
- Ensure the
scopeis identical across all PostgreSQL nodes in the cluster. - The
namemust be unique for each Patroni instance. - Configure PostgreSQL’s
pg_hba.confto allow replication and superuser access from other nodes in the cluster and potentially from your application servers. - The
replicationuser and password must be set up in PostgreSQL. synchronous_mode: falseis recommended for initial setup to avoid potential blocking during network partitions. Consider enabling it later if your network guarantees low latency and high reliability.
Create a similar configuration file for pg-02 and pg-03, changing only the name parameter to pg-02 and pg-03 respectively. Ensure the etcd.host list includes all etcd nodes.
Configuring PostgreSQL for Patroni
Patroni manages PostgreSQL configuration files. However, you need to ensure PostgreSQL is set up to be managed by Patroni. This often involves creating the necessary directories and ensuring the PostgreSQL service can be started and stopped by the user running Patroni.
1. Create PostgreSQL Data Directory:
sudo mkdir -p /var/lib/postgresql/14/main sudo chown -R postgres:postgres /var/lib/postgresql/14/main
2. Configure pg_hba.conf:
Edit /etc/postgresql/14/main/pg_hba.conf on each node. Allow replication and superuser access from your private network range. For example:
# TYPE DATABASE USER ADDRESS METHOD # Allow replication connections from any node in the cluster host replication replicator 10.10.0.0/24 md5 # Allow superuser access from any node in the cluster host all postgres 10.10.0.0/24 md5 # Allow application access (adjust as needed) host all all 10.10.0.0/24 md5
3. Create Replication User:
On one of the nodes (it doesn’t matter which one initially, as Patroni will manage it), connect as the postgres user and create the replication user:
sudo -u postgres psql -c "CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'YourReplicationPassword';"
Starting Patroni and PostgreSQL
Patroni can manage the PostgreSQL service. Create a systemd service file for Patroni, e.g., /etc/systemd/system/patroni.service:
[Unit] Description=Patroni - Highly Available PostgreSQL After=syslog.target network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml ExecStop=/bin/kill -HUP $MAINPID KillMode=process TimeoutSec=30 Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target
Ensure the ExecStart path points to your Patroni executable. Reload systemd, start, and enable Patroni:
sudo systemctl daemon-reload sudo systemctl start patroni sudo systemctl enable patroni
Patroni will now attempt to start PostgreSQL. The first node to acquire a lock in etcd will become the initial primary. Subsequent nodes will start as replicas and connect to the primary for replication.
Verifying the PostgreSQL Cluster
Check the status of the Patroni service on each node:
sudo systemctl status patroni
You can also query etcd directly to see the cluster state:
ETCDCTL_API=3 etcdctl get /service/my_pg_cluster --prefix --keys-only
This should show keys like /service/my_pg_cluster/leader and keys for each node. The node listed as the leader is the current primary.
Connect to the PostgreSQL primary and verify replication status:
SELECT * FROM pg_stat_replication;
On the replica nodes, check the replication lag:
SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
Simulating a Failover
To test the failover mechanism, you can stop the Patroni service on the current primary node. Patroni on the other nodes will detect the primary’s unavailability and initiate a new election.
On the current primary node (e.g., pg-01):
sudo systemctl stop patroni
Monitor the logs of the other Patroni instances (journalctl -u patroni -f). Within seconds, one of the replicas should be promoted, and the cluster should elect a new leader.
After the failover, connect to the new primary and verify that it’s serving read/write traffic and that the other nodes are replicating from it.
Application-Level Failover with PHP
While Patroni handles database failover, your PHP application needs to be aware of the current primary. Simply pointing your application to a single PostgreSQL IP address will result in downtime during a failover. Here are common strategies:
1. DNS-Based Failover
Maintain a DNS record (e.g., db.yourdomain.com) that always points to the IP address of the current PostgreSQL primary. When a failover occurs, you need a mechanism to update this DNS record.
Mechanism: A script or a dedicated service monitors the Patroni cluster state (e.g., by querying etcd or Patroni’s REST API) and updates the DNS record via your DNS provider’s API (e.g., OVH’s API, Cloudflare API). This requires careful consideration of DNS TTL to minimize propagation delays.
PHP Implementation: Your PHP application connects to db.yourdomain.com. No code changes are strictly necessary if DNS propagation is fast enough.
2. Load Balancer / Proxy Layer
Deploy a load balancer or proxy (e.g., HAProxy, Nginx with TCP proxying) in front of your PostgreSQL cluster. The load balancer’s health checks can be configured to monitor the PostgreSQL primary. When the primary fails, the load balancer stops sending traffic to it and directs it to the new primary.
Mechanism: HAProxy can be configured to perform health checks against PostgreSQL. Patroni can expose an API that allows external systems to query the current primary. A custom script could update HAProxy’s configuration or use its runtime API to reconfigure it upon failover.
PHP Implementation: Your PHP application connects to the load balancer’s IP address. Similar to DNS, no application code changes are needed if the load balancer is managed correctly.
3. Application-Level Connection Pooling and Retry Logic
This is often the most practical approach for PHP applications, especially when dealing with shared hosting or environments where external DNS/LB updates are complex. You can maintain a list of all PostgreSQL node IPs in your PHP application’s configuration and implement retry logic.
PHP Configuration Example (e.g., in a config file):
<?php
return [
'db' => [
'hosts' => [
'10.10.0.10', // pg-01
'10.10.0.11', // pg-02
'10.10.0.12', // pg-03
],
'port' => 5432,
'database' => 'your_database',
'username' => 'app_user',
'password' => 'YourAppPassword',
'retry_attempts' => 3,
'retry_delay_ms' => 1000, // 1 second
],
];
?>
PHP Connection Logic with Retry:
<?php
// Assuming you are using PDO
function connect_to_database(array $config) {
$hosts = $config['db']['hosts'];
$attempts = $config['db']['retry_attempts'];
$delay = $config['db']['retry_delay_ms'];
$last_exception = null;
for ($i = 0; $i <= $attempts; $i++) {
shuffle($hosts); // Randomize host order to try different ones
foreach ($hosts as $host) {
$dsn = "pgsql:host={$host};port={$config['db']['port']};dbname={$config['db']['database']}";
try {
$pdo = new PDO($dsn, $config['db']['username'], $config['db']['password'], [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_TIMEOUT => 5, // Connection timeout in seconds
]);
// Optionally, check if it's a read-write connection
// $pdo->query("SELECT pg_is_in_recovery()"); // This would return false on primary
return $pdo;
} catch (PDOException $e) {
$last_exception = $e;
// Log the failed attempt to a specific host
error_log("Database connection failed to {$host}: " . $e->getMessage());
}
}
// If all hosts failed in this iteration, wait before retrying
if ($i < $attempts) {
usleep($delay * 1000); // Convert ms to microseconds
}
}
// If all attempts failed
throw new RuntimeException("Failed to connect to the database after multiple attempts.", 0, $last_exception);
}
// --- Usage ---
$dbConfig = require 'config/database.php'; // Load your config
try {
$pdo = connect_to_database($dbConfig);
echo "Successfully connected to the database!\n";
// Proceed with database operations
} catch (RuntimeException $e) {
echo "Database connection error: " . $e->getMessage() . "\n";
// Handle the error, e.g., show a maintenance page
}
?>
This approach allows the application to automatically try connecting to other available PostgreSQL nodes if the current primary is unreachable. The shuffle($hosts) ensures that on each retry, a different host order is attempted, increasing the chance of hitting the new primary quickly. The PDO::ATTR_TIMEOUT is crucial to prevent long hangs.
Monitoring and Maintenance
Continuous monitoring is essential. Key metrics to track include:
- Patroni health status (via its REST API or logs).
- etcd cluster health.
- PostgreSQL replication lag.
- Disk space on database nodes.
- CPU and memory utilization.
- Network connectivity between nodes.
Tools like Prometheus with the `postgres_exporter` and a Patroni exporter, or commercial solutions, can provide comprehensive monitoring. Regularly test your failover procedure (e.g., quarterly) to ensure it functions as expected and to train your operations team.
Consider implementing automated backups using tools like pg_dump or pg_basebackup, storing them off-site or in a separate OVH storage solution. Patroni itself does not handle backups.
Conclusion
Implementing automated failover for PostgreSQL on OVH Public Cloud using Patroni provides a significant boost to your application’s availability. By combining Patroni’s robust cluster management with a well-defined application-level strategy for handling database endpoint changes, you can achieve near-zero downtime during PostgreSQL node failures. Remember that high availability is a layered approach, and while Patroni handles the database, your application’s resilience is equally critical.