Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and C++ Deployments on Linode
Establishing a High-Availability PostgreSQL Cluster
Achieving automated failover for PostgreSQL requires a robust, multi-node architecture. We’ll focus on a synchronous replication setup with a dedicated witness or arbiter node to prevent split-brain scenarios. This example uses Patroni, a popular template for PostgreSQL HA, orchestrating etcd for distributed consensus.
First, ensure you have three PostgreSQL nodes (e.g., `pg-primary`, `pg-replica-1`, `pg-replica-2`) and one etcd node (`etcd-01`). All nodes should be running a recent PostgreSQL version and have network connectivity.
Configuring etcd for Consensus
etcd is crucial for Patroni to manage leader election and cluster state. Install etcd on `etcd-01` and initialize a cluster. For production, consider a multi-node etcd cluster for its own HA.
On `etcd-01`:
# Install etcd (example for Ubuntu/Debian) sudo apt update && sudo apt install etcd -y # Configure etcd for a single-node cluster (for simplicity, production needs more) sudo nano /etc/etcd/etcd.conf.yml
Add the following configuration to `/etc/etcd/etcd.conf.yml`:
name: etcd-01 data-dir: /var/lib/etcd listen-client-urls: http://0.0.0.0:2379 advertise-client-urls: http://<ETCD_NODE_IP>:2379 listen-peer-urls: http://0.0.0.0:2380 initial-advertise-peer-urls: http://<ETCD_NODE_IP>:2380 initial-cluster: etcd-01=http://<ETCD_NODE_IP>:2380 initial-cluster-token: my-etcd-cluster initial-cluster-state: new
Replace <ETCD_NODE_IP> with the actual IP address of `etcd-01`. Then, start and enable the etcd service:
sudo systemctl start etcd sudo systemctl enable etcd
Deploying Patroni on PostgreSQL Nodes
Install Patroni and its dependencies (like `python3-pip`, `python3-dev`, `build-essential`) on each PostgreSQL node. Then, install Patroni via pip.
On each PostgreSQL node (e.g., `pg-primary`, `pg-replica-1`, `pg-replica-2`):
# Install dependencies (example for Ubuntu/Debian) sudo apt update && sudo apt install python3 python3-pip python3-dev build-essential -y # Install Patroni and psycopg2 sudo pip3 install "patroni[etcd]" psycopg2-binary
Configuring Patroni for PostgreSQL HA
Create a Patroni configuration file (e.g., `/etc/patroni/patroni.yml`) on each PostgreSQL node. The configuration needs to point to the etcd cluster and define PostgreSQL settings.
# /etc/patroni/patroni.yml
scope: my_postgres_cluster
namespace: /service/
# etcd configuration
etcd:
host: <ETCD_NODE_IP>:2379
protocol: http
# PostgreSQL configuration
postgresql:
listen: 0.0.0.0:5432
connect_address: <POSTGRES_NODE_IP>:5432
data_dir: /var/lib/postgresql/14/main # Adjust version as needed
pg_hba:
- host all all 0.0.0.0/0 md5
replication:
username: replicator
password: <REPLICATION_PASSWORD>
ssl: false # Set to true for production
parameters:
max_connections: 100
shared_buffers: 128MB
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
# Patroni REST API configuration
restapi:
listen: 0.0.0.0:8008
connect_address: <POSTGRES_NODE_IP>:8008
# Tags for node identification
tags:
nofailover: false
clonefrom: false
Key points:
scope: A unique identifier for this PostgreSQL cluster.namespace: The etcd path prefix for cluster state.etcd.host: The IP and port of your etcd node.postgresql.connect_address: The IP address this node should advertise for PostgreSQL connections.postgresql.replication.username/password: Credentials for replication. Ensure these are created in PostgreSQL.postgresql.parameters: Essential PostgreSQL settings for replication.restapi.connect_address: The IP address Patroni’s API will be accessible on.
Replace <ETCD_NODE_IP> and <POSTGRES_NODE_IP> with the respective IP addresses. Ensure the PostgreSQL user specified for replication exists and has the necessary privileges.
Initializing the PostgreSQL Cluster with Patroni
Start Patroni as a systemd service on each PostgreSQL node. The first node to start will attempt to initialize the cluster.
Create a systemd service file (e.g., `/etc/systemd/system/patroni.service`):
[Unit] Description=Patroni PostgreSQL High-Availability After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target
Reload systemd, start, and enable Patroni:
sudo systemctl daemon-reload sudo systemctl start patroni sudo systemctl enable patroni
Monitor the logs (`journalctl -u patroni -f`) on each node. The first node will initialize PostgreSQL, create the replication user, and become the primary. Subsequent nodes will join as replicas, cloning data if necessary.
Integrating C++ Applications with Auto-Failover
Your C++ application needs to be aware of the PostgreSQL cluster’s primary node. Instead of hardcoding a single database connection string, implement a dynamic connection strategy. A common approach is to query Patroni’s REST API to discover the current primary.
First, ensure Patroni’s REST API is accessible. You might need to configure firewall rules to allow access from your application servers to port 8008 on the PostgreSQL nodes.
C++ Client for Patroni API
Here’s a simplified C++ example using `libcurl` to fetch the primary node information. This code would typically run within your application or as a separate service that your application queries.
#include <iostream>
#include <string>
#include <curl/curl.h>
#include <nlohmann/json.hpp> // Using nlohmann/json for JSON parsing
using json = nlohmann::json;
// Callback function for libcurl to write received data
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* s) {
size_t newLength = size * nmemb;
try {
s->append((char*)contents, newLength);
} catch(std::bad_alloc &e) {
//handle memory problem
return 0;
}
return newLength;
}
// Function to get the primary PostgreSQL node from Patroni API
std::string getPrimaryPostgresNode(const std::string& patroniApiUrl) {
CURL *curl;
CURLcode res;
std::string readBuffer;
std::string primaryNodeAddress;
curl_global_init(CURL_GLOBAL_ALL);
curl = curl_easy_init();
if(curl) {
curl_easy_setopt(curl, CURLOPT_URL, patroniApiUrl.c_str());
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
curl_easy_setopt(curl, CURLOPT_TIMEOUT, 5L); // 5 second timeout
res = curl_easy_perform(curl);
if(res != CURLE_OK) {
std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res) << std::endl;
} else {
try {
auto data = json::parse(readBuffer);
if (data.contains("primary") && !data["primary"].empty()) {
primaryNodeAddress = data["primary"];
} else {
std::cerr << "No primary found in Patroni response." << std::endl;
}
} catch (json::parse_error& e) {
std::cerr << "JSON parse error: " << e.what() << std::endl;
} catch (const std::exception& e) {
std::cerr << "Error processing JSON: " << e.what() << std::endl;
}
}
curl_easy_cleanup(curl);
}
curl_global_cleanup();
return primaryNodeAddress;
}
int main() {
// Example: Querying the API of one of the Patroni nodes
// In a real app, you'd have a list of Patroni nodes to query if the first fails.
std::string patroniApiUrl = "http://<POSTGRES_NODE_IP_1>:8008/primary";
std::string primaryDbHost = getPrimaryPostgresNode(patroniApiUrl);
if (!primaryDbHost.empty()) {
std::cout << "Current PostgreSQL primary is: " << primaryDbHost << std::endl;
// Use this primaryDbHost to establish your PostgreSQL connection
// e.g., connect_to_postgres(primaryDbHost, "5432", "user", "password", "database");
} else {
std::cerr << "Failed to determine PostgreSQL primary." << std::endl;
// Implement retry logic or fail gracefully
}
return 0;
}
To compile this, you’ll need `libcurl` and `nlohmann/json` (a header-only library, easily integrated). For example, using g++:
g++ your_app.cpp -o your_app -lcurl -std=c++17
Replace <POSTGRES_NODE_IP_1> with the IP of one of your PostgreSQL nodes. In a production scenario, you would maintain a list of Patroni API endpoints and iterate through them until a successful response is received. This function should be called periodically or when a connection attempt fails.
Implementing Connection Pooling and Retry Logic
Directly querying the Patroni API on every connection attempt can be inefficient. A more robust strategy involves:
- Connection Pooling: Use a C++ connection pool library. When the pool needs to establish a new connection, it calls your `getPrimaryPostgresNode` function to get the current primary’s address.
- Retry Mechanism: If a connection to the determined primary fails, your application should:
- Mark the current primary as potentially down.
- Query the Patroni API again (perhaps from a different Patroni node if the first one is unresponsive).
- Attempt to connect to the newly identified primary.
- Implement exponential backoff for repeated failures.
- Read Replicas: For read-heavy workloads, you can configure Patroni to expose read replicas. Your application can then distribute read queries across these replicas, only directing writes to the primary discovered via the API.
Testing Failover Scenarios
Automated failover is only as good as its testing. Regularly simulate failures:
- Stop PostgreSQL Service: On the primary node, stop the PostgreSQL service (`sudo systemctl stop postgresql`). Observe Patroni’s logs and verify that a replica is promoted. Test application connectivity.
- Network Partition: Simulate network issues between nodes. This is where etcd’s quorum and Patroni’s logic are critical.
- Node Reboot: Reboot the primary node.
- Patroni Service Stop: Stop the Patroni service on the primary.
After each test, ensure the cluster state is stable, data integrity is maintained, and your C++ application can reconnect and resume operations seamlessly.