Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and PHP Deployments on Google Cloud
Leveraging Google Cloud’s Managed PostgreSQL for Automated Failover
For mission-critical applications, a robust disaster recovery strategy is non-negotiable. When architecting for high availability with PostgreSQL and PHP on Google Cloud Platform (GCP), leveraging Cloud SQL for PostgreSQL’s managed failover capabilities significantly simplifies operational overhead and reduces RTO (Recovery Time Objective). Cloud SQL automatically handles failover to a standby instance in a different zone within the same region in the event of an instance failure. This section details the configuration steps and considerations for setting up a highly available PostgreSQL instance.
Configuring Cloud SQL for PostgreSQL High Availability
The core of our automated failover strategy for PostgreSQL lies in enabling the High Availability (HA) option for your Cloud SQL instance. This provisions a primary instance and a synchronous standby replica in a different zone. All writes are committed to both instances before being acknowledged, ensuring data consistency. Reads can be directed to either the primary or the replica, though for simplicity in this initial setup, we’ll focus on directing writes to the primary and handling application-level failover for read traffic.
Here’s how to enable HA via the `gcloud` CLI, which is ideal for infrastructure-as-code practices:
gcloud sql instances patch YOUR_INSTANCE_NAME \
--availability-type=REGIONAL \
--backup-start-time=03:00 \
--region=YOUR_REGION \
--project=YOUR_PROJECT_ID
Replace YOUR_INSTANCE_NAME, YOUR_REGION, and YOUR_PROJECT_ID with your specific values. The --availability-type=REGIONAL flag is crucial for enabling HA. Setting a --backup-start-time is also recommended for scheduled backups, which are a prerequisite for HA.
Application-Level Failover for PHP Applications
While Cloud SQL handles the database instance failover, your PHP application needs to be aware of and adapt to the change. The primary instance’s IP address remains the same after a failover, which simplifies things. However, the standby instance will have a different IP address. For seamless failover, we need a mechanism to ensure our PHP application always connects to the *current* primary instance.
A common and effective pattern is to use a DNS record or a load balancer that points to the primary instance’s IP. Cloud SQL provides a stable IP address for the primary instance. If HA is enabled, the IP address of the primary instance does not change during a failover. The standby instance will have a different IP, but it’s not directly exposed for application connections in a standard HA setup.
However, for applications that might need to connect to the *standby* during maintenance or for read replicas, managing IP addresses becomes complex. A more robust approach for application-level awareness involves using a service that can resolve the *current* primary IP. For simplicity and immediate failover, relying on the stable primary IP provided by Cloud SQL is the first step. If you need to direct traffic to the standby for read scaling or during planned maintenance, you would typically manage this through application configuration or a dedicated proxy.
PHP Database Connection Strategy
Your PHP application’s database connection logic must be resilient. Instead of hardcoding IP addresses, use connection strings that reference the Cloud SQL instance name or its stable IP. When HA is enabled, the primary instance’s IP address is stable. The failover process is managed by Cloud SQL, and the application should ideally reconnect to the same IP address if the connection is dropped.
Consider a connection pooler like pgbouncer if you have a high volume of short-lived connections, but for direct application connections, ensure your connection logic includes retry mechanisms. Here’s a simplified PHP example using PDO:
<?php
// config.php
define('DB_HOST', 'YOUR_CLOUD_SQL_INSTANCE_CONNECTION_NAME'); // e.g., 'your-project:us-central1:your-instance'
define('DB_USER', 'your_db_user');
define('DB_PASS', 'your_db_password');
define('DB_NAME', 'your_database');
define('DB_PORT', '5432'); // Default PostgreSQL port
// For direct IP connection (less recommended for failover resilience)
// define('DB_IP_ADDRESS', 'YOUR_PRIMARY_INSTANCE_IP');
$dsn = "pgsql:host=" . DB_HOST . ";port=" . DB_PORT . ";dbname=" . DB_NAME;
$options = [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
PDO::ATTR_EMULATE_PREPARES => false,
];
$pdo = null;
$max_retries = 5;
$retry_delay_ms = 2000; // 2 seconds
for ($i = 0; $i <= $max_retries; $i++) {
try {
// If using instance connection name, PDO will resolve it to the correct IP
$pdo = new PDO($dsn, DB_USER, DB_PASS, $options);
echo "Connected successfully!\n";
break; // Exit loop on success
} catch (PDOException $e) {
if ($i === $max_retries) {
// Log the error and potentially trigger an alert
error_log("Database connection failed after {$max_retries} retries: " . $e->getMessage());
// In a real application, you might redirect to an error page or show a maintenance message
die("Database connection failed. Please try again later.");
}
// Wait before retrying
usleep($retry_delay_ms * 1000); // usleep expects microseconds
echo "Connection failed. Retrying... (" . ($i + 1) . "/" . $max_retries . ")\n";
}
}
// Now $pdo is your database connection object
// Example query:
// $stmt = $pdo->query('SELECT version()');
// $row = $stmt->fetch();
// print_r($row);
?>
In this example, using the Cloud SQL instance connection name (e.g., your-project:us-central1:your-instance) is preferred. PDO, when configured with the Cloud SQL Auth Proxy or when running on GCP infrastructure with appropriate service account permissions, can resolve this name to the correct IP address of the primary instance. The retry logic is crucial for handling transient network issues or the brief unavailability during a failover event.
Deploying PHP Applications for High Availability
Your PHP application deployment strategy also plays a role in resilience. Deploying multiple instances of your PHP application across different zones within the same region as your Cloud SQL instance is a standard practice. This ensures that if one zone experiences an outage, other instances in different zones can continue serving traffic.
Google Kubernetes Engine (GKE) is an excellent platform for this. You can configure your deployments to have Pods spread across multiple nodes in different zones. A Google Cloud Load Balancer (HTTP(S) Load Balancer or Network Load Balancer) can then distribute incoming traffic to these healthy Pods.
GKE Deployment Example
Here’s a simplified GKE deployment manifest that targets multiple zones and uses a Service to expose the application, which would then be fronted by a Google Cloud Load Balancer.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-php-app
labels:
app: php
spec:
replicas: 3 # Start with 3 replicas, ideally one per zone
selector:
matchLabels:
app: php
template:
metadata:
labels:
app: php
spec:
# This topologySpreadConstraints ensures pods are spread across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: php
containers:
- name: php-apache
image: php:8.1-apache # Replace with your actual PHP image
ports:
- containerPort: 80
env:
- name: DB_HOST
value: "YOUR_CLOUD_SQL_INSTANCE_CONNECTION_NAME" # e.g., 'your-project:us-central1:your-instance'
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_PASS
valueFrom:
secretKeyRef:
name: db-credentials
key: password
- name: DB_NAME
value: "your_database"
# Add readiness and liveness probes for robust health checking
readinessProbe:
httpGet:
path: /healthz # A simple endpoint that checks DB connection
port: 80
initialDelaySeconds: 15
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 30
periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
name: my-php-app-service
spec:
selector:
app: php
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP # This will be exposed via an Ingress or LoadBalancer Service
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-php-app-ingress
annotations:
kubernetes.io/ingress.class: "gce" # For Google Cloud Load Balancer
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-php-app-service
port:
number: 80
The topologySpreadConstraints ensure that GKE attempts to schedule your application Pods across different zones. The readinessProbe and livenessProbe are critical. A readiness probe that checks database connectivity will prevent traffic from being sent to a Pod that cannot reach the database, especially during a failover event before the application has re-established its connection.
Monitoring and Alerting
Automated failover is only effective if you are alerted when it occurs and can verify its success. Google Cloud’s operations suite (formerly Stackdriver) provides robust monitoring and alerting for Cloud SQL and GKE.
- Cloud SQL Metrics: Monitor
cloudsql.googleapis.com/database/cpu/utilization,cloudsql.googleapis.com/database/disk/bytes_used, and crucially,cloudsql.googleapis.com/database/replication/lag(though less relevant for synchronous HA). Set up alerts for high CPU, low disk space, and any metrics indicating instance health degradation. - Cloud SQL Logs: Enable query insights and audit logging. Monitor PostgreSQL logs for errors that might precede a failure.
- GKE Health Checks: Ensure your application’s health check endpoints are reliable and accurately reflect the application’s ability to connect to the database.
- GKE Node and Pod Health: Monitor GKE node status and Pod restarts.
- Custom Alerts: Implement custom alerts for critical database operations or application errors that might indicate a problem even if the failover itself was successful. For instance, an alert if the number of active database connections drops significantly or if specific error rates spike in your application logs.
By combining Cloud SQL’s managed HA with resilient application design and deployment patterns on GKE, you can achieve a robust, automated disaster recovery solution for your PostgreSQL and PHP applications on Google Cloud.