Server Monitoring Best Practices: Keeping Your Laravel App and DynamoDB Clusters Alive on DigitalOcean
Proactive Laravel Application Health Checks
Maintaining the health of a Laravel application goes beyond simply checking if the web server is responding. We need to ensure the application itself is functioning correctly, processing requests efficiently, and not experiencing internal errors. This involves implementing deep health checks that can be integrated into your monitoring stack.
A robust health check endpoint should verify several critical components:
- Database connectivity and basic query execution.
- Cache driver accessibility.
- Queue worker status (though this is often a separate, more involved check).
- Key external API dependencies (if applicable).
- Application-level errors (e.g., recent exceptions).
Let’s create a custom health check route in Laravel. This route will be polled by our monitoring system.
Implementing the Laravel Health Check Endpoint
First, define a route in routes/api.php (or routes/web.php if you prefer, but API routes are generally better for machine-to-machine communication).
// routes/api.php
use Illuminate\Support\Facades\Route;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;
use Illuminate\Http\JsonResponse;
use App\Http\Controllers\HealthCheckController;
Route::get('/health', [HealthCheckController::class, 'index']);
Next, create the HealthCheckController.
// app/Http/Controllers/HealthCheckController.php
namespace App\Http\Controllers;
use Illuminate\Http\JsonResponse;
use Illuminate\Routing\Controller as BaseController;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;
use Throwable;
class HealthCheckController extends BaseController
{
public function index(): JsonResponse
{
$status = 'healthy';
$checks = [];
// 1. Database Check
try {
DB::connection()->getPdo();
$checks['database'] = 'ok';
} catch (Throwable $e) {
$status = 'unhealthy';
$checks['database'] = 'error: ' . $e->getMessage();
Log::error('Database connection failed for health check', ['exception' => $e]);
}
// 2. Cache Check
try {
$cacheKey = 'health_check_cache_test_' . uniqid();
Cache::put($cacheKey, 'test', 1);
if (Cache::get($cacheKey) === 'test') {
$checks['cache'] = 'ok';
Cache::forget($cacheKey); // Clean up
} else {
throw new \Exception('Cache read/write failed');
}
} catch (Throwable $e) {
$status = 'unhealthy';
$checks['cache'] = 'error: ' . $e->getMessage();
Log::error('Cache connection/operation failed for health check', ['exception' => $e]);
}
// 3. Add more checks as needed (e.g., external APIs, specific service availability)
// Example: External API check (simplified)
/*
try {
$client = new \GuzzleHttp\Client();
$response = $client->request('GET', config('services.external_api.url') . '/health', ['timeout' => 5]);
if ($response->getStatusCode() === 200) {
$checks['external_api'] = 'ok';
} else {
throw new \Exception('External API returned non-200 status');
}
} catch (Throwable $e) {
$status = 'unhealthy';
$checks['external_api'] = 'error: ' . $e->getMessage();
Log::error('External API health check failed', ['exception' => $e]);
}
*/
// 4. Application Exception Check (e.g., check logs for recent critical errors)
// This is more complex and might involve parsing log files or using a dedicated logging service.
// For simplicity, we'll assume a basic check or rely on external monitoring of logs.
// A more advanced approach might query a logging service like Elasticsearch or Datadog.
return response()->json([
'status' => $status,
'checks' => $checks,
'timestamp' => now()->toIso8601String(),
], $status === 'healthy' ? 200 : 503); // 503 Service Unavailable for unhealthy
}
}
Ensure your .env file has the correct database and cache configurations. For production, you’ll likely be using Redis for caching and a managed database service like DigitalOcean Managed Databases (PostgreSQL/MySQL).
Monitoring the Health Endpoint with UptimeRobot/Prometheus
You can use external services like UptimeRobot for basic HTTP checks, but for more granular insights and integration with your alerting system, Prometheus is a standard choice. You’ll need a Prometheus exporter that can scrape your Laravel application’s health endpoint.
A simple approach is to use a generic HTTP exporter or write a small custom exporter. For instance, you could deploy a small Python application using the prometheus_client library that periodically scrapes your health endpoint and exposes metrics to Prometheus.
# exporter/app.py
from prometheus_client import Gauge, start_http_server
import time
import requests
import os
HEALTH_URL = os.environ.get("LARAVEL_HEALTH_URL", "http://localhost/api/health")
EXPORTER_PORT = int(os.environ.get("EXPORTER_PORT", 9101))
# Metrics
app_health_status = Gauge('laravel_app_health_status', 'Laravel application health status (1 for healthy, 0 for unhealthy)', ['check'])
app_health_checks_total = Gauge('laravel_app_health_checks_total', 'Total number of health checks performed', ['check'])
def scrape_health_endpoint():
try:
response = requests.get(HEALTH_URL, timeout=10)
response.raise_for_status() # Raise an exception for bad status codes
data = response.json()
# Reset all gauges before updating
for check_name in ['overall', 'database', 'cache', 'external_api']: # Add all possible checks
app_health_status.labels(check=check_name).set(0)
app_health_checks_total.labels(check=check_name).inc()
if data.get('status') == 'healthy':
app_health_status.labels(check='overall').set(1)
else:
app_health_status.labels(check='overall').set(0)
for check_name, check_status in data.get('checks', {}).items():
if check_status == 'ok':
app_health_status.labels(check=check_name).set(1)
else:
app_health_status.labels(check=check_name).set(0)
# Optionally log the error from the check status
print(f"Health check failed for {check_name}: {check_status}")
print(f"Scraped {HEALTH_URL} successfully. Status: {data.get('status')}")
except requests.exceptions.RequestException as e:
print(f"Error scraping health endpoint {HEALTH_URL}: {e}")
# Set overall health to unhealthy if scrape fails
app_health_status.labels(check='overall').set(0)
app_health_checks_total.labels(check='overall').inc()
except Exception as e:
print(f"An unexpected error occurred: {e}")
app_health_status.labels(check='overall').set(0)
app_health_checks_total.labels(check='overall').inc()
if __name__ == '__main__':
print(f"Starting Prometheus exporter on port {EXPORTER_PORT}")
start_http_server(EXPORTER_PORT)
print(f"Scraping Laravel health endpoint at {HEALTH_URL}")
while True:
scrape_health_endpoint()
time.sleep(60) # Scrape every 60 seconds
Deploy this exporter on a separate DigitalOcean Droplet or within your Kubernetes cluster, and configure Prometheus to scrape its metrics endpoint (e.g., http://exporter-ip:9101/metrics).
DynamoDB Cluster Health and Performance Monitoring
Monitoring DynamoDB involves looking at both operational health (availability) and performance metrics (throughput, latency, errors). AWS CloudWatch is the primary tool for this, and we’ll integrate its metrics into our monitoring stack, likely via Prometheus.
Key DynamoDB Metrics to Monitor
Focus on these critical metrics, available via AWS CloudWatch:
- ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits: Track actual usage against provisioned capacity. Spikes can indicate performance issues or inefficient queries.
- ReadThrottleEvents / WriteThrottleEvents: Crucial for identifying when requests are being throttled due to exceeding provisioned capacity. This directly impacts application performance.
- SuccessfulRequestLatency: Measures the latency of successful requests. High latency indicates potential issues within DynamoDB or network problems.
- SystemErrors: Count of internal DynamoDB errors. Any non-zero value requires immediate investigation.
- UserErrors: Count of errors originating from user requests (e.g., validation errors, conditional check failures). While some are expected, a sudden surge can point to application bugs.
- ReturnedItemCount: Useful for understanding the volume of data being returned by queries.
Integrating CloudWatch Metrics with Prometheus
To bring CloudWatch metrics into Prometheus, we use the cloudwatch_exporter. This allows Prometheus to scrape metrics exposed by the exporter, which in turn pulls data from CloudWatch.
First, set up the cloudwatch_exporter. You’ll need AWS credentials configured (e.g., via environment variables or an IAM role if running on EC2/ECS).
# cloudwatch_exporter/config.yml
# Example configuration for DynamoDB metrics
---
region: us-east-1 # Or your AWS region
metrics:
- aws_namespace: AWS/DynamoDB
namespace: dynamodb
# Define specific metrics for your tables
dimensions:
- name: TableName
value: your-laravel-table-name # Replace with your actual table name
metrics:
- name: ConsumedReadCapacityUnits
statistics: [Sum]
period: 300 # 5 minutes
- name: ConsumedWriteCapacityUnits
statistics: [Sum]
period: 300
- name: ReadThrottleEvents
statistics: [Sum]
period: 60 # 1 minute for throttles
- name: WriteThrottleEvents
statistics: [Sum]
period: 60
- name: SuccessfulRequestLatency
statistics: [Average, Maximum]
period: 300
- name: SystemErrors
statistics: [Sum]
period: 60
- name: UserErrors
statistics: [Sum]
period: 60
- name: ReturnedItemCount
statistics: [Sum]
period: 300
# Add more tables or global secondary indexes as needed
# Example for a specific index:
# - aws_namespace: AWS/DynamoDB
# namespace: dynamodb
# dimensions:
# - name: TableName
# value: your-laravel-table-name
# - name: GlobalSecondaryIndexName
# value: your-index-name
# metrics:
# - name: ConsumedReadCapacityUnits
# statistics: [Sum]
# period: 300
Deploy the cloudwatch_exporter (e.g., as a Docker container) and configure Prometheus to scrape its metrics endpoint (e.g., http://cloudwatch-exporter-ip:9100/metrics).
Alerting on DynamoDB Throttles and Latency
Alerting is paramount. Configure Prometheus Alertmanager to trigger alerts based on specific thresholds.
# prometheus/alert.rules.yml
groups:
- name: dynamodb_alerts
rules:
- alert: HighDynamoDBReadThrottleRate
expr: sum(rate(dynamodb_readthrottleevents_sum{job="cloudwatch_exporter"}[5m])) by (table) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "High read throttle rate detected on DynamoDB table {{ $labels.table }}"
description: "DynamoDB table {{ $labels.table }} is experiencing a high rate of read throttles (more than 0 per second over 5 minutes)."
- alert: HighDynamoDBWriteThrottleRate
expr: sum(rate(dynamodb_writethrottleevents_sum{job="cloudwatch_exporter"}[5m])) by (table) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "High write throttle rate detected on DynamoDB table {{ $labels.table }}"
description: "DynamoDB table {{ $labels.table }} is experiencing a high rate of write throttles (more than 0 per second over 5 minutes)."
- alert: HighDynamoDBReadLatency
expr: avg(dynamodb_successfulrequestlatency_maximum{job="cloudwatch_exporter", statistic="Maximum"} > 0.5) by (table) # Latency in seconds
for: 10m
labels:
severity: critical
annotations:
summary: "High DynamoDB read latency on table {{ $labels.table }}"
description: "DynamoDB table {{ $labels.table }} has a maximum read latency exceeding 0.5 seconds for the last 10 minutes."
- alert: DynamoDBSystemErrors
expr: sum(rate(dynamodb_systemerrors_sum{job="cloudwatch_exporter"}[5m])) by (table) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "System errors detected in DynamoDB table {{ $labels.table }}"
description: "DynamoDB table {{ $labels.table }} is reporting system errors."
These alerts should be routed through Alertmanager to your preferred notification channels (Slack, PagerDuty, email).
Server Resource Monitoring on DigitalOcean Droplets
For your Laravel application servers (DigitalOcean Droplets), standard system resource monitoring is essential. Node Exporter is the de facto standard for exposing host-level metrics to Prometheus.
Deploying Node Exporter
Node Exporter can be deployed as a systemd service or a Docker container. Here’s a typical systemd service file:
# /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus # Or a dedicated user ExecStart=/usr/local/bin/node_exporter --web.listen-address=":9100" [Install] Service] WantedBy=multi-user.target
After placing this file, enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter
Configure Prometheus to scrape the node_exporter targets (e.g., http://your-droplet-ip:9100/metrics).
Key Server Metrics and Alerts
Focus on:
- CPU Usage (
node_cpu_seconds_total): Monitor overall CPU load and per-core usage. High sustained CPU can indicate inefficient code or insufficient resources. - Memory Usage (
node_memory_MemAvailable_bytes,node_memory_MemTotal_bytes): Track available memory. Running out of memory leads to swapping and severe performance degradation. - Disk I/O (
node_disk_io_time_seconds_total): Monitor disk read/write activity. High I/O wait times can bottleneck applications, especially databases. - Network Traffic (
node_network_receive_bytes_total,node_network_transmit_bytes_total): Track network throughput. - Load Average (
node_load1,node_load5,node_load15): A general indicator of system load.
Example Prometheus alerts for server resources:
# prometheus/alert.rules.yml (add to existing file)
groups:
- name: server_alerts
rules:
- alert: HighCpuUsage
expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} has been above 90% for the last 10 minutes."
- alert: LowAvailableMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 15m
labels:
severity: critical
annotations:
summary: "Low available memory on {{ $labels.instance }}"
description: "Available memory on {{ $labels.instance }} is below 10% for the last 15 minutes."
- alert: HighDiskIOWait
expr: rate(node_disk_io_time_seconds_total{device="sda"}[5m]) > 0.8 # Adjust device and threshold
for: 10m
labels:
severity: warning
annotations:
summary: "High disk I/O wait on {{ $labels.instance }}"
description: "Disk I/O wait time on {{ $labels.instance }} is high (above 80% of time spent waiting)."
These alerts, coupled with the application and database monitoring, provide a comprehensive view of your system’s health, enabling proactive intervention before issues impact users.