Server Monitoring Best Practices: Keeping Your Laravel App and DynamoDB Clusters Alive on DigitalOcean

Proactive Laravel Application Health Checks

Maintaining the health of a Laravel application goes beyond simply checking if the web server is responding. We need to ensure the application itself is functioning correctly, processing requests efficiently, and not experiencing internal errors. This involves implementing deep health checks that can be integrated into your monitoring stack.

A robust health check endpoint should verify several critical components:

Database connectivity and basic query execution.
Cache driver accessibility.
Queue worker status (though this is often a separate, more involved check).
Key external API dependencies (if applicable).
Application-level errors (e.g., recent exceptions).

Let’s create a custom health check route in Laravel. This route will be polled by our monitoring system.

Implementing the Laravel Health Check Endpoint

First, define a route in routes/api.php (or routes/web.php if you prefer, but API routes are generally better for machine-to-machine communication).

// routes/api.php
use Illuminate\Support\Facades\Route;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;
use Illuminate\Http\JsonResponse;
use App\Http\Controllers\HealthCheckController;

Route::get('/health', [HealthCheckController::class, 'index']);

Next, create the HealthCheckController.

// app/Http/Controllers/HealthCheckController.php
namespace App\Http\Controllers;

use Illuminate\Http\JsonResponse;
use Illuminate\Routing\Controller as BaseController;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;
use Throwable;

class HealthCheckController extends BaseController
{
    public function index(): JsonResponse
    {
        $status = 'healthy';
        $checks = [];

        // 1. Database Check
        try {
            DB::connection()->getPdo();
            $checks['database'] = 'ok';
        } catch (Throwable $e) {
            $status = 'unhealthy';
            $checks['database'] = 'error: ' . $e->getMessage();
            Log::error('Database connection failed for health check', ['exception' => $e]);
        }

        // 2. Cache Check
        try {
            $cacheKey = 'health_check_cache_test_' . uniqid();
            Cache::put($cacheKey, 'test', 1);
            if (Cache::get($cacheKey) === 'test') {
                $checks['cache'] = 'ok';
                Cache::forget($cacheKey); // Clean up
            } else {
                throw new \Exception('Cache read/write failed');
            }
        } catch (Throwable $e) {
            $status = 'unhealthy';
            $checks['cache'] = 'error: ' . $e->getMessage();
            Log::error('Cache connection/operation failed for health check', ['exception' => $e]);
        }

        // 3. Add more checks as needed (e.g., external APIs, specific service availability)
        // Example: External API check (simplified)
        /*
        try {
            $client = new \GuzzleHttp\Client();
            $response = $client->request('GET', config('services.external_api.url') . '/health', ['timeout' => 5]);
            if ($response->getStatusCode() === 200) {
                $checks['external_api'] = 'ok';
            } else {
                throw new \Exception('External API returned non-200 status');
            }
        } catch (Throwable $e) {
            $status = 'unhealthy';
            $checks['external_api'] = 'error: ' . $e->getMessage();
            Log::error('External API health check failed', ['exception' => $e]);
        }
        */

        // 4. Application Exception Check (e.g., check logs for recent critical errors)
        // This is more complex and might involve parsing log files or using a dedicated logging service.
        // For simplicity, we'll assume a basic check or rely on external monitoring of logs.
        // A more advanced approach might query a logging service like Elasticsearch or Datadog.

        return response()->json([
            'status' => $status,
            'checks' => $checks,
            'timestamp' => now()->toIso8601String(),
        ], $status === 'healthy' ? 200 : 503); // 503 Service Unavailable for unhealthy
    }
}

Ensure your .env file has the correct database and cache configurations. For production, you’ll likely be using Redis for caching and a managed database service like DigitalOcean Managed Databases (PostgreSQL/MySQL).

Monitoring the Health Endpoint with UptimeRobot/Prometheus

You can use external services like UptimeRobot for basic HTTP checks, but for more granular insights and integration with your alerting system, Prometheus is a standard choice. You’ll need a Prometheus exporter that can scrape your Laravel application’s health endpoint.

A simple approach is to use a generic HTTP exporter or write a small custom exporter. For instance, you could deploy a small Python application using the prometheus_client library that periodically scrapes your health endpoint and exposes metrics to Prometheus.

# exporter/app.py
from prometheus_client import Gauge, start_http_server
import time
import requests
import os

HEALTH_URL = os.environ.get("LARAVEL_HEALTH_URL", "http://localhost/api/health")
EXPORTER_PORT = int(os.environ.get("EXPORTER_PORT", 9101))

# Metrics
app_health_status = Gauge('laravel_app_health_status', 'Laravel application health status (1 for healthy, 0 for unhealthy)', ['check'])
app_health_checks_total = Gauge('laravel_app_health_checks_total', 'Total number of health checks performed', ['check'])

def scrape_health_endpoint():
    try:
        response = requests.get(HEALTH_URL, timeout=10)
        response.raise_for_status() # Raise an exception for bad status codes
        data = response.json()

        # Reset all gauges before updating
        for check_name in ['overall', 'database', 'cache', 'external_api']: # Add all possible checks
            app_health_status.labels(check=check_name).set(0)
            app_health_checks_total.labels(check=check_name).inc()

        if data.get('status') == 'healthy':
            app_health_status.labels(check='overall').set(1)
        else:
            app_health_status.labels(check='overall').set(0)

        for check_name, check_status in data.get('checks', {}).items():
            if check_status == 'ok':
                app_health_status.labels(check=check_name).set(1)
            else:
                app_health_status.labels(check=check_name).set(0)
                # Optionally log the error from the check status
                print(f"Health check failed for {check_name}: {check_status}")

        print(f"Scraped {HEALTH_URL} successfully. Status: {data.get('status')}")

    except requests.exceptions.RequestException as e:
        print(f"Error scraping health endpoint {HEALTH_URL}: {e}")
        # Set overall health to unhealthy if scrape fails
        app_health_status.labels(check='overall').set(0)
        app_health_checks_total.labels(check='overall').inc()
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        app_health_status.labels(check='overall').set(0)
        app_health_checks_total.labels(check='overall').inc()


if __name__ == '__main__':
    print(f"Starting Prometheus exporter on port {EXPORTER_PORT}")
    start_http_server(EXPORTER_PORT)
    print(f"Scraping Laravel health endpoint at {HEALTH_URL}")

    while True:
        scrape_health_endpoint()
        time.sleep(60) # Scrape every 60 seconds

Deploy this exporter on a separate DigitalOcean Droplet or within your Kubernetes cluster, and configure Prometheus to scrape its metrics endpoint (e.g., http://exporter-ip:9101/metrics).

DynamoDB Cluster Health and Performance Monitoring

Monitoring DynamoDB involves looking at both operational health (availability) and performance metrics (throughput, latency, errors). AWS CloudWatch is the primary tool for this, and we’ll integrate its metrics into our monitoring stack, likely via Prometheus.

Key DynamoDB Metrics to Monitor

Focus on these critical metrics, available via AWS CloudWatch:

ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits: Track actual usage against provisioned capacity. Spikes can indicate performance issues or inefficient queries.
ReadThrottleEvents / WriteThrottleEvents: Crucial for identifying when requests are being throttled due to exceeding provisioned capacity. This directly impacts application performance.
SuccessfulRequestLatency: Measures the latency of successful requests. High latency indicates potential issues within DynamoDB or network problems.
SystemErrors: Count of internal DynamoDB errors. Any non-zero value requires immediate investigation.
UserErrors: Count of errors originating from user requests (e.g., validation errors, conditional check failures). While some are expected, a sudden surge can point to application bugs.
ReturnedItemCount: Useful for understanding the volume of data being returned by queries.

Integrating CloudWatch Metrics with Prometheus

To bring CloudWatch metrics into Prometheus, we use the cloudwatch_exporter. This allows Prometheus to scrape metrics exposed by the exporter, which in turn pulls data from CloudWatch.

First, set up the cloudwatch_exporter. You’ll need AWS credentials configured (e.g., via environment variables or an IAM role if running on EC2/ECS).

# cloudwatch_exporter/config.yml
# Example configuration for DynamoDB metrics
---
region: us-east-1 # Or your AWS region
metrics:
  - aws_namespace: AWS/DynamoDB
    namespace: dynamodb
    # Define specific metrics for your tables
    dimensions:
      - name: TableName
        value: your-laravel-table-name # Replace with your actual table name
    metrics:
      - name: ConsumedReadCapacityUnits
        statistics: [Sum]
        period: 300 # 5 minutes
      - name: ConsumedWriteCapacityUnits
        statistics: [Sum]
        period: 300
      - name: ReadThrottleEvents
        statistics: [Sum]
        period: 60 # 1 minute for throttles
      - name: WriteThrottleEvents
        statistics: [Sum]
        period: 60
      - name: SuccessfulRequestLatency
        statistics: [Average, Maximum]
        period: 300
      - name: SystemErrors
        statistics: [Sum]
        period: 60
      - name: UserErrors
        statistics: [Sum]
        period: 60
      - name: ReturnedItemCount
        statistics: [Sum]
        period: 300
# Add more tables or global secondary indexes as needed
# Example for a specific index:
#  - aws_namespace: AWS/DynamoDB
#    namespace: dynamodb
#    dimensions:
#      - name: TableName
#        value: your-laravel-table-name
#      - name: GlobalSecondaryIndexName
#        value: your-index-name
#    metrics:
#      - name: ConsumedReadCapacityUnits
#        statistics: [Sum]
#        period: 300

Deploy the cloudwatch_exporter (e.g., as a Docker container) and configure Prometheus to scrape its metrics endpoint (e.g., http://cloudwatch-exporter-ip:9100/metrics).

Alerting on DynamoDB Throttles and Latency

Alerting is paramount. Configure Prometheus Alertmanager to trigger alerts based on specific thresholds.

# prometheus/alert.rules.yml
groups:
- name: dynamodb_alerts
  rules:
  - alert: HighDynamoDBReadThrottleRate
    expr: sum(rate(dynamodb_readthrottleevents_sum{job="cloudwatch_exporter"}[5m])) by (table) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High read throttle rate detected on DynamoDB table {{ $labels.table }}"
      description: "DynamoDB table {{ $labels.table }} is experiencing a high rate of read throttles (more than 0 per second over 5 minutes)."

  - alert: HighDynamoDBWriteThrottleRate
    expr: sum(rate(dynamodb_writethrottleevents_sum{job="cloudwatch_exporter"}[5m])) by (table) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High write throttle rate detected on DynamoDB table {{ $labels.table }}"
      description: "DynamoDB table {{ $labels.table }} is experiencing a high rate of write throttles (more than 0 per second over 5 minutes)."

  - alert: HighDynamoDBReadLatency
    expr: avg(dynamodb_successfulrequestlatency_maximum{job="cloudwatch_exporter", statistic="Maximum"} > 0.5) by (table) # Latency in seconds
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High DynamoDB read latency on table {{ $labels.table }}"
      description: "DynamoDB table {{ $labels.table }} has a maximum read latency exceeding 0.5 seconds for the last 10 minutes."

  - alert: DynamoDBSystemErrors
    expr: sum(rate(dynamodb_systemerrors_sum{job="cloudwatch_exporter"}[5m])) by (table) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "System errors detected in DynamoDB table {{ $labels.table }}"
      description: "DynamoDB table {{ $labels.table }} is reporting system errors."

These alerts should be routed through Alertmanager to your preferred notification channels (Slack, PagerDuty, email).

Server Resource Monitoring on DigitalOcean Droplets

For your Laravel application servers (DigitalOcean Droplets), standard system resource monitoring is essential. Node Exporter is the de facto standard for exposing host-level metrics to Prometheus.

Deploying Node Exporter

Node Exporter can be deployed as a systemd service or a Docker container. Here’s a typical systemd service file:

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus # Or a dedicated user
ExecStart=/usr/local/bin/node_exporter --web.listen-address=":9100"

[Install]
Service]
WantedBy=multi-user.target

After placing this file, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Configure Prometheus to scrape the node_exporter targets (e.g., http://your-droplet-ip:9100/metrics).

Key Server Metrics and Alerts

Focus on:

CPU Usage (node_cpu_seconds_total): Monitor overall CPU load and per-core usage. High sustained CPU can indicate inefficient code or insufficient resources.
Memory Usage (node_memory_MemAvailable_bytes, node_memory_MemTotal_bytes): Track available memory. Running out of memory leads to swapping and severe performance degradation.
Disk I/O (node_disk_io_time_seconds_total): Monitor disk read/write activity. High I/O wait times can bottleneck applications, especially databases.
Network Traffic (node_network_receive_bytes_total, node_network_transmit_bytes_total): Track network throughput.
Load Average (node_load1, node_load5, node_load15): A general indicator of system load.

Example Prometheus alerts for server resources:

# prometheus/alert.rules.yml (add to existing file)
groups:
- name: server_alerts
  rules:
  - alert: HighCpuUsage
    expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage on {{ $labels.instance }} has been above 90% for the last 10 minutes."

  - alert: LowAvailableMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Low available memory on {{ $labels.instance }}"
      description: "Available memory on {{ $labels.instance }} is below 10% for the last 15 minutes."

  - alert: HighDiskIOWait
    expr: rate(node_disk_io_time_seconds_total{device="sda"}[5m]) > 0.8 # Adjust device and threshold
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High disk I/O wait on {{ $labels.instance }}"
      description: "Disk I/O wait time on {{ $labels.instance }} is high (above 80% of time spent waiting)."

These alerts, coupled with the application and database monitoring, provide a comprehensive view of your system’s health, enabling proactive intervention before issues impact users.