Server Monitoring Best Practices: Keeping Your Laravel App and DynamoDB Clusters Alive on Linode

Proactive Health Checks for Laravel Applications

Maintaining the health of a Laravel application goes beyond simply checking if the web server is responding. We need to ensure the application itself is functioning correctly, processing requests efficiently, and not succumbing to common pitfalls like memory leaks or database connection exhaustion. This involves implementing a multi-layered monitoring strategy.

Application-Level Health Endpoint

The first line of defense is a dedicated health check endpoint within your Laravel application. This endpoint should perform critical checks, such as verifying database connectivity, cache availability, and the status of essential external services. A simple implementation can be achieved by creating a controller and a route.

Create a new controller:

php artisan make:controller HealthCheckController

Now, implement the health check logic within this controller. This example checks database connectivity and a hypothetical external API.

<?php

namespace App\Http\Controllers;

use Illuminate\Http\JsonResponse;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Http;
use Illuminate\Routing\Controller as BaseController;

class HealthCheckController extends BaseController
{
    /**
     * Perform a comprehensive health check.
     *
     * @return \Illuminate\Http\JsonResponse
     */
    public function index(): JsonResponse
    {
        $status = 'ok';
        $checks = [];

        // 1. Database Connection Check
        try {
            DB::connection()->getPdo();
            $checks['database'] = ['status' => 'ok', 'message' => 'Database connection successful.'];
        } catch (\Exception $e) {
            $status = 'error';
            $checks['database'] = ['status' => 'error', 'message' => 'Database connection failed: ' . $e->getMessage()];
        }

        // 2. Cache Check (e.g., Redis)
        try {
            Cache::put('health_check_key', 'health_check_value', 1);
            if (Cache::get('health_check_key') === 'health_check_value') {
                $checks['cache'] = ['status' => 'ok', 'message' => 'Cache is accessible.'];
            } else {
                $status = 'error';
                $checks['cache'] = ['status' => 'error', 'message' => 'Cache read/write failed.'];
            }
        } catch (\Exception $e) {
            $status = 'error';
            $checks['cache'] = ['status' => 'error', 'message' => 'Cache connection failed: ' . $e->getMessage()];
        }

        // 3. External Service Check (Example: a hypothetical API)
        try {
            // Replace with a real, reliable external service endpoint
            $response = Http::timeout(5)->get('https://api.example.com/health');
            if ($response->successful()) {
                $checks['external_service'] = ['status' => 'ok', 'message' => 'External service is responsive.'];
            } else {
                $status = 'error';
                $checks['external_service'] = ['status' => 'error', 'message' => 'External service returned an error: ' . $response->status()];
            }
        } catch (\Exception $e) {
            $status = 'error';
            $checks['external_service'] = ['status' => 'error', 'message' => 'External service check failed: ' . $e->getMessage()];
        }

        // Add more checks as needed (e.g., queue worker status, specific business logic checks)

        return response()->json([
            'status' => $status,
            'checks' => $checks,
        ], $status === 'ok' ? 200 : 503); // 503 Service Unavailable
    }
}

Register a route for this health check endpoint in routes/api.php or routes/web.php, depending on your application’s structure. For API-based health checks, routes/api.php is generally preferred.

// routes/api.php
use App\Http\Controllers\HealthCheckController;

Route::get('/health', [HealthCheckController::class, 'index']);

This endpoint should be accessible without authentication for monitoring tools. Configure your web server (Nginx/Apache) to allow access to this specific route from your monitoring IP addresses or networks.

System-Level Monitoring with Prometheus and Node Exporter

While application-level checks are crucial, they don’t tell the whole story. We need to monitor the underlying infrastructure: CPU usage, memory, disk I/O, network traffic, and process status. Prometheus, coupled with Node Exporter, provides a robust solution for this.

Deploying Prometheus and Node Exporter on Linode

You can deploy Prometheus and Node Exporter as Docker containers for ease of management. Ensure your Linode instances have Docker installed.

Create a docker-compose.yml file for Prometheus:

version: '3.7'

services:
  prometheus:
    image: prom/prometheus:v2.45.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node_exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

volumes:
  prometheus:

Create the Prometheus configuration file prometheus/prometheus.yml. This configuration will scrape metrics from Node Exporter running on the same host and your Laravel application’s health endpoint.

global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape Node Exporter for system metrics
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100'] # Assumes Node Exporter is on the same host

  # Scrape Laravel Application Health Endpoint
  # Replace 'your_laravel_app_ip' with the actual IP of your Laravel server
  - job_name: 'laravel_app_health'
    static_configs:
      - targets: ['your_laravel_app_ip:80'] # Or 443 if using HTTPS
    metrics_path: '/health' # The path to your Laravel health check endpoint
    scheme: 'http' # or 'https'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+)(:[0-9]+)?'
        replacement: '$1'
      - target_label: __param_target
        source_labels: [__address__]
        regex: '([^:]+)(:[0-9]+)?'
        replacement: '$1:80' # Or 443 for HTTPS
      - target_label: __address__
        replacement: 'your_laravel_app_ip:9091' # Prometheus needs an HTTP endpoint to scrape, so we'll use a reverse proxy or a dedicated exporter.
                                              # For simplicity here, we'll assume a reverse proxy setup.
                                              # A more robust solution would be a dedicated exporter for the health endpoint.

# Example for scraping a reverse proxy that exposes the health endpoint as metrics
# If you have Nginx configured to expose /health as metrics (e.g., using a custom module or a sidecar exporter)
# - job_name: 'laravel_app_proxy'
#   static_configs:
#     - targets: ['your_laravel_app_ip:9101'] # Port where Nginx/proxy exposes metrics
#   metrics_path: '/metrics' # Assuming the proxy exposes metrics at /metrics

To make the Laravel health endpoint scrapeable by Prometheus, you typically need an intermediary. A common pattern is to use a reverse proxy (like Nginx) that exposes the health status as Prometheus metrics, or a dedicated exporter. For simplicity in this example, we’ll assume a reverse proxy setup. If you don’t have such a proxy, you’d need to build a small exporter that polls the health endpoint and exposes metrics.

Start the containers:

docker-compose up -d

Access the Prometheus UI at http://your_linode_ip:9090. You should see your Node Exporter and Laravel application targets appearing as ‘UP’.

Monitoring DynamoDB Clusters

DynamoDB, being a managed service, offloads much of the operational burden. However, monitoring its performance and cost is still critical. AWS CloudWatch is the primary tool for this. We’ll focus on key metrics and how to set up alerts.

Key DynamoDB Metrics to Monitor

ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits: Essential for understanding your provisioned throughput utilization and identifying potential throttling.
ProvisionedReadCapacityUnits and ProvisionedWriteCapacityUnits: Track your provisioned capacity.
ThrottledRequests: A direct indicator of requests being rejected due to exceeding provisioned throughput.
SuccessfulRequestLatency: Measures the time taken for successful requests. High latency can indicate performance issues.
SystemErrors: Count of internal server errors.
UserErrors: Count of client-side errors (e.g., validation errors).
ItemCount: Number of items in a table. Useful for understanding data volume.
TableSizeBytes: Size of the table.

Setting Up CloudWatch Alarms

You can set up CloudWatch Alarms via the AWS Management Console, AWS CLI, or Infrastructure as Code tools like Terraform or CloudFormation.

Here’s an example using the AWS CLI to create an alarm for throttled requests on a specific table:

aws cloudwatch put-metric-alarm \
    --alarm-name "DynamoDB-ThrottledRequests-High" \
    --alarm-description "Alarm when throttled requests exceed a threshold" \
    --metric-name "ThrottledRequests" \
    --namespace "AWS/DynamoDB" \
    --statistic Sum \
    --period 300 \
    --threshold 10 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions \
        Name=TableName,Value=YourDynamoDBTableName \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data notBreaching \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:YourSNSTopicForAlerts

Explanation:

--alarm-name: A descriptive name for the alarm.
--metric-name: The specific DynamoDB metric to monitor.
--namespace: The AWS service namespace.
--statistic: How to aggregate the metric data (e.g., Sum for counts, Average for rates).
--period: The length of time, in seconds, over which to evaluate the metric.
--threshold: The value against which the metric is compared.
--comparison-operator: The operator to use for comparison (e.g., GreaterThanOrEqualToThreshold).
--dimensions: Filters the metric data to a specific resource (e.g., a DynamoDB table).
--evaluation-periods: The number of periods over which the data is compared to the threshold.
--datapoints-to-alarm: The number of datapoints that must be breaching to trigger the alarm.
--treat-missing-data: How to handle missing data points.
--alarm-actions: The ARN of an SNS topic to send notifications to when the alarm state changes.

Consider setting up alarms for:

High Consumed Capacity vs. Provisioned Capacity: To proactively scale up or optimize queries.
Sustained Throttled Requests: Indicates a critical performance bottleneck.
High Latency: For successful requests, signaling potential issues.
Increasing System Errors: To catch unexpected service-side problems.

Log Aggregation and Analysis

Centralized logging is indispensable for debugging and understanding application behavior. Tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Grafana Loki are excellent choices.

Laravel Logging Configuration

Ensure your Laravel application logs are configured to output in a structured format, preferably JSON, which makes parsing by log aggregators much easier. Modify config/logging.php.

// config/logging.php

'channels' => [
    // ... other channels

    'json' => [
        'driver' => 'single',
        'path' => storage_path('logs/laravel.log'),
        'level' => env('LOG_LEVEL', 'debug'),
        'formatter' => Monolog\Formatter\JsonFormatter::class,
    ],

    'stack' => [
        'driver' => 'stack',
        'channels' => ['json'], // Use the JSON formatter for your primary logs
        'ignore_exceptions' => false,
    ],
],

'default' => env('LOG_CHANNEL', 'stack'),

Then, in your .env file, set LOG_CHANNEL=json (or use the ‘stack’ channel which directs to ‘json’ in this example).

Log Shipping with Fluentd or Filebeat

On your Linode servers, you’ll need a log shipper to collect these logs and send them to your aggregation system. Fluentd or Filebeat are common choices.

Example configuration for Filebeat (filebeat.yml) to tail Laravel logs:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/www/your-laravel-app/storage/logs/laravel.log # Adjust path as per your deployment
  json.keys_under_root: true
  json.overwrite_keys: true
  json.message_key: log # Assuming your JSON logs have a 'log' field for the message

output.elasticsearch:
  hosts: ["your-elasticsearch-host:9200"] # Replace with your Elasticsearch endpoint
  # username: "elastic"
  # password: "changeme"

# Or for Logstash:
# output.logstash:
#   hosts: ["your-logstash-host:5044"]

Ensure Filebeat is installed and configured to run as a service on your Linode instances. This setup allows you to search, visualize, and alert on log events effectively.

Alerting Strategy and Notification Channels

A robust monitoring system is only effective if it triggers timely and actionable alerts. Define clear thresholds and escalation policies.

Integrating Prometheus Alerts with Alertmanager

Prometheus itself can evaluate alerting rules, but Alertmanager is used to handle these alerts: deduplicating, grouping, and routing them to the correct receiver (e.g., Slack, PagerDuty, email).

Add Alertmanager configuration to your docker-compose.yml and create alertmanager/alertmanager.yml.

# docker-compose.yml (add Alertmanager service)
  alertmanager:
    image: prom/alertmanager:v0.25.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications' # Default receiver

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL' # Replace with your Slack webhook URL
    channel: '#alerts'
    send_resolved: true
    title: '{{ template "slack.default.title" . }}'
    text: '{{ template "slack.default.text" . }}'

# Prometheus configuration (prometheus.yml) needs to point to Alertmanager
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093'] # Assuming Alertmanager is on the same Docker network

Define alerting rules in Prometheus (e.g., in a file like prometheus/rules.yml and include it in prometheus.yml):

# prometheus/rules.yml
groups:
- name: laravel_alerts
  rules:
  - alert: HighLaravelAppErrorRate
    expr: |
      sum(rate(http_requests_total{job="laravel_app_health", status=~"5.."} [5m]))
      /
      sum(rate(http_requests_total{job="laravel_app_health"} [5m]))
      > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected for Laravel app ({{ $value | humanizePercentage }})"
      description: "The error rate for the Laravel application has exceeded 5% over the last 5 minutes."

  - alert: LaravelAppUnreachable
    expr: up{job="laravel_app_health"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Laravel application is unreachable"
      description: "The Laravel application target '{{ $labels.instance }}' is down."

- name: dynamodb_alerts
  rules:
  - alert: HighDynamoDBThrottledRequests
    expr: sum(aws_dynamodb_throttled_requests{TableName="YourDynamoDBTableName"}) by (TableName) > 10
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High throttled requests for DynamoDB table {{ $labels.TableName }}"
      description: "DynamoDB table '{{ $labels.TableName }}' is experiencing a high number of throttled requests."

# Include this rules file in prometheus.yml
# rule_files:
#   - "rules.yml"

Ensure your Prometheus configuration (prometheus.yml) includes the rule_files directive pointing to your rules file and the alerting section pointing to Alertmanager.

Monitoring Linode Infrastructure

Linode provides basic infrastructure monitoring through its Cloud Manager. This includes CPU utilization, network traffic, and disk I/O for your compute instances. While useful for an overview, it’s often insufficient for granular, application-aware monitoring.

Leveraging Node Exporter Metrics

The Node Exporter deployed earlier provides a wealth of system-level metrics that Prometheus collects. Key metrics to watch include:

node_cpu_seconds_total: CPU time spent in various modes.
node_memory_MemAvailable_bytes: Available memory.
node_disk_io_time_seconds_total: Disk I/O time.
node_network_receive_bytes_total and node_network_transmit_bytes_total: Network traffic.
node_filesystem_avail_bytes: Available disk space.

Set up Prometheus alerts for critical infrastructure conditions:

# prometheus/rules.yml (add to existing file)
groups:
# ... other groups

- name: linode_infrastructure_alerts
  rules:
  - alert: HighCpuUsage
    expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on Linode instance {{ $labels.instance }}"
      description: "CPU usage on {{ $labels.instance }} is above 90% for the last 10 minutes."

  - alert: LowMemoryAvailable
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Low available memory on Linode instance {{ $labels.instance }}"
      description: "Available memory on {{ $labels.instance }} is below 10% for the last 15 minutes."

  - alert: DiskAlmostFull
    expr: node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 5
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Disk almost full on Linode instance {{ $labels.instance }}"
      description: "Filesystem on {{ $labels.instance }} has less than 5% free space."

Regularly review these metrics and alerts in Grafana (if you integrate Prometheus with Grafana for visualization) to ensure the stability and performance of your Linode infrastructure.