Server Monitoring Best Practices: Keeping Your Magento 2 App and MongoDB Clusters Alive on DigitalOcean

Proactive MongoDB Cluster Health Checks

Maintaining the health of a MongoDB replica set is paramount for Magento 2’s performance and availability. Beyond basic CPU/RAM, we need to monitor MongoDB-specific metrics that indicate potential issues before they impact the application. This involves querying the MongoDB server directly and setting up alerts based on these metrics.

A critical metric is the oplog window. This represents the time difference between the oldest and newest entries in the operation log. A growing oplog window signifies that secondaries are falling behind the primary, which can lead to data staleness or even failover events if the lag becomes too great. We can query this using the rs.status() command.

Monitoring Oplog Lag

We’ll use a simple Python script to connect to the MongoDB replica set, fetch the oplog status, and calculate the lag. This script can be scheduled via cron or a systemd timer.

First, ensure you have the pymongo library installed:

pip install pymongo

Here’s the Python script:

import pymongo
from datetime import datetime, timedelta
import sys

# --- Configuration ---
MONGO_URI = "mongodb://user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin"
OPLOG_LAG_THRESHOLD_MINUTES = 15 # Alert if oplog lag exceeds 15 minutes
# ---------------------

def check_oplog_lag(mongo_uri, lag_threshold_minutes):
    try:
        client = pymongo.MongoClient(mongo_uri)
        db = client.admin
        rs_status = db.command('replSetGetStatus')

        if not rs_status.get('ok'):
            print(f"Error: Could not get replica set status. Response: {rs_status}")
            sys.exit(1)

        oplog_entries = []
        for member in rs_status.get('members', []):
            if member.get('stateStr') == 'PRIMARY':
                primary_host = member.get('name')
                break
        else:
            print("Error: No primary found in replica set.")
            sys.exit(1)

        # Connect to the primary to query the oplog
        primary_client = pymongo.MongoClient(f"mongodb://user:password@{primary_host}/?authSource=admin")
        oplog_db = primary_client.local
        oplog_collection = oplog_db.oplog.rs

        # Find the oldest and newest entries in the oplog
        # We'll look at the last 1000 entries to get a representative sample
        # For very high write loads, you might need to adjust this or use a different approach
        latest_oplog_entry = oplog_collection.find_one(sort=[('$natural', pymongo.DESCENDING)])
        oldest_oplog_entry = oplog_collection.find_one(skip=max(0, oplog_collection.count_documents({}) - 1000), sort=[('$natural', pymongo.ASCENDING)])

        if not latest_oplog_entry or not oldest_oplog_entry:
            print("Warning: Could not retrieve oplog entries. Possibly empty oplog.")
            return

        latest_ts = latest_oplog_entry['ts']
        oldest_ts = oldest_oplog_entry['ts']

        # Convert BSON timestamps to datetime objects
        # BSON timestamp is a 64-bit integer: 32 bits for seconds, 32 bits for increment
        latest_datetime = datetime.fromtimestamp(latest_ts.time)
        oldest_datetime = datetime.fromtimestamp(oldest_ts.time)

        oplog_window_seconds = (latest_datetime - oldest_datetime).total_seconds()
        oplog_window_minutes = oplog_window_seconds / 60

        print(f"Oplog Window: {oplog_window_minutes:.2f} minutes")

        if oplog_window_minutes > lag_threshold_minutes:
            print(f"ALERT: Oplog lag ({oplog_window_minutes:.2f} minutes) exceeds threshold ({lag_threshold_minutes} minutes).")
            # In a real-world scenario, you'd send an alert here (e.g., via PagerDuty, Slack, email)
            sys.exit(2) # Exit with a non-zero code to indicate an alert

    except pymongo.errors.ConnectionFailure as e:
        print(f"Error: Could not connect to MongoDB: {e}")
        sys.exit(1)
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        sys.exit(1)
    finally:
        if 'client' in locals() and client:
            client.close()
        if 'primary_client' in locals() and primary_client:
            primary_client.close()

if __name__ == "__main__":
    check_oplog_lag(MONGO_URI, OPLOG_LAG_THRESHOLD_MINUTES)

To integrate this with a monitoring system like Prometheus, you could adapt the script to expose metrics via an HTTP endpoint (using Flask or FastAPI) or use a dedicated MongoDB exporter. For simpler setups, cron jobs with `curl` and a basic alert script are sufficient.

Key MongoDB Metrics to Monitor

Oplog Window: As detailed above, crucial for replica set health.
Network In/Out: High traffic can indicate replication issues or heavy application load.
Disk I/O: MongoDB is I/O intensive. Monitor read/write operations per second and latency.
Memory Usage: Track resident memory and cache hit rates.
Connections: Monitor active connections and connection pool usage.
Query Performance: Track slow queries (using MongoDB’s profiler) and overall query latency.
Replication Lag (per member): While oplog window is global, individual member lag is also important.
Disk Space: Ensure sufficient free space for data, oplog, and temporary files.

DigitalOcean’s Managed Databases for MongoDB provide some of these metrics out-of-the-box. For self-hosted clusters, consider using tools like mongostat, mongotop, or the MongoDB Atlas monitoring tools (even if not using Atlas for hosting, their concepts are valuable) and integrating them with your chosen monitoring stack (e.g., Prometheus + Grafana, Datadog, New Relic).

Magento 2 Application Performance Monitoring (APM)

Magento 2 is a complex application with many moving parts. Effective monitoring requires looking beyond basic server resource utilization to understand application-level performance bottlenecks. This includes tracking request latency, error rates, database query times, and external service dependencies.

Leveraging New Relic for Deep Insights

New Relic is a powerful APM tool that provides granular visibility into Magento 2 applications. Its PHP agent can automatically instrument your code, capturing transaction traces, database queries, external calls, and errors.

Installation and Configuration (PHP Agent):

1. **Download the agent:** Obtain the latest New Relic agent installer for Linux from the New Relic website or via `wget`.

wget https://download.newrelic.com/daemon/newrelic-daemon-x64.tar.gz
tar -zxvf newrelic-daemon-x64.tar.gz
cd newrelic-daemon-x64
sudo ./install.sh

2. **Configure `newrelic.ini`:** The installer will prompt for your New Relic license key and application name. You’ll find the configuration file typically at `/etc/newrelic/newrelic.ini` or within your PHP extension directory.

[newrelic]
license_key = YOUR_NEW_RELIC_LICENSE_KEY
app_name = Magento2 Production Server

[php]
; Set to true to enable the agent
enabled = true
; Set to true to enable the agent for the CLI
# cli.enabled = true
; Set to true to enable the agent for the web server SAPI
web.enabled = true
; The path to the agent's log file
log_level = info
log_file = /var/log/newrelic/newrelic-php5.log
; The path to the agent's daemon log file
daemon_log_file = /var/log/newrelic/newrelic-daemon.log
; The path to the agent's pid file
pidfile = /var/run/newrelic-daemon.pid

3. **Enable the extension in `php.ini`:** Ensure the `newrelic.so` extension is loaded. This is usually handled by the installer, but verify in your `php.ini` (or relevant `conf.d` file).

extension=newrelic.so

4. **Restart your web server (Nginx/Apache) and PHP-FPM:**

sudo systemctl restart nginx
sudo systemctl restart php8.1-fpm # Adjust PHP version as needed

Key Magento 2 Metrics in New Relic

Transaction Traces: Identify slow pages, API endpoints, or background tasks. Look for Magento-specific components like EAV queries, collection loading, or plugin execution.
Database Queries: Pinpoint inefficient SQL queries. Magento’s EAV model can lead to complex and slow queries if not optimized.
External Services: Monitor latency and errors when calling third-party APIs (payment gateways, shipping providers, ERP integrations).
Errors: Track PHP exceptions and fatal errors. Filter by Magento error codes or specific modules.
Throughput: Requests per minute.
Apdex Score: A measure of user satisfaction based on response times.

By correlating New Relic’s APM data with MongoDB metrics, you can diagnose issues like slow product page loads caused by inefficient MongoDB queries or replication lag impacting checkout processes.

Server-Level Monitoring on DigitalOcean

DigitalOcean provides built-in monitoring for Droplets, offering a good baseline. However, for production environments, a more robust, centralized monitoring solution is essential. We’ll focus on setting up Prometheus and Grafana for comprehensive metrics collection and visualization.

Prometheus & Grafana Stack Deployment

We’ll deploy Prometheus for time-series data collection and Grafana for dashboarding. This can be done directly on a dedicated Droplet or, preferably, using Docker Compose for easier management and isolation.

Using Docker Compose:

Create a docker-compose.yml file:

version: '3.7'

services:
  prometheus:
    image: prom/prometheus:v2.40.0 # Use a specific, stable version
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.0.0 # Use a specific, stable version
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:v1.6.0 # Use a specific, stable version
    container_name: node_exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/host/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

  mongodb_exporter:
    image: percona/mongodb_exporter:latest # Consider pinning to a specific version
    container_name: mongodb_exporter
    environment:
      - MONGODB_URI=mongodb://user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin
    ports:
      - "9204:9204" # Default port for mongodb_exporter
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration (`prometheus.yml`):

global:
  scrape_interval: 15s # How frequently to scrape targets
  evaluation_interval: 15s # How frequently to evaluate rules

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000'] # Use service name if on same Docker network

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100'] # Use service name if on same Docker network

  - job_name: 'mongodb_exporter'
    static_configs:
      - targets: ['mongodb_exporter:9204'] # Use service name if on same Docker network
    metrics_path: /metrics
    # You might want to add specific scrape configs for each MongoDB node if not using a single exporter
    # that aggregates, or if you have multiple replica sets.

Deployment Steps:

Create a directory for your configuration: mkdir prometheus-grafana && cd prometheus-grafana
Save the docker-compose.yml and prometheus.yml files in this directory.
Run: docker-compose up -d
Access Grafana at http://your_droplet_ip:3000 (default login: admin/admin).
Add Prometheus as a data source in Grafana (URL: http://prometheus:9090).
Import pre-built Grafana dashboards for Node Exporter and MongoDB Exporter (many are available on Grafana.com).

Essential Server Metrics & Dashboards

For your Magento 2 Droplets, focus on:

CPU Usage: Overall, per-core, and per-process (especially PHP-FPM, Nginx).
Memory Usage: Total, free, cached, buffered. Monitor swap usage closely.
Disk I/O: Read/write operations, latency, queue depth.
Network Traffic: In/out bytes, packets, errors.
PHP-FPM Status: Active processes, requests, slow requests.
Nginx Status: Active connections, requests per second, error rates (4xx, 5xx).

For your MongoDB Droplets (if self-hosted), the mongodb_exporter will provide crucial metrics. Ensure your Grafana dashboards visualize:

Oplog status (if not using the Python script).
Replication lag.
Query performance (reads/writes per second, latency).
Cache hit rates.
Network traffic.
Disk I/O.
Connections.

Alerting Strategy

Proactive alerting is key to preventing outages. We’ll use Prometheus Alertmanager to handle alerts generated by Prometheus rules.

Configuring Alertmanager

Add an Alertmanager service to your docker-compose.yml:

  alertmanager:
    image: prom/alertmanager:v0.25.0 # Use a specific, stable version
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

Create alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  # Email configuration (example)
  # smtp_smarthost: 'smtp.example.com:587'
  # smtp_from: '[email protected]'
  # smtp_auth_username: '[email protected]'
  # smtp_auth_password: 'YOUR_SMTP_PASSWORD'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver' # Default receiver if no specific route matches

  routes:
    - receiver: 'slack-notifications'
      matchers:
        - severity =~ "critical|warning"
      continue: true # Allows matching other routes if needed

receivers:
  - name: 'default-receiver'
    webhook_configs:
      - url: 'http://your-alert-webhook-url/path' # e.g., for PagerDuty, Opsgenie, or a custom handler

  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts'
        send_resolved: true
        title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}'
        text: >-
          {{ range .Alerts }}
            *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
            *Description:* {{ .Annotations.description }}
            *Details:*
            {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
            {{ end }}
          {{ end }}

Update your prometheus.yml to include the Alertmanager:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093'] # Use service name if on same Docker network

Example Prometheus Alerting Rules

Create a file like prometheus/rules.yml and include it in your prometheus.yml under `rule_files`:

groups:
  - name: MagentoAlerts
    rules:
      - alert: HighCpuUsage
        expr: node_cpu_seconds_total{mode="idle", instance="your_magento_droplet_ip:9100"} > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU idle time on Magento server {{ $labels.instance }}"
          description: "CPU idle time is above 90% for 5 minutes on {{ $labels.instance }}. This might indicate an issue or an underutilized server."

      - alert: LowDiskSpace
        expr: node_filesystem_avail_bytes{mountpoint="/", instance="your_magento_droplet_ip:9100"} < 1024 * 1024 * 1024 # Less than 1GB
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on Magento server {{ $labels.instance }}"
          description: "Filesystem '/' on {{ $labels.instance }} has less than 1GB free space."

      - alert: HighPhpFpmSlowRequests
        # This requires PHP-FPM's status page to be enabled and scraped by Prometheus
        # You'll need a php-fpm exporter or configure Prometheus to scrape the status page directly
        # Example assumes a php-fpm exporter is running and accessible
        expr: php_fpm_slow_requests_total{instance="your_php_fpm_exporter_ip:9000"} > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High number of slow PHP-FPM requests on {{ $labels.instance }}"
          description: "PHP-FPM on {{ $labels.instance }} is reporting more than 5 slow requests."

  - name: MongoAlerts
    rules:
      - alert: MongoOplogTooLarge
        # This rule uses the output of the Python script if it's exposed via an exporter,
        # or directly queries MongoDB if the mongodb_exporter supports it.
        # Assuming mongodb_exporter exposes a metric like 'mongodb_replset_oplog_window_seconds'
        expr: mongodb_replset_oplog_window_seconds{job="mongodb_exporter"} > 900 # 15 minutes in seconds
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "MongoDB oplog window is too large on {{ $labels.instance }}"
          description: "The oplog window for replica set {{ $labels.replset }} is {{ $value }} seconds, exceeding the 900-second threshold."

      - alert: MongoNetworkError
        expr: mongodb_network_in_bytes_total{job="mongodb_exporter"} == 0 # Example: No network traffic for a period
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "No MongoDB network traffic detected on {{ $labels.instance }}"
          description: "No incoming network traffic detected on MongoDB instance {{ $labels.instance }} for 10 minutes. Potential network issue or node down."

Remember to replace placeholder IPs and URLs with your actual configurations. Regularly review and refine your alerting rules to minimize false positives and ensure critical issues are flagged promptly.