Server Monitoring Best Practices: Keeping Your Magento 2 App and MongoDB Clusters Alive on DigitalOcean
Proactive MongoDB Cluster Health Checks
Maintaining the health of a MongoDB replica set is paramount for Magento 2’s performance and availability. Beyond basic CPU/RAM, we need to monitor MongoDB-specific metrics that indicate potential issues before they impact the application. This involves querying the MongoDB server directly and setting up alerts based on these metrics.
A critical metric is the oplog window. This represents the time difference between the oldest and newest entries in the operation log. A growing oplog window signifies that secondaries are falling behind the primary, which can lead to data staleness or even failover events if the lag becomes too great. We can query this using the rs.status() command.
Monitoring Oplog Lag
We’ll use a simple Python script to connect to the MongoDB replica set, fetch the oplog status, and calculate the lag. This script can be scheduled via cron or a systemd timer.
First, ensure you have the pymongo library installed:
pip install pymongo
Here’s the Python script:
import pymongo from datetime import datetime, timedelta import sys # --- Configuration --- MONGO_URI = "mongodb://user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin" OPLOG_LAG_THRESHOLD_MINUTES = 15 # Alert if oplog lag exceeds 15 minutes # --------------------- def check_oplog_lag(mongo_uri, lag_threshold_minutes): try: client = pymongo.MongoClient(mongo_uri) db = client.admin rs_status = db.command('replSetGetStatus') if not rs_status.get('ok'): print(f"Error: Could not get replica set status. Response: {rs_status}") sys.exit(1) oplog_entries = [] for member in rs_status.get('members', []): if member.get('stateStr') == 'PRIMARY': primary_host = member.get('name') break else: print("Error: No primary found in replica set.") sys.exit(1) # Connect to the primary to query the oplog primary_client = pymongo.MongoClient(f"mongodb://user:password@{primary_host}/?authSource=admin") oplog_db = primary_client.local oplog_collection = oplog_db.oplog.rs # Find the oldest and newest entries in the oplog # We'll look at the last 1000 entries to get a representative sample # For very high write loads, you might need to adjust this or use a different approach latest_oplog_entry = oplog_collection.find_one(sort=[('$natural', pymongo.DESCENDING)]) oldest_oplog_entry = oplog_collection.find_one(skip=max(0, oplog_collection.count_documents({}) - 1000), sort=[('$natural', pymongo.ASCENDING)]) if not latest_oplog_entry or not oldest_oplog_entry: print("Warning: Could not retrieve oplog entries. Possibly empty oplog.") return latest_ts = latest_oplog_entry['ts'] oldest_ts = oldest_oplog_entry['ts'] # Convert BSON timestamps to datetime objects # BSON timestamp is a 64-bit integer: 32 bits for seconds, 32 bits for increment latest_datetime = datetime.fromtimestamp(latest_ts.time) oldest_datetime = datetime.fromtimestamp(oldest_ts.time) oplog_window_seconds = (latest_datetime - oldest_datetime).total_seconds() oplog_window_minutes = oplog_window_seconds / 60 print(f"Oplog Window: {oplog_window_minutes:.2f} minutes") if oplog_window_minutes > lag_threshold_minutes: print(f"ALERT: Oplog lag ({oplog_window_minutes:.2f} minutes) exceeds threshold ({lag_threshold_minutes} minutes).") # In a real-world scenario, you'd send an alert here (e.g., via PagerDuty, Slack, email) sys.exit(2) # Exit with a non-zero code to indicate an alert except pymongo.errors.ConnectionFailure as e: print(f"Error: Could not connect to MongoDB: {e}") sys.exit(1) except Exception as e: print(f"An unexpected error occurred: {e}") sys.exit(1) finally: if 'client' in locals() and client: client.close() if 'primary_client' in locals() and primary_client: primary_client.close() if __name__ == "__main__": check_oplog_lag(MONGO_URI, OPLOG_LAG_THRESHOLD_MINUTES)
To integrate this with a monitoring system like Prometheus, you could adapt the script to expose metrics via an HTTP endpoint (using Flask or FastAPI) or use a dedicated MongoDB exporter. For simpler setups, cron jobs with `curl` and a basic alert script are sufficient.
Key MongoDB Metrics to Monitor
- Oplog Window: As detailed above, crucial for replica set health.
- Network In/Out: High traffic can indicate replication issues or heavy application load.
- Disk I/O: MongoDB is I/O intensive. Monitor read/write operations per second and latency.
- Memory Usage: Track resident memory and cache hit rates.
- Connections: Monitor active connections and connection pool usage.
- Query Performance: Track slow queries (using MongoDB’s profiler) and overall query latency.
- Replication Lag (per member): While oplog window is global, individual member lag is also important.
- Disk Space: Ensure sufficient free space for data, oplog, and temporary files.
DigitalOcean’s Managed Databases for MongoDB provide some of these metrics out-of-the-box. For self-hosted clusters, consider using tools like mongostat, mongotop, or the MongoDB Atlas monitoring tools (even if not using Atlas for hosting, their concepts are valuable) and integrating them with your chosen monitoring stack (e.g., Prometheus + Grafana, Datadog, New Relic).
Magento 2 Application Performance Monitoring (APM)
Magento 2 is a complex application with many moving parts. Effective monitoring requires looking beyond basic server resource utilization to understand application-level performance bottlenecks. This includes tracking request latency, error rates, database query times, and external service dependencies.
Leveraging New Relic for Deep Insights
New Relic is a powerful APM tool that provides granular visibility into Magento 2 applications. Its PHP agent can automatically instrument your code, capturing transaction traces, database queries, external calls, and errors.
Installation and Configuration (PHP Agent):
1. **Download the agent:** Obtain the latest New Relic agent installer for Linux from the New Relic website or via `wget`.
wget https://download.newrelic.com/daemon/newrelic-daemon-x64.tar.gz tar -zxvf newrelic-daemon-x64.tar.gz cd newrelic-daemon-x64 sudo ./install.sh
2. **Configure `newrelic.ini`:** The installer will prompt for your New Relic license key and application name. You’ll find the configuration file typically at `/etc/newrelic/newrelic.ini` or within your PHP extension directory.
[newrelic] license_key = YOUR_NEW_RELIC_LICENSE_KEY app_name = Magento2 Production Server [php] ; Set to true to enable the agent enabled = true ; Set to true to enable the agent for the CLI # cli.enabled = true ; Set to true to enable the agent for the web server SAPI web.enabled = true ; The path to the agent's log file log_level = info log_file = /var/log/newrelic/newrelic-php5.log ; The path to the agent's daemon log file daemon_log_file = /var/log/newrelic/newrelic-daemon.log ; The path to the agent's pid file pidfile = /var/run/newrelic-daemon.pid
3. **Enable the extension in `php.ini`:** Ensure the `newrelic.so` extension is loaded. This is usually handled by the installer, but verify in your `php.ini` (or relevant `conf.d` file).
extension=newrelic.so
4. **Restart your web server (Nginx/Apache) and PHP-FPM:**
sudo systemctl restart nginx sudo systemctl restart php8.1-fpm # Adjust PHP version as needed
Key Magento 2 Metrics in New Relic
- Transaction Traces: Identify slow pages, API endpoints, or background tasks. Look for Magento-specific components like EAV queries, collection loading, or plugin execution.
- Database Queries: Pinpoint inefficient SQL queries. Magento’s EAV model can lead to complex and slow queries if not optimized.
- External Services: Monitor latency and errors when calling third-party APIs (payment gateways, shipping providers, ERP integrations).
- Errors: Track PHP exceptions and fatal errors. Filter by Magento error codes or specific modules.
- Throughput: Requests per minute.
- Apdex Score: A measure of user satisfaction based on response times.
By correlating New Relic’s APM data with MongoDB metrics, you can diagnose issues like slow product page loads caused by inefficient MongoDB queries or replication lag impacting checkout processes.
Server-Level Monitoring on DigitalOcean
DigitalOcean provides built-in monitoring for Droplets, offering a good baseline. However, for production environments, a more robust, centralized monitoring solution is essential. We’ll focus on setting up Prometheus and Grafana for comprehensive metrics collection and visualization.
Prometheus & Grafana Stack Deployment
We’ll deploy Prometheus for time-series data collection and Grafana for dashboarding. This can be done directly on a dedicated Droplet or, preferably, using Docker Compose for easier management and isolation.
Using Docker Compose:
Create a docker-compose.yml file:
version: '3.7'
services:
prometheus:
image: prom/prometheus:v2.40.0 # Use a specific, stable version
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
restart: unless-stopped
grafana:
image: grafana/grafana:10.0.0 # Use a specific, stable version
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
node_exporter:
image: prom/node-exporter:v1.6.0 # Use a specific, stable version
container_name: node_exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
mongodb_exporter:
image: percona/mongodb_exporter:latest # Consider pinning to a specific version
container_name: mongodb_exporter
environment:
- MONGODB_URI=mongodb://user:[email protected]:27017,mongo2.example.com:27017,mongo3.example.com:27017/?replicaSet=myReplicaSet&authSource=admin
ports:
- "9204:9204" # Default port for mongodb_exporter
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration (`prometheus.yml`):
global:
scrape_interval: 15s # How frequently to scrape targets
evaluation_interval: 15s # How frequently to evaluate rules
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000'] # Use service name if on same Docker network
- job_name: 'node_exporter'
static_configs:
- targets: ['node_exporter:9100'] # Use service name if on same Docker network
- job_name: 'mongodb_exporter'
static_configs:
- targets: ['mongodb_exporter:9204'] # Use service name if on same Docker network
metrics_path: /metrics
# You might want to add specific scrape configs for each MongoDB node if not using a single exporter
# that aggregates, or if you have multiple replica sets.
Deployment Steps:
- Create a directory for your configuration:
mkdir prometheus-grafana && cd prometheus-grafana - Save the
docker-compose.ymlandprometheus.ymlfiles in this directory. - Run:
docker-compose up -d - Access Grafana at
http://your_droplet_ip:3000(default login: admin/admin). - Add Prometheus as a data source in Grafana (URL:
http://prometheus:9090). - Import pre-built Grafana dashboards for Node Exporter and MongoDB Exporter (many are available on Grafana.com).
Essential Server Metrics & Dashboards
For your Magento 2 Droplets, focus on:
- CPU Usage: Overall, per-core, and per-process (especially PHP-FPM, Nginx).
- Memory Usage: Total, free, cached, buffered. Monitor swap usage closely.
- Disk I/O: Read/write operations, latency, queue depth.
- Network Traffic: In/out bytes, packets, errors.
- PHP-FPM Status: Active processes, requests, slow requests.
- Nginx Status: Active connections, requests per second, error rates (4xx, 5xx).
For your MongoDB Droplets (if self-hosted), the mongodb_exporter will provide crucial metrics. Ensure your Grafana dashboards visualize:
- Oplog status (if not using the Python script).
- Replication lag.
- Query performance (reads/writes per second, latency).
- Cache hit rates.
- Network traffic.
- Disk I/O.
- Connections.
Alerting Strategy
Proactive alerting is key to preventing outages. We’ll use Prometheus Alertmanager to handle alerts generated by Prometheus rules.
Configuring Alertmanager
Add an Alertmanager service to your docker-compose.yml:
alertmanager:
image: prom/alertmanager:v0.25.0 # Use a specific, stable version
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
Create alertmanager/alertmanager.yml:
global: resolve_timeout: 5m # Email configuration (example) # smtp_smarthost: 'smtp.example.com:587' # smtp_from: '[email protected]' # smtp_auth_username: '[email protected]' # smtp_auth_password: 'YOUR_SMTP_PASSWORD' route: group_by: ['alertname', 'job'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' # Default receiver if no specific route matches routes: - receiver: 'slack-notifications' matchers: - severity =~ "critical|warning" continue: true # Allows matching other routes if needed receivers: - name: 'default-receiver' webhook_configs: - url: 'http://your-alert-webhook-url/path' # e.g., for PagerDuty, Opsgenie, or a custom handler - name: 'slack-notifications' slack_configs: - api_url: 'YOUR_SLACK_WEBHOOK_URL' channel: '#alerts' send_resolved: true title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}' text: >- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end }} {{ end }}
Update your prometheus.yml to include the Alertmanager:
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093'] # Use service name if on same Docker network
Example Prometheus Alerting Rules
Create a file like prometheus/rules.yml and include it in your prometheus.yml under `rule_files`:
groups:
- name: MagentoAlerts
rules:
- alert: HighCpuUsage
expr: node_cpu_seconds_total{mode="idle", instance="your_magento_droplet_ip:9100"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU idle time on Magento server {{ $labels.instance }}"
description: "CPU idle time is above 90% for 5 minutes on {{ $labels.instance }}. This might indicate an issue or an underutilized server."
- alert: LowDiskSpace
expr: node_filesystem_avail_bytes{mountpoint="/", instance="your_magento_droplet_ip:9100"} < 1024 * 1024 * 1024 # Less than 1GB
for: 10m
labels:
severity: critical
annotations:
summary: "Low disk space on Magento server {{ $labels.instance }}"
description: "Filesystem '/' on {{ $labels.instance }} has less than 1GB free space."
- alert: HighPhpFpmSlowRequests
# This requires PHP-FPM's status page to be enabled and scraped by Prometheus
# You'll need a php-fpm exporter or configure Prometheus to scrape the status page directly
# Example assumes a php-fpm exporter is running and accessible
expr: php_fpm_slow_requests_total{instance="your_php_fpm_exporter_ip:9000"} > 5
for: 2m
labels:
severity: warning
annotations:
summary: "High number of slow PHP-FPM requests on {{ $labels.instance }}"
description: "PHP-FPM on {{ $labels.instance }} is reporting more than 5 slow requests."
- name: MongoAlerts
rules:
- alert: MongoOplogTooLarge
# This rule uses the output of the Python script if it's exposed via an exporter,
# or directly queries MongoDB if the mongodb_exporter supports it.
# Assuming mongodb_exporter exposes a metric like 'mongodb_replset_oplog_window_seconds'
expr: mongodb_replset_oplog_window_seconds{job="mongodb_exporter"} > 900 # 15 minutes in seconds
for: 5m
labels:
severity: critical
annotations:
summary: "MongoDB oplog window is too large on {{ $labels.instance }}"
description: "The oplog window for replica set {{ $labels.replset }} is {{ $value }} seconds, exceeding the 900-second threshold."
- alert: MongoNetworkError
expr: mongodb_network_in_bytes_total{job="mongodb_exporter"} == 0 # Example: No network traffic for a period
for: 10m
labels:
severity: critical
annotations:
summary: "No MongoDB network traffic detected on {{ $labels.instance }}"
description: "No incoming network traffic detected on MongoDB instance {{ $labels.instance }} for 10 minutes. Potential network issue or node down."
Remember to replace placeholder IPs and URLs with your actual configurations. Regularly review and refine your alerting rules to minimize false positives and ensure critical issues are flagged promptly.