Troubleshooting Systemd Journald Log Spikes and Disk Space Exhaustion on RHEL 9 under Intense Network Scrapes

Diagnosing Journald Log Volume Surges on RHEL 9

Enterprise environments running RHEL 9 often encounter unexpected spikes in systemd-journald log volume, particularly when subjected to intense network scraping or high-throughput application activity. This can rapidly lead to disk space exhaustion, impacting system stability and availability. This post details a systematic approach to diagnosing and mitigating these issues, focusing on practical, production-ready techniques.

Identifying the Source of Excessive Logging

The first step is to pinpoint which services or system components are generating the bulk of the log data. journalctl provides powerful filtering and aggregation capabilities for this.

Real-time Log Monitoring and Filtering

To get a live view of log activity and identify noisy processes, use journalctl -f. However, this can be overwhelming. A more targeted approach involves filtering by priority and service.

To view logs from a specific service, such as a web server or a custom application, use:

journalctl -u nginx.service -f

To identify the top N services by log volume over a specific period (e.g., the last hour), we can leverage journalctl‘s output and some shell utilities:

journalctl --since "1 hour ago" -o verbose | awk -F= '/^PRIORITY=/{p=$2} /^SYSLOG_IDENTIFIER=/{i=$2} END{if(p && i) print i, p}' | sort | uniq -c | sort -nr | head -n 10

This command:

journalctl --since "1 hour ago" -o verbose: Dumps logs from the last hour in a verbose format, making fields like PRIORITY and SYSLOG_IDENTIFIER explicit.
awk -F= '/^PRIORITY=/{p=$2} /^SYSLOG_IDENTIFIER=/{i=$2} END{if(p && i) print i, p}': Extracts the syslog identifier and priority for each log entry.
sort | uniq -c | sort -nr | head -n 10: Counts occurrences of each identifier, sorts them numerically in reverse order, and displays the top 10.

A similar approach can be used to identify specific messages that are being logged excessively:

journalctl --since "1 hour ago" -o cat | sort | uniq -c | sort -nr | head -n 20

Analyzing Journald Configuration for Disk Usage

systemd-journald has built-in mechanisms to control disk usage. The primary configuration file is /etc/systemd/journald.conf. Understanding its directives is crucial for managing log volume.

Key Configuration Directives

The most relevant directives for disk space management are:

Storage: Controls how journal logs are stored. Options include auto (default, uses volatile runtime if no persistent storage is configured), volatile (logs stored in /run/log/journal/), and persistent (logs stored in /var/log/journal/). For production servers, persistent is usually desired.
SystemMaxUse: Sets a maximum size for the journal directory (e.g., 1G, 500M). When this limit is reached, older journal files are removed.
SystemKeepFree: Ensures a minimum amount of free disk space is kept (e.g., 10%, 500M).
MaxFileSec: Configures the maximum age of journal files.
MaxRetentionSec: Similar to MaxFileSec but applies to the entire journal.

To apply changes to journald.conf, the service must be reloaded:

sudo systemctl restart systemd-journald

It’s important to check the current disk usage of the journal:

journalctl --disk-usage

And to clean up old journal files if necessary:

sudo journalctl --vacuum-size=500M

Or to remove logs older than a certain time:

sudo journalctl --vacuum-time=2weeks

Troubleshooting Network Scraping Impact

When network scraping tools (e.g., Prometheus exporters, Nagios checks, custom scripts) are configured to poll aggressively, they can inadvertently trigger verbose logging from applications or the system itself. This is often due to:

Increased request rates leading to more access logs.
Error conditions being triggered by the scraping process (e.g., timeouts, malformed requests) which then get logged.
The scraping tool itself generating excessive logs if misconfigured or if it encounters issues.

Identifying Scraper-Related Log Entries

Look for log entries that correlate with the scraping interval or the IP addresses of your monitoring infrastructure. If your scraping tool uses a specific user agent or logs identifiable messages, filter for those.

For example, if Prometheus is scraping an application and you suspect it’s causing log spikes, you might filter for entries originating from the Prometheus server’s IP or related to the application’s metrics endpoint:

# Assuming Prometheus server IP is 192.168.1.100
journalctl _TRANSPORT=kernel -g "192.168.1.100" --since "1 hour ago"
journalctl -u myapp.service -g "metrics" --since "1 hour ago"

Application-Level Logging Adjustments

Often, the root cause lies within the application being scraped. If an application is logging excessively, journald will simply relay that volume. Adjusting the application’s logging level is usually the most effective solution.

Example: Nginx Access Log Verbosity

If Nginx access logs are contributing significantly, especially if detailed request information is being logged unnecessarily, consider simplifying the log format or disabling access logging for specific endpoints if they are being scraped excessively and don’t require detailed tracking.

In /etc/nginx/nginx.conf or within a site-specific configuration, the log_format directive controls this. A very verbose format might include:

log_format combined '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for" '
                    '"$http_cookie"';

If this level of detail is not required for scraped endpoints, consider a simpler format or conditional logging.

Example: Custom Application Logging

For custom applications, review their logging configuration. Many frameworks (e.g., Python’s logging module, PHP’s Monolog) allow dynamic adjustment of log levels. If your application logs to stdout/stderr and is captured by journald, you might need to:

Modify the application’s configuration to reduce log verbosity (e.g., from DEBUG to INFO or WARNING).
If the application supports it, configure it to log to a file and manage that file’s rotation and size independently, rather than relying solely on journald.

For a hypothetical PHP application using Monolog, you might adjust the handler’s level:

<?php
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$log = new Logger('my_app');
// Change this level to reduce verbosity
$log->pushHandler(new StreamHandler('/var/log/my_app/app.log', Logger::INFO));

// ... application logic ...
?>

Ensure that the application’s log directory (e.g., /var/log/my_app/) is properly configured with appropriate permissions and rotation policies (e.g., using logrotate).

Proactive Monitoring and Alerting

To prevent future disk space exhaustion, implement proactive monitoring of journald’s disk usage and overall system disk utilization.

Monitoring Journald Disk Usage

Use a monitoring agent (e.g., Prometheus Node Exporter, Datadog Agent) to scrape journalctl --disk-usage or monitor the size of /var/log/journal/. A simple script for Prometheus Node Exporter’s textfile collector could look like this:

#!/bin/bash

# Path to the journal directory
JOURNAL_DIR="/var/log/journal"

# Check if journal directory exists
if [ ! -d "$JOURNAL_DIR" ]; then
    echo "journald_disk_usage_bytes 0"
    exit 0
fi

# Get disk usage in bytes
# Using du -sb for total size of the directory
# Alternatively, journalctl --disk-usage provides a human-readable output that can be parsed
# For simplicity and direct byte count:
DISK_USAGE=$(du -sb "$JOURNAL_DIR" | awk '{print $1}')

echo "journald_disk_usage_bytes $DISK_USAGE"

Save this script to a file like /etc/node_exporter/conf.d/journald_disk_usage.sh and ensure it’s executable. Configure Node Exporter to collect metrics from this directory.

Alerting on High Log Volume

Beyond disk space, consider alerting on unusually high log *rates*. This can be achieved by:

Using a log aggregation system (e.g., ELK stack, Splunk, Grafana Loki) to monitor log throughput per service or per host.
Setting up alerts in your monitoring system for specific error messages or high volumes of logs from critical services.

For instance, in Grafana Loki, you could create a metric that counts log lines per minute for a specific application and alert if it exceeds a threshold.

Conclusion

Troubleshooting journald log spikes under heavy network load requires a multi-faceted approach. By systematically identifying log sources, understanding and configuring journald.conf, analyzing the impact of scraping tools, and adjusting application-level logging, you can effectively manage disk space and maintain system stability. Proactive monitoring and alerting are key to preventing recurrence.