Troubleshooting High Load Average and I/O Wait Spikes on Rocky Linux 9: Tuning ext4 and XFS Mount Parameters

Identifying the Root Cause: Load Average vs. I/O Wait

High load average on a Linux system, particularly Rocky Linux 9, is often a symptom, not the disease. It indicates the number of processes that are either running or waiting to run. A load average consistently higher than the number of CPU cores suggests a bottleneck. However, distinguishing between CPU-bound processes and I/O-bound processes is critical for effective troubleshooting. High I/O wait (wa% in `top` or `iostat`) specifically points to the CPU spending time waiting for I/O operations to complete, often disk I/O. This guide focuses on tuning filesystem mount options for ext4 and XFS to mitigate these I/O-related spikes.

Initial Diagnostics: Gathering System Metrics

Before altering any configurations, establish a baseline and pinpoint the problematic I/O patterns. The following commands provide essential insights:

System-wide Performance Overview

The top command offers a real-time snapshot. Pay close attention to the load average (load average: X.XX, Y.YY, Z.ZZ) and the %wa column under the CPU states.

top - 10:30:00 up 10 days, 2:15,  1 user,  load average: 1.50, 1.60, 1.70
Tasks: 250 total,   1 running, 249 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.0 us,  2.0 sy,  0.0 ni, 92.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  16000.0 total,   8000.0 free,   4000.0 used,   4000.0 buff/cache
MiB Swap:   2000.0 total,   2000.0 free,      0.0 used.  11000.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1234 mysql     20   0  500000  10000   5000 S   0.5   0.1  12:34:56 mysqld
   5678 nginx     20   0   20000   5000   2000 S   0.2   0.0   5:01:23 nginx
   9012 appuser   20   0   10000   3000   1000 R   0.1   0.0   0:00:01 script.sh

The iostat command provides more granular disk I/O statistics. Use it with a short interval to capture spikes.

iostat -dx 5
Linux 5.14.0-284.11.1.el9_3.x86_64 (rocky9-prod-web) 	04/15/2024 	_x86_64_	(8 CPU)

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm   r_await w_await aqu-sz  await  %util
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00   0.00   0.00
sdb              0.00   10.00      0.00     40.00     0.00     0.00   0.00   0.00    0.00    5.00   0.00   5.00  10.00
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00   0.00   0.00
...

Focus on %util (device utilization), await (average I/O wait time), and aqu-sz (average queue length). High values here, especially when correlated with load average spikes, indicate disk I/O as the primary bottleneck.

Tuning Filesystem Mount Options: ext4

Rocky Linux 9 typically uses either ext4 or XFS. For ext4, several mount options can influence I/O behavior. The most impactful for reducing I/O wait are related to journaling and writeback behavior.

Understanding Key ext4 Mount Options

data=writeback: This is the most aggressive journaling mode. Metadata is journaled, but data blocks are written directly to the filesystem without being journaled first. This significantly reduces I/O overhead but offers less protection against data corruption in case of a crash during a write operation. It’s ideal for performance-critical workloads where data integrity can be managed at the application level or where downtime is acceptable.
data=ordered (default): Metadata is journaled, and data blocks are written and committed to the filesystem before the corresponding metadata is committed to the journal. This provides a good balance between performance and data integrity.
data=journal: Both data and metadata are journaled. This offers the highest level of data integrity but incurs the most I/O overhead.
commit=seconds: This option controls how often the filesystem flushes its data to disk. The default is often 5 seconds. Increasing this value (e.g., to 30 or 60) can reduce the frequency of write operations, potentially smoothing out I/O spikes, but also increases the window of potential data loss in case of a crash.
nobarrier: For certain hardware RAID controllers or SSDs that guarantee write ordering, disabling I/O barriers can improve performance by eliminating redundant write checks. However, this should be used with extreme caution and only if the underlying storage subsystem guarantees ordered writes. Incorrect use can lead to severe data corruption.

Applying ext4 Mount Options

To test these options without a reboot, you can remount the filesystem. For example, to test data=writeback and a longer commit interval on a filesystem mounted at /data:

# First, check current mount options
mount | grep /data

# Example output:
# /dev/sdb1 on /data type ext4 (rw,relatime,data=ordered)

# Remount with new options (replace /dev/sdb1 with your actual device)
sudo mount -o remount,data=writeback,commit=30 /data

# Verify the change
mount | grep /data
# Expected output:
# /dev/sdb1 on /data type ext4 (rw,relatime,data=writeback,commit=30)

To make these changes permanent, edit the /etc/fstab file. Add or modify the options for the relevant filesystem entry. For example:

# /etc/fstab
UUID=...  /  ext4  defaults,errors=remount-ro  0 1
UUID=...  /data  ext4  rw,relatime,data=writeback,commit=30  0 2
UUID=...  /var/log  ext4  rw,relatime,data=ordered,commit=60  0 2

Caution: Using data=writeback or a very high commit value increases the risk of data loss or corruption during unexpected shutdowns. Assess your application’s tolerance for this risk. For critical data, stick to data=ordered or consider XFS.

Tuning Filesystem Mount Options: XFS

XFS is known for its high performance, especially with large files and concurrent I/O. Its journaling implementation is generally more efficient than ext4’s, and it offers different tuning parameters.

Understanding Key XFS Mount Options

logbufs=N: Specifies the number of in-memory log buffers. Increasing this can improve performance for write-heavy workloads by allowing more I/O to be buffered before being written to disk. The default is typically 8.
logbsize=N: Sets the size of each log buffer in kilobytes. Increasing this can also help buffer more data. The default is usually 32k.
noatime / relatime: These options control the update of file access times. noatime disables access time updates entirely, while relatime (often the default) only updates access times if the previous update was more than 24 hours ago or if the file was modified. Using noatime can reduce small writes, especially on busy systems with many reads.
swalloc=N: This option, when used during filesystem creation (mkfs.xfs), pre-allocates space for the log. While not a mount option, it’s relevant for XFS performance tuning.

Applying XFS Mount Options

Similar to ext4, you can remount an XFS filesystem to test options. For example, to increase log buffers and disable access time updates on a filesystem mounted at /data:

# Check current mount options
mount | grep /data

# Example output:
# /dev/sdc1 on /data type xfs (rw,relatime,attr2,inode64,noquota)

# Remount with new options (replace /dev/sdc1 with your actual device)
sudo mount -o remount,logbufs=16,noatime /data

# Verify the change
mount | grep /data
# Expected output:
# /dev/sdc1 on /data type xfs (rw,relatime,attr2,inode64,noquota,logbufs=16,noatime)

To make these changes permanent, edit /etc/fstab. For XFS, the tuning parameters are often set during filesystem creation or via xfs_growfs for certain attributes, but mount options like noatime and logbufs can be adjusted in fstab.

# /etc/fstab
UUID=...  /  xfs  defaults,inode64,noquota  0 1
UUID=...  /data  xfs  rw,relatime,attr2,inode64,noquota,logbufs=16,noatime  0 2

Note: The logbufs and logbsize options are most effective when set at filesystem creation time. While remounting can apply them, their full benefit is realized with a properly tuned filesystem from the start. For existing filesystems, consider the impact of noatime on applications that rely on access times.

Advanced Considerations and Further Tuning

Beyond mount options, several other factors contribute to I/O performance and can be tuned:

I/O Scheduler Tuning

The I/O scheduler determines the order in which I/O requests are sent to the storage device. For modern SSDs, the none or mq-deadline schedulers are often recommended. For HDDs, bfq or kyber might offer better throughput. You can check and set the scheduler for a device:

# Check current scheduler for a device (e.g., sdb)
cat /sys/block/sdb/queue/scheduler

# Example output:
# mq-deadline [bfq] none

# Set scheduler (e.g., to none) - requires root privileges
echo none > /sys/block/sdb/queue/scheduler

# To make this persistent across reboots, use udev rules.
# Create a file like /etc/udev/rules.d/60-ioschedulers.rules

# /etc/udev/rules.d/60-ioschedulers.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="none"

Swappiness and Memory Management

Excessive swapping to disk can cause severe I/O wait. While not directly a filesystem tuning parameter, reducing the kernel’s tendency to swap can alleviate I/O pressure.

# Check current swappiness
cat /proc/sys/vm/swappiness

# Set swappiness temporarily (e.g., to 10)
sudo sysctl vm.swappiness=10

# Make permanent by editing /etc/sysctl.conf or a file in /etc/sysctl.d/
# Example: /etc/sysctl.d/99-swappiness.conf
# vm.swappiness = 10

Application-Level Tuning

Ultimately, the most effective tuning often occurs at the application level. Database tuning (e.g., buffer pool sizes, query optimization), web server configuration (e.g., caching, connection limits), and application code can significantly reduce I/O load. For instance, a database application might benefit from larger buffer pools to keep frequently accessed data in RAM, reducing disk reads.

Conclusion and Best Practices

Troubleshooting high load average and I/O wait spikes on Rocky Linux 9 requires a systematic approach. Start with diagnostics to confirm I/O as the bottleneck. Then, carefully evaluate filesystem mount options for ext4 and XFS, understanding the trade-offs between performance and data integrity. For ext4, data=writeback and adjusted commit intervals are powerful but risky. For XFS, tuning log buffers and using noatime can yield benefits. Always test changes in a staging environment before applying them to production. Remember to consider I/O scheduler settings and application-level optimizations for a holistic performance tuning strategy.