Upgrading Rocky Linux 9 kernel parameters to enable active NVMe write caching for database speedups

Understanding NVMe Write Caching on Linux

Modern NVMe SSDs offer significant performance advantages, but by default, Linux kernel parameters might not fully leverage their write caching capabilities, especially in high-I/O workloads like database operations. This can lead to suboptimal write throughput and increased latency. The primary mechanism we’ll focus on is the `writeback` mode for NVMe devices, which allows the kernel to acknowledge writes to the application before they are physically committed to the NAND flash. This significantly reduces perceived write latency but introduces a risk of data loss in case of sudden power failure.

Identifying NVMe Devices and Current Settings

Before making any changes, it’s crucial to identify your NVMe devices and inspect their current kernel-level settings. We’ll use `lsblk` to list block devices and `nvme-cli` for detailed NVMe information. If `nvme-cli` is not installed, it can typically be found in your distribution’s repositories (e.g., `dnf install nvme-cli` on Rocky Linux).

First, list all block devices to identify your NVMe drives. They are usually named `/dev/nvmeXnY`.

lsblk -o NAME,SIZE,MODEL,TYPE,MOUNTPOINT

Next, use `nvme-cli` to query the controller and namespace information. This will give us the device path (e.g., `/dev/nvme0n1`).

nvme list

To check the current write cache settings, we can query the device’s capabilities. The relevant NVMe command is `get-feature` for Feature ID `0x02` (Volatile Memory and Cache Control).

nvme get-feature /dev/nvme0n1 --namespace-id=1 --feature-id=0x02

The output will indicate the current `WCE` (Write Cache Enable) status. A value of `1` means it’s enabled, and `0` means it’s disabled. We are aiming to ensure this is enabled and, more importantly, that the kernel’s I/O scheduler is configured to take advantage of it.

Configuring Kernel Parameters for NVMe Write Caching

The primary kernel parameter influencing how block devices handle writes, including NVMe, is related to the I/O scheduler and specific device flags. For NVMe devices, the kernel often defaults to a more conservative approach. We want to ensure the device is treated as capable of `writeback` caching.

The most direct way to influence this is through the `libata.force` parameter or, more commonly for NVMe, by setting device-specific options via `udev` rules. For NVMe, the kernel’s default behavior is generally to respect the device’s `WCE` bit. However, we can explicitly set the `write_cache` sysfs attribute.

Leveraging `udev` for Persistent NVMe Settings

The most robust and recommended method for applying persistent kernel-level settings to specific devices is through `udev` rules. This ensures that the desired configuration is applied automatically on boot and whenever the device is detected.

We will create a new `udev` rule file. Let’s assume our NVMe device is `/dev/nvme0n1`. We need to identify its unique attributes, such as the vendor and model, or its subsystem device path. Using `udevadm info -a -p $(udevadm info -q path -n /dev/nvme0n1)` can help find these attributes.

Create a new `udev` rule file, for example, `/etc/udev/rules.d/99-nvme-write-cache.rules`. The content of this file will depend on how you want to identify the device. A common approach is to use the subsystem and device name.

ACTION=="add|change", KERNEL=="nvme[0-9]*n[0-9]*", SUBSYSTEM=="block", ATTR{queue/write_cache}=="writethrough", ATTR{queue/write_cache}="writeback"

Explanation:

ACTION=="add|change": Applies the rule when the device is added or its state changes.
KERNEL=="nvme[0-9]*n[0-9]*": Matches any NVMe block device (e.g., nvme0n1, nvme1n1).
SUBSYSTEM=="block": Ensures we are targeting block devices.
ATTR{queue/write_cache}=="writethrough": This is a conditional. It only applies the subsequent action if the current setting is `writethrough`. This prevents unnecessary writes if it’s already `writeback` or something else.
ATTR{queue/write_cache}="writeback": This is the core action. It attempts to set the `write_cache` attribute to `writeback`.

After creating or modifying the `udev` rule file, you need to reload the `udev` rules and trigger them for the existing devices.

sudo udevadm control --reload-rules
sudo udevadm trigger

Verify the change by re-checking the `write_cache` sysfs attribute for your NVMe device.

cat /sys/block/nvme0n1/queue/write_cache

The output should now be `writeback`.

Alternative: Direct Sysfs Modification (Less Persistent)

While `udev` is preferred for persistence, you can temporarily change the `write_cache` setting directly via sysfs. This is useful for testing but will be reset on reboot.

echo writeback | sudo tee /sys/block/nvme0n1/queue/write_cache

Again, verify the change:

cat /sys/block/nvme0n1/queue/write_cache

Impact on Database Performance and Data Safety

Enabling `writeback` caching can dramatically improve database write performance by reducing the latency of individual write operations. Applications, including databases, will see acknowledgments for writes much faster, as the kernel and device controller buffer these writes. This is particularly beneficial for transactional workloads with many small writes.

However, this comes with a significant caveat: data loss risk. If the system loses power unexpectedly (e.g., a power outage, a sudden hardware failure) before the cached writes are flushed to the NAND flash, that data will be lost. NVMe drives with power loss protection (PLP) mechanisms (like capacitors on the drive itself) can mitigate this risk to some extent, but they are not foolproof against all scenarios.

Recommendations for mitigating risk:

Use UPS: Ensure all database servers are connected to a reliable Uninterruptible Power Supply (UPS) with proper shutdown procedures configured.
RAID/Replication: For critical data, ensure you have robust RAID configurations (e.g., RAID 10) or synchronous replication to other database instances. This provides redundancy at a higher level.
Monitor Drive Health: Regularly monitor the health of your NVMe drives using SMART tools and `nvme-cli` to detect potential failures early.
Understand Your Workload: Assess if the performance gains from `writeback` caching outweigh the potential data loss risk for your specific application. For highly sensitive, non-replicated data, `writethrough` might still be a safer choice.
Test Thoroughly: Before deploying to production, conduct extensive performance testing and simulate power loss scenarios (if feasible in a test environment) to understand the actual impact and recovery procedures.

Verifying Performance Improvements

After enabling `writeback` caching, it’s essential to measure the actual performance impact. Tools like `fio` (Flexible I/O Tester) are invaluable for this.

First, establish a baseline with `writeback` disabled (or in its default state). Then, enable `writeback` and re-run the tests.

# Example fio job file (write_cache_test.fio)
[global]
ioengine=libaio
direct=1
bs=4k
rw=randwrite
numjobs=8
runtime=60
time_based
filename=/mnt/nvme0n1/fio_test_file # Ensure this path exists and is on your NVMe device

[rand-write]
stonewall

Run `fio`:

fio write_cache_test.fio

Compare the IOPS (Input/Output Operations Per Second) and latency figures between the two configurations. You should observe a noticeable increase in random write IOPS and a decrease in average write latency when `writeback` caching is active.

Conclusion and Best Practices

Enabling NVMe `writeback` caching via `udev` rules is a powerful technique for boosting database write performance on Rocky Linux 9. However, it’s a trade-off that requires careful consideration of data safety. Always implement this change in conjunction with robust power protection and data redundancy strategies. Thorough testing and monitoring are paramount to ensuring both performance gains and data integrity in a production environment.

Upgrading Rocky Linux 9 kernel parameters to enable active NVMe write caching for database speedups

Understanding NVMe Write Caching on Linux

Identifying NVMe Devices and Current Settings

Configuring Kernel Parameters for NVMe Write Caching

Leveraging `udev` for Persistent NVMe Settings

Alternative: Direct Sysfs Modification (Less Persistent)

Impact on Database Performance and Data Safety

Verifying Performance Improvements

Conclusion and Best Practices

Reader Interactions

Leave a Reply Cancel reply

Recent Posts

Top Categories

Our Products

Our Services