How to Disable ext4 Write Barriers with External Journal

ext4filesystemsmdadmnvmesoftware-raid

I'm currently experimenting with different ways of improving write speeds to a fairly large, rotating disk-based, software-raid (mdadm) array on Debian using fast NVMe devices.

I found that using a pair of such devices (raid1, mirrored) to store the filesystem's journal yields interesting performance benefits. The mount options I am using to achieve this are noatime,journal_aync_commit,data=journal.

In my tests, I've also discovered that adding the barrier=0 option offers significant benefits in terms of write performance. However, I'm not certain that this option is safe to use in my particular filesystem configuration. This is what the kernel documentation says about ext4 write barriers:

Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, disabling barriers may safely improve performance.

The specific NVMe device I'm using is an Intel DC P3700 which has built-in power-loss protection which means that in the event of an unexpected shutdown, any data still present in temporary buffers is safely committed to NAND storage thanks to reserve energy storage.

So my question is, can I safely disable ext4 write barriers if the journal is stored on a device with battery-backed cache, while the rest of the filesystem itself sits on disks which don't have this feature?

Best Answer

I'm writing a new answer because after further analysis, I don't think the previous answer is correct.

If we look at the write_dirty_buffers function, it issues a write request with the REQ_SYNC flag, but it doesn't cause a cache flush, or barrier, to be issued. That is accomplished by the blkdev_issue_flush call, which is appropriately gated by a verification of the JDB2_BARRIER flag, which itself is only present when the filesystem is mounted with barriers enabled.

So if we look back at checkpoint.c, barriers only matter when a transaction is dropped from the journal. The comments in the code are informative here, telling us that this write barrier is unlikely to be necessary, but is there anyway as a safeguard. I think the assumption here is that by the time a transaction is dropped from the journal, the data itself is unlikely to be still lingering in the drive's cache, and not yet committed to permanent storage. But since it's only an assumption, the write barrier is issued anyway.

So why aren't barriers used when writing data to the main filesystem? I think the key here is that as long as the journal is coherent, metadata that's missing from the filesystem (eg. because it was lost in a power-loss event) is normally recovered during the journal replay, thus avoiding filesystem corruption. Furthermore, the use of data=journal should also guarantee consistency of actual filesystem data because, as I understand it, the recovery process will also write out data blocks that were committed to the journal as part of its replay mechanism.

So while ext4 does not actually flush disk caches at the end of a checkpoint, some steps should be taken to maximize recoverability in case of a power-loss:

  1. The filesystem should be mounted with data=journal, and not data=writeback (data=ordered is unavailable when using an external journal). This one should be obvious: we want a copy of all incoming data blocks inside the journal since those are the ones likely to be lost in a power-loss event. This isn't expensive performance-wise, since NVMe devices are very fast.

  2. The maximum journal size of 102400 blocks (400MB when using 4K filesystem blocks) should be used, so as to maximize the amount of data that's recoverable in a journal replay. This shouldn't be an issue since all NVMe devices are always at least several gigabytes in size.

  3. Problems may still arise in case an unexpected shutdown happens during a write-intensive operation. If transactions get dropped from the journal device faster than the data drives are able to flush their caches on their own, unrecoverable data loss or filesystem corruption could occur.

So the bottom line is, in my view, is that it's not 100% safe to disable write barriers, although some precautions can be implemented (#1 and #2) to make this setup a little safer.