Is ceph waiting the until the next journal commit when it hits an fsync call?
Yes, mostly. But it acts a little differently depending on the backend.
Under FileStore, there's a small journal buffer that can act as a small write-burst cache, but it's tiny. And yes, once it fills, it blocks to flush - across the entire cluster or PG.
Under BlueStore, there's no such buffer. And yes, bluestore blocks on every write to fsync to the Journal - all Journals in the PG. This is how BlueStore can remain very consistent and predictable in IOPS and writes. Under Bluestore, you want to move at least the write-ahead-log (WAL) off to an Enterprise SSD - because BlueStore will move the Journal and DB off to the same WAL partition - if there's enough room (you don't even have to specify them, just the WAL).
Enterprise SSDs as WAL/DB/Journals because they ignore fsync
But the real issue in this cluster is that you are using sub-optimum HDDs as Journals that are blocking on very slow fsyncs when they get flushed.
Even Consumer-grade SSDs have serious issues with Ceph's fsync
frequency as journals/WAL, as consumer SSDs only have transactional logs and no real power-backup.
It's Enterprise SSDs that have large capacitors that allow the drive to continue operating after a power-loss. Thus, they can guarantee a successful write under a power-loss event.
The added benefit is that Enterprise SSDs typically ignore fsync commands from the OS! Because they can guarantee the success of the write, they immediately return the fsync
request from the OS.
Thus, you get major performance gains when using an Enterprise-grade SSD as your WAL/DB/Journal.
Under FileStore, you'll see those delays go away, but you'll see inconsistent bursts of cache and then back down.
This is where BlueStore comes in, as BlueStore will guarantee a consistent IOPS and write across the board. But, you need WAL/DB/Journal on an Enterprise SSD to ignore those fsyncs.
At this time, Intel S3700s can be had in the used market for about $40/ea. Tiny investment for massive performance gains of unblocking fsyncs.
Some quotes (https://yourcmc.ru/wiki/index.php?title=Ceph_performance&mobileaction=toggle_view_desktop#Bluestore_vs_Filestore):
Filestore writes everything to the journal and only starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it makes journal act as a «temporary buffer» that absorbs random write bursts.
Bluestore can’t do the same even when you put its WAL+DB on SSD. It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and it lacks any kind of background flush threads. So you actually can increase the maximum number of deferred requests, but after the queue fills up the performance will drop until OSD restarts.
And: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/
The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL device would while also allowing additional metadata to be stored there (if it will fit). This means that if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.
Best Answer