Is ceph waiting the until the next journal commit when it hits an fsync call?
Yes, mostly. But it acts a little differently depending on the backend.
Under FileStore, there's a small journal buffer that can act as a small write-burst cache, but it's tiny. And yes, once it fills, it blocks to flush - across the entire cluster or PG.
Under BlueStore, there's no such buffer. And yes, bluestore blocks on every write to fsync to the Journal - all Journals in the PG. This is how BlueStore can remain very consistent and predictable in IOPS and writes. Under Bluestore, you want to move at least the write-ahead-log (WAL) off to an Enterprise SSD - because BlueStore will move the Journal and DB off to the same WAL partition - if there's enough room (you don't even have to specify them, just the WAL).
Enterprise SSDs as WAL/DB/Journals because they ignore fsync
But the real issue in this cluster is that you are using sub-optimum HDDs as Journals that are blocking on very slow fsyncs when they get flushed.
Even Consumer-grade SSDs have serious issues with Ceph's fsync
frequency as journals/WAL, as consumer SSDs only have transactional logs and no real power-backup.
It's Enterprise SSDs that have large capacitors that allow the drive to continue operating after a power-loss. Thus, they can guarantee a successful write under a power-loss event.
The added benefit is that Enterprise SSDs typically ignore fsync commands from the OS! Because they can guarantee the success of the write, they immediately return the fsync
request from the OS.
Thus, you get major performance gains when using an Enterprise-grade SSD as your WAL/DB/Journal.
Under FileStore, you'll see those delays go away, but you'll see inconsistent bursts of cache and then back down.
This is where BlueStore comes in, as BlueStore will guarantee a consistent IOPS and write across the board. But, you need WAL/DB/Journal on an Enterprise SSD to ignore those fsyncs.
At this time, Intel S3700s can be had in the used market for about $40/ea. Tiny investment for massive performance gains of unblocking fsyncs.
Some quotes (https://yourcmc.ru/wiki/index.php?title=Ceph_performance&mobileaction=toggle_view_desktop#Bluestore_vs_Filestore):
Filestore writes everything to the journal and only starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it makes journal act as a «temporary buffer» that absorbs random write bursts.
Bluestore can’t do the same even when you put its WAL+DB on SSD. It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and it lacks any kind of background flush threads. So you actually can increase the maximum number of deferred requests, but after the queue fills up the performance will drop until OSD restarts.
And: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/
The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL device would while also allowing additional metadata to be stored there (if it will fit). This means that if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.
Journal/data separation
If you have just these four drives per OSD host, and all drives have similar performance, then the usual/recommended setup would be to have one OSD per disk (i.e. 4 per server), and each OSD would have its journal file on the same disk as the data.
Another popular (at least historically) setup is to have journals on separate drives that are optimized for write throughput and latency; usually SSDs, ideally SSDs with "power loss protection" so that they can acknowledge "sync" writes quickly without necessarily writing to the flash array (which can be somewhat slow). In this setup it is common to share a journal SSD between multiple OSD (data) drives. For example, our OSD servers have 8 or 10 spinning-rust drives for Ceph OSDs, and the journals are distributed over two SSDs.
Partitions
If your data and journal are on the same physical disk, I personally would put them on the same partition/file system. Mostly because I would be worried that if they were on separate partitions, then there would be a lot of head movement when the OSD/file system alternates between journal and (background) data writes. I'm not sure this is actually an issue, and on SSDs it certainly isn't. In general, separate partitions give you some optimization opportunities, i.e. different file system parameters or even file system types, or no file system at all for the journal. This comes at the cost of operational complexity, for example when adding or changing the size of a journal requires you'd need to repartition the disk.
In our setup with data on spinning disks and journals on (fewer) separate SSDs, we have a single partition per spinning disk (OSD), and a dedicated "journal" partition on each SSD; each partition contains 4–5 journals as files. Our journal files are sized at 6 GiB each, so the journal partitions are 40 GB or so.
Caveat emptor
This setup has evolved based on a few years of experience and considerations of SSD lifetime and file system/SSD efficiency (latency, throughput). It's not necessarily the optimum, but then it's a tricky area... OSD journals have a peculiar access pattern: write only to a circular buffer, with frequent "sync"s. And SSDs can have large variations in (especially write) latency depending on usage (and controller and file system smartness); and latency peaks can be exacerbated by the fact that Ceph only ACKs a write when N (typically 3) writes have been committed to stable storage. In general, I think this is still a little bit of a (dark?) science, and you definitely need to take the expected usage patterns into account, so take all recommendations with a grain of salt, especially these here.
Oh and everything I said is for the "classical" Ceph where the data is stored in a file system such as XFS/ext4/... With the upcoming "BlueStore" these considerations may not (all) apply anymore.
Best Answer
It depends on the type of data access: Ceph can store data as block devices (RBD), as an S3 object store (RGW), or as a filesystem (CephFS). I assume CephFS here as you mentioned it and Gluster, both of which are filesystem abstractions.
In a three-node configuration, Ceph would have one or more OSD daemons running at each site (one per disk drive). The data is striped across the OSDs in the cluster, and your CephFS client (kernel, FUSE, or Windows) will algorithmically access the right node to store data in, no gateway is needed. How this is done is long to explain, but essentially it is a distributed hash table mapping with additional data kept server-side in the MON daemons.
The data path of CephFS is straight, from your client to the OSD, with no gateways interposed.
The filesystem makes use of an additional daemon type, the MDS, which stores your filesystem metadata. If your filesystem operation performs a filesystem change (e.g. create a directory), the MDS will be accessed instead of the OSD.
However, specifically to your intended use case, Ceph is a synchronous storage system, and its performance will decline the farther you stretch the distance between the nodes. It is generally recommended you keep a stretched configuration to within 10ms of round-trip latency between nodes. In other words, Ceph clusters like to live in one datacenter, but you can stretch them across a city or some small country if you have very good links.