The 'MAX AVAIL' column represents the amount of data that can be used before the first OSD becomes full. It takes into account the projected distribution of data across disks from the CRUSH map and uses the 'first OSD to fill up' as the target.
It also factors in replication size. If your data pool has a larger replication size than the other pools, that would explain the difference.
You can check the replication size like this.
ceph osd pool get default.rgw.buckets.data size
Is ceph waiting the until the next journal commit when it hits an fsync call?
Yes, mostly. But it acts a little differently depending on the backend.
Under FileStore, there's a small journal buffer that can act as a small write-burst cache, but it's tiny. And yes, once it fills, it blocks to flush - across the entire cluster or PG.
Under BlueStore, there's no such buffer. And yes, bluestore blocks on every write to fsync to the Journal - all Journals in the PG. This is how BlueStore can remain very consistent and predictable in IOPS and writes. Under Bluestore, you want to move at least the write-ahead-log (WAL) off to an Enterprise SSD - because BlueStore will move the Journal and DB off to the same WAL partition - if there's enough room (you don't even have to specify them, just the WAL).
Enterprise SSDs as WAL/DB/Journals because they ignore fsync
But the real issue in this cluster is that you are using sub-optimum HDDs as Journals that are blocking on very slow fsyncs when they get flushed.
Even Consumer-grade SSDs have serious issues with Ceph's fsync
frequency as journals/WAL, as consumer SSDs only have transactional logs and no real power-backup.
It's Enterprise SSDs that have large capacitors that allow the drive to continue operating after a power-loss. Thus, they can guarantee a successful write under a power-loss event.
The added benefit is that Enterprise SSDs typically ignore fsync commands from the OS! Because they can guarantee the success of the write, they immediately return the fsync
request from the OS.
Thus, you get major performance gains when using an Enterprise-grade SSD as your WAL/DB/Journal.
Under FileStore, you'll see those delays go away, but you'll see inconsistent bursts of cache and then back down.
This is where BlueStore comes in, as BlueStore will guarantee a consistent IOPS and write across the board. But, you need WAL/DB/Journal on an Enterprise SSD to ignore those fsyncs.
At this time, Intel S3700s can be had in the used market for about $40/ea. Tiny investment for massive performance gains of unblocking fsyncs.
Some quotes (https://yourcmc.ru/wiki/index.php?title=Ceph_performance&mobileaction=toggle_view_desktop#Bluestore_vs_Filestore):
Filestore writes everything to the journal and only starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it makes journal act as a «temporary buffer» that absorbs random write bursts.
Bluestore can’t do the same even when you put its WAL+DB on SSD. It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and it lacks any kind of background flush threads. So you actually can increase the maximum number of deferred requests, but after the queue fills up the performance will drop until OSD restarts.
And: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/
The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL device would while also allowing additional metadata to be stored there (if it will fit). This means that if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.
Best Answer
Well I was wondering about the usage and mechanism of CephFS snapshot and the search results brought me here.
Firstly, snapshot in CephFS is available, but not yet stable. With
allow_new_snaps
set, snapshot will be enabled in CephFS, and making snapshots is as easy as creating a directory. Besides being not stable, what I've found is that files in snapshots still seem to be changing as the files in filesystem change, but haven't got a clue about this.Snapshotting the pool seems to be a reliable way to do backups, but keep in mind that you gotta snapshot both the data pool and the metadata pool, and both snapshots need to be taken at the same time, in order to get a consistent snapshot of the filesystem. What's worse, you will need to combine both snapshots and make a new filesystem with them in order to get a single file or directory from the snapshot, but
multi-fs
is not yet implemented, AFAIK, in ceph. So your only way to do a recover may be overwriting the current filesystem with the snapshot entirely.I'm using the
allow_new_snaps
way which seems to be more promising.