Ceph OSDs and journal drives

ceph

I have a separate drive for each of my ceph OSD servers. Each OSD host has 4 data drives. Does one journal drive serve the 4? Is the journal drive shared? Should there be a partition for each data drive?

Best Answer

Journal/data separation

If you have just these four drives per OSD host, and all drives have similar performance, then the usual/recommended setup would be to have one OSD per disk (i.e. 4 per server), and each OSD would have its journal file on the same disk as the data.

Another popular (at least historically) setup is to have journals on separate drives that are optimized for write throughput and latency; usually SSDs, ideally SSDs with "power loss protection" so that they can acknowledge "sync" writes quickly without necessarily writing to the flash array (which can be somewhat slow). In this setup it is common to share a journal SSD between multiple OSD (data) drives. For example, our OSD servers have 8 or 10 spinning-rust drives for Ceph OSDs, and the journals are distributed over two SSDs.

Partitions

If your data and journal are on the same physical disk, I personally would put them on the same partition/file system. Mostly because I would be worried that if they were on separate partitions, then there would be a lot of head movement when the OSD/file system alternates between journal and (background) data writes. I'm not sure this is actually an issue, and on SSDs it certainly isn't. In general, separate partitions give you some optimization opportunities, i.e. different file system parameters or even file system types, or no file system at all for the journal. This comes at the cost of operational complexity, for example when adding or changing the size of a journal requires you'd need to repartition the disk.

In our setup with data on spinning disks and journals on (fewer) separate SSDs, we have a single partition per spinning disk (OSD), and a dedicated "journal" partition on each SSD; each partition contains 4–5 journals as files. Our journal files are sized at 6 GiB each, so the journal partitions are 40 GB or so.

Caveat emptor

This setup has evolved based on a few years of experience and considerations of SSD lifetime and file system/SSD efficiency (latency, throughput). It's not necessarily the optimum, but then it's a tricky area... OSD journals have a peculiar access pattern: write only to a circular buffer, with frequent "sync"s. And SSDs can have large variations in (especially write) latency depending on usage (and controller and file system smartness); and latency peaks can be exacerbated by the fact that Ceph only ACKs a write when N (typically 3) writes have been committed to stable storage. In general, I think this is still a little bit of a (dark?) science, and you definitely need to take the expected usage patterns into account, so take all recommendations with a grain of salt, especially these here.

Oh and everything I said is for the "classical" Ceph where the data is stored in a file system such as XFS/ext4/... With the upcoming "BlueStore" these considerations may not (all) apply anymore.

Related Solutions

Ceph OSD always ‘down’ in Ubuntu 14.04.1

I experienced the same issue in very much the same environment. I finally tracked down the problem to a messed-up OSD UUID. What gave it away was the following line in the MON log (not the OSD log!):

... mon.minion-001@0(leader).osd e75 preprocess_boot from osd.0 10.208.66.2:6800/3427 clashes with existing osd: different fsid (ours: 71b33e7f-b464-4ba9-96b3-8c814921fea2 ; theirs: 5401be6f-b4ff-42ef-8531-78ee73772d5b)

I resolved the problem by first manually removing the OSD, destroying its file system and manually re-creating it from scratch. How the problem came into existence is something I will subsequently have to track down.

Given the fact that I used puppet to set up the OSDs and the reason for it to mess up is probably related to something particular to my environment means that the issue you are experiencing is likely to be a different one, but maybe you can check your MON log anyway. You will have to enable debugging on the MON, though, by stating something like this in ceph.conf:

[mon]
        debug mon = 9

The message in question is logged at level 7, so this gives you some more details without making everything terribly chatty.

@LoicDachary: wouldn't it make sense to log this error/warning message at level 0? I would certainly have spotted this issue earlier had it been logged right away.

Ceph osd down and rgw Initialization timeout, failed to initialize after reboot

I did eventually work out what was wrong. I had to manually change 'type host' to 'type osd' in our crushmap, which is different from Spongman's suggestion.

after booting rgw, I find that the owner of radosgw process is "root", not "ceph". command "ceph -s" also show that "100.000% pgs not active".

I search the clue "100.000% pgs not active", the post "https://www.cnblogs.com/boshen-hzb/p/13305560.html" tell how to solve it - change 'type host' to 'type osd' , as result, "ceph -s" show "HEALTH_OK" and the owner of radosgw process become "ceph", and rgw web service(7480) is listening.

the owner of radosgw process is root