Is ceph replication based on nodes or disks

cephreplication

Im currently evaluating storage systems for xenserver. Because data replication is important in the case of a failure i have a question regarding replication in ceph.

As far as i know every disk in a node is an osd by itself (disks are not in any raid configuration). Is the ceph replication algorithm aware of the fact that 2 osd's are on the same node so not replicating the data on these osd's?

Minimal Example:
2 nodes with 2 disks each.
Because of the non raid setup each disk is a osd -> 4 osd's.
Node A: OSD1, OSD2; Node B: OSD3, OSD4.
I set the replication amount to 2 and save an object into ceph.
Will the object be saved and replicated on so that in case of a node failure the data is completely accessible?

Thank you for your answers

Best Answer

Yes
You can define the policy to replicate by node, racks, datacenters, etc.

Enterprise SSDs as WAL/DB/Journals because they ignore `fsync`

But the real issue in this cluster is that you are using sub-optimum HDDs as Journals that are blocking on very slow fsyncs when they get flushed.

Even Consumer-grade SSDs have serious issues with Ceph's fsync frequency as journals/WAL, as consumer SSDs only have transactional logs and no real power-backup.

It's Enterprise SSDs that have large capacitors that allow the drive to continue operating after a power-loss. Thus, they can guarantee a successful write under a power-loss event.

The added benefit is that Enterprise SSDs typically ignore fsync commands from the OS! Because they can guarantee the success of the write, they immediately return the fsync request from the OS.

Thus, you get major performance gains when using an Enterprise-grade SSD as your WAL/DB/Journal.

Under FileStore, you'll see those delays go away, but you'll see inconsistent bursts of cache and then back down.

This is where BlueStore comes in, as BlueStore will guarantee a consistent IOPS and write across the board. But, you need WAL/DB/Journal on an Enterprise SSD to ignore those fsyncs.

At this time, Intel S3700s can be had in the used market for about $40/ea. Tiny investment for massive performance gains of unblocking fsyncs.

Some quotes (https://yourcmc.ru/wiki/index.php?title=Ceph_performance&mobileaction=toggle_view_desktop#Bluestore_vs_Filestore):

Filestore writes everything to the journal and only starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it makes journal act as a «temporary buffer» that absorbs random write bursts.

Bluestore can’t do the same even when you put its WAL+DB on SSD. It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and it lacks any kind of background flush threads. So you actually can increase the maximum number of deferred requests, but after the queue fills up the performance will drop until OSD restarts.

And: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/

The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL device would while also allowing additional metadata to be stored there (if it will fit). This means that if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.

Is Ceph possible to handle hardware RAID arrays (LUNs) as OSD drives

You can doesn't mean you should. Mapping RAID LUNs to Ceph is possible, but you inject one extra layer of abstraction and kind of render at least part of Ceph functionality useless.

Similar thread on their mailing list:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021159.html

Best Answer

Related Solutions

Slow fsync() with ceph (cephfs)

Enterprise SSDs as WAL/DB/Journals because they ignore fsync

Is Ceph possible to handle hardware RAID arrays (LUNs) as OSD drives

Related Topic

Enterprise SSDs as WAL/DB/Journals because they ignore `fsync`