Ceph Cluster – Reduced data availability: 96 pgs inactive And All OSD Nodes Are Down

cephhealthcheckobject-storage

I set up my Ceph Cluster by following this document. I have one Manager Node, one Monitor Node, and three OSD Nodes. The problem is that right after I finished setting up the cluster, the ceph health returned HEALTH_OK for all three nodes. However, when I get back to my cluster it wasn't ok. This is the output of the health check:

HEALTH_WARN Reduced data availability: 96 pgs inactive
PG_AVAILABILITY Reduced data availability: 96 pgs inactive
    pg 0.0 is stuck inactive for 35164.889973, current state unknown, last acting []
    pg 0.1 is stuck inactive for 35164.889973, current state unknown, last acting []
    pg 0.2 is stuck inactive for 35164.889973, current state unknown, last acting []

and also for all the other pgs.
I'm new to ceph and I don't know why this has happened. I'm using Ceph version 13.2.10 mimic (stable). I have searched for an answer, but others who seem to have the same problem aren't experiencing node failure. All my osd nodes are down and this is the output for ceph -s:

  cluster:
    id:     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx
    health: HEALTH_WARN
            Reduced data availability: 96 pgs inactive

  services:
    mon: 1 daemons, quorum server-1
    mgr: server-1(active)
    osd: 3 osds: 0 up, 0 in

  data:
    pools:   2 pools, 96 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             96 unknown

I have also checked the osd logs and I didn't understand what the problem is, but these few lines indicate that there is the problem with my version of Ceph and I have to upgrade to luminous but I already have a newer version:

2021-02-18 22:01:11.994 7fb070e25c00  0 osd.1 14 done with init, starting boot process
2021-02-18 22:01:11.994 7fb070e25c00  1 osd.1 14 start_boot
2021-02-18 22:01:11.998 7fb049add700 -1 osd.1 14 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-18 22:11:00.706 7fb050aeb700 -1 osd.1 15 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-18 22:35:52.276 7fb050aeb700 -1 osd.1 16 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-18 22:36:08.836 7fb050aeb700 -1 osd.1 17 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-19 04:05:00.895 7fb0512ec700  1 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace gifting 0x1f00000~100000 to bluefs
2021-02-19 04:05:00.931 7fb0512ec700  1 bluefs add_block_extent bdev 1 0x1f00000~100000
2021-02-19 04:23:51.208 7fb0512ec700  1 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace gifting 0x2400000~400000 to bluefs
2021-02-19 04:23:51.244 7fb0512ec700  1 bluefs add_block_extent bdev 1 0x2400000~400000

I also checked the osd versions by ceph tell osd.* version and this is the output:

Error ENXIO: problem getting command descriptions from osd.0
osd.0: problem getting command descriptions from osd.0
Error ENXIO: problem getting command descriptions from osd.1
osd.1: problem getting command descriptions from osd.1
Error ENXIO: problem getting command descriptions from osd.2
osd.2: problem getting command descriptions from osd.2

while ceph-osd --version returns Ceph version 13.2.10 mimic (stable).

I can't understand what the problem could be. I also tried systemctl start -l ceph-osd@# and it didn't work. I have no clue what else I can try or why did this happen in the first place.

Best Answer

I remember experiencing the same issue couple times. Once the problem was the iptables, I forgot to open ports for cluster network on both monitors and OSDs. The other time it was because my crushmap failure domain was set to host and I was running an all in one cluster, problem solved by setting crushmap to osd.

Related Topic