I experienced the same issue in very much the same environment. I finally tracked down the problem to a messed-up OSD UUID. What gave it away was the following line in the MON log (not the OSD log!):
... mon.minion-001@0(leader).osd e75 preprocess_boot from osd.0 10.208.66.2:6800/3427 clashes with existing osd: different fsid (ours: 71b33e7f-b464-4ba9-96b3-8c814921fea2 ; theirs: 5401be6f-b4ff-42ef-8531-78ee73772d5b)
I resolved the problem by first manually removing the OSD, destroying its file system and manually re-creating it from scratch. How the problem came into existence is something I will subsequently have to track down.
Given the fact that I used puppet to set up the OSDs and the reason for it to mess up is probably related to something particular to my environment means that the issue you are experiencing is likely to be a different one, but maybe you can check your MON log anyway. You will have to enable debugging on the MON, though, by stating something like this in ceph.conf:
[mon]
debug mon = 9
The message in question is logged at level 7, so this gives you some more details without making everything terribly chatty.
@LoicDachary: wouldn't it make sense to log this error/warning message at level 0? I would certainly have spotted this issue earlier had it been logged right away.
I've finally sorted this out but the documentation regarding this is fairly obscure so I'll answer my own question. It appears the host which went down had also filled up its disk, which is why it was behaving differently to the other two hosts and why its mon wasn't starting up. I solved that by clearing old logs and unnecessary packages. That then meant the three hosts behaved identically because all three mons could start up.
To troubleshoot the cluster, I found the easiest way to start is to get the mon_status of each monitor. I use cephadm, so the commands below are with Docker containers. In the "normal" setup, you'd instead do sudo ceph tell mon.s64-ceph mon_status
.
ceph --admin-daemon /run/ceph/9ea4d206-baec-11ea-b970-2165cf493db2/ceph-mon.<mon_name>.asok mon_status
That will give you something like:
{
"name": "s64-ceph",
"rank": 0,
"state": "leader",
"election_epoch": 25568,
"quorum": [
0,
1
],
"quorum_age": 17,
"features": {
"required_con": "2449958747315978244",
"required_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
],
"quorum_con": "4540138292836696063",
"quorum_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 5,
"fsid": "9ea4d206-baec-11ea-b970-2165cf493db2",
"modified": "2020-07-15T12:13:10.390355Z",
"created": "2020-06-30T16:15:22.596364Z",
"min_mon_release": 15,
"min_mon_release_name": "octopus",
"features": {
"persistent": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "s64-ceph",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.2.64.2:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.2.64.2:6789",
"nonce": 0
}
]
},
"addr": "10.2.64.2:6789/0",
"public_addr": "10.2.64.2:6789/0",
"priority": 0,
"weight": 0
},
{
"rank": 1,
"name": "s63-ceph",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.2.63.2:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.2.63.2:6789",
"nonce": 0
}
]
},
"addr": "10.2.63.2:6789/0",
"public_addr": "10.2.63.2:6789/0",
"priority": 0,
"weight": 0
},
{
"rank": 2,
"name": "s65-ceph",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.2.65.2:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.2.65.2:6789",
"nonce": 0
}
]
},
"addr": "10.2.65.2:6789/0",
"public_addr": "10.2.65.2:6789/0",
"priority": 0,
"weight": 0
}
]
},
"feature_map": {
"mon": [
{
"features": "0x3f01cfb8ffadffff",
"release": "luminous",
"num": 1
}
],
"client": [
{
"features": "0x27018fb86aa42ada",
"release": "jewel",
"num": 1
}
]
}
}
If you look at the quorum field, it only lists two out of three monitors as in the quorum. This is because s65-ceph was the one whose disk had filled up and whose mon wouldn't start up. When you do get the third host's mon up, it will show all three monitors are in the quorum.
Ordinarily, Ceph should be able to run (albeit not in a healthy state) even with only 2/3 monitors up because 2/3 is a majority, meaning they'd be able to form quorum. However, this was not the case here. Examine the journal on each host, and at least in my case they were calling for elections very frequently (you'll see lines containing "calling for election"). They were calling elections so frequently (about every 5-10 seconds) so they were switching monitors before the cluster was available to users again, and this is why the cluster always appeared to be down.
When troubleshooting many problems, I keep Glances open and I noticed very high RAM utilisation, and network and disk read/write spikes were occurring when the mons had done an election, which made me think the frequent monitor switching was causing the high IO and paging was making the IO problem worse. I found one blog post which seemed to support this.
I can't add more RAM to any of the hosts to test that but I found that if a monitor is being very slow, the other monitors will call for elections. In my case the HDDs I use aren't fast enough for constant monitor switching (read: frequent bursts of random read/writes), and that meant if one monitor had just been elected leader, it would write to its HDD for a few seconds, but it would be extremely unresponsive while it's doing that. That meant the other monitors would call for elections, and another monitor would face the same problem. The cycle would continue on and on like this, in a sort of positive feedback.
I eventually found there's a parameter called mon_lease, which by default is set to 5.0 seconds. That controls how long the other monitors will wait for a given monitor to respond before calling for elections again. 5 seconds is the default because Ceph is usually run on somewhat fast servers, but I know my cluster runs much slower because I'm using three very old recycled laptops as my cluster. I set my mon_lease time to 30 s with the command below so this frequent switching problem would go away, and I'm also not running much software atop Ceph so I'm not concerned about read/writes timing out if there is a mon switch. After changing mon_lease, wait a few minutes and THEN check your journal logs. You should find none of the hosts are doing the constant monitor switching. Make sure to check your cluster is working as expected, and ideally reboot all Ceph hosts just to make sure everything will work on the next boot.
ceph --admin-daemon /run/ceph/9ea4d206-baec-11ea-b970-2165cf493db2/ceph-mon.s64-ceph.asok config set mon_lease 30.0
I hope my answer helps someone avoid experiencing the same misfortunes with Ceph and put a comment if you need clarification.
Best Answer
I did eventually work out what was wrong. I had to manually change 'type host' to 'type osd' in our crushmap, which is different from Spongman's suggestion.
after booting rgw, I find that the owner of radosgw process is "root", not "ceph". command "ceph -s" also show that "100.000% pgs not active".
I search the clue "100.000% pgs not active", the post "https://www.cnblogs.com/boshen-hzb/p/13305560.html" tell how to solve it - change 'type host' to 'type osd' , as result, "ceph -s" show "HEALTH_OK" and the owner of radosgw process become "ceph", and rgw web service(7480) is listening.