Ceph e5 handle_auth_request failed to assign global_id after a host outage

ceph

I have a small 3-host Ceph cluster with Ubuntu 20.04.1 and Ceph 15.2.5 using docker containers and deployed with cephadm. Yesterday one of the hosts (s65-ceph) had a power outage. The other two hosts continued working for a while but then s63-ceph and s64-ceph started filling their logs with "e5 handle_auth_request failed to assign global_id", coming from the monitors and their HDDs were full because of this. I have received this type of error other times, where the problem was just clock skew. I've synchronised their clocks but clock skew doesn't appear to be the problem here. I have also restarted ceph.target and rebooted all machines multiple times to no avail. I haven't made any changes to the cluster in a long time so I also know this error isn't some sort of configuration error.

I have tried following the documentation's monitor troubleshooting page. The odd thing though is s64-ceph lists itself and s63-ceph in the quorum while s63-ceph says it's out of quorum. Here are what the mon_status commands return: s63-ceph, s64-ceph. I am also able to telnet to 3300 & 6789 from other hosts, so I know the network is fine. The monitor (and its container) on s65-ceph are down so that has no mon_status. If it's of use, hee's the cluster config. I've uploaded the journal of s64-ceph to here (having removed duplicate lines with the handle_auth_request error and all grafana/prometheus/dashboard lines). S63-ceph has basically the same logs anyways.

I am aware there is an extremely similar question here, but that hasn't had any answers. This is a production system, so if there is no way to get the cluster back to normal operation, I'd appreciate if there is some safe way of recovering my files (the cluster is used only for CephFS).

Thanks in advance.

Best Answer

I've finally sorted this out but the documentation regarding this is fairly obscure so I'll answer my own question. It appears the host which went down had also filled up its disk, which is why it was behaving differently to the other two hosts and why its mon wasn't starting up. I solved that by clearing old logs and unnecessary packages. That then meant the three hosts behaved identically because all three mons could start up.

To troubleshoot the cluster, I found the easiest way to start is to get the mon_status of each monitor. I use cephadm, so the commands below are with Docker containers. In the "normal" setup, you'd instead do sudo ceph tell mon.s64-ceph mon_status.

ceph --admin-daemon /run/ceph/9ea4d206-baec-11ea-b970-2165cf493db2/ceph-mon.<mon_name>.asok mon_status

That will give you something like:

{
"name": "s64-ceph",
"rank": 0,
"state": "leader",
"election_epoch": 25568,
"quorum": [
    0,
    1
],
"quorum_age": 17,
"features": {
    "required_con": "2449958747315978244",
    "required_mon": [
        "kraken",
        "luminous",
        "mimic",
        "osdmap-prune",
        "nautilus",
        "octopus"
    ],
    "quorum_con": "4540138292836696063",
    "quorum_mon": [
        "kraken",
        "luminous",
        "mimic",
        "osdmap-prune",
        "nautilus",
        "octopus"
    ]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
    "epoch": 5,
    "fsid": "9ea4d206-baec-11ea-b970-2165cf493db2",
    "modified": "2020-07-15T12:13:10.390355Z",
    "created": "2020-06-30T16:15:22.596364Z",
    "min_mon_release": 15,
    "min_mon_release_name": "octopus",
    "features": {
        "persistent": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus"
        ],
        "optional": []
    },
    "mons": [
        {
            "rank": 0,
            "name": "s64-ceph",
            "public_addrs": {
                "addrvec": [
                    {
                        "type": "v2",
                        "addr": "10.2.64.2:3300",
                        "nonce": 0
                    },
                    {
                        "type": "v1",
                        "addr": "10.2.64.2:6789",
                        "nonce": 0
                    }
                ]
            },
            "addr": "10.2.64.2:6789/0",
            "public_addr": "10.2.64.2:6789/0",
            "priority": 0,
            "weight": 0
        },
        {
            "rank": 1,
            "name": "s63-ceph",
            "public_addrs": {
                "addrvec": [
                    {
                        "type": "v2",
                        "addr": "10.2.63.2:3300",
                        "nonce": 0
                    },
                    {
                        "type": "v1",
                        "addr": "10.2.63.2:6789",
                        "nonce": 0
                    }
                ]
            },
            "addr": "10.2.63.2:6789/0",
            "public_addr": "10.2.63.2:6789/0",
            "priority": 0,
            "weight": 0
        },
        {
            "rank": 2,
            "name": "s65-ceph",
            "public_addrs": {
                "addrvec": [
                    {
                        "type": "v2",
                        "addr": "10.2.65.2:3300",
                        "nonce": 0
                    },
                    {
                        "type": "v1",
                        "addr": "10.2.65.2:6789",
                        "nonce": 0
                    }
                ]
            },
            "addr": "10.2.65.2:6789/0",
            "public_addr": "10.2.65.2:6789/0",
            "priority": 0,
            "weight": 0
        }
    ]
},
"feature_map": {
    "mon": [
        {
            "features": "0x3f01cfb8ffadffff",
            "release": "luminous",
            "num": 1
        }
    ],
    "client": [
        {
            "features": "0x27018fb86aa42ada",
            "release": "jewel",
            "num": 1
        }
    ]
}

}

If you look at the quorum field, it only lists two out of three monitors as in the quorum. This is because s65-ceph was the one whose disk had filled up and whose mon wouldn't start up. When you do get the third host's mon up, it will show all three monitors are in the quorum.

Ordinarily, Ceph should be able to run (albeit not in a healthy state) even with only 2/3 monitors up because 2/3 is a majority, meaning they'd be able to form quorum. However, this was not the case here. Examine the journal on each host, and at least in my case they were calling for elections very frequently (you'll see lines containing "calling for election"). They were calling elections so frequently (about every 5-10 seconds) so they were switching monitors before the cluster was available to users again, and this is why the cluster always appeared to be down.

When troubleshooting many problems, I keep Glances open and I noticed very high RAM utilisation, and network and disk read/write spikes were occurring when the mons had done an election, which made me think the frequent monitor switching was causing the high IO and paging was making the IO problem worse. I found one blog post which seemed to support this.

I can't add more RAM to any of the hosts to test that but I found that if a monitor is being very slow, the other monitors will call for elections. In my case the HDDs I use aren't fast enough for constant monitor switching (read: frequent bursts of random read/writes), and that meant if one monitor had just been elected leader, it would write to its HDD for a few seconds, but it would be extremely unresponsive while it's doing that. That meant the other monitors would call for elections, and another monitor would face the same problem. The cycle would continue on and on like this, in a sort of positive feedback.

I eventually found there's a parameter called mon_lease, which by default is set to 5.0 seconds. That controls how long the other monitors will wait for a given monitor to respond before calling for elections again. 5 seconds is the default because Ceph is usually run on somewhat fast servers, but I know my cluster runs much slower because I'm using three very old recycled laptops as my cluster. I set my mon_lease time to 30 s with the command below so this frequent switching problem would go away, and I'm also not running much software atop Ceph so I'm not concerned about read/writes timing out if there is a mon switch. After changing mon_lease, wait a few minutes and THEN check your journal logs. You should find none of the hosts are doing the constant monitor switching. Make sure to check your cluster is working as expected, and ideally reboot all Ceph hosts just to make sure everything will work on the next boot.

ceph --admin-daemon /run/ceph/9ea4d206-baec-11ea-b970-2165cf493db2/ceph-mon.s64-ceph.asok config set mon_lease 30.0

I hope my answer helps someone avoid experiencing the same misfortunes with Ceph and put a comment if you need clarification.

Related Topic