NFS server unresponsive to clients – with processes ‘migration’ and ‘xfssyncd’ consuming unusual cpu

nfsxfs

I have a CentOS 6.4 fileserver running NFS 4, serving a couple of XFS filesystems. There are a few dozen clients connected to it. Today it slowed to a crawl for the clients – the clients would hang or only respond after a few minutes when accessing the mounted NFS share from the server. On the server itself I could access the shared filesystems without trouble. The trouble went away after about four hours, but I don't know why – see below.

top showed several migration processes and xfssyncd process consuming unusual amounts of cpu, jumping between 0% and anywhere up to 100% every few seconds. No other processes were noticeably active. The overall cpu usage reported by top was low, like so:

Cpu(s): 0.0%us, 4.2%sy, 0.0%ni, 95.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

I haven't been able to find anything online talking about this in particular, apart possibly from a RHEL support entry that's in their subscriber-only section and I can't see.

I ran service nfs restart. Then service nfs status showed running daemons, except nfsd dead but subsys locked. After another restart this was gone and nfsd was running, but clients were still hung.

I tried some things that were suggested for xfssyncd-related issues:

1) mount –o remount /mnt/data on the exported fs. Interestingly, this command took about a minute to run, and during this time the 'wild' processes settled down. But once the command finished running, the processes were back to high cpu usage.

2) echo 720000 > /proc/sys/fs/xfs/xfssyncd_centisecs in order to change the sync interval for xfssyncd. This didn't make any noticeable difference, which is not too surprising since the fs is busy with NFS clients and the issue must be something else.

I had an issue with this server 3 weeks ago in which a .nfsNNN file (from a removed file is still open and being accessed) was filling up fast with a looping error message on a client. Killing the problem process fixed the NFS slowdown. [However the file server then proceeded to slow down again a couple days later without such a .nfsNNN file issue, and I eventually had to just reboot it. At the time I saw some processes with unusal cpu levels, but didn't note what they were and can't remember now if they were the same as the current issue.]

I searched today again for open .nfsNNN files that were maybe problems, but found none. I did delete some from a few months ago, but they weren't being currently modified so figured they weren't a problem. I noticed no difference after deleting these files.

About an hour after trying the various things above, the server went back to normal and the migration and xfssyncd processes no longer had high cpu usage. I don't know what changed, but I'd like to try and get ahead of figuring this out since it seems it might happen again.

Thanks for any suggestions.

-M

Best Answer

I have a RHEL 6.10 with similar issues. The only thing that seems to help is killing long-running user sftp processes on the NFS client. These were processes kept open by GUI-based SFTP clients (e.g. WinSCP, Nimble Commander) for many hours (> 10 hrs).

Monitoring shows some NFSv3 client activity coinciding with the issue, but the activity is actually lower than some other typical activity on other clients (there are > 100 clients) which don't cause the issue.

There isn't actually a lot of i/o done, as well.

UPDATE 2019-12-10: The root cause seems to have been XFS quotas on the NFS server. User home directories have quotas imposed, with a soft limit 2 GB lower than the hard limit. Some users tried to install a full Anaconda Python, which exceeded the soft limit. The Anaconda installer did not seem to have a way to intercept the warnings, and kept downloading files past the soft limit. This generated a massive rate of quota warnings, completely bogging down the system, and making it unresponsive.

I say "seems" because the evidence is circumstantial. When the users tried the install into a directory with no quota, everything went fine.