Nfs – XFS: possible memory allocation deadlock in kmem_alloc

nfsnfs4xfs

I am performing a data analysis that entails loading a large data matrix of ~112GB into a memory-mapped file using R programming language, specifically the bigmemory package (see https://cran.r-project.org/web/packages/bigmemory/index.html). The matrix has 80664 columns and 356751 rows.

Data storage consists of NFS-mounted XFS filesystem.

XFS mount options are:

xfs noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k

NFS is exporting the FS using the following options:

rw,async,no_subtree_check,no_root_squash

NFS client is mounting the FS using these options:

defaults,async,_netdev

After sometime in loading the file, the compute node becomes unresponsive (including other nodes on the cluster) and the file server logs report the following errors:

XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

Which I can resolve by dropping cache like so:

echo 3 > /proc/sys/vm/drop_caches

The file server has 16 GB of memory.

I have already read though the following blog:

https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlock-kmem_alloc/

However, the issue is not due to fragmentation, as the fragmentation reported is below 2% for the filesystem I am writing to.

So, due to the XFS error above, I assume that the file server is running out of memory as it cannot handle the number of IO requests issued by the task at hand.

Apart from dropping cache periodically (eg. via cron), is there a more permanent solution to this?

Thanks in advance for the help.

Edit: CentOS 7.2 on client and server.

Edit #2: Kernel 3.10.0-229.14.1.el7.x86_64 on client and server.

Best Answer

It's related to memory fragmentation and filesystem fragmentation, see https://bugzilla.kernel.org/show_bug.cgi?id=73831

You should check your filesystem fragmentation with xfs_db -r -c 'frag' <filesystem>' . Keeping it not too full (80% or less) and running xfs_fsrfor a while should help, too.