I am performing a data analysis that entails loading a large data matrix of ~112GB into a memory-mapped file using R
programming language, specifically the bigmemory
package (see https://cran.r-project.org/web/packages/bigmemory/index.html). The matrix has 80664 columns and 356751 rows.
Data storage consists of NFS-mounted XFS filesystem.
XFS mount options are:
xfs noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k
NFS is exporting the FS using the following options:
rw,async,no_subtree_check,no_root_squash
NFS client is mounting the FS using these options:
defaults,async,_netdev
After sometime in loading the file, the compute node becomes unresponsive (including other nodes on the cluster) and the file server logs report the following errors:
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
Which I can resolve by dropping cache like so:
echo 3 > /proc/sys/vm/drop_caches
The file server has 16 GB of memory.
I have already read though the following blog:
https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlock-kmem_alloc/
However, the issue is not due to fragmentation, as the fragmentation reported is below 2% for the filesystem I am writing to.
So, due to the XFS error above, I assume that the file server is running out of memory as it cannot handle the number of IO requests issued by the task at hand.
Apart from dropping cache periodically (eg. via cron
), is there a more permanent solution to this?
Thanks in advance for the help.
Edit: CentOS 7.2 on client and server.
Edit #2: Kernel 3.10.0-229.14.1.el7.x86_64 on client and server.
Best Answer
It's related to memory fragmentation and filesystem fragmentation, see https://bugzilla.kernel.org/show_bug.cgi?id=73831
You should check your filesystem fragmentation with
xfs_db -r -c 'frag' <filesystem>'
. Keeping it not too full (80% or less) and runningxfs_fsr
for a while should help, too.