Network File Transfers – Why Are Transfers Slow with Multiple Small Files?

ftpnetwork-sharenfssambasftp

When transferring large (Several GBs) of data over various forms of file transfer, such as: FTP, SFTP, NFS and Samba. They all suffer from the same issue of multiple small files hampering speeds down to MBs or KBs at times – even over a 10Gbps link.

However if I was to zip, tar or rar the entire folder before transferring, then the network link gets fully saturated.

  • What is it that causes this effect?

  • What can be done to improve the performance of large transfers with many small individual files over a network?

  • Out of the available file transfer protocols, which is best suited for this?

I have full administration over the network so all configurations and options are available like setting MTU and Buffer sizes on network interfaces and turning off async and encryption in file server configurations as a couple of throwaway ideas.

Best Answer

File system metadata. Overhead needed to make files possible is underappreciated by sysadmins. Until they try to deal with many small files.

Say you have a million small 4 KB files, a decently fast storage with 8 drive spindles, and a 10 Gb link that the array can sometimes saturate with sequential reads. Further assume 100 IOPS per spindle, and it takes one IO per file (this is oversimplifying, but illustrates the point).

$ units "1e6 / (8 * 100 per sec)" "sec"
        * 1250
        / 0.0008

21 minutes! Instead, assume the million files are in one archive file, and sequential transfer can saturate the 10 Gb link. 80% useful throughput, due to being wrapped in IP in Ethernet.

$ units "(1e6 * 4 * 1024 * 8 bits) / (1e10 bits per second * .8)" "sec"
        * 4.096
        / 0.24414062

4 seconds is quite a bit faster.

If the underlying storage is small files, any file transfer protocol will have a problem with many of them. When IOPS of the array are the bottleneck, the protocol of the file server on top of it doesn't really help.

Fastest would be copying one big archive or disk image. Mostly sequential IO, least file system metadata.

Maybe with file serving protocols you don't have to copy everything. Mount the remote share and access the files you need. However, accessing directories with very large number of files, or copying them all, is still slow. (And beware, NFS servers going away unexpectedly can cause clients to hang stuck in IO forever.)