Linux – Copying data over with rsync causes size discrepancies

ext4linuxopensusersync

I am switching machines and have attached the old hard drive (/dev/sda4) to the new machine.

The old machine had a slightly smaller hard drive (720G), compared to the new one (736G), so I created a slightly larger partition as well.

So, I then ran rsync to copy all the data to the new partition, as shown below:

linux-70e2:/ # time rsync -azprvl /mnt/external-disk/foo /media/sda4/

...
sent 169,237,139,987 bytes  received 24,529 bytes  24,419,185.41 bytes/sec
total size is 190,542,953,489  speedup is 1.13

real    115m30.297s
user    112m13.068s
sys     3m59.996s

The data gets copied without errors.

However, when I do:

du -h -m -s /mnt/external-disk/foo /media/sda4/foo

I get:

162414  /mnt/external-disk/foo
181721  /media/sda4/foo

Could somebody please explain this massive difference? Why am I not getting the same results? This is driving me nuts for days now. There are a few other partitions as well and I'm getting similar discrepancies as well.

Both partitions are ext4.

linux-70e2:/ # mount | grep sda4
/dev/nvme0n1p5 on /media/sda4 type ext4 (rw,relatime,data=ordered)
/dev/sda4 on /mnt/external-disk type ext4 (rw,nosuid,nodev,relatime,data=ordered,uhelper=udisks2)

To my knowledge, there is nothing wrong with both drives which are SSD-s. One of them is brand new. I've run e2fsck on both of them.

In addition, I ran:

find -L /mnt/external-disk type/foo -type l

and this doesn't list any symlinks below the source directory.

This is not my first time using rsync for this kind of thing, but I've never had this kind of issue before. Please, advise!

Best Answer

The discrepance is mostly likely caused by more sparsely-populated file on the old disk.

Anyway, let's first check that file and inode numbers are the same:

issue find <path> | wc -l on both mountpoints. Is the number of file/directory the same?
issue df -i. Is the number of inodes the same?

If the answer to both question is yes, than the difference can be explained by more sparsely file on the new disk. But what are sparse files? In short, sparse files are normal files which are smaller than they appear. This is possible thank to a feature of (relatively) modern filesystems which, instead to write all zeroes to a file, simply set a flag telling the system "this file (or part of) is full of zeroes, don't let me write them all".

By default, du reports the real space taken by the file, and not it apparent size. To show apparent size, use du --apparent-size (for other options, please see du manpage)

For a practical example, you can create a sparse file using the command truncate test.img -s 1G. As reported by ls, the newly created file is 1 GB in size, but if you try du -hs test.img, you'll see a very, very small filesize (possibly even zero!). How it is possible? As stated above, modern filesystem sometime "lie" to the appliations, reporting back an allocated size which does not exists in reality. On the other side du -hs --apparent-size test.img will print the same size as ls.

As you start writing into a sparse file, the filesystem will dynamically allocate the required space. For example, issuing dd if=/etc/services of=test.img conv=notrunc,nocreat will write some data into the previously all-sparse test.img file. Now, running du -hs test.img will report the ~600 KB allocated for data storage.

An obvious, but very important implication is that sparse file support can only optimize for zero-filled files (or part of). The very same moment your write to a file, its allocated space begin to grow. This is true event if you write other zeroes to the file, unless the application know how to handle sparse files (in this case, the application will advise the filesystem that it is going to write all zeroes, and the filesystem optimize accordlying).

What if you want to really preallocate some space? Then you can use fallocate test.img -l 1G. If you execute ls; du -hs test.img; du -hs --apparent-size test.img, you'll see that all tools report the very same size, because the file was really fully allocated by the fallocate call.

In short, it is possible that, during the copy, some file were recreated in a less sparsely manner, replacing sparse sections with "real" zeroes. To use sparse file with rsync you had to use the -S option.

Related Solutions

Linux – Copying a large directory tree locally? cp or rsync

I would use rsync as it means that if it is interrupted for any reason, then you can restart it easily with very little cost. And being rsync, it can even restart part way through a large file. As others mention, it can exclude files easily. The simplest way to preserve most things is to use the -a flag – ‘archive.’ So:

rsync -a source dest

Although UID/GID and symlinks are preserved by -a (see -lpgo), your question implies you might want a full copy of the filesystem information; and -a doesn't include hard-links, extended attributes, or ACLs (on Linux) or the above nor resource forks (on OS X.) Thus, for a robust copy of a filesystem, you'll need to include those flags:

rsync -aHAX source dest # Linux
rsync -aHE source dest  # OS X

The default cp will start again, though the -u flag will "copy only when the SOURCE file is newer than the destination file or when the destination file is missing". And the -a (archive) flag will be recursive, not recopy files if you have to restart and preserve permissions. So:

cp -au source dest

Debian – No posix acl in debian Wheezy with ext4

ext4 has the acl and user_xattrs options enabled by default¹. You would have to use noacl or nouser_xattr to not use them.

① http://patchwork.ozlabs.org/patch/102405/

Best Answer

Related Solutions

Linux – Copying a large directory tree locally? cp or rsync

Debian – No posix acl in debian Wheezy with ext4

Related Topic