Linux – Read Errors Backing Up To NFS via rsync

hard drivelinuxnfsrsync

I'm backing up a linux box to a NAS mounted via NFS. I'm using rsync (as part of a scheme along the lines of with hard-links). That is I ssh into machine_being_backed_up, start my rsync command, it backs up files for about an hour or so, and then freezes the server (e.g., needs to be physically rebooted; which is very inconvenient as the servers in another building across town so takes time to reboot) the error at the end being (with actual names anonymized):

rsync: read errors mapping "/home/some/path/file1.gz": Input/output error (5)
rsync: read errors mapping "/home/some/path/file2.gz": Input/output error (5)

This is likely indicating that the hard drive on the machine I'm trying to back up has some faulty sectors, correct? Or could that error arise from the NFS connection being too slow or choosing the wrong options when mounting my NFS drive (mounting with rw,soft,intr options)? Is there anyway to make these input/output errors just skip/fail those files, and not freeze the system (so I don't have to go across town to reboot the server)?

Update: I turned on SMART yesterday and ran short and long selftests yesterday that reported no errors (yesterday I couldn't mention this as the long test finished around 7p and the computer crashed around midnight so I could login until this morning when I could on-site reboot).

Also I tried rsync-ing the in question files to a different partition on the same drive and didn't get any errors. I'm now trying to rsync directly to the NAS (rather than mounting the NAS using NFS).

Update (Oct 3rd): I've moved the hard drive to a different machine and its been ~2 weeks with no errors. While in the old machine there were daily errors of this type. I'm guessing motherboard or memory errors in the other machine (haven't had the time to fully diagnose and pinpoint the problem).

Best Answer

The fact that it physically freezes the machine strongly indicates that this is a symptom of a hardware error. I would not expect bad sectors to cause a machine to hang though, so it may be something less easy to diagnose.

To see if it is the disks that are the problem, try reading the affected files locally (login via SSH and use cat /home/ > /dev/null) though if this works it does not necessarily mean the disk surface is fine (it could be borderline and sometimes readable other not). If you do not already, run SMART monitoring tools and watch for things like the sector remapping count going up - this will indicate the disk surface is not in tip top shape (a few sectors remapped is not unusual with modern massive drives, but many indicates a serious problem).

It could be filesystem corruption, but again I would not expect this to completely hang the machine - or if it were so bad as to crash the filesystem driver I would expect a kernel panic message on the console rather than the machine stopping. You can use fsck to check this, though make sure everything you can currently read is backed up just in case the corruption is so bad that trying to fix it makes things worse (this is rare, but I have seen it happen especially if you are using an experimental filesystem or a beta release rather than a tried+tested version).

Another thing to check for with the hardware freezing is that the CPU and RAM are fine. They could be faulty and overheating - not so much so that it causes a problem in normal operation, be the extra load imposed by running rsync for some time pushing something over the edge. Running a memory test and CPU "burn in" test may highlight this if it is the problem. Your I/O controller could be a suspect too in the same way, though I'm not sure how you would go about testing that.