Data Backup to local drive causes frequent and random system hangs

I have been using rsync for several years to back up data onto local SATA drives using a docking station. Originally it worked quite well, but over the past year or two it has caused my system to crash any time it runs more than a few minutes and some times in just a few seconds. The backups can be continued after a reboot, each time sequentially adding more data until it either hangs again or completes. I thought, as you are probably thinking, that the drives were failing. So I bought two new Hitachi 1TB drives and saw no improvement. This problem can be found in many forums; but so far I have not found a solution – if people have resolved the issue it appears they did not return to the forum to document it.

A few links:
https://forums.opensuse.org/english/get-technical-help-here/network-internet/454704-rsync-freezes-whole-computer-without-error-messages.html
https://forums.opensuse.org/english/other-forums/news-announcements/tech-news/476148-rsync-opensuse-org-unreliable.html

The bash file:

FileID=MMSSataUpdate$(date +%Y-%m-%d_%H%M) #appending date to filename
begin=$(date +%k%M)        #starting time
echo Starting MMS data update for $(date +%D)
#mount /dev/mapper/cr_sdc1
rsync -avu --delete /media/data/* /media/backupSATA/ >> /var/log/MMS_SataUpdateLogs/$FileID.log
#umount /dev/mapper/cr_sdc1
end=$(date +%k%M)        #ending time
echo the elapsed time was: $(expr $end - $begin) minutes

# -a Archive mode    preserve everything
# -v Verbose
# -u Update        Update only - don't overwrite newer files
# --delete        Delete files that don't exist on the sending side.

Pertinent facts.

  • The system drives (backup source) are RAID 1 with a promise 8350 controller. No problems noted outside of backup.
  • The destination drives are SATA drives formatted with EX4 and encryption but I have tested without encryption with the same result.
  • The backup destination drive is placed in a docking station however I have connected directly to the MB to test with the same result.
  • I have unmounted and tested all drives and RAID pairs with e2fsck and all return clean.
  • SMART shows all drives functional.
  • Five drives are rotated through the backup plan. As a test, I reformatted two with the same result.
  • There is never an error message or an error in any log file. The system just hangs, I reboot, and sda recovers from journals during restart.
  • After reboot, the backup process can be continued, with more progress made each time, until eventually the backup is complete.
  • The glitches are totally random in time and volume of data written to the drive. I keep log files of progress and their sizes are all over the map.
  • I upgraded rsync to the latest version - no change.
  • In addition to rsync, I have used the cp command as root in a gnome terminal with the same result.
  • I recently upgraded 11.3 to 11.4 but the problem persists.
  • Backups can be successfully completed over the network to another computer, albeit prohibitively slow.

I still am leaning to a hardware fault, but I think it would be a mistake to write off the OSS. So, how do I further troubleshoot this problem?

Interesting indeed the system hangs even when eliminating the docking station and directly connecting the SATA drive.

A few thoughts that come to mind:

  1. Can you confirm that the hang is a complete freeze - no ability to go out to a vty and check dmesg, etc?

  2. Is smartctl able to read through the Promise controller to verify health of the system drives? (Or could you power off and check them from another system?)

  3. Have you done a thorough system memory test? Perhaps bad memory is causing the hang and presenting at various times as memory is used to buffer the operations?

  4. Is mcelog able to read anything that remains in the mcelog after a warm reboot? (Assumes this is a 64-bit machine)

  5. Perhaps kdump and running a debug of the kernel state when the hang occurs might be helpful.
    openSUSE 12.3: Chapter 18. kexec and kdump

LewsTherin