This problem pops up here and there on the Internet, with the prevalent gratuitous advice involving the purchase of new hardware. I ran into this myself, and am posting my experiences for future reference.
I had this problem on two systems, one desktop and one laptop, and suspectedly on a third system also (a server with two drives on RAID1), but I’ll leave that one out of consideration because it has speed issues for other reasons and I haven’t had a proper look at it yet.
The visible symptom is that, from time to time, the machine “freezes” all disk I/O operations, for a duration usually around 90 seconds, even when under low load. This happens approximately twice per hour. For example, the home desktop machine might be playing music on Amarok (and doing nothing else except routine background services), and about twice an hour the music will freeze in the middle of a song, then after little more than a minute playback resumes. The laptop showed the same problem playing video files, for example films: video and audio playback freeze, and then resume after the pause. In an effort to find something amusing about all this, I noticed that xine (kaffeine) resumes playback from “ninety seconds later”, as if it had been playing all along, so that you need to rewind; mplayer continues right from where it froze. Other symptoms include emails not being written to mailboxes, problems/delays in saving or moving or renaming any kind of file, etc.
When opening ksysguard, all processes attempting disk I/O show as “disk sleep” for the entire duration of the freeze. This usually includes the symptom process (amarok, xine, etc.), as well as background processes involved in I/O operations, such as the various jbd2, flush, and kio_file, as pointed out by Linuxmath above.
“Disk sleep” is generally understood to indicate a bottleneck disk I/O condition, i.e., the disk can’t keep up with the rate of data supplied to it or requested from it, so that the I/O must be frozen at the software end to allow the disk to catch up. In normal everyday use, this is not supposed to cause any problems; “disk sleep” either shouldn’t happen, or should be resolved in fractions of a second. Also, there is no reasonable explanation for an I/O bottleneck condition when playing reasonable multimedia formats (FLAC audio, films in xvid / dirac / theora / x264, etc., assuming the resolutions are “sensible”). Any 5400rpm hard disc can handle this… in its sleep, as it were.
When I turned to the Internet, the overwhelming initial advice I got was “dude, get a new disk” (– you buy me one, sucker!). Don’t blame your hardware before you’ve at least done the following, obviously as root:
e2fsck -Dfvy /dev/sda2
smartctl -a /dev/sda
smartctl -t long /dev/sda2
badblocks -nv -b 4096 -o /root/20140202.sda2.badblocks /dev/sda2
The first, e2fsck, is a check of your file system (n.b., assuming you’re using ext2/ext3/ext4), obviously referring to the partition you’re struggling to read from, in this case partition 2 on drive /dev/sda. You will need to unmount it first. If it’s your root partition, your best bet is to boot up a Live USB of openSUSE, then you’re guaranteed your disks have nothing to do. If it’s not your root partition (in my case it was “/home”), you can use umount, or sometimes it’s easier to comment out (i.e., prefix with #) its line in your filesystem table (/etc/fstab), and rebooting. You will need to remove the # to get your files back, of course.
The second command, smartctl, is a utility to “talk” to the hard disk’s S.M.A.R.T. capabilities (Self-Monitoring, Analysis, and Reporting Technology). With the “-a” option, smartctl simply spits out all the information it has on the disk in question. Look for anything that looks like a warning; some useful variables include “197 Current_Pending_Sector”, “198 Offline_Uncorrectable”, “199 UDMA_CRC_Error_Count” and “254 Free_Fall_Sensor”, all of which are bad news if their RAW_VALUE is anything other than zero. The “smartctl -t long” command tells the disk to start an extended self-test, which will run in the background, you can do other things in the meantime. The “smartctl -a” output should tell you how long the extended test will last (typically a few hours). After that duration of time, a new “smartctl -a” will show the outcomes of the test.
Smartctl also kindly pointed out that a firmware update was available for my drive. So I went to the manufacturer’s website, and downloaded & installed the new firmware. I’d suggest you check this with the manufacturer, too; smartctl will tell you what make/model disk you have, and which firmware version is installed.
The final command, badblocks, does some of the same stuff as smartctl, but more disruptively to the user. Meaning: it detects “bad blocks”, storage areas on the disk that have become unusable (indicative that your disk is “dying”). However, badblocks only works while the disk is unmounted (you cannot access your data in the meantime), and for a typical SATA-II 7200rpm disk today you might want to budget 24h for every 500GB you’d like to test, per pass (the example command above does a single pass, but higher degrees of paranoia are available). By the way: the -o option specifies an output file, where badblocks will send a list of any bad blocks found; this can be passed to e2fsck to ensure that those blocks are not used for data storage. This only works if your -b option to badblocks matches the block size for your file system; try something like: dumpe2fs -h /dev/sda2 | grep “Block size”.
My drives merrily passed all these tests. And if yours does too, and does not give you any other reason to doubt it (rattling, etc.), then don’t let anyone tell you to buy a new one.
By the way: if you did find any bad blocks, with either smartctl or badblocks, you should be able to find out which file they go with (you will have lost at least part of this file), and tell your filesystem to move that file someplace else, and not use the same blocks again. Here’s a fantastic article on how to do that. You can happily enough keep on using the same disk; however, bad blocks are usually a sign of old age, and having a few bad blocks is a good hint that more will go bad soon. Best to replace the disk in the near future!
After much poking around, I found that the following command fixes the disk sleep freeze issue for me on both machines:
hdparm -va0 /dev/sda
This instructs the disk to disable the “read ahead” feature of the disk.
This setting will always be reset when shutting down or restarting the computer, so it is useful to make sure the above command is executed automatically at every boot time. In older SysV-based systems, the file to look to was /etc/rc.local; on openSUSE 12.3 this is /etc/init.d/boot-local. Simply add a line at the bottom with your hdparm command.
How the “read ahead” feature relates to the symptoms described above is beyond me; I am also by no means convinced that I have yet found the root cause of this problem. The above fix worked for two different machines, with different hardware and software configurations. The common denominators are: openSUSE 12.3, KDE desktop environment, and hard disks manufactured by Seagate (but different product lines: one Barracuda and one Momentus).
If anyone else has any experience to share on the same issue, I’d look forward to reading.
— Warning: some rant lives below this line. —
I’d file a bug report, but I’m tired of filing bug reports for the current version of openSUSE only to get told that “it’s been fixed in Factory” and will be available in the next release. I don’t report bugs just to do the SLED/SLES people a favour. Also, I’m tired of upgrading from a stable openSUSE version to the newest, now 13.1, only to find that it’s horribly buggy and unstable because Attachmate is using openSUSE releases as bétas for community testing. I was once one of those people who had the latest openSUSE on my servers, and the next Release Candidate on my own machine, in order to report bugs… only to find that they’d get fixed AFTER final release, leaving me with buggy servers. So, what can I say… I’ll be upgrading to 13.1… later this year.