Intermittent I/O performance issues

arogan · May 20, 2011, 11:21pm

I am fairly new to opensuse/linux and have been running into a very intermittent I/O issue that I can not seem to pinpoint the source of.

Just about any program that deals with any significant IO will fall into uninterruptable sleep. For example here are two instances of dd. The first time it hung in uninterruptable sleep and the second it was completely fine and performed as expected.

dataproc@ares:/temp> dd if=/dev/zero of=ddfile.big bs=3MB count=1k
1024+0 records in
1024+0 records out
3072000000 bytes (3.1 GB) copied, 113.168 s, 27.1 MB/s
dataproc@ares:/temp> dd if=/dev/zero of=ddfile.big bs=3MB count=1k
1024+0 records in
1024+0 records out
3072000000 bytes (3.1 GB) copied, 13.5521 s, 227 MB/s

Monitoring disk I/O with iotop indicates normal performance for most of the time then all the sudden the read/write drops from ~60-80MB/s to 1-2MB/s.

This happens for just about any program I run. It is not easily repeatable since it seems to randomly happen. My limited knowledge led me to believe that it could be a symptom of the old SLED 10 OS that i was running so I upgraded to openSUSE 11.4 earlier this week with no change.

I have changed the journaling to ordered and the I/O scheduler to deadline as well as cfq to no avail.

Any help is appreciated.

brunomcl · May 24, 2011, 2:51am

Since you got no reply until now, I’ll venture some guesses:

Could it be related to the file system you’re using? Did you do a fresh install or kept the working parition(s)? What’s their filesystems? Do they have size limitations?
Are them single disks or raid(s)? If raid, by software or hardware? Raid driver has any update? Bug report?
Any bios setting that affect HD usage? Things like legacy mode, IRQ assigning, etc.
Did you check the disks? Partition magic liveCD is a good tool for that, there are others.

I hope something above may help you. Good luck.

arogan · May 24, 2011, 4:24am

I did a fresh install and reformatted the partition with an ext4 partition. The previous partition was a reiserfs. I am no expert by any means and just chose the file system type that the OS installer suggested both times I installed the two different OS. So as far as I know there shouldn’t be any size limitations.
The system is a comprised of two raids running on the HP p410 raid controller. The first has 2 x 1 TB drives in a raid 1 configuration and the second has 4 x 1 TB drives in a raid 5 configuration. I have installed the latest firmware for the controller. I have not seen any bug related to this issue yet.
From what I remember there were no bios settings that could cause this but it has been a little while since I configured the bios so I will check again in the morning.
I have not run any third party disk checking software only used the raid controller to monitor the disk status and they all appear healthy. I will certainly try liveCD as as soon as possible.

If you have any other ideas/suggestions or need any more information from me, please don’t hesitate to ask.

Thanks!

brunomcl · May 24, 2011, 7:35am

I’m afraid I’m out of suggestions. RAID is uncharted territory for me, I don’t think it would be an issue, but…

Only thing that occurs me is perhaps you’re going out of disk space? Does it happen in both raid matrices or just one? Long shot, I know.

Perhaps someone more knowledgeable will chime in.

djh-novell · May 24, 2011, 1:37pm

arogan wrote:
> I am fairly new to opensuse/linux and have been running into a very
> intermittent I/O issue that I can not seem to pinpoint the source of.
>
> Just about any program that deals with any significant IO will fall
> into uninterruptable sleep. For example here are two instances of dd.
> The first time it hung in uninterruptable sleep and the second it was
> completely fine and performed as expected.
>
> dataproc@ares:/temp> dd if=/dev/zero of=ddfile.big bs=3MB count=1k
> 1024+0 records in
> 1024+0 records out
> 3072000000 bytes (3.1 GB) copied, 113.168 s, 27.1 MB/s
> dataproc@ares:/temp> dd if=/dev/zero of=ddfile.big bs=3MB count=1k
> 1024+0 records in
> 1024+0 records out
> 3072000000 bytes (3.1 GB) copied, 13.5521 s, 227 MB/s
>
> Monitoring disk I/O with iotop indicates normal performance for most of
> the time then all the sudden the read/write drops from ~60-80MB/s to
> 1-2MB/s.

When you say uninterruptable sleep, what do you mean? You can’t CTRL-C
it? But you can kill -9 it or not?

What does top say when the problem is occurring? Particularly, what does
it say about system and wait CPU usage?

When the process is hung, find its pid and as root try

strace -p <pid>

Report the output, please.

When the process is hung, can you do any other activity on the device
(ls etc)?

Cheers, Dave

arogan · May 24, 2011, 4:30pm

When the program or command is ion uninterruptible sleep, designated by a D in the status column in top, the system is unresponsive until is comes out of this sleep. I can not issue any commands as root or any other user so i would be unable to run strace. I can not ctrl-c and kill -9 will not end the process until it the process wakes. There is little to no CPU usage in all cases.

Would it be useful at all to run the strace command after the process wakes?

arogan · May 24, 2011, 5:57pm

I managed to run strace when a process fell into regular sleep and it was exhibiting the same IO issue. The results are below:
ares:/home/rogan # strace -p 9693
Process 9693 attached - interrupt to quit
futex(0x647594, FUTEX_WAIT_PRIVATE, 0, NULL

Here is what top shows:
top - 09:58:32 up 28 min, 5 users, load average: 6.83, 4.06, 2.08
Tasks: 166 total, 1 running, 165 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.9%sy, 0.0%ni, 82.6%id, 16.4%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 32238M total, 20495M used, 11742M free, 102M buffers
Swap: 32765M total, 0M used, 32765M free, 17671M cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9693 dataproc 20 0 1291m 1.0g 3928 S 0 3.2 4:55.86 PIMAWrapper

djh-novell · May 24, 2011, 6:17pm

arogan wrote:
> When the program or command is ion uninterruptible sleep, designated by
> a D in the status column in top, the system is unresponsive until is
> comes out of this sleep. I can not issue any commands as root or any
> other user

Ah, sorry, I hadn’t realized you were using uninterruptible sleep with a
defined meaning; it’s not a term I remember coming across before.

But it seems like you’re saying that it’s not the process that is hung
but the whole system? And yet you’re able to use top and iotop? I’m
confused.

> so i would be unable to run strace. I can not ctrl-c and kill -9 will
> not end the process until it the process wakes. There is little
> to no CPU usage in all cases.
>
> Would it be useful at all to run the strace command after the process
> wakes?

Probably not.

From a little googling, this sounds like either a hardware problem or a
kernel (driver) bug or both. Is there anything in the logs?

arogan · May 24, 2011, 8:38pm

Yeah. I just have top and iotop running in a serperate terminal before I issue any commands that might cause the D status. During that time I simply am monitoring their output for the moment that the process switches to D status. Both top and iotop still refresh during these system hangs but I can not issue any commands or even exit out of top until the process wakes again.

I have run a system diagnotisc using HP’s smartstart CD and everything there indicated that all the hardware was all functioning normally. I checked the /var/logs/messages and there doesn’t seem to be anything that out of the ordinary.

djh-novell · May 25, 2011, 12:17pm

arogan wrote:
> Yeah. I just have top and iotop running in a serperate terminal before I
> issue any commands that might cause the D status. During that time I
> simply am monitoring their output for the moment that the process
> switches to D status. Both top and iotop still refresh during these
> system hangs but I can not issue any commands or even exit out of top
> until the process wakes again.
>
> I have run a system diagnotisc using HP’s smartstart CD and everything
> there indicated that all the hardware was all functioning normally. I
> checked the /var/logs/messages and there doesn’t seem to be anything
> that out of the ordinary.

Not sure I can think of much else. It might be worth trying an ssh
session into the machine to see if that retains control when the problem
occurs. Or possibly CTRL-ALT-F1 might still work to give you a text
console session.

The strace output doesn’t tell me much, but somebody else might know
more. It might be worth trying the opensuse mailing list or even the
opensuse-kernel list. It seems that either you need to find somebody who
recognizes the symptoms or else you will need to find some way to make
it more repeatable.

FWIW, I have an Ubuntu box that sometimes apparently hangs for a minute
or so. I suspect that it spins its disk down when idle and I have to
wait when I first do something subsequently. But I haven’t got around to
investigating yet. Doesn’t match your symptoms though, so probably
something different.

arogan · May 25, 2011, 4:11pm

Well thanks for trying anyway.