Race Condition / Lockup 13.1

@Forum Mods: Not sure where to start this, please move if needed; forum email will keep me up-to-date. Thx.

Hello! I am using openSUSE 13.1 kernel: Linux 3.11.10-21-desktop #1 SMP PREEMPT x86_64 x86_64 x86_64 GNU/Linux. This is running with KDE 4.11.5 desktop. Using for about 6 to 8 months on IBM M78 tower with good to excellent results. UEFI is disabled in bios, and has not been an issue. In the past 2 months or so, every 10-14 days I must do a forced cold / hard boot and kill the running system by pressing the power button! Fortunately, openSUSE is so well built, I’ve not lost any data. When this situation occurs, I must do this forced boot due to a lock-up of the system, which I would call (probably?) a ‘race’ condition in that the disk access is fast, continuos, and takes priority over everything while denying access of any means. That is, Ctrl-Alt-Del does not work, Ctrl-Alt-F(x) will not lgive me any tty with which to view or kill the process, etc. So, I still don’t know exactly which process or application is causing this issue. If anyone has suggestions or advice, please tell me what log(s) or other info is needed and I will post it. Thanks for listening!
—rob

Oh no…It may or may not be a hard disk failure. If a sector becomes damaged it can cause continues retries until the check-sum matches. Run smartctl to see if there is any reported damage. run man smartctl for detailed information on its use.

Other things that can cause a lockup is video driver problems but generally youi don’t see continuous disk activity. A reboot will force a fsck to be run which can fix simple file system problems. What file system are you using?

thanks for timely response. I should have mentioned: hard disk is Seagate ST500DM002-1BD142 (500gb/sata), and merely 4-5 mos. new. Also, no problems with anything related to video that I can see. File system is ext4 for all partitions. I ran short (2x) and long (3x-4x) options of smartctl and no problems reported. I am not seeing any other problems with disk read/write, video, sound, nothing. Did some reading/googling, and I’m thinking of looking at the ‘magic SysReq’ key options so as to be able to (maybe) interrupt the process, and be able to find out exactly which process or program is causing the issue. Again, thx for quick response. All advice appreciated. Take Care.
----rob

Run top in a console window and see what is eating CPU but note that if it is a kernel process it will not show in top. rereading a bad sectory is an internal kernel process and it will lock things up until the sector read to the checksum which could be forever. I used to have that problem in 10.1. I could be video but that generally does not cause disk reads.

Thanks again. However I cannot run top or anything else when this process takes over. I only realize the hard disk activity from the fron panel access light and that I cannot get into another console. I’ve thoroughly checked the disk for errors (and memory also) since I purchased it a few months ago. Video does not give me any problems whatever, although this configuration/cpu type is shared, not dedicated video memory. I see nothing in the cron files; they (even using sudo) are empty! Doing some poking around in the forums/wiki I see following: https://en.opensuse.org/Cron_replace This info is a little scary; I’m not a dev so am unqualified to interpret or fully understand. Also, problem on this page (and some others?) is no date/revision of this info is present! But I do: rpm -qa | grep and see on my system I have: cronie-1.4.8-50.1.2.x86_64 and
cron-4.2-50.1.2.x86_64. I see openSUSE 11.2 version mentioned on the Cron_replace page and the cron/cronie version nos. are very close, so small incremental development on these two for some time now. I find this situation really intriguing. Once — and only once — I waited about 1 hour+ and the process did stop, but I was unable to discover exactly what process took control for so long.
In any event, if you think of something further, please kindly advise and I will purse it to the best of my limited ability. Again, thank you.
—rob

On 2014-12-21 00:06, robhwill wrote:
>
> gogalthorp;2684034 Wrote:
>> Run top in a console window and see what is eating CPU but note that if
>> it is a kernel process it will not show in top.

Also “iotop -o” on another, and “dmesg --follow” on another.

> Thanks again. However I cannot run top or anything else when this
> process takes over.

No, have them open in advance. For weeks if need be, and if possible,
also via another computer with ssh.

Another possible case is a runaway process eating memory, causing swap
to be requested as fast as the disk can cope. The result is a machine
that can take a minute to respond to a key, but in the end, it responds.
If swap space is spent, system commit suicide and dies.

What I do in those cases is press ctrl-alt-f1, and if I have another
computer available, I try to ssh-in. Then I run “top”. Sometimes the
culprit can be found that uses a lot of cpu, other times that it uses a
lot of “res” memory.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

Thanks so much to you both. Have iotp -o and dmesg --follow on local (problem computer); will do ssh from another system also. What I’m still not clear on however, is whether either command (iotop -o , or dmesg --follow) will allow me to log the process(es) that grab the cpu/hard disk, or in alternative allow me to actually interrupt the process. Will continue monitoring as advised and update when/if experience another event. Usually takes about 10 to 14 days to occur. Must learn to play detective! : ) Again, thanks and Enjoy the Holidays and Family.
----robert

On 2014-12-23 02:36, robhwill wrote:

> Thanks so much to you both. Have iotp -o and dmesg --follow on local
> (problem computer); will do ssh from another system also. What I’m
> still not clear on however, is whether either command (iotop -o , or
> dmesg --follow) will allow me to log the process(es) that grab the
> cpu/hard disk,

No, but you could find another command line option to log the output of
iotop to a file at intervals. What iotop does is print what programs are
using the disks, and how much.

There are other tools to log process activity, though. Look at “ac” and
“sa”. Needs “acct” service running.

dmesg just prints the log messages to the screen, and they are already
saved on disk (/var/log/messages)

top displays the processes running on the system, sorted on whatever you
want, like memory or cpu usage.

The idea is to be able to have a look at those terminals when the
problem occurs and perhaps find out why.

> or in alternative allow me to actually interrupt the
> process.

top, yes. The others, no.

Press ‘k’ on top, and it will prompt you for the PID of the process to
kill (which you see on its display), and what signal (the default value
is perfect). If the process refuses to suicide itself, use signal 9 instead.

> Will continue monitoring as advised and update when/if
> experience another event. Usually takes about 10 to 14 days to occur.
> Must learn to play detective! : )

Yes, now and then we have to. At least, even if they are not beautiful,
we have tools and we don’t need to pay for them… :slight_smile:

Another possibility that happens is that the graphical system freezes,
but the system is otherwise working. In that case, you can use ssh from
another machine to kill graphics (init 3).

> Again, thanks and Enjoy the Holidays
> and Family.

Welcome.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

Hello again, and special thanks to <robin_listas> and <gogalthorp> for kind advice. I’m marking this issue [SOLVED] as it has now
been 3 weeks since last incident. See here:

 top - 20:37:47 up 21 days,  6:40,  3 users,  load average: 0.04, 0.09, 0.19
Tasks: 206 total,   2 running, 204 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   3159172 total,  2600352 used,   558820 free,    91864 buffers
KiB Swap:  8400892 total,    35484 used,  8365408 free.  1510284 cached Mem /CODE]
 However, I must caution that in the meantime, I upgraded w/'clean' install to v 13.2 and no such problems since!  So, although the issue is probably gone, I technically did not solve by knowing what caused the problem in the first place . . .But I did learn some bit of detective technique using the tools suggested.  Never did resolve (no such issue on buglist),  so suspect some kernel issue possibly related to my specific hardware (as suggested maybe gui interference --- my video/graphics is shared from memory;  always a potential 'gotcha').  I additionally always stay away from indexing-type programs for this very reason;  they never seem to stay out of my way.   
So again thank you both for time and knowledge.  Take Care!!
---rob