Spontaneous Reboot Problem

My first post in these forums, so be gentle. :slight_smile: Hard to say if this is the correct forum for my problem, but here goes…

I have a new system with Asus M3A78-EM mobo, 4 GB RAM, AMD Phenom II quad-core 955 3.2 GHz CPU, on which I’ve installed the 64-bit version of openSUSE 11.2. It’s a lightweight server as well as a desktop machine, and is on 24/7 (and on a UPS). The problem is that the system reboots itself spontaneously, on average a couple of times a day, at seemingly random times. After one of these reboots, an examination of the system log shows nothing at all suspicious logged in the minutes prior to the sudden reboot. Sometimes the system recovers after the reboot, but about half the time the KDE4 desktop comes up with the keyboard and mouse not working. The system log in that case shows that the USB subsystem is wedged, with the message “task khubd:38 blocked for more than 120 seconds”, followed by a series of call traces. At that point, I can usually log in via SSH from another local machine and execute a shutdown/reboot, and the system usually recovers fully after that… until the next spontaneous reboot. The only thing unusual about my USB setup is that I have a Hauppauge HVR-1950 TV tuner plugged in, and the pvrusb2 driver installed on the system.

The mobo has an onboard ATI Radeon HD 3200 graphic controller, and I installed the proprietory ATI driver for it in the hopes of improving system stability. It didn’t.

Due to the lack of syslog warnings before one of these reboot events, I was inclined to think the hardware was at fault, and my first thought was bad memory. However, fairly extensive testing with memtest86+ showed no errors. Of course, it could be other more obscure hardware problems, but it would be nice to be a little more certain before I resort to replacing the motherboard or the whole system.

Other info… booting in failsafe mode doesn’t seem to cure the problem. Setting acpi=off in the boot menu doesn’t seem to help either. There are some suspicious things in the boot log, though, such as:

0.008496] Unpacking initramfs...
0.016075] BUG: scheduling while atomic: swapper/0/0x10000002
0.016081] Modules linked in:
0.016084] Pid: 0, comm: swapper Not tainted 2.6.31.8-0.1-desktop #1
0.016086] Call Trace:
0.016098]  <ffffffff81011a19>] try_stack_unwind+0x189/0x1b0

and so on. I see some discussions about this bug on the net, but it’s not clear how serious it might be, and whether it could be relevant to my problem. More from the boot log:

8.620456] EDAC amd64_edac:  Ver: 3.2.0 Dec 18 2009
8.620659] EDAC amd64: This node reports that Memory ECC is currently disabled.
8.620662] EDAC amd64: bit 0x400000 in register F3x44 of the MISC_CONTROL device (0000:00:18.3) should be enabled
8.620664] EDAC amd64: WARNING: ECC is NOT currently enabled by the BIOS. Module will NOT be loaded.
8.620665]     Either Enable ECC in the BIOS, or use the 'ecc_enable_override' parameter.
8.620666]     Might be a BIOS bug, if BIOS says ECC is enabled
8.620667]     Use of the override can cause unknown side effects.
8.620674] amd64_edac: probe of 0000:00:18.2 failed with error -22

My system does not have ECC memory, so it is indeed properly disabled in my BIOS. Again, I see discussions of this bug(?) on the net, but little indication of whether it is anything serious.

At this point, I’m running out of ideas… I’m still not sure whether the hardware or openSUSE 11.2 are at fault, and I don’t want to give up on either prematurely. Any suggestions as to where to go from here?

ve3jf wrote:
> Due to the lack of syslog warnings before one of these reboot
> events, I was inclined to think the hardware was at fault …
> [snip] … I’m still not sure whether the hardware or openSUSE 11.2
> are at fault, and I don’t want to give up on either prematurely.
> Any suggestions as to where to go from here?

-welcome-

though i’m not sure either, i really do think it is a hardware problem…

maybe a bad/almost good ground or hot lead somewhere…or a almost
loose connector…if it were mine i’d pull the lid off and, one at a
time, unhook and reconnect every connector, except the CPU itself…

and, while open i’d just take a look at the capacitors when sour quick
(any have bowed tops?)…and see if the MB is secure…just wiggle it
a little (NOT with a hammer)…

then i’d put it back together and cross my fingers…if it happens again:

  • i’d install atop and read the directions to set the timer to take a
    snapshot as often as you think you want (i forget the default)…in
    addition to the stuff you see with top, you get some other things
    recorded…good to have…

  • and, i’d install something to check and record cpu/mb/hard drive
    heat…

  • and to watch the drives smart…

while you are thinking about all of that you also think back: did you
check that the install media was ok?, like: http://tinyurl.com/yajm2aq

is the system full updated?

use this:
zypper lr -d
in a terminal to see what repos you have enabled…and, if more than
the four in this post, cut’em back:
http://forums.opensuse.org/new-user-how-faq-read-only/424611-new-users-opensuse-pre-install-general-please-read.html#post2058902

and, if you have not seen that thread before, you might find some very
interesting stuff in there…as well as:
http://forums.opensuse.org/new-user-how-faq-read-only/424615-new-users-suse-11-2-pre-installation-please-read.html

in either of those you might discover a hole you fell in and didn’t
even notice…

come back and let us know how you get on…

–
palladium

So do I. In addition to your suggestions, the OP might want to try another power supply while he/she is tinkering. If one of the power supply rails has gone “soft” (ie, the voltage drops under load), that can cause this very problem, and it’s a booger to find.

One other possibility, which you touched on: AMDs are good processors – I use them exclusively because of the great bang-per-buck ratio – but they ARE more sensitive to heat, as a general rule, than Intel. Double-check the heatsink and fan. Pull the heatsink, clean off the old grease and put on a thin, fresh layer of heatsink compound, then firmly re-attach.

(Edited: I’m sure you know this, but just to be sure … don’t remove the heatsink with the unit powered up! AMDs will fry in a matter of seconds without a heatsink.)

Thanks for your replies and useful comments! Here’s an update…

I’ve reviewed all the material on installation, etc., and didn’t discover any gotchas. I did check the install media, and it was fine. The system is fully updated, including the recent kernel update.

Good idea to reseat all the connectors and inspect the mobo for bad caps - I’ll do that when the next opportunity comes along. As for lifting the heatsink, um, maybe not… I think the guys who put together this box are competent, and nothing really points to a thermal problem at this point. This morning I set up lm_sensors and have been watching things with gkrellm… normally the cpu temperature sits at 38-39C, and goes up to 41C or so under heavier loads… max spec for that cpu is 62C. System voltages and HD temps seem okay as well. I still need to set up atop or something similar to log some of this stuff to a file, though.

I mentioned previously that I’d tried booting with acpi=off, but it turns out I was mistaken. Since the last reboot, however, I am running with acpi=off, and so far have 26 hours uptime without any funny business. It’s way too soon to declare that the system is more stable yet, but it’s encouraging. I’m not sure what the downside of running with ACPI disabled is, and it’s not a very satisfying solution (if it is one at all), but it beats what I’ve been putting up with so far!

I’ll keep y’all posted on further developments…

You may want to clean your computer, not the OS but the actual hardware like cooling fan for your power supply unit and make sure to open er up if its a desktop.
Dust contamination is one issue all OS’s have, though its possible you have a failing PSU.
My old computer did the same thing before she gave in, the PSU blew and the MOBO got fried like a christmas goose.

ve3jf wrote:
> I mentioned previously that I’d tried booting with acpi=off, but it
> turns out I was mistaken. Since the last reboot, however, I am running
> with acpi=off, and so far have 26 hours uptime without any funny
> business. It’s way too soon to declare that the system is more stable
> yet, but it’s encouraging. I’m not sure what the downside of running
> with ACPI disabled is, and it’s not a very satisfying solution (if it is
> one at all), but it beats what I’ve been putting up with so far!

if that =off solves the problem you may wanna fine tune it…

i’m not real clear on the ups and downs of acpi enabled/disabled…
but, i’m sure i (or you) could google up a help of info if felt
necessary…WAIT! i have a rather long list of kernel parameters living
on my machine at
file:///usr/src/linux/Documentation/kernel-parameters.txt

i don’t recall putting it there (well, i know i didn’t) so i suspect
it was either put there in a default install or by downloading the
kernel headers (with YaST)…see if you have one because mine lists a
LOT (over 20) more options besides just enabled/disenabled…

so, you might wanna get your hands on one for YOUR kernel…and, i’d
drag the Asus forum hoping to find a kernel guru to help you sort
through the options and select the parameter perfect for you setup…

–
palladium

Well, it seemed to be too good to be true, and sure enough, it was. :frowning: After 26 hours of uptime, I had yet another spontaneous reboot. As usual, no clues in the system log. So, back to looking for solutions on the hardware side. Something intermittent on the motherboard or connectors? Bad power supply? Could my 4-year-old UPS (APC Back-UPS ES 650) have gone flaky? I dunno, but this is getting to be a pain…

For what it is worth I have a mother board that has a some what flaky power switch system. Normally ever thing is fine but if there are power problems in the area. for several up to 24 hours the machine will spontaneously shut down randomly. I have to actually to remove the power completely and and wait a few minutes then it will reboot. I think it is very sensitive to power fluctuations that do not even phase anything else. This only happens after a major power outage. Some noise on the power line cause it I guess. And yes I do filter the power.

That’s a possibility. If you haven’t replaced the batteries in it yet, I doubt seriously if it’s doing any good at all. If your power company is prone to giving you brief, almost unnoticeable glitches (as ours is here!), without that UPS, you might indeed reboot.

Just for the record, the power supply in a computer will typically start getting flaky after a few years, too. I don’t know how old your machine is.

The power supply, like the rest of the system, is new, in service for less than a month. It’s a reputable brand (Antec), in a Sonata III case.

I have checked the UPS recently, by pulling the plug, and it kept the system going without any glitches. This was just a quick test, but it seems to show the batteries still have some life. I presume it also indicates that surge suppression also is working, but who knows for sure.

Today I’ll open up the box, reseat all the connectors, give it a good inspection, fire it up again, and hope for the best…

Another update…

Before one of my previous spontaneous reboots, I had been running atop (per the suggestion of palladium), logging to a file every 5 seconds. After the reboot, I checked the log, and found nothing at all suspicious - the system was essentially idling, with very little going on and plenty of resources. More evidence that it’s not a software problem. I’ve also been watching the system temperatures closely… I switched from the tk87 driver to the newer asus_atk0110 and have been getting more accurate readings, and they show no indication of thermal problems - the CPU temp is always in the 28-31C range, and the mobo and hard drive readings are always 27-28C.

Earlier this week, I opened up the case, and reseated all of the connectors that I could easily get at. I didn’t see anything in there that looked suspicious. It seemed to help, as after the restart I accumulated more than three days of uptime with no problems. That got my hopes up, but earlier today - poof - another spontaneous reboot. Sigh.

I guess the next step is to talk to my dealer and see if they’ll replace the whole system (except for the drives). What a pain…

Thanks again for all your suggestions!