My first post in these forums, so be gentle. Hard to say if this is the correct forum for my problem, but here goes…
I have a new system with Asus M3A78-EM mobo, 4 GB RAM, AMD Phenom II quad-core 955 3.2 GHz CPU, on which I’ve installed the 64-bit version of openSUSE 11.2. It’s a lightweight server as well as a desktop machine, and is on 24/7 (and on a UPS). The problem is that the system reboots itself spontaneously, on average a couple of times a day, at seemingly random times. After one of these reboots, an examination of the system log shows nothing at all suspicious logged in the minutes prior to the sudden reboot. Sometimes the system recovers after the reboot, but about half the time the KDE4 desktop comes up with the keyboard and mouse not working. The system log in that case shows that the USB subsystem is wedged, with the message “task khubd:38 blocked for more than 120 seconds”, followed by a series of call traces. At that point, I can usually log in via SSH from another local machine and execute a shutdown/reboot, and the system usually recovers fully after that… until the next spontaneous reboot. The only thing unusual about my USB setup is that I have a Hauppauge HVR-1950 TV tuner plugged in, and the pvrusb2 driver installed on the system.
The mobo has an onboard ATI Radeon HD 3200 graphic controller, and I installed the proprietory ATI driver for it in the hopes of improving system stability. It didn’t.
Due to the lack of syslog warnings before one of these reboot events, I was inclined to think the hardware was at fault, and my first thought was bad memory. However, fairly extensive testing with memtest86+ showed no errors. Of course, it could be other more obscure hardware problems, but it would be nice to be a little more certain before I resort to replacing the motherboard or the whole system.
Other info… booting in failsafe mode doesn’t seem to cure the problem. Setting acpi=off in the boot menu doesn’t seem to help either. There are some suspicious things in the boot log, though, such as:
0.008496] Unpacking initramfs...
0.016075] BUG: scheduling while atomic: swapper/0/0x10000002
0.016081] Modules linked in:
0.016084] Pid: 0, comm: swapper Not tainted 2.6.31.8-0.1-desktop #1
0.016086] Call Trace:
0.016098] <ffffffff81011a19>] try_stack_unwind+0x189/0x1b0
and so on. I see some discussions about this bug on the net, but it’s not clear how serious it might be, and whether it could be relevant to my problem. More from the boot log:
8.620456] EDAC amd64_edac: Ver: 3.2.0 Dec 18 2009
8.620659] EDAC amd64: This node reports that Memory ECC is currently disabled.
8.620662] EDAC amd64: bit 0x400000 in register F3x44 of the MISC_CONTROL device (0000:00:18.3) should be enabled
8.620664] EDAC amd64: WARNING: ECC is NOT currently enabled by the BIOS. Module will NOT be loaded.
8.620665] Either Enable ECC in the BIOS, or use the 'ecc_enable_override' parameter.
8.620666] Might be a BIOS bug, if BIOS says ECC is enabled
8.620667] Use of the override can cause unknown side effects.
8.620674] amd64_edac: probe of 0000:00:18.2 failed with error -22
My system does not have ECC memory, so it is indeed properly disabled in my BIOS. Again, I see discussions of this bug(?) on the net, but little indication of whether it is anything serious.
At this point, I’m running out of ideas… I’m still not sure whether the hardware or openSUSE 11.2 are at fault, and I don’t want to give up on either prematurely. Any suggestions as to where to go from here?