Random and unexplained crashes

We have been suffering from frequent (often 2 - 3 times a day), apparently random and unexplained crashes for some time on one of our desktop machines.
Frequently the machine would crash without leaving any messages in /var/log/messages or any hints as to the cause. Other times the crash was less sudden and dumps such as the following would ensue:

Jul 23 08:52:19 bijvoet kernel: [234721.562898] general protection fault: 0000 #1] PREEMPT SMP
Jul 23 08:52:19 bijvoet kernel: [234721.562908] CPU 3
Jul 23 08:52:19 bijvoet kernel: [234721.562911] Modules linked in: autofs4 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device jc42 coretemp edd vboxpci vboxnetadp vboxnetflt vboxdrv
nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet cpufreq_conservative cpufreq_userspace microcode cpufreq_powersave acpi_cpufreq mperf snd_hda_codec_hdmi sr_mod cdrom joydev
sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nvidia(P) snd_pcm i2c_i801 tpm_tis firewire_ohci firewire_core crc_itu_t i7core_edac tpm iTCO_wdt iTCO_vendor_support
pcspkr edac_core ioatdma tpm_bios xhci_hcd snd_timer snd soundcore igb dca snd_page_alloc button dm_mod linear processor thermal_sys ata_generic
Jul 23 08:52:19 bijvoet kernel: [234721.562993]
Jul 23 08:52:19 bijvoet kernel: [234721.562998] Pid: 4659, comm: kworker/3:1 Tainted: P 3.1.10-1.16-desktop #1 Intel Corporation S5520SC/S5520SC
Jul 23 08:52:19 bijvoet kernel: [234721.563009] RIP: 0010:<ffffffff8113f7bd>] <ffffffff8113f7bd>] free_block+0xcd/0x180
Jul 23 08:52:19 bijvoet kernel: [234721.563023] RSP: 0018:ffff88065936fd60 EFLAGS: 00010002
Jul 23 08:52:19 bijvoet kernel: [234721.563028] RAX: ffff88065b5e7d40 RBX: ffff88035b5736c0 RCX: ffff880655bfa000
Jul 23 08:52:19 bijvoet kernel: [234721.563034] RDX: ffff8803bee590c0 RSI: ffff8803bee59000 RDI: 0000e10000e50000
Jul 23 08:52:19 bijvoet kernel: [234721.563041] RBP: ffff88065b70c018 R08: 0000000000000001 R09: 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.563047] R10: 000000000000001b R11: dead000000100100 R12: 0000000000000018
Jul 23 08:52:19 bijvoet kernel: [234721.563053] R13: 0000000000000000 R14: ffffea0000000000 R15: 0000000000000008
Jul 23 08:52:19 bijvoet kernel: [234721.563060] FS: 0000000000000000(0000) GS:ffff88066fc20000(0000) knlGS:0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.563067] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 23 08:52:19 bijvoet kernel: [234721.563072] CR2: 00007fc53d648000 CR3: 0000000001c05000 CR4: 00000000000006e0
Jul 23 08:52:19 bijvoet kernel: [234721.563079] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.563085] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 23 08:52:19 bijvoet kernel: [234721.563091] Process kworker/3:1 (pid: 4659, threadinfo ffff88065936e000, task ffff880659d30840)
Jul 23 08:52:19 bijvoet kernel: [234721.563098] Stack:
Jul 23 08:52:19 bijvoet kernel: [234721.563100] 000000000000fc40 ffff8803bee590c0 0000000000000003 ffff88065b70c000
Jul 23 08:52:19 bijvoet kernel: [234721.563111] ffff88035b5736c0 0000000000000018 ffff88065b5e7d80 ffff88065b70c018
Jul 23 08:52:19 bijvoet kernel: [234721.563121] 0000000000000001 ffffffff8113fa73 ffff88066f826780 ffff88065b5e7d40
Jul 23 08:52:19 bijvoet kernel: [234721.563131] Call Trace:
Jul 23 08:52:19 bijvoet kernel: [234721.563144] <ffffffff8113fa73>] drain_array.part.42+0x83/0xe0
Jul 23 08:52:19 bijvoet kernel: [234721.563153] <ffffffff8113fdff>] cache_reap+0x6f/0x220
Jul 23 08:52:19 bijvoet kernel: [234721.563165] <ffffffff81071631>] process_one_work+0x111/0x4d0
Jul 23 08:52:19 bijvoet kernel: [234721.563174] <ffffffff81071db2>] worker_thread+0x152/0x340
Jul 23 08:52:19 bijvoet kernel: [234721.563184] <ffffffff81075e8e>] kthread+0x7e/0x90
Jul 23 08:52:19 bijvoet kernel: [234721.563195] <ffffffff815a9534>] kernel_thread_helper+0x4/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.563203] Code: 8b 08 81 e1 80 00 00 00 0f 84 c0 00 00 00 48 8b 70 28 48 8b 43 68 49 bb 00 01 10 00 00 00 ad de 48 8b 3e 48 8b 4e 08 4a 8b 04
38
Jul 23 08:52:19 bijvoet kernel: [234721.563251] RIP <ffffffff8113f7bd>] free_block+0xcd/0x180
Jul 23 08:52:19 bijvoet kernel: [234721.563257] RSP <ffff88065936fd60>
Jul 23 08:52:19 bijvoet kernel: [234721.563263] — end trace 2c9a843aa76ab842 ]—
Jul 23 08:52:19 bijvoet kernel: [234721.565006] note: kworker/3:1[4659] exited with preempt_count 1
Jul 23 08:52:19 bijvoet kernel: [234721.565055] BUG: unable to handle kernel paging request at fffffffffffffff8
Jul 23 08:52:19 bijvoet kernel: [234721.565067] IP: <ffffffff810761b7>] kthread_data+0x7/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565079] PGD 1c07067 PUD 1c08067 PMD 0
Jul 23 08:52:19 bijvoet kernel: [234721.565090] Oops: 0000 #2] PREEMPT SMP
Jul 23 08:52:19 bijvoet kernel: [234721.565099] CPU 3
Jul 23 08:52:19 bijvoet kernel: [234721.565103] Modules linked in: autofs4 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device jc42 coretemp edd vboxpci vboxnetadp vboxnetflt vboxdrv
nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet cpufreq_conservative cpufreq_userspace microcode cpufreq_powersave acpi_cpufreq mperf snd_hda_codec_hdmi sr_mod cdrom joydev
sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nvidia(P) snd_pcm i2c_i801 tpm_tis firewire_ohci firewire_core crc_itu_t i7core_edac tpm iTCO_wdt iTCO_vendor_support
pcspkr edac_core ioatdma tpm_bios xhci_hcd snd_timer snd soundcore igb dca snd_page_alloc button dm_mod linear processor thermal_sys ata_generic
Jul 23 08:52:19 bijvoet kernel: [234721.565236]
Jul 23 08:52:19 bijvoet kernel: [234721.565241] Pid: 4659, comm: kworker/3:1 Tainted: P D 3.1.10-1.16-desktop #1 Intel Corporation S5520SC/S5520SC
Jul 23 08:52:19 bijvoet kernel: [234721.565259] RIP: 0010:<ffffffff810761b7>] <ffffffff810761b7>] kthread_data+0x7/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565276] RSP: 0018:ffff88065936fba0 EFLAGS: 00010002
Jul 23 08:52:19 bijvoet kernel: [234721.565286] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000003
Jul 23 08:52:19 bijvoet kernel: [234721.565297] RDX: 0000000000000003 RSI: 0000000000000003 RDI: ffff880659d30840
Jul 23 08:52:19 bijvoet kernel: [234721.565309] RBP: ffff88065936fc28 R08: 0000000000989680 R09: ffff880659e52368
Jul 23 08:52:19 bijvoet kernel: [234721.565322] R10: 0000000000000400 R11: ffff880659e52358 R12: 0000000000000003
Jul 23 08:52:19 bijvoet kernel: [234721.565334] R13: ffff880659d30c50 R14: ffffea0000000000 R15: 0000000000000008
Jul 23 08:52:19 bijvoet kernel: [234721.565347] FS: 0000000000000000(0000) GS:ffff88066fc20000(0000) knlGS:0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.565361] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 23 08:52:19 bijvoet kernel: [234721.565371] CR2: fffffffffffffff8 CR3: 0000000001c05000 CR4: 00000000000006e0
Jul 23 08:52:19 bijvoet kernel: [234721.565383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.565395] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 23 08:52:19 bijvoet kernel: [234721.565408] Process kworker/3:1 (pid: 4659, threadinfo ffff88065936e000, task ffff880659d30840)
Jul 23 08:52:19 bijvoet kernel: [234721.565421] Stack:
Jul 23 08:52:19 bijvoet kernel: [234721.565427] ffffffff81072168 ffff88066fc324c0 ffffffff8159dfb8 ffff880659e52200
Jul 23 08:52:19 bijvoet kernel: [234721.565446] ffff88065936ffd8 0000000000000001 ffff88065936ffd8 ffff88065936ffd8
Jul 23 08:52:19 bijvoet kernel: [234721.565463] ffff88065936ffd8 ffff880659d30e5c ffff880659d30840 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.565475] Call Trace:
Jul 23 08:52:19 bijvoet kernel: [234721.565485] <ffffffff81072168>] wq_worker_sleeping+0x8/0x90
Jul 23 08:52:19 bijvoet kernel: [234721.565498] <ffffffff8159dfb8>] thread_return+0x26e/0x356
Jul 23 08:52:19 bijvoet kernel: [234721.565514] <ffffffff810580f8>] do_exit+0x268/0x450
Jul 23 08:52:19 bijvoet kernel: [234721.565527] <ffffffff815a16a6>] oops_end+0xa6/0xf0
Jul 23 08:52:19 bijvoet kernel: [234721.565538] <ffffffff815a0945>] general_protection+0x25/0x30
Jul 23 08:52:19 bijvoet kernel: [234721.565550] <ffffffff8113f7bd>] free_block+0xcd/0x180
Jul 23 08:52:19 bijvoet kernel: [234721.565560] <ffffffff8113fa73>] drain_array.part.42+0x83/0xe0
Jul 23 08:52:19 bijvoet kernel: [234721.565571] <ffffffff8113fdff>] cache_reap+0x6f/0x220
Jul 23 08:52:19 bijvoet kernel: [234721.565582] <ffffffff81071631>] process_one_work+0x111/0x4d0
Jul 23 08:52:19 bijvoet kernel: [234721.565593] <ffffffff81071db2>] worker_thread+0x152/0x340
Jul 23 08:52:19 bijvoet kernel: [234721.565604] <ffffffff81075e8e>] kthread+0x7e/0x90
Jul 23 08:52:19 bijvoet kernel: [234721.565615] <ffffffff815a9534>] kernel_thread_helper+0x4/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565623] Code: e8 9f 7f 52 00 e9 f3 fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 48 89 df e8 68 ad fd ff e9 d7 fe ff ff 0f 1f 00 48 8b 87 b8 03 00 00
Jul 23 08:52:19 bijvoet kernel: [234721.565674] RIP <ffffffff810761b7>] kthread_data+0x7/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565683] RSP <ffff88065936fba0>
Jul 23 08:52:19 bijvoet kernel: [234721.565690] CR2: fffffffffffffff8
Jul 23 08:52:19 bijvoet kernel: [234721.565699] — end trace 2c9a843aa76ab843 ]—
Jul 23 08:52:19 bijvoet kernel: [234721.567329] Fixing recursive fault but reboot is needed!
Jul 23 08:54:23 bijvoet kernel: imklog 5.8.5, log source = /proc/kmsg started.

The machine was purchased in Feb 2011 and was initially installed with Opensuse 11.4. It was stable until around October 2011 when the instability suddenly started (I should say that the system was regularly updated). Since then, both Opensuse 12.1 and 12.2 have been installed (as fresh installs, not updates), but the instability has persisted.

Our initial feeling was a possible hardware problem. However, we have replaced the hard drives, the graphics card and sata card and the power supply, motherboard, cpu’s and memory have all been replaced under warranty (the only original hardware left is the case), but without any success. The system still crashes.

We finally got fed up with the instability and wiped the system and installed Ubuntu 12.04 lts. Since then the machine has been completely stable - no crashes of any sort.

This suggests to us a possible driver issue - at least a hardware-software interaction problem - probably introduced by an update first released in 11.4 as an update in around October 11.4 and which clearly still persists in 12.1 and 12.2. (I should say the instability is considerably worse if the machine was booted via systemd rather than systemV).

The hardware is:

Intel S5520SCR motherboard
Intel SC5650 Server chassis 1000W PSU
2 x Xeon X5660 2.80GHz 6 Core 12MB Cache
Seagate Barracuda 1.5TB 3.5" 7200RPM 32MB SATA
Crucial C300 SSD
24Gb DDR3 1333 ECC Kingston/Samsung Server memory (6 x 4Gb)
Zotac GeForce GT 560Ti 1GB DDR5 PCI-e HDMI DVI VGA
Startech PEXESAT32 eSATA card

I can send the output of hwinfo and lspci if more information is required.

Does anyone know of any known issues with any of the above hardware? We have tried removing the esata card and the SSD and reinstalling the OS on a normal hard drive with no improvement!

We are completely stumped.

Any thought?

Andrew >:(

asharff wrote:
> We have been suffering from frequent (often 2 - 3 times a day),
> apparently random and unexplained crashes for some time on one of our
> desktop machines.

A few thoughts. It depends how keen you are to find the source of the
problem :frowning:

> The machine was purchased in Feb 2011 and was initially installed with
> Opensuse 11.4. It was stable until around October 2011 when the
> instability suddenly started (I should say that the system was regularly
> updated). Since then, both Opensuse 12.1 and 12.2 have been installed
> (as fresh installs, not updates), but the instability has persisted.
>
> Our initial feeling was a possible hardware problem. However, we have
> replaced the hard drives, the graphics card and sata card and the power
> supply, motherboard, cpu’s and memory have all been replaced under
> warranty (the only original hardware left is the case), but without any
> success. The system still crashes.
>
> We finally got fed up with the instability and wiped the system and
> installed Ubuntu 12.04 lts. Since then the machine has been completely
> stable - no crashes of any sort.
>
> This suggests to us a possible driver issue - at least a
> hardware-software interaction problem - probably introduced by an update
> first released in 11.4 as an update in around October 11.4 and which
> clearly still persists in 12.1 and 12.2. (I should say the instability
> is considerably worse if the machine was booted via systemd rather than
> systemV).
[snip]
> I can send the output of hwinfo and lspci if more information is
> required.

So there seem to be a few possible avenues to explore, but they involve
a fair amount of work:

(1) go back to 11.4 and bisect the updates to find the commit that
caused the problem to appear

(2) compare the detailed configurations of Ubuntu 12.04 and openSUSE
12.1/12.2 to see if there are any configuration differences (CPU
scheduling maybe? etc etc) and/or version or patchset differences.

(3) for each item of hardware that you have, search (google or commit
logs) for any bugs introduced and fixed in the drivers or kernel.

You seem to have quite a few modules loaded so it would ease the task if
you can identify a simpler configuration that also shows the problem.

It might be worth posting that crash log to the opensuse-kernel mailing
list to see if anybody recognizes it, or at least can narrow down your
search.

Just to exclude things: does the machine have the same behaviour in another place? F.e. other building?

Thank you for the help. Much as I would love to go back to Opensuse (I much prefer it to Ubuntu), we are a small company with limited resources and I simply cannot afford the time to go into such a forensic analysis at this point in time. Hopefully I will get the opportunity in the not too distant future because this bugs me!

Andrew:(

I can’t imagine any company, no matter what size, that allows random crashing in it’s ICT infrastructure.

Knurpht wrote:
> asharff;2498949 Wrote:
>> Thank you for the help. Much as I would love to go back to Opensuse (I
>> much prefer it to Ubuntu), we are a small company with limited resources
>> and I simply cannot afford the time to go into such a forensic analysis
>> at this point in time. Hopefully I will get the opportunity in the not
>> too distant future because this bugs me!
>>
>> Andrew:(
>
> I can’t imagine any company, no matter what size, that allows random
> crashing in it’s ICT infrastructure.

But his company doesn’t have any random crashing since they stopped
using openSUSE!

The sad thing is that there’s pretty much no chance of anybody else
identifying this problem, so it will just stay there until/unless it
bites somebody else or some other patch reverses the flaw.

Hi,

I am a colleague of Andrew’s at the same company, so I have seen his pain at first hand…

We have only stopped using openSUSE on that particular machine. We have several other machines with various different hardware and openSUSE runs fine on all of them.

The sad thing is that there’s pretty much no chance of anybody else
identifying this problem, so it will just stay there until/unless it
bites somebody else or some other patch reverses the flaw.

Exactly. However, if the problem (whatever it is) gets through into SUSE and significant numbers of enterprise customers have the same problem, it will be investigated and fixed pretty sharpish, I imagine :wink:

As Andrew said, we were convinced for a long time that the problem was faulty hardware. The unpredictability of it combined with the complete absence of diagnostics indicating that anything was going wrong a lot of the time fed that conviction. And I don’t just mean files in /var/log: with the help of the original system builder (who was very helpful) we looked at diagnostics in the hardware too, for a sign of something wrong.

The only original hardware that remains is the case, monitors and cabling, so to that extent the problem is reproducible - in effect it has been demonstrated on more than one piece of kit, and it could in principle be investigated by someone who puts together the same system. The difficulty for us is that we were never able to trigger the crashes: we just had to wait for hours or days, which is part of the reason why we cannot spend the time investigating this further. All I can say is that they were worse in runlevel 5 than 3, and as Andrew said, systemd seemed to make things worse (which makes me wonder about a race condition of some kind). Switching between the NVIDIA and nouveau drivers also didn’t make any difference.

Regards,
Peter.

On 10/26/2012 03:46 PM, pakeller wrote:
> The only original hardware that remains is the case, monitors and
> cabling,

a long time ago (last century) in a land (and discussion forum) far away
an old old man who was a big iron mainframe admin in the '70s [maybe
'60s] told me:

“Always expect the cables to be bad, FIRST.”

i am not making this up! so since you have replaced all but the case
monitors and cabling…i’m gonna guess if you replace the cables…and
then maybe the switches, routers, terminal blocks, connectors, whatevers
that those cables connect to–you will find your machine (even if you
put all the old parts back in it) will run just as fine as it all your
other sweet runners do.


dd

dd wrote:
> On 10/26/2012 03:46 PM, pakeller wrote:
>> The only original hardware that remains is the case, monitors and
>> cabling,
>
> a long time ago (last century) in a land (and discussion forum) far away
> an old old man who was a big iron mainframe admin in the '70s [maybe
> '60s] told me:
>
> “Always expect the cables to be bad, FIRST.”

I have to say that I agree with dd (and the old old man). If there are
any power supply or disk cables that haven’t been replaced, that would
be worth a shot.

Oh, and I still think it would be worth posting to the opensuse-kernel
list. You might get lucky and find somebody who recognizes it.

And probably also worth posting a bugzilla just so if it does turn up
again the bug is already confirmed and to give another data point for
any investigation.

I don’t disagree with you completely, but if Ubuntu is more tolerant of dodgy cabling than openSUSE then it still isn’t a pure hardware issue, IMHO :wink:

OTOH, is it conceivable that the monitors might be somehow involved? I think of monitors as being “arms length” devices as far as the OS and drivers are concerned, but maybe I am out of date here. FWIW, they are iiYama ProLite E2208HDS

Anyway, I’m afraid that this aspect is closed as far as we are concerned: we simply don’t have the time to put openSUSE back on this machine and continue experimenting. It isn’t our core business, and I wince every time I think about the amount of our time that investigating this single machine has already taken.

Oh, and I still think it would be worth posting to the opensuse-kernel
list. You might get lucky and find somebody who recognizes it.

And probably also worth posting a bugzilla just so if it does turn up
again the bug is already confirmed and to give another data point for
any investigation.

Yes: we’ll put something together and get it out there.

Regards,
Peter.