Unexplained crashes

For the past month or so my machine has been randomly crashing for no apparent reason. Mostly, without leaving any messages in /var/log/messages (or anywhere else).

It initially seemed to be hardware related (reinstalling Opensuse had no effect).

But, the power supply, motherboard, cpu’s and memory have been replaced and the video card swapped out, to no avail.

The crashes are now becoming more frequent and very annoying. Occasionally now there are some messages in /var/log/messages.

The messages from the last crash are:

Jul 23 08:52:19 bijvoet kernel: [234721.562898] general protection fault: 0000 #1] PREEMPT SMP
Jul 23 08:52:19 bijvoet kernel: [234721.562908] CPU 3
Jul 23 08:52:19 bijvoet kernel: [234721.562911] Modules linked in: autofs4 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device jc42 coretemp edd vboxpci vboxnetadp vboxnetflt vboxdrv
nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet cpufreq_conservative cpufreq_userspace microcode cpufreq_powersave acpi_cpufreq mperf snd_hda_codec_hdmi sr_mod cdrom joydev
sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nvidia(P) snd_pcm i2c_i801 tpm_tis firewire_ohci firewire_core crc_itu_t i7core_edac tpm iTCO_wdt iTCO_vendor_support
pcspkr edac_core ioatdma tpm_bios xhci_hcd snd_timer snd soundcore igb dca snd_page_alloc button dm_mod linear processor thermal_sys ata_generic
Jul 23 08:52:19 bijvoet kernel: [234721.562993]
Jul 23 08:52:19 bijvoet kernel: [234721.562998] Pid: 4659, comm: kworker/3:1 Tainted: P 3.1.10-1.16-desktop #1 Intel Corporation S5520SC/S5520SC
Jul 23 08:52:19 bijvoet kernel: [234721.563009] RIP: 0010:<ffffffff8113f7bd>] <ffffffff8113f7bd>] free_block+0xcd/0x180
Jul 23 08:52:19 bijvoet kernel: [234721.563023] RSP: 0018:ffff88065936fd60 EFLAGS: 00010002
Jul 23 08:52:19 bijvoet kernel: [234721.563028] RAX: ffff88065b5e7d40 RBX: ffff88035b5736c0 RCX: ffff880655bfa000
Jul 23 08:52:19 bijvoet kernel: [234721.563034] RDX: ffff8803bee590c0 RSI: ffff8803bee59000 RDI: 0000e10000e50000
Jul 23 08:52:19 bijvoet kernel: [234721.563041] RBP: ffff88065b70c018 R08: 0000000000000001 R09: 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.563047] R10: 000000000000001b R11: dead000000100100 R12: 0000000000000018
Jul 23 08:52:19 bijvoet kernel: [234721.563053] R13: 0000000000000000 R14: ffffea0000000000 R15: 0000000000000008
Jul 23 08:52:19 bijvoet kernel: [234721.563060] FS: 0000000000000000(0000) GS:ffff88066fc20000(0000) knlGS:0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.563067] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 23 08:52:19 bijvoet kernel: [234721.563072] CR2: 00007fc53d648000 CR3: 0000000001c05000 CR4: 00000000000006e0
Jul 23 08:52:19 bijvoet kernel: [234721.563079] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.563085] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 23 08:52:19 bijvoet kernel: [234721.563091] Process kworker/3:1 (pid: 4659, threadinfo ffff88065936e000, task ffff880659d30840)
Jul 23 08:52:19 bijvoet kernel: [234721.563098] Stack:
Jul 23 08:52:19 bijvoet kernel: [234721.563100] 000000000000fc40 ffff8803bee590c0 0000000000000003 ffff88065b70c000
Jul 23 08:52:19 bijvoet kernel: [234721.563111] ffff88035b5736c0 0000000000000018 ffff88065b5e7d80 ffff88065b70c018
Jul 23 08:52:19 bijvoet kernel: [234721.563121] 0000000000000001 ffffffff8113fa73 ffff88066f826780 ffff88065b5e7d40
Jul 23 08:52:19 bijvoet kernel: [234721.563131] Call Trace:
Jul 23 08:52:19 bijvoet kernel: [234721.563144] <ffffffff8113fa73>] drain_array.part.42+0x83/0xe0
Jul 23 08:52:19 bijvoet kernel: [234721.563153] <ffffffff8113fdff>] cache_reap+0x6f/0x220
Jul 23 08:52:19 bijvoet kernel: [234721.563165] <ffffffff81071631>] process_one_work+0x111/0x4d0
Jul 23 08:52:19 bijvoet kernel: [234721.563174] <ffffffff81071db2>] worker_thread+0x152/0x340
Jul 23 08:52:19 bijvoet kernel: [234721.563184] <ffffffff81075e8e>] kthread+0x7e/0x90
Jul 23 08:52:19 bijvoet kernel: [234721.563195] <ffffffff815a9534>] kernel_thread_helper+0x4/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.563203] Code: 8b 08 81 e1 80 00 00 00 0f 84 c0 00 00 00 48 8b 70 28 48 8b 43 68 49 bb 00 01 10 00 00 00 ad de 48 8b 3e 48 8b 4e 08 4a 8b 04
38
Jul 23 08:52:19 bijvoet kernel: [234721.563251] RIP <ffffffff8113f7bd>] free_block+0xcd/0x180
Jul 23 08:52:19 bijvoet kernel: [234721.563257] RSP <ffff88065936fd60>
Jul 23 08:52:19 bijvoet kernel: [234721.563263] — end trace 2c9a843aa76ab842 ]—
Jul 23 08:52:19 bijvoet kernel: [234721.565006] note: kworker/3:1[4659] exited with preempt_count 1
Jul 23 08:52:19 bijvoet kernel: [234721.565055] BUG: unable to handle kernel paging request at fffffffffffffff8
Jul 23 08:52:19 bijvoet kernel: [234721.565067] IP: <ffffffff810761b7>] kthread_data+0x7/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565079] PGD 1c07067 PUD 1c08067 PMD 0
Jul 23 08:52:19 bijvoet kernel: [234721.565090] Oops: 0000 #2] PREEMPT SMP
Jul 23 08:52:19 bijvoet kernel: [234721.565099] CPU 3
Jul 23 08:52:19 bijvoet kernel: [234721.565103] Modules linked in: autofs4 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device jc42 coretemp edd vboxpci vboxnetadp vboxnetflt vboxdrv
nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet cpufreq_conservative cpufreq_userspace microcode cpufreq_powersave acpi_cpufreq mperf snd_hda_codec_hdmi sr_mod cdrom joydev
sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nvidia(P) snd_pcm i2c_i801 tpm_tis firewire_ohci firewire_core crc_itu_t i7core_edac tpm iTCO_wdt iTCO_vendor_support
pcspkr edac_core ioatdma tpm_bios xhci_hcd snd_timer snd soundcore igb dca snd_page_alloc button dm_mod linear processor thermal_sys ata_generic
Jul 23 08:52:19 bijvoet kernel: [234721.565236]
Jul 23 08:52:19 bijvoet kernel: [234721.565241] Pid: 4659, comm: kworker/3:1 Tainted: P D 3.1.10-1.16-desktop #1 Intel Corporation S5520SC/S5520SC
Jul 23 08:52:19 bijvoet kernel: [234721.565259] RIP: 0010:<ffffffff810761b7>] <ffffffff810761b7>] kthread_data+0x7/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565276] RSP: 0018:ffff88065936fba0 EFLAGS: 00010002
Jul 23 08:52:19 bijvoet kernel: [234721.565286] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000003
Jul 23 08:52:19 bijvoet kernel: [234721.565297] RDX: 0000000000000003 RSI: 0000000000000003 RDI: ffff880659d30840
Jul 23 08:52:19 bijvoet kernel: [234721.565309] RBP: ffff88065936fc28 R08: 0000000000989680 R09: ffff880659e52368
Jul 23 08:52:19 bijvoet kernel: [234721.565322] R10: 0000000000000400 R11: ffff880659e52358 R12: 0000000000000003
Jul 23 08:52:19 bijvoet kernel: [234721.565334] R13: ffff880659d30c50 R14: ffffea0000000000 R15: 0000000000000008
Jul 23 08:52:19 bijvoet kernel: [234721.565347] FS: 0000000000000000(0000) GS:ffff88066fc20000(0000) knlGS:0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.565361] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 23 08:52:19 bijvoet kernel: [234721.565371] CR2: fffffffffffffff8 CR3: 0000000001c05000 CR4: 00000000000006e0
Jul 23 08:52:19 bijvoet kernel: [234721.565383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.565395] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 23 08:52:19 bijvoet kernel: [234721.565408] Process kworker/3:1 (pid: 4659, threadinfo ffff88065936e000, task ffff880659d30840)
Jul 23 08:52:19 bijvoet kernel: [234721.565421] Stack:
Jul 23 08:52:19 bijvoet kernel: [234721.565427] ffffffff81072168 ffff88066fc324c0 ffffffff8159dfb8 ffff880659e52200
Jul 23 08:52:19 bijvoet kernel: [234721.565446] ffff88065936ffd8 0000000000000001 ffff88065936ffd8 ffff88065936ffd8
Jul 23 08:52:19 bijvoet kernel: [234721.565463] ffff88065936ffd8 ffff880659d30e5c ffff880659d30840 0000000000000000
Jul 23 08:52:19 bijvoet kernel: [234721.565475] Call Trace:
Jul 23 08:52:19 bijvoet kernel: [234721.565485] <ffffffff81072168>] wq_worker_sleeping+0x8/0x90
Jul 23 08:52:19 bijvoet kernel: [234721.565498] <ffffffff8159dfb8>] thread_return+0x26e/0x356
Jul 23 08:52:19 bijvoet kernel: [234721.565514] <ffffffff810580f8>] do_exit+0x268/0x450
Jul 23 08:52:19 bijvoet kernel: [234721.565527] <ffffffff815a16a6>] oops_end+0xa6/0xf0
Jul 23 08:52:19 bijvoet kernel: [234721.565538] <ffffffff815a0945>] general_protection+0x25/0x30
Jul 23 08:52:19 bijvoet kernel: [234721.565550] <ffffffff8113f7bd>] free_block+0xcd/0x180
Jul 23 08:52:19 bijvoet kernel: [234721.565560] <ffffffff8113fa73>] drain_array.part.42+0x83/0xe0
Jul 23 08:52:19 bijvoet kernel: [234721.565571] <ffffffff8113fdff>] cache_reap+0x6f/0x220
Jul 23 08:52:19 bijvoet kernel: [234721.565582] <ffffffff81071631>] process_one_work+0x111/0x4d0
Jul 23 08:52:19 bijvoet kernel: [234721.565593] <ffffffff81071db2>] worker_thread+0x152/0x340
Jul 23 08:52:19 bijvoet kernel: [234721.565604] <ffffffff81075e8e>] kthread+0x7e/0x90
Jul 23 08:52:19 bijvoet kernel: [234721.565615] <ffffffff815a9534>] kernel_thread_helper+0x4/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565623] Code: e8 9f 7f 52 00 e9 f3 fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 48 89 df e8 68 ad fd ff e9 d7 fe ff ff 0f 1f 00 48 8b 87 b8 03 00 00
Jul 23 08:52:19 bijvoet kernel: [234721.565674] RIP <ffffffff810761b7>] kthread_data+0x7/0x10
Jul 23 08:52:19 bijvoet kernel: [234721.565683] RSP <ffff88065936fba0>
Jul 23 08:52:19 bijvoet kernel: [234721.565690] CR2: fffffffffffffff8
Jul 23 08:52:19 bijvoet kernel: [234721.565699] — end trace 2c9a843aa76ab843 ]—
Jul 23 08:52:19 bijvoet kernel: [234721.567329] Fixing recursive fault but reboot is needed!
Jul 23 08:54:23 bijvoet kernel: imklog 5.8.5, log source = /proc/kmsg started.

Any ideas on what the culprit is?

Thanks

Andrew>:(

On 07/23/2012 04:06 AM, asharff wrote:
>
> For the past month or so my machine has been randomly crashing for no
> apparent reason. Mostly, without leaving any messages in
> /var/log/messages (or anywhere else).
>
> It initially seemed to be hardware related (reinstalling Opensuse had
> no effect).
>
> But, the power supply, motherboard, cpu’s and memory have been replaced
> and the video card swapped out, to no avail.
>
> The crashes are now becoming more frequent and very annoying.
> Occasionally now there are some messages in /var/log/messages.
>
> The messages from the last crash are:
>
>
> Jul 23 08:52:19 bijvoet kernel: [234721.562898] general protection
> fault: 0000 #1] PREEMPT SMP
> Jul 23 08:52:19 bijvoet kernel: [234721.562908] CPU 3
> Jul 23 08:52:19 bijvoet kernel: [234721.562911] Modules linked in:
> autofs4 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device jc42 coretemp
> edd vboxpci vboxnetadp vboxnetflt vboxdrv
> nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet
> cpufreq_conservative cpufreq_userspace microcode cpufreq_powersave
> acpi_cpufreq mperf snd_hda_codec_hdmi sr_mod cdrom joydev
> sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep
> nvidia(P) snd_pcm i2c_i801 tpm_tis firewire_ohci firewire_core crc_itu_t
> i7core_edac tpm iTCO_wdt iTCO_vendor_support
> pcspkr edac_core ioatdma tpm_bios xhci_hcd snd_timer snd soundcore igb
> dca snd_page_alloc button dm_mod linear processor thermal_sys
> ata_generic
> Jul 23 08:52:19 bijvoet kernel: [234721.562993]
> Jul 23 08:52:19 bijvoet kernel: [234721.562998] Pid: 4659, comm:
> kworker/3:1 Tainted: P 3.1.10-1.16-desktop #1 Intel
> Corporation S5520SC/S5520SC
> Jul 23 08:52:19 bijvoet kernel: [234721.563009] RIP:
> 0010:<ffffffff8113f7bd>] <ffffffff8113f7bd>] free_block+0xcd/0x180

The tainted flag “P” means you have loaded the proprietary module nvidia. As
that introduces kernel code that no one outside of Nvidia has been able to see,
kernel developers generally tell you to duplicate the result without that kernel
taint before they will even look at this problem. That may be necessary, but
there are some things to try first. Your crash occurred in routine free_block(),
which is part of memory management.

I saw that you changed most, if not all, of the hardware. Did you also run
memtest86+ for a long time? As it took the OS about 65 hours to hit this error,
then I would run the memory test for at least 24 hours.

If the memory test passes, the next step will depend on whether you generate
your own kernels. If you do, then change from SLAB to SLUB memory management in
case you are hitting a subtle bug in MM; however, your statement that the
frequency of these crashes are increasing argues for a hardware problem, unless
your workload has changed.

Finally, the nvidia driver is still not off the hook. If none of the above steps
show a problem, then try to duplicate it without loading the nvidia module. If
that cannot be done, then try different versions of the Nvidia driver.

Larry Finger wrote:

> I saw that you changed most, if not all, of the hardware. Did you also run
> memtest86+ for a long time? As it took the OS about 65 hours to hit this
> error, then I would run the memory test for at least 24 hours.

Nice trouble shooting procedure, Larry.

I also resort to the old hair drier when memory is suspect. Raising the
internal temp of the computer 10 degrees C (don’t get crazy with this!) will
greatly accelerate most memory flakeouts while testing. Conversly, adding
additional fans to keep the system as cool as possible usually extends the
failure interval for the time needed to make a backup once memory issues are
confirmed.

I must be lucky. So far I haven’t hit any issues with the Nvidia drivers
but I don’t really stress the graphics systems any more.


Will Honea

Thanks for the help.

We had run memtest86+ for ~16hours (overnight) without any errors. But maybe it needs longer.

We do have another identical machine (no problems with it), although it is running ubuntu. As its user is away for 2 weeks, I may use the opportunity to swap the memory in the two machines and see if that also transfers the problem!! Could do the same with the video card as well!