MYRI10GE driver and kernel crash BUG: unable to handle kernel paging request at 0000033800000000

I have a cluster of 20 nodes (mac pros) each with 2x6 cores (hyperthreading disabled) connected by 10G myrinet cards and a asante networks 10G switch. The machine is used for doing solid state calculations using intel MPI typically. The system worked flawlessly with opensuse 11.2 and 11.4, but upon upgrading to 12.2, use of the myrinet cards leads to crashes when the load becomes heavier. The same problem occurs even for a card to card connection so the switch is innocent. The log entries are pasted below for reference. The same problem happens for both the standard myri10ge driver that ships with the latest version of 12.2 and for the development version from myricom.

12.530280] myri10ge: Version 1.5.3devel-jan-2013
12.555104] myri10ge 0000:03:00.0: PCIE x4 Link
12.642701] firewire_ohci 0000:0c:00.0: added OHCI v1.10 device as card 0, 8 IR + 8 IT contexts, quirks 0x2
12.803385] myri10ge 0000:03:00.0: irq 69 for MSI/MSI-X
12.803889] myri10ge 0000:03:00.0: MSI IRQ 69, tx bndry 4096, fw myri10ge_eth_z8e.dat, WC Enabled

The eth2 entry in ifconfig thus reads

eth2 Link encap:Ethernet HWaddr 00:60:DD:45:47:2A
inet addr:192.168.100.15 Bcast:192.168.100.255 Mask:255.255.255.0
inet6 addr: fe80::260:ddff:fe45:472a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:1464 errors:0 dropped:0 overruns:0 frame:0
TX packets:1452 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:98605 (96.2 Kb) TX bytes:293262 (286.3 Kb)
Interrupt:69

I have also made the suggested changes in the buffer sizes as recommended on the myricom site (these settings were the same as those used before the kernel panic under 11.2 and 11.4).

myrinet related buffer changes

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 250000

Any suggestions of what to try?

Thanks for any help in advance.

The kernel panic is below:
Mar 11 08:01:39 volans kernel: [250626.456788] BUG: unable to handle kernel paging request at 0000033800000000
Mar 11 08:01:39 volans kernel: [250626.460613] IP: <ffffffff81586eb1>] cache_alloc_refill+0x128/0x1e1
Mar 11 08:01:39 volans kernel: [250626.462561] PGD 0
Mar 11 08:01:39 volans kernel: [250626.464491] Oops: 0002 #1] PREEMPT SMP
Mar 11 08:01:39 volans kernel: [250626.466420] CPU 23
Mar 11 08:01:39 volans kernel: [250626.466443] Modules linked in: binfmt_misc fuse rfcomm bnep nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_conservative cpufreq_userspace cpufreq_powersave firewire_ohci firewire_core crc_itu_t acpi_cpufreq mperf coretemp crc32c_intel snd_hda_codec_hdmi i7core_edac snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_pcm ghash_clmulni_intel snd_seq snd_timer snd_seq_device snd_mixer_oss snd aesni_intel edac_core btusb cryptd ioatdma bluetooth i2c_i801 aes_x86_64 b43 mac80211 cfg80211 rfkill bcma myri10ge sr_mod dca cdrom shpchp soundcore pci_hotplug sg snd_page_alloc ssb mmc_core pcmcia e1000e iTCO_wdt pcmcia_core iTCO_vendor_support button applesmc input_polldev pcspkr edd microcode autofs4 radeon ata_piix ttm drm_kms_helper drm i2c_algo_bit scsi_dh_hp_sw scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh fan processor ata_generic thermal thermal_sys
Mar 11 08:01:39 volans kernel: [250626.477507]
Mar 11 08:01:39 volans kernel: [250626.479829] Pid: 4342, comm: abinit Not tainted 3.4.28-2.20-desktop #1 Apple Inc. MacPro5,1/Mac-F221BEC8
Mar 11 08:01:39 volans kernel: [250626.482207] RIP: 0010:<ffffffff81586eb1>] <ffffffff81586eb1>] cache_alloc_refill+0x128/0x1e1
Mar 11 08:01:39 volans kernel: [250626.484600] RSP: 0018:ffff88082c0e1c48 EFLAGS: 00010046
Mar 11 08:01:39 volans kernel: [250626.486963] RAX: ffff880855a2b000 RBX: 000000000000000c RCX: ffffea001d26c588
Mar 11 08:01:39 volans kernel: [250626.489355] RDX: 0000033800000000 RSI: 0000000000000004 RDI: dead000000100100
Mar 11 08:01:39 volans kernel: [250626.491768] RBP: 00000000000412d0 R08: dead000000200200 R09: ffff88087ec054d0
Mar 11 08:01:39 volans kernel: [250626.494209] R10: ffff88087ec054e0 R11: dead000000100100 R12: ffff88087ec05500
Mar 11 08:01:39 volans kernel: [250626.496654] R13: ffff88087ec054c0 R14: ffff880856c35000 R15: ffff88087ec001c0
Mar 11 08:01:39 volans kernel: [250626.499104] FS: 00002adb372f40c0(0000) GS:ffff88087f4e0000(0000) knlGS:0000000000000000
Mar 11 08:01:39 volans kernel: [250626.501579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 11 08:01:39 volans kernel: [250626.504050] CR2: 0000033800000000 CR3: 000000082f011000 CR4: 00000000000007e0
Mar 11 08:01:39 volans kernel: [250626.506538] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 11 08:01:39 volans kernel: [250626.509026] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 11 08:01:39 volans kernel: [250626.511516] Process abinit (pid: 4342, threadinfo ffff88082c0e0000, task ffff880853002500)
Mar 11 08:01:39 volans kernel: [250626.514046] Stack:
Mar 11 08:01:39 volans kernel: [250626.516565] ffff8808532570c0 0000000052ef6f00 ffffffffa0569fe0 ffff88087ec001c0
Mar 11 08:01:39 volans kernel: [250626.519149] ffff880660f83a40 00000000000000d0 0000000000000202 0000000000000070
Mar 11 08:01:39 volans kernel: [250626.521750] 00000000000000d0 ffffffff8114a99d 00000064000003e9 ffff8808525ea5c0
Mar 11 08:01:39 volans kernel: [250626.524346] Call Trace:
Mar 11 08:01:39 volans kernel: [250626.526910] <ffffffff8114a99d>] kmem_cache_alloc_trace+0x1ad/0x1c0
Mar 11 08:01:39 volans kernel: [250626.529497] <ffffffffa069375c>] alloc_nfs_open_context+0x4c/0x120 [nfs]
Mar 11 08:01:39 volans kernel: [250626.532117] <ffffffffa0693984>] nfs_open+0x24/0x70 [nfs]
Mar 11 08:01:39 volans kernel: [250626.534714] <ffffffff8115bcd1>] __dentry_open+0x221/0x300
Mar 11 08:01:39 volans kernel: [250626.537288] <ffffffff8116c8c0>] do_last+0x450/0x8d0
Mar 11 08:01:39 volans kernel: [250626.539838] <ffffffff8116ce59>] path_openat+0xd9/0x3d0
Mar 11 08:01:39 volans kernel: [250626.542372] <ffffffff8116d274>] do_filp_open+0x44/0xb0
Mar 11 08:01:39 volans kernel: [250626.544900] <ffffffff8115cf25>] do_sys_open+0xf5/0x1e0
Mar 11 08:01:39 volans kernel: [250626.547432] <ffffffff81595abd>] system_call_fastpath+0x1a/0x1f
Mar 11 08:01:39 volans kernel: [250626.549976] <00002adb347cda00>] 0x2adb347cd9ff
Mar 11 08:01:39 volans kernel: [250626.552484] Code: 3b 57 18 73 07 ff cb 83 fb ff 75 c7 48 8b 08 48 8b 50 08 48 bf 00 01 10 00 00 00 ad de 49 b8 00 02 20 00 00 00 ad de 48 89 51 08 <48> 89 0a 83 78 24 ff 48 89 38 4c 89 40 08 75 15 49 8b 55 10 48
Mar 11 08:01:39 volans kernel: [250626.555447] RIP <ffffffff81586eb1>] cache_alloc_refill+0x128/0x1e1
Mar 11 08:01:39 volans kernel: [250626.558260] RSP <ffff88082c0e1c48>
Mar 11 08:01:39 volans kernel: [250626.561072] CR2: 0000033800000000

On 03/11/2013 01:56 AM, Paul in Japan wrote:
>
> I have a cluster of 20 nodes (mac pros) each with 2x6 cores
> (hyperthreading disabled) connected by 10G myrinet cards and a asante
> networks 10G switch. The machine is used for doing solid state
> calculations using intel MPI typically. The system worked flawlessly
> with opensuse 11.2 and 11.4, but upon upgrading to 12.2, use of the
> myrinet cards leads to crashes when the load becomes heavier. The same
> problem occurs even for a card to card connection so the switch is
> innocent. The log entries are pasted below for reference. The same
> problem happens for both the standard myri10ge driver that ships with
> the latest version of 12.2 and for the development version from myricom.
>
> 12.530280] myri10ge: Version 1.5.3devel-jan-2013
> 12.555104] myri10ge 0000:03:00.0: PCIE x4 Link
> 12.642701] firewire_ohci 0000:0c:00.0: added OHCI v1.10 device as
> card 0, 8 IR + 8 IT contexts, quirks 0x2
> 12.803385] myri10ge 0000:03:00.0: irq 69 for MSI/MSI-X
> 12.803889] myri10ge 0000:03:00.0: MSI IRQ 69, tx bndry 4096, fw
> myri10ge_eth_z8e.dat, WC Enabled
>
> The eth2 entry in ifconfig thus reads
>
> eth2 Link encap:Ethernet HWaddr 00:60:DD:45:47:2A
> inet addr:192.168.100.15 Bcast:192.168.100.255
> Mask:255.255.255.0
> inet6 addr: fe80::260:ddff:fe45:472a/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
> RX packets:1464 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1452 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:98605 (96.2 Kb) TX bytes:293262 (286.3 Kb)
> Interrupt:69
>
> I have also made the suggested changes in the buffer sizes as
> recommended on the myricom site (these settings were the same as those
> used before the kernel panic under 11.2 and 11.4).
>
> # myrinet related buffer changes
> net.core.rmem_max = 16777216
> net.core.wmem_max = 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 65536 16777216
> net.core.netdev_max_backlog = 250000
>
>
> Any suggestions of what to try?
>
> Thanks for any help in advance.
>
>
>
>
>
>
>
> The kernel panic is below:
> Mar 11 08:01:39 volans kernel: [250626.456788] BUG: unable to handle
> kernel paging request at 0000033800000000
> Mar 11 08:01:39 volans kernel: [250626.460613] IP: <ffffffff81586eb1>]
> cache_alloc_refill+0x128/0x1e1
> Mar 11 08:01:39 volans kernel: [250626.462561] PGD 0
> Mar 11 08:01:39 volans kernel: [250626.464491] Oops: 0002 #1] PREEMPT
> SMP
> Mar 11 08:01:39 volans kernel: [250626.466420] CPU 23
> Mar 11 08:01:39 volans kernel: [250626.466443] Modules linked in:
> binfmt_misc fuse rfcomm bnep nfs lockd fscache auth_rpcgss nfs_acl
> sunrpc cpufreq_conservative cpufreq_userspace cpufreq_powersave
> firewire_ohci firewire_core crc_itu_t acpi_cpufreq mperf coretemp
> crc32c_intel snd_hda_codec_hdmi i7core_edac snd_hda_codec_realtek
> snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_pcm
> ghash_clmulni_intel snd_seq snd_timer snd_seq_device snd_mixer_oss snd
> aesni_intel edac_core btusb cryptd ioatdma bluetooth i2c_i801 aes_x86_64
> b43 mac80211 cfg80211 rfkill bcma myri10ge sr_mod dca cdrom shpchp
> soundcore pci_hotplug sg snd_page_alloc ssb mmc_core pcmcia e1000e
> iTCO_wdt pcmcia_core iTCO_vendor_support button applesmc input_polldev
> pcspkr edd microcode autofs4 radeon ata_piix ttm drm_kms_helper drm
> i2c_algo_bit scsi_dh_hp_sw scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh
> fan processor ata_generic thermal thermal_sys
> Mar 11 08:01:39 volans kernel: [250626.477507]
> Mar 11 08:01:39 volans kernel: [250626.479829] Pid: 4342, comm: abinit
> Not tainted 3.4.28-2.20-desktop #1 Apple Inc. MacPro5,1/Mac-F221BEC8
> Mar 11 08:01:39 volans kernel: [250626.482207] RIP:
> 0010:<ffffffff81586eb1>] <ffffffff81586eb1>]
> cache_alloc_refill+0x128/0x1e1
> Mar 11 08:01:39 volans kernel: [250626.484600] RSP:
> 0018:ffff88082c0e1c48 EFLAGS: 00010046
> Mar 11 08:01:39 volans kernel: [250626.486963] RAX: ffff880855a2b000
> RBX: 000000000000000c RCX: ffffea001d26c588
> Mar 11 08:01:39 volans kernel: [250626.489355] RDX: 0000033800000000
> RSI: 0000000000000004 RDI: dead000000100100
> Mar 11 08:01:39 volans kernel: [250626.491768] RBP: 00000000000412d0
> R08: dead000000200200 R09: ffff88087ec054d0
> Mar 11 08:01:39 volans kernel: [250626.494209] R10: ffff88087ec054e0
> R11: dead000000100100 R12: ffff88087ec05500
> Mar 11 08:01:39 volans kernel: [250626.496654] R13: ffff88087ec054c0
> R14: ffff880856c35000 R15: ffff88087ec001c0
> Mar 11 08:01:39 volans kernel: [250626.499104] FS:
> 00002adb372f40c0(0000) GS:ffff88087f4e0000(0000) knlGS:0000000000000000
> Mar 11 08:01:39 volans kernel: [250626.501579] CS: 0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> Mar 11 08:01:39 volans kernel: [250626.504050] CR2: 0000033800000000
> CR3: 000000082f011000 CR4: 00000000000007e0
> Mar 11 08:01:39 volans kernel: [250626.506538] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
> Mar 11 08:01:39 volans kernel: [250626.509026] DR3: 0000000000000000
> DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Mar 11 08:01:39 volans kernel: [250626.511516] Process abinit (pid:
> 4342, threadinfo ffff88082c0e0000, task ffff880853002500)
> Mar 11 08:01:39 volans kernel: [250626.514046] Stack:
> Mar 11 08:01:39 volans kernel: [250626.516565] ffff8808532570c0
> 0000000052ef6f00 ffffffffa0569fe0 ffff88087ec001c0
> Mar 11 08:01:39 volans kernel: [250626.519149] ffff880660f83a40
> 00000000000000d0 0000000000000202 0000000000000070
> Mar 11 08:01:39 volans kernel: [250626.521750] 00000000000000d0
> ffffffff8114a99d 00000064000003e9 ffff8808525ea5c0
> Mar 11 08:01:39 volans kernel: [250626.524346] Call Trace:
> Mar 11 08:01:39 volans kernel: [250626.526910] <ffffffff8114a99d>]
> kmem_cache_alloc_trace+0x1ad/0x1c0
> Mar 11 08:01:39 volans kernel: [250626.529497] <ffffffffa069375c>]
> alloc_nfs_open_context+0x4c/0x120 [nfs]
> Mar 11 08:01:39 volans kernel: [250626.532117] <ffffffffa0693984>]
> nfs_open+0x24/0x70 [nfs]
> Mar 11 08:01:39 volans kernel: [250626.534714] <ffffffff8115bcd1>]
> __dentry_open+0x221/0x300
> Mar 11 08:01:39 volans kernel: [250626.537288] <ffffffff8116c8c0>]
> do_last+0x450/0x8d0
> Mar 11 08:01:39 volans kernel: [250626.539838] <ffffffff8116ce59>]
> path_openat+0xd9/0x3d0
> Mar 11 08:01:39 volans kernel: [250626.542372] <ffffffff8116d274>]
> do_filp_open+0x44/0xb0
> Mar 11 08:01:39 volans kernel: [250626.544900] <ffffffff8115cf25>]
> do_sys_open+0xf5/0x1e0
> Mar 11 08:01:39 volans kernel: [250626.547432] <ffffffff81595abd>]
> system_call_fastpath+0x1a/0x1f
> Mar 11 08:01:39 volans kernel: [250626.549976] <00002adb347cda00>]
> 0x2adb347cd9ff
> Mar 11 08:01:39 volans kernel: [250626.552484] Code: 3b 57 18 73 07 ff
> cb 83 fb ff 75 c7 48 8b 08 48 8b 50 08 48 bf 00 01 10 00 00 00 ad de 49
> b8 00 02 20 00 00 00 ad de 48 89 51 08 <48> 89 0a 83 78 24 ff 48 89 38
> 4c 89 40 08 75 15 49 8b 55 10 48
> Mar 11 08:01:39 volans kernel: [250626.555447] RIP
> <ffffffff81586eb1>] cache_alloc_refill+0x128/0x1e1
> Mar 11 08:01:39 volans kernel: [250626.558260] RSP <ffff88082c0e1c48>
> Mar 11 08:01:39 volans kernel: [250626.561072] CR2: 0000033800000000

The problem is clearly a kernel regression introduced between the standard
kernel in 11.4 and the one in 12.2. Not many people that read this list are
kernel developers, thus we cannot help with the specifics. About all I can say
is that it happens in the SLAB memory management routines.

I would try the following:

  1. Update to the latest kernel as described in the wireless forum under the
    title “Re: Wireless connection drops when downloading files (Corrected Post)”.
    The kernel patch log indicates a fix from October that might explain the problem.

  2. Use the driver built-into the kernel.

  3. If the crashes still occur, report the problem (as a regression) at
    http://bugzilla.kernel.org, on the mailing lists at linux-kernel@vger.kernel.org
    and netdev.vger.kernel.org , and to Andrew Gallatin <gallatin@myri.com> (the
    maintainer of the driver).

Good luck.