Crash on SLES 11 SP3 with KVM

Hi to all,

We have 4 Blades UCS-B200-M3 with SLES 11 SP3 installed for KVM supporting SAP.

This morning one of our servers had a crash on 4 of its vNics. There is a total of 13 of them.

The dump on /var/log/messages is:

04:47:10 hv-1 kernel: [3511310.839912] ------------ cut here ]------------
Aug 13 04:47:10 hv-1 kernel: [3511310.839922] WARNING: at /usr/src/packages/BUILD/kernel-default-3.0.101/linux-3.0/net/sched/sch_generic.c:255 dev_watchdog+0x23e/0x250()
Aug 13 04:47:10 hv-1 kernel: [3511310.839925] Hardware name: UCSB-B200-M3
Aug 13 04:47:10 hv-1 kernel: [3511310.839927] NETDEV WATCHDOG: kvm-cluster-pec (enic): transmit queue 0 timed out
Aug 13 04:47:10 hv-1 kernel: [3511310.839928] Modules linked in: af_packet ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables edd bridge stp ll
c cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop vhost_net macvtap macvlan tun kvm_intel kvm ipv6 ipv6_lib joydev pcspkr iTCO_wdt iTCO_
vendor_support i2c_i801 button acpi_power_meter enic container ac sg wmi rtc_cmos ext3 jbd mbcache usbhid hid sd_mod crc_t10dif ttm drm_kms_helper drm i2c_algo_bit sysimgblt
sysfillrect i2c_core syscopyarea ehci_hcd usbcore usb_common processor thermal_sys hwmon dm_service_time dm_least_pending dm_queue_length dm_round_robin dm_multipath scsi_dh_
hp_sw scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh dm_snapshot dm_mod fnic libfcoe libfc scsi_transport_fc scsi_tgt megaraid_sas scsi_mod
Aug 13 04:47:10 hv-1 kernel: [3511310.839984] Supported: Yes
Aug 13 04:47:10 hv-1 kernel: [3511310.839987] Pid: 0, comm: swapper Not tainted 3.0.101-0.21-default #1
Aug 13 04:47:10 hv-1 kernel: [3511310.839989] Call Trace:
Aug 13 04:47:10 hv-1 kernel: [3511310.840020] <ffffffff81004935>] dump_trace+0x75/0x310
Aug 13 04:47:10 hv-1 kernel: [3511310.840032] <ffffffff8145e063>] dump_stack+0x69/0x6f
Aug 13 04:47:10 hv-1 kernel: [3511310.840041] <ffffffff8106063b>] warn_slowpath_common+0x7b/0xc0
Aug 13 04:47:10 hv-1 kernel: [3511310.840049] <ffffffff81060735>] warn_slowpath_fmt+0x45/0x50
Aug 13 04:47:10 hv-1 kernel: [3511310.840057] <ffffffff813c071e>] dev_watchdog+0x23e/0x250
Aug 13 04:47:10 hv-1 kernel: [3511310.840069] <ffffffff8106f4db>] call_timer_fn+0x6b/0x120
Aug 13 04:47:10 hv-1 kernel: [3511310.840077] <ffffffff810708f3>] run_timer_softirq+0x173/0x240
Aug 13 04:47:10 hv-1 kernel: [3511310.840087] <ffffffff8106770f>] __do_softirq+0x11f/0x260
Aug 13 04:47:10 hv-1 kernel: [3511310.840096] <ffffffff81469fdc>] call_softirq+0x1c/0x30
Aug 13 04:47:10 hv-1 kernel: [3511310.840107] <ffffffff81004435>] do_softirq+0x65/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840114] <ffffffff810674d5>] irq_exit+0xc5/0xe0
Aug 13 04:47:10 hv-1 kernel: [3511310.840122] <ffffffff81026588>] smp_apic_timer_interrupt+0x68/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840130] <ffffffff81469773>] apic_timer_interrupt+0x13/0x20
Aug 13 04:47:10 hv-1 kernel: [3511310.840142] <ffffffff812bd0c1>] intel_idle+0xa1/0x130
Aug 13 04:47:10 hv-1 kernel: [3511310.840152] <ffffffff8137a9ab>] cpuidle_idle_call+0x11b/0x280
Aug 13 04:47:10 hv-1 kernel: [3511310.840161] <ffffffff81002126>] cpu_idle+0x66/0xb0
Aug 13 04:47:10 hv-1 kernel: [3511310.840172] <ffffffff81befeff>] start_kernel+0x376/0x447
Aug 13 04:47:10 hv-1 kernel: [3511310.840180] <ffffffff81bef3c9>] x86_64_start_kernel+0x123/0x13d
Aug 13 04:47:10 hv-1 kernel: [3511310.840186] — end trace f0165b8680ad586b ]—

I cannot recover from this without shutting down the server.

At this moment, and after several tests and debbuging I went nowhere on solving this issue or finding the root cause of it.

On the UCS side there is no Logs or errors regarding the nics.

So, I come here to see anyone has seen this type of errors on this type of systems, and if can provide a more insightful way of solving it.

NOTE - I have not rebooted the server, as we are still trying to figure out what is the root cause for this, since it’s not happening on the other 3 servers that have the same configuration. All vms where moved to the other hypervisors.

Thank you for your support.

Jorge Gomes

These are the openSUSE forums and NOT the SLES/SLED forums.

The SUSE Linux Enterprise forums are at https://forums.suse.com/forum.php

Thanks for the reply.