30 seconds freeze on boot - OpenSuse 11.4

On my HP 8560w, when I boot OpenSuse 11.4 everything freezes for EXACTLY 30 seconds (i timed it) before things return to normal.
It does not happen when I resume from sleep.
Even if I don’t login from kdm but from a text terminal (Ctrl+Alt+F1) it will freeze for 30 seconds.
Weirdly enough adding irqpoll‎ to the boot options seemed to solve that problem but i did notice a performance degradation so I’m trying to find another solution.
Anyone?

Thanks

Note that the freeze is not triggered by the login i.e. if i wait a bit before logging in the freeze will still happen and prevent me from logging in (it’s always happening about 45 seconds after boot).
It also happens if i run in safe mode.

I don’t know if mcelog failing to start - OpenSuse 11.4
is related

The freeze happens shortly after the following kernel messages are displayed:


   23.108978] irq 17: nobody cared (try booting with the "irqpoll" option)
   23.108981] Pid: 0, comm: kworker/0:1 Tainted: P            2.6.37.6-0.7-desktop #1
   23.108983] Call Trace:
   23.108992]  <ffffffff810059b9>] dump_trace+0x79/0x340
   23.108996]  <ffffffff81522672>] dump_stack+0x69/0x6f
   23.109000]  <ffffffff810cc3be>] __report_bad_irq+0x1e/0x90
   23.109003]  <ffffffff810cc5d9>] note_interrupt+0x1a9/0x200
   23.109006]  <ffffffff810cd575>] handle_fasteoi_irq+0x105/0x140
   23.109009]  <ffffffff810058b5>] handle_irq+0x15/0x20
   23.109011]  <ffffffff810054fe>] do_IRQ+0x5e/0xe0
   23.109014]  <ffffffff81525f13>] ret_from_intr+0x0/0xa
   23.109018]  <ffffffff813f0a52>] poll_idle+0x32/0x70
   23.109021]  <ffffffff813f0b48>] cpuidle_idle_call+0xb8/0x370
   23.109025]  <ffffffff8100125c>] cpu_idle+0x4c/0xa0
   23.109027] handlers:
   23.109027] <ffffffffa029b560>] (azx_interrupt+0x0/0x1a0 [snd_hda_intel])
   23.109037] Disabling IRQ #17

That device is:


01:00.1 Audio device: nVidia Corporation GF108 High Definition Audio Controller (rev a1)
        Subsystem: Hewlett-Packard Company Device 1631
        Flags: bus master, fast devsel, latency 0, IRQ 17
        Memory at d1000000 (32-bit, non-prefetchable) [size=16]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Kernel driver in use: HDA Intel

[/size]

Also note that before the freeze: /proc/interrupts would give you


 17:        219          0          0          0  IR-IO-APIC-fasteoi   hda_intel

while after the freeze you would get:


 17:        219          0     199782          0  IR-IO-APIC-fasteoi   hda_intel

So nearly 200000 interrupts happened in these 30 seconds!!

When the freeze ends, the count stays constant obviously. Note that during the freeze, the cursor still works and if i’m in text mode i can still type characters but i can’t execute any command…

If i boot with irqpoll the interrupt count stays low explaining why the freeze doesn’t happen.


17:        197          0          7          0  IR-IO-APIC-fasteoi   hda_intel

Anyone can explain this?

On 2011-08-05 04:06, thefaser wrote:
>
> The freeze happens shortly after the following kernel messages are
> displayed:
>
>
> Code:
> --------------------
>
> 23.108978] irq 17: nobody cared (try booting with the “irqpoll” option)
> 23.108981] Pid: 0, comm: kworker/0:1 Tainted: P 2.6.37.6-0.7-desktop #1
> 23.108983] Call Trace:

You have a kernel crash, bug, and dump info. You can take the entire dump
trace and send it to bugzilla.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

On 08/05/2011 07:38 AM, Carlos E. R. wrote:
> On 2011-08-05 04:06, thefaser wrote:
>>
>> The freeze happens shortly after the following kernel messages are
>> displayed:
>>
>>
>> Code:
>> --------------------
>>
>> 23.108978] irq 17: nobody cared (try booting with the “irqpoll” option)
>> 23.108981] Pid: 0, comm: kworker/0:1 Tainted: P 2.6.37.6-0.7-desktop #1
>> 23.108983] Call Trace:
>
> You have a kernel crash, bug, and dump info. You can take the entire dump
> trace and send it to bugzilla.

One word of advice. That “Tainted: P” means that you have loaded a proprietary
kernel module. Developers will not touch your bug report until you reproduce the
problem without tainting the kernel. As the thread seems to indicate that the
problem is your graphics driver, you will likely not be able to do that. In that
case, you will need to take that complaint to the supplier of that faulty,
proprietary driver.

On 2011-08-05 15:34, Larry Finger wrote:
> On 08/05/2011 07:38 AM, Carlos E. R. wrote:

>> trace and send it to bugzilla.
>
> One word of advice. That “Tainted: P” means that you have loaded a
> proprietary kernel module. Developers will not touch your bug report until

Ah, yes, you are right. There are several types of “tainted” and I forget
what each letter means.

> you reproduce the problem without tainting the kernel. As the thread seems
> to indicate that the problem is your graphics driver, you will likely not
> be able to do that. In that case, you will need to take that complaint to
> the supplier of that faulty, proprietary driver.

However, it may happen that the bug is in the kernel; but closed source
being involved, there is no way to prove either way.

An unfortunate situation.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

On 08/05/2011 08:58 AM, Carlos E. R. wrote:
>
> However, it may happen that the bug is in the kernel; but closed source
> being involved, there is no way to prove either way.
>
> An unfortunate situation.

I don’t blame the developers. In fact, I share the philosophy. How can you debug
a system when there is no possibility of seeing the source code that is given
unrestricted access?

The policy was instituted based on experience with the 2.4 series of kernels
where the devs could not tell if proprietary code had been loaded. They wasted a
lot of time trying to find bugs that were not from the public code, but came
from sloppy private code. In addition, vendors code can be really ugly and be
lacking in error checking.

I was able to reproduce the bug without the proprietary nvidia driver so it’s not related and the problem is not the kernel crash that’s showing in the kernel messages as by disabling the sound device i fixed the crash issue but the freeze remains.
Oddly enough, the problem does not happen when i boot in single user mode but it does if i switch to runlevel 3.
The exact 30 seconds is a little off for a kernel bug don’t you think?
What services that are not run in single user mode involve a timeout of 30 seconds?

The freeze doesn’t happen at runlevel 2 either so that narrows it down

Seems to be triggered when I start NetworkManager and it does happen to be linked to that kernel crash since the crash happens right after i start NetworkManager.
The bug is still present with the 3.0 kernel.

On 08/05/2011 02:56 PM, thefaser wrote:
>
> Seems to be triggered when I start NetworkManager and it does happen to
> be linked to that kernel crash since the crash happens right after i
> start NetworkManager.
> The bug is still present with the 3.0 kernel.

IT IS NOT A KERNEL BUG!!!

Look at the file /etc/sysconfig/network/config. You will see a section that has
the following:

Type: int

Default: 30

When using NetworkManager you may define a timeout to wait for NetworkManager

to connect in /etc/init.d/network(-remotefs) script. Other network services

may require the system to have a valid network setup in order to succeed.

This variable has no effect if NETWORKMANAGER=no.

NM_ONLINE_TIMEOUT=“30”

If you change the 30 to 0, it will not wait.

Thanks for the reply but it’s actually the opposite. I already set that value to 0 and that’s why I’m noticing the freeze. If you leave it to 30, the connection timeout will cover the freeze. So it confirms it’s related to the NetworkManager. It seems NetworkManager has some other (maybe hardcoded?) 30 seconds timeout that is independent of this value.

On 2011-08-05 16:40, Larry Finger wrote:
> On 08/05/2011 08:58 AM, Carlos E. R. wrote:

> I don’t blame the developers. In fact, I share the philosophy. How can you
> debug a system when there is no possibility of seeing the source code that
> is given unrestricted access?

The same way you debug a closed source driver of company A on a system of
company B.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

On 2011-08-05 20:26, thefaser wrote:
> What services that are not run in single user mode involve a timeout of
> 30 seconds?

Almost all. Runlevel 1 stops almost all services.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

On 2011-08-05 21:56, thefaser wrote:
>
> Seems to be triggered when I start NetworkManager and it does happen to
> be linked to that kernel crash since the crash happens right after i
> start NetworkManager.
> The bug is still present with the 3.0 kernel.

There is a kernel crash only if you see a kernel crash dump like the one on
your first post, not if you see a delay.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

well the kernel crash is not a necessary condition as I’m able to reproduce the bug with having the mentioned kernel crash but it does seem to be related (possibly a side effect of the freeze) as it happens right when the freeze starts (in the case where i get both).
Note that using ifup (disabling NetworkManager) completely messes up KDE with random freezes and graphical artifacts/black background…

I’m wondering if this has anything to do with my KDE 4.6.5 upgrade. Has anyone tried to upgrade?