OpenSUSE Leap 15.5, failure to initialize bus, and GPU correctly

So, this problem started after some update in Leap 15.5. Where sometimes my workstation (2xE5-2667v3 with 160G on an HP Z840) will not boot into a graphical display (talking post grub and initrd), it doesn’t go black, it just hangs. The host does come “up” and is on the network just locked at the grub loading initrd section.

GPU is a Radeon RX 5700 XT

This doesn’t happen all the time. Sometimes It will just boot right up, sometimes I have to reboot it many times. Probably the most times I’ve had to reboot is about 8. Most of the time I can get it to boot up with less than 5 reboot attempts.

There was a moment, when the problem seemed to go away, and there was an Intel microcode update that came in, and I’m back to the problem again.

Posting in case anyone has an idea.

# diff bad1.out good1.out
53c53
< tsc: Detected 3192.859 MHz processor
---
> tsc: Detected 3192.498 MHz processor
415c415
< Memory: 2550096K/167685700K available (14336K kernel code, 3421K rwdata, 9360K rodata, 2868K init, 18132K bss, 2940348K reserved, 0K cma-reserved)
---
> Memory: 2549892K/167685700K available (14336K kernel code, 3421K rwdata, 9360K rodata, 2868K init, 18132K bss, 2940536K reserved, 0K cma-reserved)
457,458c457,458
< clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e05f04b2c2, max_idle_ns: 440795314213 ns
< Calibrating delay loop (skipped), value calculated using timer frequency.. 6385.71 BogoMIPS (lpj=12771436)
---
> clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e049b08001, max_idle_ns: 440795202126 ns
> Calibrating delay loop (skipped), value calculated using timer frequency.. 6384.99 BogoMIPS (lpj=12769992)
512,514c512,514
< smpboot: Total of 32 processors activated (204373.18 BogoMIPS)
< node 0 deferred pages initialised in 100ms
< node 1 deferred pages initialised in 100ms
---
> smpboot: Total of 32 processors activated (204350.54 BogoMIPS)
> node 0 deferred pages initialised in 96ms
> node 1 deferred pages initialised in 104ms
521c521
< PM: RTC time: 21:05:31, date: 2023-10-21
---
> PM: RTC time: 21:09:39, date: 2023-10-21
527c527
< audit: type=2000 audit(1697922330.340:1): state=initialized audit_enabled=0 res=1
---
> audit: type=2000 audit(1697922579.344:1): state=initialized audit_enabled=0 res=1
782d781
< pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
796,801c795,796
< pci 0000:03:00.0:   bridge window [io  0x0000-0x0fff]
< pci 0000:03:00.0:   bridge window [mem 0x00000000-0x000fffff]
< pci 0000:03:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
< pci 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
< pci 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
< pci 0000:04:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
---
> pci 0000:03:00.0:   bridge window [mem 0xe4000000-0xf20fffff]
> pci 0000:03:00.0:   bridge window [mem 0xb0000000-0xc9ffffff 64bit pref]
803,804c798,799
< pci 0000:05:00.0: reg 0x10: [mem 0x00000000-0x0003ffff]
< pci 0000:05:00.0: reg 0x14: [mem 0x00000000-0x00000fff]
---
> pci 0000:05:00.0: reg 0x10: [mem 0xf2000000-0xf203ffff]
> pci 0000:05:00.0: reg 0x14: [mem 0xf2040000-0xf2040fff]
808c803
< pci 0000:04:00.0: PCI bridge to [bus 05-31]
---
> pci 0000:04:00.0: PCI bridge to [bus 05]
810,811c805,808
< end is updated to 05
< end is updated to 05
---
> pci 0000:04:03.0: PCI bridge to [bus 06-2c]
> pci 0000:04:03.0:   bridge window [mem 0xe4000000-0xf1ffffff]
> pci 0000:04:03.0:   bridge window [mem 0xb0000000-0xc9ffffff 64bit pref]
> pci 0000:04:04.0: PCI bridge to [bus 2d]
901,905d897
< (conflicts with (null) [bus 04-05])
< pci 0000:04:03.0: PCI bridge to [bus 06-2c]
< pci 0000:04:03.0:   bridge window [mem 0xe4000000-0xf1ffffff]
< pci 0000:04:03.0:   bridge window [mem 0xb0000000-0xc9ffffff 64bit pref]
< cannot be assigned for them
907,915d898
< (conflicts with (null) [bus 04-05])
< pci 0000:04:04.0: PCI bridge to [bus 2d]
< cannot be assigned for them
< pci 0000:03:00.0: bridge has subordinate 05 but max busn 2d
< pci_bus 0000:04: Allocating resources
< pci 0000:04:00.0: can't claim BAR 14 [mem 0xf2000000-0xf20fffff]: no compatible bridge window
< pci 0000:04:03.0: can't claim BAR 14 [mem 0xe4000000-0xf1ffffff]: no compatible bridge window
< pci 0000:04:03.0: can't claim BAR 15 [mem 0xb0000000-0xc9ffffff 64bit pref]: no compatible bridge window
< add_size 1000
917d899
< add_size 200000 add_align 100000
919d900
< add_size 100000 add_align 100000
921,926d901
< add_size 1000
< add_size 200000 add_align 100000
< add_size 200000 add_align 100000
< add_size 1000
< add_size 200000 add_align 100000
< add_size 200000 add_align 100000
928,970d902
< add_size 4000
< add_size 600000 add_align 100000
< add_size 500000 add_align 100000
< add_size 4000
< add_size 600000 add_align 100000
< add_size 500000 add_align 100000
< pci 0000:00:02.0: BAR 14: assigned [mem 0xb0000000-0xbe5fffff]
< pci 0000:00:02.0: BAR 15: assigned [mem 0xbe600000-0xd8bfffff 64bit pref]
< pci 0000:00:02.0: BAR 13: assigned [io  0x1000-0x4fff]
< pci 0000:03:00.0: BAR 14: assigned [mem 0xb0000000-0xb07fffff]
< pci 0000:03:00.0: BAR 15: assigned [mem 0xbe600000-0xbeefffff 64bit pref]
< pci 0000:03:00.0: BAR 13: assigned [io  0x1000-0x4fff]
< pci 0000:04:00.0: BAR 14: assigned [mem 0xb0000000-0xb01fffff]
< pci 0000:04:00.0: BAR 15: assigned [mem 0xbe600000-0xbe7fffff 64bit pref]
< pci 0000:04:03.0: BAR 14: assigned [mem 0xb0200000-0xb03fffff]
< pci 0000:04:03.0: BAR 15: assigned [mem 0xbe800000-0xbe9fffff 64bit pref]
< pci 0000:04:04.0: BAR 14: assigned [mem 0xb0400000-0xb05fffff]
< pci 0000:04:04.0: BAR 15: assigned [mem 0xbea00000-0xbebfffff 64bit pref]
< pci 0000:04:00.0: BAR 13: assigned [io  0x1000-0x1fff]
< pci 0000:04:03.0: BAR 13: assigned [io  0x2000-0x2fff]
< pci 0000:04:04.0: BAR 13: assigned [io  0x3000-0x3fff]
< pci 0000:05:00.0: BAR 0: assigned [mem 0xb0000000-0xb003ffff]
< pci 0000:05:00.0: BAR 1: assigned [mem 0xb0040000-0xb0040fff]
< pci 0000:04:00.0: PCI bridge to [bus 05]
< pci 0000:04:00.0:   bridge window [io  0x1000-0x1fff]
< pci 0000:04:00.0:   bridge window [mem 0xb0000000-0xb01fffff]
< pci 0000:04:00.0:   bridge window [mem 0xbe600000-0xbe7fffff 64bit pref]
< pci 0000:04:03.0: PCI bridge to [bus 06-2c]
< pci 0000:04:03.0:   bridge window [io  0x2000-0x2fff]
< pci 0000:04:03.0:   bridge window [mem 0xb0200000-0xb03fffff]
< pci 0000:04:03.0:   bridge window [mem 0xbe800000-0xbe9fffff 64bit pref]
< pci 0000:04:04.0: PCI bridge to [bus 2d]
< pci 0000:04:04.0:   bridge window [io  0x3000-0x3fff]
< pci 0000:04:04.0:   bridge window [mem 0xb0400000-0xb05fffff]
< pci 0000:04:04.0:   bridge window [mem 0xbea00000-0xbebfffff 64bit pref]
< pci 0000:03:00.0: PCI bridge to [bus 04-05]
< pci 0000:03:00.0:   bridge window [io  0x1000-0x4fff]
< pci 0000:03:00.0:   bridge window [mem 0xb0000000-0xb07fffff]
< pci 0000:03:00.0:   bridge window [mem 0xbe600000-0xbeefffff 64bit pref]
< pci 0000:00:02.0: PCI bridge to [bus 03-31]
< pci 0000:00:02.0:   bridge window [io  0x1000-0x4fff]
< pci 0000:00:02.0:   bridge window [mem 0xb0000000-0xbe5fffff]
< pci 0000:00:02.0:   bridge window [mem 0xbe600000-0xd8bfffff 64bit pref]
978,1002d909
< pci 0000:00:01.0: can't claim BAR 13 [io  0x3000-0x3fff]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:03.0: can't claim BAR 13 [io  0x2000-0x2fff]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:03.0: can't claim BAR 15 [mem 0xd0000000-0xe01fffff 64bit pref]: address conflict with PCI Bus 0000:03 [mem 0xbe600000-0xd8bfffff 64bit pref]
< pci 0000:32:00.0: can't claim BAR 13 [io  0x2000-0x2fff]: no compatible bridge window
< pci 0000:32:00.0: can't claim BAR 15 [mem 0xd0000000-0xe01fffff 64bit pref]: no compatible bridge window
< pci 0000:33:00.0: can't claim BAR 13 [io  0x2000-0x2fff]: no compatible bridge window
< pci 0000:33:00.0: can't claim BAR 15 [mem 0xd0000000-0xe01fffff 64bit pref]: no compatible bridge window
< pci 0000:00:1c.0: can't claim BAR 13 [io  0x1000-0x1fff]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:01:00.0: can't claim BAR 0 [io  0x3000-0x30ff]: no compatible bridge window
< pci 0000:34:00.0: can't claim BAR 0 [mem 0xd0000000-0xdfffffff 64bit pref]: no compatible bridge window
< pci 0000:34:00.0: can't claim BAR 2 [mem 0xe0000000-0xe01fffff 64bit pref]: no compatible bridge window
< pci 0000:34:00.0: can't claim BAR 4 [io  0x2000-0x20ff]: no compatible bridge window
< pci 0000:00:11.4: can't claim BAR 0 [io  0x4098-0x409f]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:11.4: can't claim BAR 1 [io  0x40ac-0x40af]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:11.4: can't claim BAR 2 [io  0x4090-0x4097]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:11.4: can't claim BAR 3 [io  0x40a8-0x40ab]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:11.4: can't claim BAR 4 [io  0x4060-0x407f]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:1f.2: can't claim BAR 0 [io  0x4088-0x408f]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:1f.2: can't claim BAR 1 [io  0x40a4-0x40a7]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:1f.2: can't claim BAR 2 [io  0x4080-0x4087]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:1f.2: can't claim BAR 3 [io  0x40a0-0x40a3]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:1f.2: can't claim BAR 4 [io  0x4020-0x403f]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:00:19.0: can't claim BAR 2 [io  0x4040-0x405f]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
< pci 0000:35:00.0: can't claim BAR 2 [io  0x1000-0x101f]: no compatible bridge window
< pci 0000:00:1f.3: can't claim BAR 4 [io  0x4000-0x401f]: address conflict with PCI Bus 0000:03 [io  0x1000-0x4fff]
1049,1065c956,964
< pci 0000:00:03.0: BAR 15: no space for [mem size 0x18000000 64bit pref]
< pci 0000:00:03.0: BAR 15: failed to assign [mem size 0x18000000 64bit pref]
< pci 0000:00:01.0: BAR 13: assigned [io  0x5000-0x5fff]
< pci 0000:00:03.0: BAR 13: assigned [io  0x6000-0x6fff]
< pci 0000:00:1c.0: BAR 13: assigned [io  0x7000-0x7fff]
< pci 0000:00:11.4: BAR 4: assigned [io  0x8000-0x801f]
< pci 0000:00:19.0: BAR 2: assigned [io  0x8020-0x803f]
< pci 0000:00:1f.2: BAR 4: assigned [io  0x8040-0x805f]
< pci 0000:00:1f.3: BAR 4: assigned [io  0x8060-0x807f]
< pci 0000:00:11.4: BAR 0: assigned [io  0x8080-0x8087]
< pci 0000:00:11.4: BAR 2: assigned [io  0x8088-0x808f]
< pci 0000:00:1f.2: BAR 0: assigned [io  0x8090-0x8097]
< pci 0000:00:1f.2: BAR 2: assigned [io  0x8098-0x809f]
< pci 0000:00:11.4: BAR 1: assigned [io  0x80a0-0x80a3]
< pci 0000:00:11.4: BAR 3: assigned [io  0x80a4-0x80a7]
< pci 0000:00:1f.2: BAR 1: assigned [io  0x80a8-0x80ab]
< pci 0000:00:1f.2: BAR 3: assigned [io  0x80ac-0x80af]
---
> add_size 1000
> add_size 200000 add_align 100000
> add_size 1000
> add_size 1000
> add_size 200000 add_align 100000
> add_size 200000 add_align 100000
> add_size 3000
> add_size 3000
> pci 0000:00:02.0: BAR 13: assigned [io  0x5000-0x7fff]
1068d966
< pci 0000:01:00.0: BAR 0: assigned [io  0x5000-0x50ff]
1070c968
< pci 0000:00:01.0:   bridge window [io  0x5000-0x5fff]
---
> pci 0000:00:01.0:   bridge window [io  0x3000-0x3fff]
1072a971,986
> pci 0000:03:00.0: BAR 13: assigned [io  0x5000-0x7fff]
> pci 0000:04:00.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
> pci 0000:04:00.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
> pci 0000:04:04.0: BAR 14: no space for [mem size 0x00200000]
> pci 0000:04:04.0: BAR 14: failed to assign [mem size 0x00200000]
> pci 0000:04:04.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
> pci 0000:04:04.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
> pci 0000:04:00.0: BAR 13: assigned [io  0x5000-0x5fff]
> pci 0000:04:03.0: BAR 13: assigned [io  0x6000-0x6fff]
> pci 0000:04:04.0: BAR 13: assigned [io  0x7000-0x7fff]
> pci 0000:04:04.0: BAR 14: no space for [mem size 0x00200000]
> pci 0000:04:04.0: BAR 14: failed to assign [mem size 0x00200000]
> pci 0000:04:04.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
> pci 0000:04:04.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
> pci 0000:04:00.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
> pci 0000:04:00.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
1074,1076c988,989
< pci 0000:04:00.0:   bridge window [io  0x1000-0x1fff]
< pci 0000:04:00.0:   bridge window [mem 0xb0000000-0xb01fffff]
< pci 0000:04:00.0:   bridge window [mem 0xbe600000-0xbe7fffff 64bit pref]
---
> pci 0000:04:00.0:   bridge window [io  0x5000-0x5fff]
> pci 0000:04:00.0:   bridge window [mem 0xf2000000-0xf20fffff]
1078,1080c991,993
< pci 0000:04:03.0:   bridge window [io  0x2000-0x2fff]
< pci 0000:04:03.0:   bridge window [mem 0xb0200000-0xb03fffff]
< pci 0000:04:03.0:   bridge window [mem 0xbe800000-0xbe9fffff 64bit pref]
---
> pci 0000:04:03.0:   bridge window [io  0x6000-0x6fff]
> pci 0000:04:03.0:   bridge window [mem 0xe4000000-0xf1ffffff]
> pci 0000:04:03.0:   bridge window [mem 0xb0000000-0xc9ffffff 64bit pref]
1082,1088c995,999
< pci 0000:04:04.0:   bridge window [io  0x3000-0x3fff]
< pci 0000:04:04.0:   bridge window [mem 0xb0400000-0xb05fffff]
< pci 0000:04:04.0:   bridge window [mem 0xbea00000-0xbebfffff 64bit pref]
< pci 0000:03:00.0: PCI bridge to [bus 04-05]
< pci 0000:03:00.0:   bridge window [io  0x1000-0x4fff]
< pci 0000:03:00.0:   bridge window [mem 0xb0000000-0xb07fffff]
< pci 0000:03:00.0:   bridge window [mem 0xbe600000-0xbeefffff 64bit pref]
---
> pci 0000:04:04.0:   bridge window [io  0x7000-0x7fff]
> pci 0000:03:00.0: PCI bridge to [bus 04-31]
> pci 0000:03:00.0:   bridge window [io  0x5000-0x7fff]
> pci 0000:03:00.0:   bridge window [mem 0xe4000000-0xf20fffff]
> pci 0000:03:00.0:   bridge window [mem 0xb0000000-0xc9ffffff 64bit pref]
1090,1103c1001,1003
< pci 0000:00:02.0:   bridge window [io  0x1000-0x4fff]
< pci 0000:00:02.0:   bridge window [mem 0xb0000000-0xbe5fffff]
< pci 0000:00:02.0:   bridge window [mem 0xbe600000-0xd8bfffff 64bit pref]
< pci 0000:32:00.0: BAR 15: no space for [mem size 0x18000000 64bit pref]
< pci 0000:32:00.0: BAR 15: failed to assign [mem size 0x18000000 64bit pref]
< pci 0000:32:00.0: BAR 13: assigned [io  0x6000-0x6fff]
< pci 0000:33:00.0: BAR 15: no space for [mem size 0x18000000 64bit pref]
< pci 0000:33:00.0: BAR 15: failed to assign [mem size 0x18000000 64bit pref]
< pci 0000:33:00.0: BAR 13: assigned [io  0x6000-0x6fff]
< pci 0000:34:00.0: BAR 0: no space for [mem size 0x10000000 64bit pref]
< pci 0000:34:00.0: BAR 0: failed to assign [mem 0xd0000000-0xdfffffff 64bit pref]
< pci 0000:34:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
< pci 0000:34:00.0: BAR 2: failed to assign [mem 0xe0000000-0xe01fffff 64bit pref]
< pci 0000:34:00.0: BAR 4: assigned [io  0x6000-0x60ff]
---
> pci 0000:00:02.0:   bridge window [io  0x5000-0x7fff]
> pci 0000:00:02.0:   bridge window [mem 0xe4000000-0xf20fffff]
> pci 0000:00:02.0:   bridge window [mem 0xb0000000-0xc9ffffff 64bit pref]
1105c1005
< pci 0000:33:00.0:   bridge window [io  0x6000-0x6fff]
---
> pci 0000:33:00.0:   bridge window [io  0x2000-0x2fff]
1106a1007
> pci 0000:33:00.0:   bridge window [mem 0xd0000000-0xe01fffff 64bit pref]
1108c1009
< pci 0000:32:00.0:   bridge window [io  0x6000-0x6fff]
---
> pci 0000:32:00.0:   bridge window [io  0x2000-0x2fff]
1109a1011
> pci 0000:32:00.0:   bridge window [mem 0xd0000000-0xe01fffff 64bit pref]
1111c1013
< pci 0000:00:03.0:   bridge window [io  0x6000-0x6fff]
---
> pci 0000:00:03.0:   bridge window [io  0x2000-0x2fff]
1113c1015
< pci 0000:35:00.0: BAR 2: assigned [io  0x7000-0x701f]
---
> pci 0000:00:03.0:   bridge window [mem 0xd0000000-0xe01fffff 64bit pref]
1115c1017
< pci 0000:00:1c.0:   bridge window [io  0x7000-0x7fff]
---
> pci 0000:00:1c.0:   bridge window [io  0x1000-0x1fff]
1119d1020
< pci_bus 0000:00: Some PCI device resources are unassigned, try booting with pci=realloc
1124c1025
< pci_bus 0000:01: resource 0 [io  0x5000-0x5fff]
---
> pci_bus 0000:01: resource 0 [io  0x3000-0x3fff]
1126,1141c1027,1039
< pci_bus 0000:03: resource 0 [io  0x1000-0x4fff]
< pci_bus 0000:03: resource 1 [mem 0xb0000000-0xbe5fffff]
< pci_bus 0000:03: resource 2 [mem 0xbe600000-0xd8bfffff 64bit pref]
< pci_bus 0000:04: resource 0 [io  0x1000-0x4fff]
< pci_bus 0000:04: resource 1 [mem 0xb0000000-0xb07fffff]
< pci_bus 0000:04: resource 2 [mem 0xbe600000-0xbeefffff 64bit pref]
< pci_bus 0000:05: resource 0 [io  0x1000-0x1fff]
< pci_bus 0000:05: resource 1 [mem 0xb0000000-0xb01fffff]
< pci_bus 0000:05: resource 2 [mem 0xbe600000-0xbe7fffff 64bit pref]
< pci_bus 0000:06: resource 0 [io  0x2000-0x2fff]
< pci_bus 0000:06: resource 1 [mem 0xb0200000-0xb03fffff]
< pci_bus 0000:06: resource 2 [mem 0xbe800000-0xbe9fffff 64bit pref]
< pci_bus 0000:2d: resource 0 [io  0x3000-0x3fff]
< pci_bus 0000:2d: resource 1 [mem 0xb0400000-0xb05fffff]
< pci_bus 0000:2d: resource 2 [mem 0xbea00000-0xbebfffff 64bit pref]
< pci_bus 0000:32: resource 0 [io  0x6000-0x6fff]
---
> pci_bus 0000:03: resource 0 [io  0x5000-0x7fff]
> pci_bus 0000:03: resource 1 [mem 0xe4000000-0xf20fffff]
> pci_bus 0000:03: resource 2 [mem 0xb0000000-0xc9ffffff 64bit pref]
> pci_bus 0000:04: resource 0 [io  0x5000-0x7fff]
> pci_bus 0000:04: resource 1 [mem 0xe4000000-0xf20fffff]
> pci_bus 0000:04: resource 2 [mem 0xb0000000-0xc9ffffff 64bit pref]
> pci_bus 0000:05: resource 0 [io  0x5000-0x5fff]
> pci_bus 0000:05: resource 1 [mem 0xf2000000-0xf20fffff]
> pci_bus 0000:06: resource 0 [io  0x6000-0x6fff]
> pci_bus 0000:06: resource 1 [mem 0xe4000000-0xf1ffffff]
> pci_bus 0000:06: resource 2 [mem 0xb0000000-0xc9ffffff 64bit pref]
> pci_bus 0000:2d: resource 0 [io  0x7000-0x7fff]
> pci_bus 0000:32: resource 0 [io  0x2000-0x2fff]
1143c1041,1042
< pci_bus 0000:33: resource 0 [io  0x6000-0x6fff]
---
> pci_bus 0000:32: resource 2 [mem 0xd0000000-0xe01fffff 64bit pref]
> pci_bus 0000:33: resource 0 [io  0x2000-0x2fff]
1145c1044,1045
< pci_bus 0000:34: resource 0 [io  0x6000-0x6fff]
---
> pci_bus 0000:33: resource 2 [mem 0xd0000000-0xe01fffff 64bit pref]
> pci_bus 0000:34: resource 0 [io  0x2000-0x2fff]
1147c1047,1048
< pci_bus 0000:35: resource 0 [io  0x7000-0x7fff]
---
> pci_bus 0000:34: resource 2 [mem 0xd0000000-0xe01fffff 64bit pref]
> pci_bus 0000:35: resource 0 [io  0x1000-0x1fff]
1160d1060
< pci 0000:03:00.0: CLS mismatch (32 != 128), using 64 bytes
1162a1063
> pci 0000:35:00.0: CLS mismatch (32 != 128), using 64 bytes
1164a1066
> Trying to unpack rootfs image as initramfs...
1167d1068
< Trying to unpack rootfs image as initramfs...
1191c1092
< pcieport 0000:04:04.0: enabling device (0104 -> 0107)
---
> pcieport 0000:04:04.0: enabling device (0104 -> 0105)
1208,1211c1109,1114
< pci 0000:34:00.0: BAR has moved, updating efifb address
< efifb: cannot reserve video memory at 0x0
< efifb: video memory @ 0x0 spans multiple EFI memory regions
< efi-framebuffer: probe of efi-framebuffer.0 failed with error -5
---
> efifb: framebuffer at 0xd0000000, using 5120k, total 5120k
> efifb: mode is 1280x1024x32, linelength=5120, pages=1
> efifb: scrolling: redraw
> efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
> Console: switching to colour frame buffer device 160x64
> fb0: EFI VGA frame buffer device
1221c1124
< sched_clock: Marking stable (1502764875, 1303589)->(1534801548, -30733084)
---
> sched_clock: Marking stable (1500543386, 1269877)->(1532491639, -30678376)
1255,1256c1158,1160
< PM:   Magic number: 11:345:96
< pci_express 0000:04:04.0:pcie202: hash matches
---
> PM:   Magic number: 11:448:197
> acpi PNP0100:00: hash matches
> acpi device:52: hash matches
1295a1200,1202
> tsc: Refined TSC clocksource calibration: 3192.606 MHz
> clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e050166e04, max_idle_ns: 440795273449 ns
> clocksource: Switched to clocksource tsc
1299,1301d1205
< tsc: Refined TSC clocksource calibration: 3192.607 MHz
< clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e0501eb3d1, max_idle_ns: 440795254769 ns
< clocksource: Switched to clocksource tsc
1302a1207
> systemd[1]: Finished Setup Virtual Console.
1305a1211
> systemd[1]: Starting dracut ask for additional cmdline parameters...
1308,1310d1213
< systemd[1]: Finished Setup Virtual Console.
< systemd[1]: Finished Apply Kernel Variables.
< systemd[1]: Starting dracut ask for additional cmdline parameters...
1321c1224
< mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (164931400 kB)
---
> mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (164931416 kB)
1353a1257
> ACPI: bus type drm_connector registered
1367d1270
< ACPI: bus type drm_connector registered
1377d1279
< scsi host2: ahci
1378a1281
> scsi host2: ahci
1398a1302
> ahci 0000:86:00.0: AHCI 0001.0300 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
1399a1304
> ahci 0000:86:00.0: flags: 64bit ncq led clo only pio ccc 
1400a1306
> scsi host9: ahci

… rest of diff snipped, because I think the above is enough.

I am posting from the above workstation, took about 7 reboots this time, which is on the longer side. Once it’s up, it’s very very very stable.

Edit: While the host can dual boot, I don’t get into Windows often, but can tell you that Windows boots every time.

@cjcox Hi there :smile:

So are you running rasdaemon (probably worthwhile, I run here on a HP Z440)?

What are your kernel options set to, cat /proc/cmdline I use acpi_osi=Linux intel_iommu=on iommu=pt

Updated the BIOS lately?

Currently, just have mitigations=auto, but I did have at one time the parameters you mentioned. I was experimenting with mitigations=off, but so far none of the above has helped any.

Updated the BIOS to latest (just now), problem persists.

Again, booting Windows is no problem.

Occasional problems of same nature not unusually are RAM-related. I would run a memory tester, e.g. memtest86, for several passes (around 4 passes minimum, at least several hours, but better overnight).

@mrmazda Z series workstations have quad core ECC RAM… that’s why running rasdaemon will show any errors :wink:

ras-mc-ctl --summary

No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.
No disk errors.
No Memory failure errors.

No MCE errors.

@cjcox what GPU(s) are in use /sbin/lspci -nnk | grep -EA3 "VGA|Display|3D" I would suggest intel_iommu=on iommu=pt.

I also see the following on my T400 (In slot 4);

 journalctl -b | grep BAR
Nov 25 20:57:52 grover kernel: pci 0000:04:00.0: BAR 1: assigned to efifb
Nov 25 20:57:52 grover kernel: pci 0000:04:00.0: can't claim BAR 6 [mem 0xfff80000-0xffffffff pref]: no compatible bridge window
Nov 25 20:57:52 grover kernel: pci 0000:00:1f.3: BAR 0: assigned [mem 0xd2000000-0xd20000ff 64bit]
Nov 25 20:57:52 grover kernel: pci 0000:04:00.0: BAR 6: assigned [mem 0xf3080000-0xf30fffff pref]

Have, or merely support? The CPU was introduced over 9 years ago. IOW, must it have ECC RAM, or could it have non-ECC RAM? The CPU specs only say “support”.

GPU was reported in OP.

@mrmazda must have, Z440 is Registered ECC DIMMS only, I’m maxed out at 128GB. The Z840 is ECC Registered DIMMs (<=256GB) and ECC Load Reducing DIMMs (<=2024GB)

HP Workstations are a different beast, long term hardware support etc, my CPU only lost servicing support in June last year E5-2695 v4, BIOS updates are available, switching from a v3 to a v4 allows TPM 2.0 upgrade, for me (US$80) from a 12 core to 18 core.

You can always check what came with a HP system via the serial number here http://partsurfer.hp.com/Search.aspx

@cjcox I suspect its a kernel bug with the amdgpu, maybe a funky backport, or lack of with the kernel… openSUSE:Submitting bug reports - openSUSE Wiki

You could try adding pci=nommconf to the kernel boot parameters, which disables Memory-Mapped PCI Configuration Space & reverts to the traditional handling of configuration space.

The other thing I do is set the slot type/speed and disable unused slots in the BIOS.

I wonder if it’s PCI 4.0 GPU in a PCI 3.0 slot…

First, thanks for all the suggestions.

I have 160G of Reg ECC DDR4 Samsung 2133, 8 x 16GB + 4 x 8GB, all inserted appropriately.
I have run memtest overnight, one of the first things I tried. Didn’t find anything, but I haven’t tried just randomly removing sticks yet.

I have tried various kernel parameters, all of the above, including the mentioned pci=nommconf, intel_iommu=on/off iommu=pt mitigations=auto/off

The sporadic nature of the ability too boot tends to indicate a possible “timing” issue with regards to bus initialization. But I haven’t dug deap into this. I’ve seen others post about problems that sound similar to mine, but “we” that have dual socket E5 workstations are somewhat rare. Again, the box will boot, and it usable, just without a graphics head, except when it wants to boot, then everything is ok, zero issues, no crashes, etc. or anything like that.

Windows will boot every time.

I did install and am now running rasdaemon. Thanks for the suggestion on that. It’s interesting even if it doesn’t yield anything at the moment.

My box has an older TPM 1.2. I mean as interesting at that might be, to go 2.0 for Windows 11, etc… the older CPUs aren’t supported there anyhow. Don’t think TPM has anything to do with anything.

The GPU is an RX 5700 XT, so it is “generationally” ok, and again, it was working for quite some time even with the PCIe 3.0 limitation. It is part of the problem? Possibly, as some have suggested. It has crossed my mind. Maybe something is slightly off there, either hardware wise (now) or as some suggested, with regards to driver (?).

I have been tempted to get a newer card (but, money). My PSU is the larger 1125W variety. I have one 6pin and the other 2 6pins go to a converter to turn those into a single 8pin for the GPU. Prior to having the Radeon, I had an Nvidia Quadro K5200.

The problem first showed up for me in the Sept/Oct 2023 time frame. Can’t be absolutely sure.

@cjcox So as long as rasdaemon is running and check the summary output for errors if they pop up…

There are a number of Forum users here with Z440, Z640 and Z840 with single and dual cpu’s :wink:

Have looked at slot settings in the BIOS? Swap the card to the other PCIex16 slot and see what happens?

So it will boot to multi-user target (runlevel 3), just not the desktop?

For upgrading I just added the registry hack to skip tpm and cpu check (which it failed on…) as just installed Windows 10 Pro for workstation and upgraded to Windows 11 (testing wsl2 and rancher-desktop).

regedit

Computer\HKEY_LOCAL_MACHINE\SYSTEM\Setup\MoSetup

New DWORD (32-bit)

AllowUpgradesWithUnsupportedTPMOrCPU and value=1

I opted for a Quadro T400 here for driving screens (three) and a Nvidia Tesla P4 for offload/compute. My other Z440 has a AMD RX550 in it…

Correct. Goes to grub, hangs with post loading intrd on boot (seemingly). So, from there on out, no console.

But machine does boot, again it’s trying to boot graphical target, so, that’s IS the target, but effectively you end up with the equivalent of multi-user. From another host I do a sleep 30; ssh root@z840 reboot when this happens (to wait until the host is available on the network and reboot it).

The big post showing the dmesg differences shows all the pci bridge confusion.

@cjcox ok, so lets eliminate plymouth with the boot option plymouth.enable=0 then can try nomodeset (resolution will be sub-optimal) and maybe throw in a amdgpu.dc=0 to turn off the core clock.

So are you running Wayland or X11 on the desktop environment? If X11 it would be interesting to see the /var/log/ or ~/.local/share/xorg Xorg.0.log after you ssh in.

Also what devices are in PCIe slots? /sbin/lspci -nnk | grep -EA3 "04:00.0|04:03.0|

@cjcox Some more info, I think the issue is the Resizeable BAR (PCIe 4.0) and the Leap kernel version…

So if you run (as root user) lspci -vvvs 04:00.0 you should see something like;

lspci -vvvs 04:00.0
04:00.0 VGA compatible controller: NVIDIA Corporation TU117GLM [Quadro T400 Mobile] (rev a1) (prog-if 00 [VGA controller])
....
....
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 256MB, supported: 64MB 128MB 256MB
		BAR 3: current size: 32MB, supported: 32MB

While the errors due include BAR errors, not a resizable BAR capable machine.

If there’s anything “good”, it did find that one of my Samsung EVO 870 2TB SSDs was “bad” (less than 1 year), so go the RMA for that one. (one of those bad Samsungs).

Machine is PCIe 3.0, it’s an HP Z840.

Do do run X11 (this is Leap, Wayland still needs work unless you go outside of Leap updates, which pretty much gets you so off path… not worthy of discussion).

@cjcox yup same as my Z440, there is discussion on the Z[4,6,8]40 C610/X99 series chipset may be the issue with PCIe 4.0 cards, was looking at an Intel ARC, may still get a A310 for testing…

Do you get any events from ras-mc-ctl --summary?

I run Tumbleweed and X11 on GNOME here.

It’s one of things that tipped me off to the SSD problem. Posting my current ras-mc-ctl --summary

No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.
Disk errors summary:
    0:0 has 58 errors
    0:2048 has 3515 errors
    0:2080 has 47 errors
    0:2096 has 254 errors
    0:2112 has 3 errors
    0:2128 has 89 errors
    0:2144 has 54 errors
    0:2160 has 3 errors
    0:2176 has 3 errors
    0:2816 has 2326 errors
    0:2817 has 60138 errors
No Memory failure errors.

MCE records summary:
    12 MEMORY CONTROLLER MS_CHANNEL2_ERR Transaction: Memory scrubbing error Corrected patrol scrub error errors
    1 MEMORY CONTROLLER RD_CHANNEL2_ERR Transaction: Memory read error errors
    1 MEMORY CONTROLLER RD_CHANNEL2_ERR Transaction: Memory read error Corrected memory read error errors

Intel ARC is a possibility for you, because you’re on TW. I’m on Leap. But even then, not sure ARC is completely “ready”.

I’m thinking that the kernel changes, I think in the DMA arena, may have stabilized this problem. It’s early but so far, 3 days of straight start ups for me. Kernel 5.14.21-150500.55.44-default

1 Like

So far, so good. Hoping fixed for good!