PCIe Bus error during boot

Hello,

As I mentioned in my previous posts, I’ve seen many Intel-related errors and possible bugs for kernel 4.14. Sometimes the system hanged at boot time after I did a zypper dup then reboot. One of the errors is like this:

~ $ dmesg | grep Error
    0.044709] ACPI Error: \_SB_.PCI0.XHC_.RHUB.HS11] Namespace lookup failure, AE_NOT_FOUND (20170728/dswload-210)
    0.048721] ACPI Error: 1 table load failures, 12 successful (20170728/tbxfload-246)
    1.801996] RAS: Correctable Errors collector initialized.
  328.053721] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
  351.383378] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
  382.382643] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
  387.739581] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
  387.739600] nvme 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID)
  387.739602] nvme 0000:04:00.0:     0] Receiver Error         (First)
  387.821726] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)

I’m having a difficult time understanding what this PCIe Bus Error is. It never appeared before I upgraded to 4.14.2 on Nov 30 or so. What is this and how to fix it? I guess I can ignore the ACPI error above. Correct me if I’m wrong.

And, btw, when will kernel 4.15 be arriving in Tumbleweed? I won’t do another zypper dup again until 4.15 is here.

Thank you very much.

Well, to extent of my knowledge…

  • Since the error says “severity=corrected” I assume that the issue should not affect how your system runs
  • Is an issue at the DataLink layer. Curiously, that’s the extreme lowest of the operating system stack, where the OS interface with hardware… A lot of this is frimware code, like working with your BIOS/UEFI… Not the main part of even your device drivers, typically. But nowadays the OS kernel sometimes reaches down into manipulating hardware in ways that wasn’t done years ago. In fact, there is also a reference in the error to the Physical Layer which is the hardware itself.

Bottom line is that I doubt any of this is anything the ordinary User can do anything about, and in fact it seems that someone might have anticipated problems so wrote self-correcting code to make sure the issue wouldn’t affect how your system operates. You can be a “good citizen” and report the error to the Linux kernel folks, but I’d also guess that they already know quite a bit about this issue.

TSU