Black screen with kernel 6.3.9 update of Nouveau

dr-yak · June 27, 2023, 8:21am

Hi!

Since updating my Tumbleweed, and kernel went from 6.3.7 to 6.3.9, (the one with the bugfixes in Nouveau), my laptop’s screen turns black at the point in the boot process when it should be kernel-modesetting (before drawing the boot splash) and stays this way.
(Rest of system boots normally and is accessible over SSH: I could fetch other informations if needed).

Reverting back to kernel 6.3.7 fixes this for now.

My laptop’s GPU is a very old:

01:00.0 VGA compatible controller: NVIDIA Corporation GT218M [NVS 3100M] (rev a2)

I’ve kept the Xorg’s log from the unsuccessful run (no error reported in the log, so it’s only the modesetting failing to turn the laptop’s screen, everything else seems to run fine).

What other information could be useful?
And best place to report the bug would be upstream nouveau, right?

dr-yak · June 27, 2023, 8:40am

Update, managed to find the nouveaudrmfb crash message in the kernel logs:

jun 27 09:57:29 argo kernel: fbcon: nouveaudrmfb (fb0) is primary device
jun 27 09:57:29 argo kernel: ------------[ cut here ]------------
jun 27 09:57:29 argo kernel: WARNING: CPU: 1 PID: 90 at drivers/gpu/drm/nouveau/nvkm/engine/disp/dp.c:497 nvkm_dp_acquire+0x50a/0x750 [nouveau]
jun 27 09:57:29 argo kernel: Modules linked in: uas usb_storage nouveau(+) crct10dif_pclmul crc32_pclmul polyval_generic gf128mul ghash_clmulni_intel pcmcia sha512_ssse3 firewire_ohci drm_ttm_helper ehci_pci ttm sdhci_pci i2c_algo_bit mxm_wmi ehci_hcd drm_display_helper aesni_intel cec crypto_simd cqhci firewire_core yenta_socket sdhci pcmcia_rsrc cryptd mmc_core crc_itu_t usbcore rc_core pcmcia_core battery video wmi button serio_raw z3fold lz4hc lz4hc_compress lz4 lz4_compress btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_intel dm_mirror dm_region_hash dm_log v4l2loopback(O) videodev mc sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua ledtrig_timer msr efivarfs
jun 27 09:57:29 argo kernel: CPU: 1 PID: 90 Comm: kworker/u16:4 Tainted: G           O       6.3.9-1-default #1 openSUSE Tumbleweed 4b767630dbc263131e96e89ef291fd4fd2951892
jun 27 09:57:29 argo kernel: Hardware name: Dell Inc. Latitude E6510/0N5KHN, BIOS A17 05/12/2017
jun 27 09:57:29 argo kernel: Workqueue: nvkm-disp nv50_disp_super [nouveau]
jun 27 09:57:29 argo kernel: RIP: 0010:nvkm_dp_acquire+0x50a/0x750 [nouveau]
jun 27 09:57:29 argo kernel: Code: 03 0f 85 0a 02 00 00 a8 04 0f 84 02 02 00 00 83 c2 01 39 fa 75 ca 41 8b 85 28 01 00 00 85 c0 0f 85 36 fc ff ff e9 d6 fb ff ff <0f> 0b c1 e8 03 45 88 66 62 44 89 fe 4c 89 ef 48 69 c0 cf 0d d6 26
jun 27 09:57:29 argo kernel: RSP: 0018:ffffa52280417d60 EFLAGS: 00010246
jun 27 09:57:29 argo kernel: RAX: 0000000000041eb0 RBX: 0000000000062b1a RCX: 0000000000041eb0
jun 27 09:57:29 argo kernel: RDX: ffffffffc0b63e00 RSI: 0000000000000002 RDI: ffffa52280417cf0
jun 27 09:57:29 argo kernel: RBP: 00000000ffffffea R08: 0000000000000001 R09: 0000000000005b76
jun 27 09:57:29 argo kernel: R10: 0000000000000009 R11: ffffa52280417de8 R12: 0000000000000001
jun 27 09:57:29 argo kernel: R13: ffff98d28f13b600 R14: ffff98d28218e480 R15: 0000000000000000
jun 27 09:57:29 argo kernel: FS:  0000000000000000(0000) GS:ffff98d3a7a80000(0000) knlGS:0000000000000000
jun 27 09:57:29 argo kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jun 27 09:57:29 argo kernel: CR2: 00007f709c87fe70 CR3: 000000010ac36006 CR4: 00000000000206e0
jun 27 09:57:29 argo kernel: Call Trace:
jun 27 09:57:29 argo kernel:  <TASK>
jun 27 09:57:29 argo kernel:  ? nvkm_dp_acquire+0x50a/0x750 [nouveau d83fa5c1e9d0d2d5a178e32fe1324e871c232aae]
jun 27 09:57:29 argo kernel:  ? __warn+0x81/0x130
jun 27 09:57:29 argo kernel:  ? nvkm_dp_acquire+0x50a/0x750 [nouveau d83fa5c1e9d0d2d5a178e32fe1324e871c232aae]
jun 27 09:57:29 argo kernel:  ? report_bug+0x171/0x1a0
jun 27 09:57:29 argo kernel:  ? handle_bug+0x3c/0x80
jun 27 09:57:29 argo kernel:  ? exc_invalid_op+0x17/0x70
jun 27 09:57:29 argo kernel:  ? asm_exc_invalid_op+0x1a/0x20
jun 27 09:57:29 argo kernel:  ? __pfx_init_done+0x10/0x10 [nouveau d83fa5c1e9d0d2d5a178e32fe1324e871c232aae]
jun 27 09:57:29 argo kernel:  ? nvkm_dp_acquire+0x50a/0x750 [nouveau d83fa5c1e9d0d2d5a178e32fe1324e871c232aae]
jun 27 09:57:29 argo kernel:  nv50_disp_super_2_2+0x6d/0x430 [nouveau d83fa5c1e9d0d2d5a178e32fe1324e871c232aae]
jun 27 09:57:29 argo kernel:  nv50_disp_super+0x117/0x230 [nouveau d83fa5c1e9d0d2d5a178e32fe1324e871c232aae]
jun 27 09:57:29 argo kernel:  process_one_work+0x20a/0x420
jun 27 09:57:29 argo kernel:  worker_thread+0x4e/0x3b0
jun 27 09:57:29 argo kernel:  ? __pfx_worker_thread+0x10/0x10
jun 27 09:57:29 argo kernel:  kthread+0xde/0x110
jun 27 09:57:29 argo kernel:  ? __pfx_kthread+0x10/0x10
jun 27 09:57:29 argo kernel:  ret_from_fork+0x2c/0x50
jun 27 09:57:29 argo kernel:  </TASK>
jun 27 09:57:29 argo kernel: ---[ end trace 0000000000000000 ]---
jun 27 09:57:29 argo kernel: Console: switching to colour frame buffer device 200x56
jun 27 09:57:29 argo kernel: nouveau 0000:01:00.0: [drm] fb0: nouveaudrmfb frame buffer device

dr-yak · June 27, 2023, 8:41am

Issue opened upstream:

mrmazda · June 29, 2023, 7:03am

I commented in your freedesktop report that I cannot reproduce your problem with my GT218 desktop PC (Tesla) or a slightly newer GF108 desktop PC (Fermi).

qwert.zuiop · June 29, 2023, 7:52am

definitely many others here with the problem.

nvidia driver can be uses, but only with wayland.

dr-yak · June 29, 2023, 8:01pm

So this could be down to:

subtle difference between the GeForce 210 and NVS 3100M (e.g.: it’s a failure to initialise the e-DP output, which obviously your desktop card doesn’t have)
my UEFI initializing it differently than your firmware (BIOS or UEFI).

I don’t even get to the desktop or to the point where I can choose what graphical server (Wayland vs X11).
Crash (and loss of picture) happens early in boot, when the framebuffer would be initialised, right before the bootsplash.

mrmazda · June 30, 2023, 1:05am

That’s another possible difference between your crash and my not. Plymouth is never installed here, unless forced, in which case it is neutered. Have you tried appending noplymouth, plymouth.enable=0 or plymouth=0 to your Grub linu line?

I just did a dup to 20230628/6.3.9 on a TW/Fermi GF119 [NVS 310] chip-ID: 10de:107d with 2 DisplayPorts, works fine; and did same afterward with a different Tesla/TW than yesterday, G98 [GeForce 8400 GS Rev. 2] chip-ID: 10de:06e4 with DVI/VGA/SVidio outputs, using DVI+VGA, also works as expected.

dr-yak · July 7, 2023, 8:14am

Nope, confirmed with testing: crash happens right before the point where plymouth would have kicked in, when the nouveau-specific drm framebuffer driver is loaded and takes over the previously used driver (efifb / simple-frame-buffer).

But…

Confirmed pluggin another monitor in:
only the initialisation of the eDP crashes and leads to a black screen on the laptop.
other displays initialise normally and show mirrors of the log-in screen (SDDM).

So you failure of reproduction boils down to the fact that you have no display connected on eDP (obviously given that yours is a desktop card).

dr-yak · July 7, 2023, 10:24am

More experiments:

Booting in BIOS-mode (CSM tough my laptop is so old it probably wasn’t called that way back then) instead of UEFI: still fails, when nouveaufb tries to take over after vgaarb + simple-frame-buffer
- so it’s not UEFI leaving the screen in a weird mode.
Booting with nomodeset options (which disables kernel mode setting, and thus prevents nouveaufb from taking over) works (including bootsplash).
- well, except of course, then X11 only works on top of the framebuffer, because Nouveau’s X driver requires kernel mode setting and cannot thus run without nouveaufb.
- again proof that it’s probably the display initialisations routines in nouveau.

Conclusion: confirms further that crash only happens when nouveau specifically fails to initialise the eDP display. Everything else, including an external display on another port, work normally.

dr-yak · August 10, 2023, 3:37pm

Update:

dr-yak · August 21, 2023, 2:59pm

Update:

The kernel 6.4.11 contains the fix
this solved the problem.