amdgpu simd exception in 15.2

Hi folks,

Ever since configuring a new desktop machine with a Ryzen 3400g and opensuse 15.2,
I’ve been plagued with system crashes for which rebooting is the only solution I’ve really
found. It is possible to ssh into the machine from elsewhere, but the jobs that were running
have gone idle and even killing the idle processes doesn’t unhang the machine. Typically
the problem occurs while running firefox, often with more than one tab open. A guarantee
of hitting it within a few seconds is to start up a parallel job (something compiled with
gcc that has some openmp loops in it), and then start up firefox.

I have looked in /var/log/messages, and the signature of the crash appears to be this:

2021-03-30T21:10:39.631908-06:00 lindblad kernel:  9373.307201] simd exception: 0000 #1] SMP NOPTI
2021-03-30T21:10:39.631920-06:00 lindblad kernel:  9373.307206] CPU: 6 PID: 1625 Comm: X Not tainted 5.3.18-lp152.66-default #1 openSUSE Leap 15.2
2021-03-30T21:10:39.631920-06:00 lindblad kernel:  9373.307209] Hardware name: Micro-Star International Co., Ltd MS-7B86/B450 GAMING PLUS MAX (MS-7B86), BIOS H.70 06/17/2020
2021-03-30T21:10:39.631922-06:00 lindblad kernel:  9373.307284] RIP: 0010:mode_support_and_system_configuration+0x2881/0x4b20 [amdgpu]
2021-03-30T21:10:39.631922-06:00 lindblad kernel:  9373.307287] Code: 17 00 00 0f 28 c3 e8 6e d1 ff ff f3 41 0f 11 87 40 19 00 00 e9 2d fd ff ff 83 bd a8 00 00 00 06 75 9a f3 0f 10 85 40 1b 00 00 <f3> 0f 5e 85 f8 17 00 00 e8 42 d1 ff ff 41 8b 97 80 04 00 00 0f 28
2021-03-30T21:10:39.631927-06:00 lindblad kernel:  9373.307292] RSP: 0018:ffffb1790173f790 EFLAGS: 00010246
2021-03-30T21:10:39.631927-06:00 lindblad kernel:  9373.307295] RAX: 0000000000000000 RBX: ffff915d9229afa0 RCX: 0000000000000004
2021-03-30T21:10:39.631928-06:00 lindblad kernel:  9373.307297] RDX: 0000000000000006 RSI: ffff915d9229ac58 RDI: 0000000000000001
2021-03-30T21:10:39.631929-06:00 lindblad kernel:  9373.307300] RBP: ffff915d9229abac R08: ffff915d9229bdb4 R09: 0000000000000120
2021-03-30T21:10:39.631929-06:00 lindblad kernel:  9373.307302] R10: ffff915d9229b558 R11: 0000000000000004 R12: ffff915d9229aa14
2021-03-30T21:10:39.631930-06:00 lindblad kernel:  9373.307304] R13: ffff915d9229c28c R14: ffff915d9229aa14 R15: ffff915d9229aa14
2021-03-30T21:10:39.631930-06:00 lindblad kernel:  9373.307307] FS:  00007f4d141e6ec0(0000) GS:ffff915e80780000(0000) knlGS:0000000000000000
2021-03-30T21:10:39.631931-06:00 lindblad kernel:  9373.307310] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-03-30T21:10:39.631931-06:00 lindblad kernel:  9373.307312] CR2: 00007f4d0412e000 CR3: 00000007f8d0e000 CR4: 00000000003406e0
2021-03-30T21:10:39.631932-06:00 lindblad kernel:  9373.307314] Call Trace:
2021-03-30T21:10:39.631932-06:00 lindblad kernel:  9373.307387]  dcn_validate_bandwidth+0xd7a/0x1f80 [amdgpu]
2021-03-30T21:10:39.631933-06:00 lindblad kernel:  9373.307451]  dc_commit_updates_for_stream+0x92c/0x1410 [amdgpu]
2021-03-30T21:10:39.631933-06:00 lindblad kernel:  9373.307502]  ? amdgpu_display_get_crtc_scanoutpos+0x85/0x170 [amdgpu]
2021-03-30T21:10:39.631933-06:00 lindblad kernel:  9373.307567]  amdgpu_dm_atomic_commit_tail+0x10da/0x1e10 [amdgpu]
2021-03-30T21:10:39.631934-06:00 lindblad kernel:  9373.307580]  ? commit_tail+0x3d/0x80 [drm_kms_helper]
2021-03-30T21:10:39.631934-06:00 lindblad kernel:  9373.307588]  commit_tail+0x3d/0x80 [drm_kms_helper]
2021-03-30T21:10:39.631935-06:00 lindblad kernel:  9373.307596]  drm_atomic_helper_commit+0x107/0x130 [drm_kms_helper]
2021-03-30T21:10:39.631936-06:00 lindblad kernel:  9373.307612]  drm_mode_obj_set_property_ioctl+0x24d/0x2e0 [drm]
2021-03-30T21:10:39.631936-06:00 lindblad kernel:  9373.307616]  ? mutex_lock+0xe/0x30
2021-03-30T21:10:39.631936-06:00 lindblad kernel:  9373.307630]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
2021-03-30T21:10:39.631937-06:00 lindblad kernel:  9373.307644]  drm_ioctl_kernel+0xac/0xf0 [drm]
2021-03-30T21:10:39.631937-06:00 lindblad kernel:  9373.307658]  drm_ioctl+0x2eb/0x3b0 [drm]
2021-03-30T21:10:39.631938-06:00 lindblad kernel:  9373.307674]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
2021-03-30T21:10:39.631938-06:00 lindblad kernel:  9373.307678]  ? do_iter_write+0xf2/0x1a0
2021-03-30T21:10:39.631939-06:00 lindblad kernel:  9373.307728]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
2021-03-30T21:10:39.631939-06:00 lindblad kernel:  9373.307731]  do_vfs_ioctl+0xa0/0x680
2021-03-30T21:10:39.631939-06:00 lindblad kernel:  9373.307735]  ? __sys_recvmsg+0x8a/0xa0
2021-03-30T21:10:39.631940-06:00 lindblad kernel:  9373.307737]  ksys_ioctl+0x70/0x80
2021-03-30T21:10:39.631940-06:00 lindblad kernel:  9373.307740]  __x64_sys_ioctl+0x16/0x20
2021-03-30T21:10:39.631940-06:00 lindblad kernel:  9373.307743]  do_syscall_64+0x65/0x1f0
2021-03-30T21:10:39.631941-06:00 lindblad kernel:  9373.307745]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2021-03-30T21:10:39.631941-06:00 lindblad kernel:  9373.307748] RIP: 0033:0x7f4d11ad59e7
2021-03-30T21:10:39.631942-06:00 lindblad kernel:  9373.307751] Code: b3 66 90 48 8b 05 b1 14 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 14 2c 00 f7 d8 64 89 01 48
2021-03-30T21:10:39.631942-06:00 lindblad kernel:  9373.307755] RSP: 002b:00007ffc56f609b8 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
2021-03-30T21:10:39.631943-06:00 lindblad kernel:  9373.307758] RAX: ffffffffffffffda RBX: 000055f7c687d800 RCX: 00007f4d11ad59e7
2021-03-30T21:10:39.631943-06:00 lindblad kernel:  9373.307760] RDX: 00007ffc56f609f0 RSI: 00000000c01864ba RDI: 000000000000000d
2021-03-30T21:10:39.631943-06:00 lindblad kernel:  9373.307762] RBP: 00007ffc56f609f0 R08: 000000000000005a R09: 000055f7c687e0c0
2021-03-30T21:10:39.631944-06:00 lindblad kernel:  9373.307764] R10: 000055f7c795e284 R11: 0000000000003246 R12: 00000000c01864ba
2021-03-30T21:10:39.631944-06:00 lindblad kernel:  9373.307766] R13: 000000000000000d R14: 0000000000000fff R15: 0000000000000003
2021-03-30T21:10:39.631945-06:00 lindblad kernel:  9373.307769] Modules linked in: fuse af_packet xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter usblp dmi_sysfs msr edac_mce_amd ccp kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio irqbypass snd_hda_codec_hdmi snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hda_core ghash_clmulni_intel snd_hwdep aesni_intel snd_pcm aes_x86_64 ppdev joydev snd_timer crypto_simd sp5100_tco cryptd glue_helper r8169 snd wmi_bmof soundcore parport_pc realtek parport libphy i2c_piix4 k10temp pcspkr gpio_amdpt gpio_generic button hid_microsoft ff_memless hid_generic usbhid xfs libcrc32c amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm
2021-03-30T21:10:39.631945-06:00 lindblad kernel:  9373.307797]  drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel drm xhci_pci sr_mod xhci_hcd cdrom usbcore wmi video sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
2021-03-30T21:10:39.631946-06:00 lindblad kernel:  9373.307822] --- end trace f7e31fd7e1ab6046 ]---
2021-03-30T21:10:39.631946-06:00 lindblad kernel:  9373.307892] RIP: 0010:mode_support_and_system_configuration+0x2881/0x4b20 [amdgpu]
2021-03-30T21:10:39.631947-06:00 lindblad kernel:  9373.307897] Code: 17 00 00 0f 28 c3 e8 6e d1 ff ff f3 41 0f 11 87 40 19 00 00 e9 2d fd ff ff 83 bd a8 00 00 00 06 75 9a f3 0f 10 85 40 1b 00 00 <f3> 0f 5e 85 f8 17 00 00 e8 42 d1 ff ff 41 8b 97 80 04 00 00 0f 28
2021-03-30T21:10:39.631947-06:00 lindblad kernel:  9373.307903] RSP: 0018:ffffb1790173f790 EFLAGS: 00010246
2021-03-30T21:10:39.631947-06:00 lindblad kernel:  9373.307908] RAX: 0000000000000000 RBX: ffff915d9229afa0 RCX: 0000000000000004
2021-03-30T21:10:39.631948-06:00 lindblad kernel:  9373.307913] RDX: 0000000000000006 RSI: ffff915d9229ac58 RDI: 0000000000000001
2021-03-30T21:10:39.631948-06:00 lindblad kernel:  9373.307917] RBP: ffff915d9229abac R08: ffff915d9229bdb4 R09: 0000000000000120
2021-03-30T21:10:39.631948-06:00 lindblad kernel:  9373.307921] R10: ffff915d9229b558 R11: 0000000000000004 R12: ffff915d9229aa14
2021-03-30T21:10:39.631949-06:00 lindblad kernel:  9373.307925] R13: ffff915d9229c28c R14: ffff915d9229aa14 R15: ffff915d9229aa14
2021-03-30T21:10:39.631949-06:00 lindblad kernel:  9373.307929] FS:  00007f4d141e6ec0(0000) GS:ffff915e80780000(0000) knlGS:0000000000000000
2021-03-30T21:10:39.631950-06:00 lindblad kernel:  9373.307934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-03-30T21:10:39.631950-06:00 lindblad kernel:  9373.307939] CR2: 00007f4d0412e000 CR3: 00000007f8d0e000 CR4: 00000000003406e0
2021-03-30T21:11:04.768518-06:00 lindblad tracker-store[14225]: OK

googling on some of the strings in the outputput above, leads me to
threads like this one:

https://gitlab.freedesktop.org/drm/amd/-/issues/1154

where I read information that indicates that my problem has something to do with some exception masking inconsistancy (or something like that).
It also seems to indicate that the fixes for the problem (if it is indeed the same one) appear in linux 5.8 or so. Not being terribly deep in kernel
patching/hacking/substitution, I’m not in a position to swap out kernels to try such myself however.

So my question are these:

  1. am I right in my diagnosis that my problem is likely the same one as in the referenced thread?
  2. If so, are the patches involved, scheduled for (already in?) the update stream for opensuse 15.2? How would
    I know whether my OS has had them applied?
  3. If not, is this error signature familiar to anyone, and does it have a fix that I can try out on my machine?

Thanks,

andy271828

@andy271828:

Here with –


 # inxi --admin --filter --cpu
CPU:       Info: Quad Core model: AMD Ryzen 5 3400G with Radeon Vega Graphics socket: AM4 bits: 64 type: MT MCP arch: Zen+ 
           family: 17 (23) model-id: 18 (24) stepping: 1 microcode: 8108109 L1 cache: 384 KiB L2 cache: 2048 KiB 
           L3 cache: 4096 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 59088 
           Speed: 1259 MHz min/max: 1400/3700 MHz base/boost: 3700/4200 boost: enabled volts: 1.5 V ext-clock: 100 MHz 
           Core speeds (MHz): 1: 1259 2: 1258 3: 1259 4: 1265 5: 1259 6: 1275 7: 1258 8: 1308 
           Vulnerabilities: Type: itlb_multihit status: Not affected 
           Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization 
           Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling 
           Type: srbds status: Not affected 
           Type: tsx_async_abort status: Not affected 
 # 
 # inxi --admin --filter --system
System:    Kernel: 5.3.18-lp152.66-default x86_64 bits: 64 compiler: gcc v: 7.5.0 
           parameters: BOOT_IMAGE=/boot/vmlinuz-5.3.18-lp152.66-default root=UUID=c59a64bf-b464-4ea2-bf3a-d3fd9dded03f 
           splash=silent resume=/dev/disk/by-id/ata-Intenso_SSD_Sata_III_AA000000000000035990-part3 quiet mitigations=auto 
           Console: tty 6 wm: kwin_x11 DM: SDDM Distro: openSUSE Leap 15.2 
 # 

No crashes …
On the other hand, I have another mainboard –


 # inxi --admin --filter --machine
Machine:   Type: Desktop Mobo: ASUSTeK model: PRIME B450-PLUS v: Rev X.0x serial: <filter> UEFI: American Megatrends v: 2807 
           date: 02/01/2021 
 # 

And, the Bug Report against the Kernel you’ve mentioned is waiting for further information – <https://bugzilla.kernel.org/show_bug.cgi?id=207979>

Have you checked the BIOS/UEFI version of your Mainboard?

Are you running the memory at the default clock speed offered by the Mainboard or, have you tweaked up the clock speed to that which AMD states as being the maximum for the APU?

Hi, and thanks for the reply,

I ran the commands you tried on your machine with the following results:

 inxi --admin --filter --cpu
CPU:       Topology: Quad Core model: AMD Ryzen 5 3400G with Radeon Vega Graphics bits: 64 type: MT MCP arch: Zen+ 
           family: 17 (23) model-id: 18 (24) stepping: 1 microcode: 8108109 L2 cache: 2048 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 59201 
           Speed: 3016 MHz min/max: N/A Core speeds (MHz): 1: 3016 2: 1258 3: 1569 4: 1235 5: 2911 6: 1527 7: 1698 8: 1242 
           Vulnerabilities: Type: itlb_multihit status: Not affected 
           Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization 
           Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling 
           Type: srbds status: Not affected 
           Type: tsx_async_abort status: Not affected 


 inxi --admin --filter --machine

Machine:   Type: Desktop Mobo: Micro-Star model: B450 GAMING PLUS MAX (MS-7B86) v: 3.0 serial: <filter> 
           UEFI [Legacy]: American Megatrends v: H.70 date: 06/17/2020 

Not sure of the significance of any of the contents above though, but here it is for whatever value it may have.

Regarding board firmware, I have checked the firmware version and it is current as of last summer or so (half a dozen or so
previous updates apparently made prior to my purchase of the board while still in factory I suppose). I bought new near
Christmas. There are a couple of new firmware updates, but nothing looked consequential, so I haven’t tried a firmware
update on the board.

Re ram speed: I have 3200 speed ram, or at least that is what it claimed on the package etc. When it registered with
the motherboard however, it only wanted to run at about 2200 (don’t recall). I have it manually set to run at 3200, but
have experimented with both default and 3200, with no difference in system hang behavior. (both settings have hangs).
I have not twiddled the cpu speed in any respect.

Not sure how the ram speed and the apu speed you mention correlate with anything, or even with how to determine
what the apu speed is…

One other tidbit. I mentioned above that I can get hangs easily with an openmp-ized executable running and then
starting up firefox. It may be of relevance that of late I’ve been doing a lot of code debugging, using debugging
flags rather than optimize flags for the compiler. Of significance because of the exception masking differences
between opt/debug compiles perhaps. Don’t know. In any case, I am certain I’ve had crashes when running under
either alternative, since sometimes I am going back and forth between one and the other on a few minute debug/run
cadence.

FWIW, I’m compiling/running a gfortran (v10) workflow with these compile flags for opt/debug (cut directly
from my makefile):


ifeq ($(MODE),FAST)
   FFLAGS = $(PARFLAG) \
            -O3 -mcmodel=medium -m64 \
            -mtune=native -fomit-frame-pointer  \
            -funroll-loops -fprefetch-loop-arrays  \
            -ffast-math \
            -fconvert=big-endian


endif
ifeq ($(MODE),DEBUG)
   FFLAGS = $(PARFLAG) \
            -mcmodel=medium -m64 -g -fexceptions -fbounds-check \
            -ffpe-trap=invalid,zero,overflow,underflow,denormal \
            -finit-real=nan -finit-integer=-1010101010 \
            -finit-character=0 -Wuninitialized -Wno-unused-label \
            -fstack-protector \
            -ftrapv  \
            -Wall -fconvert=big-endian
#            -std=f95 -pedantic
endif

The “PARFLAG” inclusion says do openmp or not. I get fewer (none?) crashes
without openmp turned on.

andy271828

…and one more of the inxi incantations I seem to have missed:

 inxi --admin --filter --system
System:    Kernel: 5.3.18-lp152.66-default x86_64 bits: 64 compiler: gcc v: 7.5.0 
           parameters: BOOT_IMAGE=/boot/vmlinuz-5.3.18-lp152.66-default root=UUID=14895fd3-dbd1-4300-bba9-5eb19d6efd4a 
           splash=silent resume=/dev/disk/by-id/ata-ST4000VN008-2DR166_ZDH9DZQT-part3 mitigations=auto quiet 
           Desktop: KDE Plasma 5.18.6 tk: Qt 5.12.7 wm: kwin_x11 dm: SDDM Distro: openSUSE Leap 15.2 

@andy271828:

Have you installed the “kernel-firmware” and “ucode-amd” firmware and micro-code packages?

  • If not, please do so …

Hello again,

Yes, I have verified that the two packages you cite are installed on my machine.

Further: I just wrote up a small reproducer code that triggers the problem in
about a second on my machine. Unfortunately, I don’t see a way to attach it
to this thread as a tarball or something.

Any ideas?

Thanks,

andy271828

Same behavior here:

[FONT=monospace]**Leap:~ #** inxi -aCS 
**System:    Host:** Leap **Kernel:** 5.3.18-lp152.66-default x86_64 **bits:** 64 **compiler:** gcc **v:** 7.5.0  
           **parameters:** BOOT_IMAGE=/boot/vmlinuz-5.3.18-lp152.66-default root=UUID=7798d9ae-137c-4b29-8156-666d4574bd7f  
           splash=silent resume=/dev/disk/by-uuid/f6e38c6d-9ca2-4827-95af-c562916958d4 quiet mitigations=auto  
           **Console:** tty 1 **wm:** kwin_wayland **dm:** SDDM **Distro:** openSUSE Leap 15.2  
**CPU:       Topology:** Quad Core **model:** AMD Ryzen 5 3400G with Radeon Vega Graphics **bits:** 64 **type:** MT MCP **arch:** Zen+  
           **family:** 17 (23) **model-id:** 18 (24) **stepping:** 1 **microcode:** 8108109 **L1 cache:** 384 KiB **L2 cache:** 2048 KiB  
           **L3 cache:** 4096 KiB  
           **flags:** avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm **bogomips:** 59088  
           **Speed:** 1340 MHz **min/max:** 1400/3700 MHz **boost:** enabled **Core speeds (MHz):****1:** 1305 **2:** 1393 **3:** 1299 **4:** 1258 **5:** 1268  
           **6:** 1391 **7:** 1268 **8:** 1259  
           **Vulnerabilities:****Type:** itlb_multihit **status:** Not affected  
           **Type:** l1tf **status:** Not affected  
           **Type:** mds **status:** Not affected  
           **Type:** meltdown **status:** Not affected  
           **Type:** spec_store_bypass **mitigation:** Speculative Store Bypass disabled via prctl and seccomp  
           **Type:** spectre_v1 **mitigation:** usercopy/swapgs barriers and __user pointer sanitization  
           **Type:** spectre_v2 **mitigation:** Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling  
           **Type:** srbds **status:** Not affected  
           **Type:** tsx_async_abort **status:** Not affected  
**Leap:~ #**[/FONT]

My machine runs rock solid and never crashed. However suspend to RAM fails with:

Mar 13 09:46:17 Leap kernel: **Non-boot CPUs are not disabled**
Mar 13 09:46:17 Leap kernel: **amdgpu 0000:08:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-22).**
Mar 13 09:46:17 Leap kernel: **[drm:process_one_work] *ERROR* ib ring test failed (-22).**

Does suspend to RAM work on your machine?

What I do in case of trouble similar to what is reported:

  • Do extensive memory stress testing: https://www.memtest86.com/
  • Refrain from tinkering, set UEFI optimize defaults,
  • Check for BIOS updates. Select the highest version recommended for the processor, but not higher.

Compilation is rock solid. Built several older kernels with ‘make -j 8’ and never ran into stabilty issues.

To OP:
Memory name? Memory volume? Memory volume dedicated to builtin video?
Only Vega 11 video is in use?
Have you installed any AMD drivers from AMD?

You may use kernel from kernel:stable repo. Hope it is ready for Leap yet.

Memory:

G.SKILL Ripjaws V Series 32GB (2 x 16GB) 288-Pin DDR4 SDRAM DDR4 3200 (PC4 25600) Desktop Memory Model F4-3200C16D-32GVK

bought from newegg. Set to run at 3200 via my mb interface. By default, the mb wants to set it to a much lower speed, as I noted up
thread. I’ve had hangs at both speeds.

I have no idea what amount of that memory might be dedicated to video. Whatever the opensuse 15.2 default is, is what I am using.

I have processed my memory through memtest, and it comes up clean, at least for the duration I have run memtest (a few hours).

To my knowledge, I have not installed anything directly from amd. I have a bit of stuff installed from packman, but turned off that
repo when it stopped responding a couple months ago. Just poked around and got attached to a different mirror. Checking,
it appears I have a bit of video stuff from packman, which I believe access some other codecs not part of the normal opensuse release,
but not very much. libavcodec*, vlc, Mplayer, ffmpeg3, gstreamer (this last is a dependency/requirement of the others??), a few other
things that also look like low level dependencies of the high level packages. The only other repos I pull from are directly opensuse related.

Not at all clear to me that any of that could relate to my problems. For reference, when I start up firefox using the reproducer I mentioned up
thread, it accesses this website: https://my.xfinity.com/?cid=cust as my first home page site. There are some moving graphics things active
there, as you can see, but they work perfectly fine if I don’t have the reproducer background code running at the same time.

Any suggestions on how I might attach this reproducer to this thread or post it at some other location? It reliably hangs within a second of trying
to start up firefox, when the openmp code is running…

Also, how does one pull from the kernel:stable repo you reference?

Here you go…

well thats interesting!

I went to the webpage referenced above and did the zypper incantations, to get the stable kernel
into my system.

After rebooting, I attempted to reproduce the same hangs that I was getting before, with the same
openmp executable+firefox combination that triggered it before. Now, however, I do not get the
hang to occur. It appears fixed, at least upon this quick smoke test level of inspection.

More detail: if I fire up yast, it tells me I have a number of other firmware upgrade options from the
new repository as well. The test I just did, was before any of the firmware upgrades.

Given the option to do those firmware upgrades, I then took the upgrades, and tested again. Still
find that the reproducer combination above does not reproduce my hangs any more.

For reference, the kernel I’m running now is:

uname -a
Linux lindblad 5.11.11-1.gdbc4a02-default #1 SMP Tue Mar 30 17:57:52 UTC 2021 (dbc4a02) x86_64 x86_64 x86_64 GNU/Linux

Going through reboot, the grub boot screen no longer gives me a separate option for the opensuse 15.2 standard release
kernel (5.early-something iirc). Not sure I need/want such an option given its instability for me, but it is of note nonetheless.
I wonder if I’m now stuck on this stable kernel sidetrack, or if it will let me off at some point in the future, if/when it
makes sense to do so.

In any case, and for the short term future at least, this changeout looks like it may work for me. And given that it works
for my case, it may be a good thing to chase down the relevant changes and get them into the standard 15.2 update stream.
That’s beyond anything I could make intelligent commentary about however, so I’ll just leave that comment out for others
to consider.

I’ll watch things on my system closely over the next couple of days to see if things stay as stable as they seem to have gotten now.
If I don’t add more to this thread, you can likely infer that my problem has stayed fixed.

In any case, thanks to those who contributed to the discussion here. At the moment, it really looks like it solved my problems.

andy271828

I’m in amaze what you’re doing.
You provided link amdgpu crashes with simd exception in mode_support_and_system_configuration (#1154) · Issues · drm / amd · GitLab that leads to page 207979 – kernel_fpu_begin() does not set mxcsr value, which contains all needed info:

What seems to happen is the kernel_fpu_begin() does not reset the mxcsr value, instead leaving it at whatever value the user mode thread has been using. This has the potential to disturb any floating point operations in the kernel and causing FPU/SIMD exceptions if the mask bits have been changed.

Problem was in Linux kernel, now it is already solved - in patch for kernel 5.8:

7ad816762f9b (“x86/fpu: Reset MXCSR to default in kernel_fpu_begin()”) is already in v5.8-rc4.

With openSUSE you have 3 options:

  1. Use TW - new kernel is included.
  2. Use Leap + new kernel (from kernel:stable or another source)
  3. Ask SUSE developers to backport patch into Leap’s 5.3 kernel by creating bug report and helping them sort things out. DIY if you need it.
    Leap 15.3 is using kernel 5.3, so upgrading 15.2 -> 15.3 will not help without applying patch.

This bug is not related to amdgpu at all.

Not sure I’ve parsed your meaning correctly. Are you amazed because

  1. I posted this problem at all, given that in the end, the solution seems already to exist
    or
  2. the connection between the problem and solution is sufficiently baroque that it was amazing that such connection was made here

Either way, here is a bit of back story:

Any time I encounter a problem, I attempt to research/solve it myself before asking for help.
Sometimes that works by itself. Sometimes I need help. This was one of the latter times.
When asking for help, it is always good for the help provider to get all of the information gathered so
far, both because it saves their time redoing background research and because it makes it
clear that the requester is willing and has put in some effort before wasting someone else’s time
with a trivial issue. This is particularly true on forum-like environments when any such help is
offered on a voluntary basis–and let me be clear, I am quite grateful for that help when it is offered.

So I provided everything I knew, with appropriate caveats about how confident I was of each bit of information.
As stated in the original post, I am not well versed in kernel internals, nor with how to approach/solve/work
around problems in them. I could see in the trouble ticket information on the other sites that some progress
existed, but not what to do about it. Also that they were still waiting for additional information because-near
as I could parse the contents of that thread-the solutions they had were only partially effective.

End result: A solution and path forward for me was not clear to me when I posted my request for help, and is clear
to me now. Hopefully, that solution can find its way into the hands of others with the same problem, both through
reading this thread, and perhaps also through some backport/whatever of the relevant kernel changes into
the normal opensuse update stream. I’m not capable of making a reliable contribution to such an effort however,
except insofar as I can say “Works for me”, so that baton will need to be taken up by others, if there is interest.

Again, thanks to all who contributed to this topic.

andy271828