3D engines causing frequent GPU lockups

https://bugzilla.opensuse.org/show_bug.cgi?id=1046962
https://bugs.freedesktop.org/show_bug.cgi?id=101672

A GPU lockup has once again been introduced in Mesa and / or the RadeonSI driver. As is usual with this sort of thing, the image immediately freezes in place while audio stops and every form of input becomes unresponsive (including the NumLock / CapsLock keyboard leds), the only option being to power the machine off and back on. I started noticing this crash roughly a month ago after a distribution upgrade (openSUSE Tumbleweed).

The crash only appears to be caused by 3D rendering. It’s probabilistic but very frequent. It is triggered by a variety of game engines, and I’ve noticed it with at least the following ones:

  • Blender 3D: When opening certain scenes in Blender and going into Weight Paint mode, the system is bound to crash in at most 5 minutes of usage.

  • Second Life: Linux native viewers for Second Life also trigger this, I believe somewhere between 5 and 30 minutes estimate.

  • Xonotic (Darkplaces engine): Starting a game will freeze the machine anywhere between instantly (the moment a game starts) and 30 minutes at most.

  • The Dark Mod (idTech 4 engine): The same freeze will occur when playing TheDarkMod, anywhere between instantly and roughly 10 minutes at most.

  • MineCraft: The native version of Minecraft can also trigger the crash, after at most 1 hour of playing a game especially on servers with a lot of geometry.

My OS is openSUSE Tumbleweed x64. My current Mesa version is 17.1.3, I can confirm first noticing this in 17.1.1, but I don’t know if the issue was introduced in 17.1.0 or prior. My video card is a Radeon R7 370 (Gigabyte), Pitcairn Islands GPU, GCN 1.0, RadeonSI. Official product page: http://www.gigabyte.com/products/product-page.aspx?pid=5469

I was able to find an important clue, while running Xonotic using the following environment variables:

MESA_DEBUG=1
MESA_LOG_FILE=/foo/bar/mesa_err.log

A log is generated and readable after restarting the machine. It only contains one line, but that looks like it might address the cause:

Mesa: User error: GL_INVALID_OPERATION in glGetQueryObjectiv(out of bounds)

The exact same story repeats with Mesa 17.1.5 (as with 17.1.4); The release notes claim that a core crash has been fixed, yet inexplicably this freeze persists after updating to the latest version.

Something very unusual happened however: I rebooted my machine and started testing under Xonotic. After about 10 minutes of playing I got my first freeze, however it did not block the machine; Only Xonotic itself crashed (image froze and sound died), so I was able to alt-tab switch to my desktop… the system detected that the process was unresponsive and killed it, after which I could notice that it did NOT eat up any CPU or memory while it was frozen. I tested again and after about 15 minutes I got another freeze… this time though it froze the entire computer as usual (including taking down SSH).

I preformed the suggested test of monitoring the files in sys/kernel/debug/dri/0 through my SSH connection, to check whether this might be caused by a vram leak. The most relevant file in there was radeon_vram, which seems to have 2.0 GB at all times (makes sense as that’s the amount of vram on my video card). I used the command “watch -n 1 cat /sys/kernel/debug/dri/0/radeon_vram” to monitor it, but that has not printed any changes in the file itself. Adding a screenshot of that directory and its contents.

http://i.imgur.com/g6lRZWN.png

I was able to use parallel SSH sessions to monitor changes in the files suggested by Max Staudt, which I did by using the command:

watch -n 1 cat filename

The relevant files that existed and I was able to watch are:

/sys/kernel/debug/dri/0/clients
/sys/kernel/debug/dri/0/gem_names
/sys/kernel/debug/dri/0/radeon_gtt_mm
/sys/kernel/debug/dri/0/radeon_vram_mm
/sys/kernel/debug/dri/0/ttm_dma_page_pool
/sys/kernel/debug/dri/0/ttm_page_pool

I will attach the captures of each output, each showing its file <= 1 second before the freeze. I understand those files should retain information about VRAM, which indicates whether this could be a progressive memory leak.

Very important note: It has taken me hours to obtain those outputs, and for a while I thought the freeze was fixed by an update altogether. For over 2 hours I was able to run all game engines that produced this crash without getting any freeze whatsoever, which has never happened before! However the freeze returned after I restarted my machine, meaning it’s still present. I have no idea whether there’s a switch in my system that causes it to happen only sometimes, but hopefully those files will say something.


Every 1,0s: cat /sys/kernel/debug/dri/0/clients                                                                                                                                                    linux-qz0r.site: Thu Jul 27 00:25:38 2017

             command   pid dev master a   uid      magic
                   X  1766   0   y    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0
                   X  1766 128   n    y     0          0


Every 1,0s: cat /sys/kernel/debug/dri/0/gem_names                                                                                                                                                  linux-qz0r.site: Thu Jul 27 00:25:38 2017

  name     size handles refcount
     1  8847360       2        1
     2  8294400       3        2


Every 1,0s: cat /sys/kernel/debug/dri/0/radeon_gtt_mm                                                                                                                                              linux-qz0r.site: Thu Jul 27 00:25:38 2017

0x0000000000000000-0x0000000000000001: 1: used
0x0000000000000001-0x0000000000000011: 16: used
0x0000000000000011-0x0000000000000111: 256: used
0x0000000000000111-0x0000000000000211: 256: used
0x0000000000000211-0x0000000000000311: 256: used
0x0000000000000311-0x0000000000000321: 16: used
0x0000000000000321-0x0000000000000331: 16: used
0x0000000000000331-0x0000000000000332: 1: used
0x0000000000000332-0x0000000000000333: 1: used
0x0000000000000333-0x0000000000000334: 1: used
0x0000000000000334-0x0000000000000434: 256: used
0x0000000000000434-0x0000000000000444: 16: used
0x0000000000000444-0x0000000000000445: 1: used
0x0000000000000445-0x0000000000000448: 3: used
0x0000000000000448-0x0000000000000449: 1: used
0x0000000000000449-0x000000000000044a: 1: used
0x000000000000044a-0x000000000000054a: 256: used
0x000000000000054a-0x000000000000054b: 1: used
0x000000000000054b-0x000000000000054c: 1: used
0x000000000000054c-0x000000000000055c: 16: used
0x000000000000055c-0x000000000000055d: 1: used
0x000000000000055d-0x000000000000055e: 1: used
0x000000000000055e-0x000000000000056e: 16: used
0x000000000000056e-0x000000000000057e: 16: used
0x000000000000057e-0x0000000000000583: 5: free
0x0000000000000583-0x0000000000000584: 1: used
0x0000000000000584-0x0000000000000594: 16: used
0x0000000000000594-0x000000000000059c: 8: free
0x000000000000059c-0x000000000000059d: 1: used
0x000000000000059d-0x000000000000059e: 1: used
0x000000000000059e-0x00000000000005ab: 13: used
0x00000000000005ab-0x00000000000005ad: 2: used
0x00000000000005ad-0x00000000000005ae: 1: used
0x00000000000005ae-0x00000000000005b1: 3: free
0x00000000000005b1-0x00000000000005b2: 1: used
0x00000000000005b2-0x00000000000005b5: 3: free
0x00000000000005b5-0x00000000000005b6: 1: used
0x00000000000005b6-0x00000000000005b7: 1: used
0x00000000000005b7-0x00000000000005c7: 16: used
0x00000000000005c7-0x00000000000005d2: 11: used
0x00000000000005d2-0x00000000000005d3: 1: used
0x00000000000005d3-0x00000000000005d4: 1: used
0x00000000000005d4-0x00000000000005d5: 1: used
0x00000000000005d5-0x00000000000005d6: 1: used
0x00000000000005d6-0x00000000000005d7: 1: used
0x00000000000005d7-0x00000000000005d8: 1: used
0x00000000000005d8-0x00000000000005d9: 1: used
0x00000000000005d9-0x00000000000005da: 1: used
0x00000000000005da-0x00000000000005db: 1: used
0x00000000000005db-0x00000000000005dc: 1: used
0x00000000000005dc-0x00000000000005dd: 1: used
0x00000000000005dd-0x00000000000005de: 1: used
0x00000000000005de-0x00000000000005df: 1: used
0x00000000000005df-0x00000000000005e2: 3: free
0x00000000000005e2-0x00000000000005e3: 1: used
0x00000000000005e3-0x00000000000005e4: 1: used
0x00000000000005e4-0x0000000000000624: 64: free
0x0000000000000624-0x0000000000000634: 16: used
0x0000000000000634-0x0000000000000644: 16: used
0x0000000000000644-0x0000000000000654: 16: used
0x0000000000000654-0x0000000000000655: 1: used
0x0000000000000655-0x0000000000000656: 1: used
0x0000000000000656-0x0000000000000657: 1: used
0x0000000000000657-0x000000000000065d: 6: free
0x000000000000065d-0x000000000000065e: 1: used
0x000000000000065e-0x000000000000065f: 1: used


Every 1,0s: cat /sys/kernel/debug/dri/0/radeon_vram_mm                                                                                                                                             linux-qz0r.site: Thu Jul 27 00:25:38 2017

0x0000000000000000-0x0000000000000040: 64: used
0x0000000000000040-0x0000000000000165: 293: used
0x0000000000000165-0x00000000000001d6: 113: used
0x00000000000001d6-0x00000000000005d6: 1024: used
0x00000000000005d6-0x00000000000005d7: 1: used
0x00000000000005d7-0x00000000000005d8: 1: used
0x00000000000005d8-0x0000000000000dc1: 2025: used
0x0000000000000dc1-0x0000000000000dc5: 4: used
0x0000000000000dc5-0x0000000000000dc6: 1: used
0x0000000000000dc6-0x0000000000000dc7: 1: used
0x0000000000000dc7-0x0000000000000dc8: 1: used
0x0000000000000dc8-0x0000000000000dc9: 1: used
0x0000000000000dc9-0x0000000000000dd1: 8: used
0x0000000000000dd1-0x0000000000000de1: 16: used
0x0000000000000de1-0x0000000000000df1: 16: used
0x0000000000000df1-0x0000000000000df5: 4: used
0x0000000000000df5-0x0000000000000df9: 4: used
0x0000000000000df9-0x0000000000000dfd: 4: used
0x0000000000000dfd-0x0000000000000e01: 4: used
0x0000000000000e01-0x0000000000000e05: 4: used
0x0000000000000e05-0x0000000000000e0d: 8: used
0x0000000000000e0d-0x0000000000000e2d: 32: used
0x0000000000000e2d-0x0000000000000e2e: 1: used
0x0000000000000e2e-0x0000000000000e30: 2: free
0x0000000000000e30-0x0000000000000e40: 16: used
0x0000000000000e40-0x0000000000000e48: 8: used
0x0000000000000e48-0x0000000000000e50: 8: used
0x0000000000000e50-0x0000000000000ea0: 80: free
0x0000000000000ea0-0x0000000000000ea8: 8: used
0x0000000000000ea8-0x0000000000000eb0: 8: used
0x0000000000000eb0-0x0000000000000eb8: 8: used
0x0000000000000eb8-0x0000000000000ec0: 8: used
0x0000000000000ec0-0x0000000000000ed0: 16: used
0x0000000000000ed0-0x0000000000000ee0: 16: used
0x0000000000000ee0-0x0000000000000ef0: 16: used
0x0000000000000ef0-0x0000000000000ef8: 8: used
0x0000000000000ef8-0x0000000000000f08: 16: used
0x0000000000000f08-0x0000000000000f09: 1: used
0x0000000000000f09-0x0000000000000f10: 7: free
0x0000000000000f10-0x0000000000000f20: 16: used
0x0000000000000f20-0x0000000000000f30: 16: used
0x0000000000000f30-0x0000000000000f31: 1: used
0x0000000000000f31-0x0000000000000f38: 7: free
0x0000000000000f38-0x0000000000000f40: 8: used
0x0000000000000f40-0x0000000000000f48: 8: used
0x0000000000000f48-0x0000000000000f49: 1: used
0x0000000000000f49-0x0000000000000f51: 8: used
0x0000000000000f51-0x0000000000000f61: 16: used
0x0000000000000f61-0x0000000000000f68: 7: free
0x0000000000000f68-0x0000000000000f69: 1: used
0x0000000000000f69-0x0000000000000f70: 7: free
0x0000000000000f70-0x0000000000000f78: 8: used
0x0000000000000f78-0x0000000000000f80: 8: used
0x0000000000000f80-0x0000000000000f88: 8: used
0x0000000000000f88-0x0000000000000f98: 16: used
0x0000000000000f98-0x0000000000000fa0: 8: used
0x0000000000000fa0-0x0000000000000fa1: 1: used
0x0000000000000fa1-0x0000000000000fa8: 7: free
0x0000000000000fa8-0x0000000000000fa9: 1: used
0x0000000000000fa9-0x0000000000000fb9: 16: used
0x0000000000000fb9-0x0000000000000fc1: 8: used
0x0000000000000fc1-0x0000000000000fd1: 16: used
0x0000000000000fd1-0x0000000000000fd9: 8: used
0x0000000000000fd9-0x0000000000000fe0: 7: free
0x0000000000000fe0-0x0000000000000ff0: 16: used
0x0000000000000ff0-0x0000000000000ff8: 8: used


Every 1,0s: cat /sys/kernel/debug/dri/0/ttm_dma_page_pool                                                                                                                                          linux-qz0r.site: Thu Jul 27 00:25:38 2017

         pool      refills   pages freed    inuse available     name
           wc         5008             0     3833    16199 radeon 0000:03:00.0
       cached        22077         83375     4929        4 radeon 0000:03:00.0


Every 1,0s: cat /sys/kernel/debug/dri/0/ttm_page_pool                                                                                                                                              linux-qz0r.site: Thu Jul 27 00:25:37 2017

  pool      refills   pages freed     size
    wc            0             0        0
    uc            0             0        0
wc dma            0             0        0
uc dma            0             0        0

I used ‘dmesg -w’ via SSH to monitor dmesg output as the system froze. I have not seen anything of interest, and no new messages were printed before the crash took place. The only arguably suspicious line was:

 1286.800069] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79750

Never the less, I have discovered another important factor during my tests: I decided to look through my BIOS settings again, as I remembered I had left enabled a memory overclock setting called Performance Enhance. In the past when I had a different set of memories, this option caused the exact same freeze when I was watching Youtube (1080p @ 60fps videos). Later on I got new memories, and due to how my clocks are synced I’m running those at an underclocked (therefore more stable) frequency, so I figured I can leave this enabled without problems. The highly erratic probabilities of the freezes threw me off (once it’s after 10 minutes, then it’s after 2 hours), whereas a crash this obvious would have been all over the bug tracker by now if it was Mesa.

After disabling it, I no longer seem to get any immediate system freezes. It will however require more testing to confirm it was that option, so please give me a few more weeks before we close this. If my theory is proven wrong, I’ll immediately post a new comment and let everyone know.

The freeze still happens with the Performance Enhance BIOS setting turned off, the crash is not caused by my overclocking settings. It took 2 hours of playing Minecraft in a row for it to occur again.

I noticed an important clue: In the case of Minecraft, the system only seems to crash after mobs have loaded into view. If I only explore a world where no entities spawn (be it full of voxel geometry), the freeze has never happened thus far. This made me realize that all engines I noticed the freeze with have one thing in common: A skeletal mesh is loaded into view. Could this be an issue related to animated models by chance?

Note that I don’t suspect Vertex Buffer Objects to be a cause: I once turned off VBO in Minecraft, restarted the game, and still got a system freeze.

For the time being, I’ve decided to test whether this also happens with the RadeonSI scheduler. To make sure I’m applying it to all games across the system, I’ve added the following line to ~/.profile and restarted:

export R600_DEBUG=sisched

I managed to play Minecraft for over an hour several times, including in areas with many mobs and therefore skeletal models in view… so far no freeze. However it will take much more testing to be sure this makes a difference, so far there is no real verdict. I’ll also follow the advice of testing with Supertux Kart, which should be an easier test case for other developers.

If the SI scheduler does turn out to fix the problem, it would mean this is a bug specific to the old scheduler (still default, hence why that environment variable is needed to switch). That would make sense since IIRC the scheduler influences how drawable items are queued and rendered, which is a likely candidate for something causing an error that freezes the system.

To rule out the possibility of a hardware issue, I ran two Memtest86 5.01 sessions from a Clonezilla bootable CD. The first was in the day for 5 hours, the second was during the night for over 10 hours: The program only registered 3 passes in total, but it did not find any errors. I’ll attach a picture just in case any useful information is printed there.

http://i.imgur.com/Tdelc3O.jpg

Wasn’t sure whether to bump this same bug report, as the original issue has clearly been fixed during nearly an year of countless Kernel + Mesa + driver updates. Unfortunately I now experience a new issue acting just like what I described here at the time: When certain 3D engines are running, there is a chance that after a few minutes the machine instantly freezes and becomes fully unusable until powered off and back on. I don’t know when the new crash was implemented since I haven’t played a lot of 3D games recently, but I’d assume somewhere within the last few months.

I now have Kernel 4.15.3 and Mesa 18.0.0. Again my video card is a Radeon R7 370 from Gigabyte (RadeonSI, GCN 1.0, AMD Pitcairn Islands). I’m running the openSUSE Tumbleweed x64 rolling release distribution.

Can someone please explain a way to debug those instant system freezes as they’re added to the system components? I can’t get an output at the time of the crash as the entire machine stops working and becomes bricked until restarted (likely including SSH), but maybe I can make it log info that I can retrieve after I reboot? Any useful info will help, just please nothing dangerous that might permanently break my OS.

Hi
Are you using radeon or amdgpu? I would switch if using radeon and see if that makes a difference.

Can you show the output from;


/usr/sbin/lspci -nnk | grep -A3 VGA

The switch is pretty easy, a blacklist, add some boot options and rebuild initrd…

Searching the two bugs and this thread I don’t see any mention of your having tried the default modesetting driver built into the Xorg server. It should be used automatically if you uninstall the radeon driver, but can be specified through xorg.conf* if desired. If the modesetting driver solves your problem it might be that the pci id of the r370 ought to be requested to be added to or removed from /etc/X11/xorg_pci_ids/modesetting.ids via bug 1046962.

I’m still on radeonsi. I tried switching, it won’t work: Whenever I try to blacklist radeonsi and use amdgpu instead, the system boots to a black screen or remains stuck in the console.

If anyone has a guaranteed way to do it, by all means let me know! amdgpu was supposed to support GCN 1.0 cards by default over an year ago… for some reason it still doesn’t default me to it up to this very day.

Something tells me the modesetting driver wouldn’t be able to handle any games, so I couldn’t even test with it. Uninstalling radeon sounds like a very risky option, likely something that would leave me with an unbootable system.

Removing an Xorg driver is incapable of preventing boot, and certainly would not be default if there was any such risk. The only “risk” might come from failure to remove all traces of a proprietary driver causing the Xorg server to fail. The Xorg modesetting driver, not to be confused with KMS, is in use here on well over half my installations: all Intels and GeForces less than about 10 years old, and both my ATI Evergreens. The minority here not using it are all at least 10 years old.

Hi
If you don’t set the boot options, then yes, you get the black screen :wink:

Add to the kernel command line options via YaST Bootloader;


amdgpu.si_support=1

This is also valid for the radeon driver (which may help…?), suggest try this first;


radeon.si_support=1

Hi
Scratch the above, I see it’s enabled by default, so only needed for amdgpu.

Oh my gosh YES! I just booted the system successfully with the following Kernel parameters:

radeon.si_support=0 amdgpu.si_support=1

The console command confirms that it’s working:

mircea@linux-qz0r:~> /sbin/lspci -nnk | grep -A3 VGA
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Curacao PRO [Radeon R7 370 / R9 270/370 OEM] [1002:6811] (rev 81)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:2015]
        Kernel driver in use: amdgpu
        Kernel modules: radeon, amdgpu

Finally, for the first time ever, I get to see amdgpu in action. I’ve been waiting for almost an year, but the previous console commands all failed… thank you for that suggestion! If the configuration proves stable, I’m adding those parameters to grub2 permanently lol!

Tomorrow I’ll probably check if the freezes also go away, at least at a first glance. If it doesn’t then this is a very weird problem which definitely needs deep investigation. Hopefully it’s specific to the radeon driver and doesn’t reach amdgpu too.

Hi
Well there are a bazillion kernel module parameter options… have no idea what all of them do…

I have a systemd service to set the power options (openSUSE Software), I’ve also played with the clock speed via command line;


cat /sys/kernel/debug/dri/0/amdgpu_pm_info
uvd    disabled
vce    disabled
power level 0    sclk: 30000 vddc: 3850
grover:~ # echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
grover:~ # cat /sys/kernel/debug/dri/0/amdgpu_pm_info
uvd    disabled
vce    disabled
power level 4    sclk: 80000 vddc: 4250
echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level
cat /sys/kernel/debug/dri/0/amdgpu_pm_info
uvd    disabled
vce    disabled
power level 0    sclk: 30000 vddc: 3850

I can confirm that the exact same system freeze happens with both the radeon and amdgpu driver: Using either module makes absolutely no difference.

This is the last piece of confirmation I needed to conclude that what’s happening must be a deliberate malware: There is simply no way a GPU crash bug could behave 100% the same way on two entirely different video drivers. Further more, this freeze is completely identical to the one I initially reported here… which was obviously fixed since there’s been so many updates to every system component it would have been solved by sheer chance at this point! Functionality like this should only be seen if someone is actively re-implementing the problem on top of updated system components, with the active intent of keeping its effects identical each time. It’s possible that my machine may be used to test malware usable to shut down Linux systems, in which case I need to find out where it’s hidden and how it’s bricking computers before it spreads.

This attack must exploit vulnerabilities that keep coming and going in X11, Mesa, or some other system component… those are hopefully holes which can be discovered and plugged to render the attacks impractical altogether. Again I only know that it’s triggered while certain 3D engines are running (possibly aimed primarily at gamers?) and has a random chance of happening roughly once every 30 minutes (likely to make testing harder and better hide the exploit).

The old issue was definitely resolved, so I’ve moved the new one to a new set of bug reports. Let’s hope this also gets solved soon and does not come back this time.

https://bugzilla.opensuse.org/show_bug.cgi?id=1084767
https://bugs.freedesktop.org/show_bug.cgi?id=105425