Bug 1084767 - 3D & games produce periodic GPU crashes (Radeon R7 370)
3D & games produce periodic GPU crashes (Radeon R7 370)
Status: RESOLVED UPSTREAM
: 1028575 (view as bug list)
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X.Org
Current
x86-64 openSUSE Factory
: P2 - High : Major (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-09 21:29 UTC by Mircea Kitsune
Modified: 2018-05-18 16:07 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
dmesg (79.69 KB, text/plain)
2018-03-09 21:31 UTC, Mircea Kitsune
Details
journalctl (3.01 MB, text/plain)
2018-03-09 21:35 UTC, Mircea Kitsune
Details
lspci (5.81 KB, text/plain)
2018-03-09 21:36 UTC, Mircea Kitsune
Details
Xorg.0.log (79.26 KB, text/x-log)
2018-03-09 21:37 UTC, Mircea Kitsune
Details
xsession-errors-:0 (272.13 KB, text/plain)
2018-03-09 21:38 UTC, Mircea Kitsune
Details
Screenshot of the Blender window glitching (1.95 MB, image/png)
2018-03-30 01:49 UTC, Mircea Kitsune
Details
Output of: watch cat /sys/kernel/debug/dri/0/amdgpu_pm_info (1.51 KB, text/plain)
2018-04-01 23:11 UTC, Mircea Kitsune
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mircea Kitsune 2018-03-09 21:29:48 UTC
I am experiencing periodical system freezes which to my knowledge are caused by a GPU lockup. Those freezes are always triggered by 3D rendering, and are seemingly produced by a multitude of game engines. The crash is highly probabilistic, with various programs having a chance of triggering it anywhere between one minute and one hour. The image instantly freezes in place while audio stops working and every form of input dies (including the NumLock / CapsLock keyboard leds), the machine is entirely bricked until powered off and back on.

The problem is oddly similar to an issue I experienced an year ago, which had the exact same behavior and was also caused by 3D. That issue was fixed as every system component has received major updates since, however it seems to have come back sometime during the last few months. I will link its report here as it may still contain useful information:

https://bugzilla.opensuse.org/show_bug.cgi?id=1046962

My OS is Linux openSUSE Tumbleweed x64: Kernel 4.15.7, Xorg X11 Server 1.19.6, Mesa 18.0.0, xf86-video-ati 17.10.0, xf86-video-amdgpu 18.0.0. My motherboard is a Gigabyte GA-X58A-UD7 (rev 1.0). My video card is a Radeon R7 370 (Gigabyte) (rev 1.0), Pitcairn Islands GPU, GCN 1.0, RadeonSI.

https://www.gigabyte.com/Motherboard/GA-X58A-UD7-rev-10
https://www.gigabyte.com/Graphics-Card/GV-R737WF2OC-2GD-rev-10
Comment 1 Mircea Kitsune 2018-03-09 21:31:45 UTC
Created attachment 763297 [details]
dmesg
Comment 2 Mircea Kitsune 2018-03-09 21:35:28 UTC
Created attachment 763298 [details]
journalctl

journalctl --since "1 day ago"

This encompasses multiple crashes caused today by Blender 2.79.
Comment 3 Mircea Kitsune 2018-03-09 21:36:20 UTC
Created attachment 763299 [details]
lspci
Comment 4 Mircea Kitsune 2018-03-09 21:37:09 UTC
Created attachment 763300 [details]
Xorg.0.log
Comment 5 Mircea Kitsune 2018-03-09 21:38:02 UTC
Created attachment 763301 [details]
xsession-errors-:0
Comment 6 Mircea Kitsune 2018-03-09 21:50:29 UTC
An important note: The exact same freeze happens with both the "radeon" and "amdgpu" driver: Using either module makes absolutely no difference.
Comment 7 Max Staudt 2018-03-15 18:11:42 UTC
(In reply to Mircea Kitsune from comment #6)
> An important note: The exact same freeze happens with both the "radeon" and
> "amdgpu" driver: Using either module makes absolutely no difference.

That is really, really strange. Maybe a hardware issue, then? I know it's unlikely you have one, but could you try the same card and system on a different motherboard, by any chance? How about a different Radeon card?

Just a random thought: Do the crashes happen during specific times of the year?
Maybe you've got a faulty power supply, or the computer heats up when it's warm (but it's cold in the winter)?
Comment 8 Mircea Kitsune 2018-03-15 23:01:08 UTC
(In reply to Max Staudt from comment #7)

The card is surely fine, this is definitely software related: I've been using this card for several years now... it only began showing those issues last year, I believe around a major MESA update. 3D rendering is always the sole cause, even a very simple 3D scene (not GPU / VRAM intensive) in the Blender viewport can trigger it... meanwhile 2D shaders such as KDE desktop effects never cause a freeze, not even resource heavy games. Heat doesn't affect the problem either, I've already tested this with the weather fluctuating in my area. Currently I cannot test the card on another system.

Given how both the cause and behavior of the issue is constant over a long amount of time, I do suspect the issue may be a self-updating malware exploiting X11 vulnerabilities as they come and go; It feels a bit like someone actively re-implementing it over updating system components. Unfortunately I don't yet have anything clear to prove this beyond a doubt... just the fact that such an issue shouldn't be constant over such a long amount of time given there are major updates daily, as well as acting identically on two different video drivers.
Comment 9 Mircea Kitsune 2018-03-25 21:20:57 UTC
I've been testing this crash using Xonotic during the past two days, granted it's a game I have a lot of experience customizing. What I found is pretty interesting and should be a good start in shedding light on this bug.

Initially the system freeze occurred somewhere between 10 and 40 minutes. Upon changing a few cvars, I seem to have almost entirely gotten rid of it: After nearly 5 hours of continuous testing, only one lockup has taken place! Below are the cvar overrides I added to my autoexec.cfg for the test: At least one of them had an influence... I'm still working on pinning down which, and that will take several more days due to the probability rate of the issue.

r_batch_multidraw 0 // old: 1
r_batch_dynamicbuffer 0 // old: 1
r_depthfirst 0 // old: 2
gl_vbo 0 // old: 3
gl_vbo_dynamicindex 0 // old: 1
gl_vbo_dynamicvertex 0 // old: 1
r_glsl_skeletal 0 // old: 1
vid_samples 1 // old: 4
gl_texture_anisotropy 0 // old: 16

I know the issue has something to do with triangles or vertices: The crash seems more frequent when there are a lot of players or objects present, indicating that an increased surface count may be a contributor. I've suspected mesh data stored on the video card to be the culprit, especially shared data with multiple objects using one instance of a mesh from video memory. This is why my bet is currently on gl_vbo (Vertex Buffer Objects / GL_ARB_vertex_buffer_object) being the variable that made a difference... again I still got a lockup even without it, so if anything it just heavily mitigated the crash.

This belief is reinforced by my previous experience in Blender: The only scene causing the GPU lockup is one where several high-poly objects share common mesh data, and the crash always occurred upon me adding a Subdivision Surface to just one of them (increasing its polygon count). It's been confirmed that as of Blender 2.77 (I have 2.79) VBO is indeed enabled in the 3D viewport. Note that I was also using the untextured viewport, thus I doubt textures play a role.

Lastly I ruled out the possibility of overheating having anything to do with it: During the first 3 hours in which I got no lockup, the temperature in my room was above 26°C. When I did get that one lockup later at night, the temperature of my room had long dropped to 23°C. The stress on the GPU was the same at all times, absolutely no settings were changed including the map.
Comment 10 Max Staudt 2018-03-26 12:36:48 UTC
That's good news, thank you Mircea!

If you can find out more, that would be really helpful. Maybe you can find something that can be used as a reproducer, and thus as a debugging aid?
Comment 11 Mircea Kitsune 2018-03-28 00:20:33 UTC
Testing is still heavily undergoing. There's still nothing conclusive yet, but I should definitely share a piece of information early on.

To my surprise, it would appear the culprit may be either Anti-Aliasing or Anisotropic Filtering. I decided to re-enable their cvars first in Xonotic since I honestly suspected them the least... the moment I did that all hell broke lose again: In 30 minutes I had two system lockups! Then I disabled them once more, and could play a 40 minute match with no problem.

I have no idea which of the two it could be, but I should be getting there in the following days. I'm slowly re-enabling the other cvars first to rule them out, then I'll see whether AA or texture filtering is behind the crashes.
Comment 12 Mircea Kitsune 2018-03-28 23:03:09 UTC
And we have a verdict. The influential factor is by far the anti-aliasing, at least in the case of Xonotic. The other cvars I previously mentioned have absolutely no effect on the frequency of this freeze.

Today I enabled the feature again and tried playing another match: I instantly got two lockups, one after 8 minutes and the other after only 20 seconds! I then disabled it and let the bots play again while I was away: This time the machine froze after more than 2 whole hours of experiencing no issues.

I find it interesting how the probability of the freeze seems to scale with the number of samples: If I use 4x AA ("vid_samples 4"), I get a crash roughly every 30 minutes... if I disable AA ("vid_samples 1"), I get a crash less than once per 2 hours... 30 minutes * 4x = 2 hours. Maybe this is just me seeing patterns but I thought I should suggest the idea.

I'd like to hear some thoughts from the developers or experienced users at this point. Can we close in on the source of this GPU lockup, knowing that Anti Aliasing greatly affects its frequency in Darkplaces engine? Are there any open bugs about AA related X11 crashes I should check out? What else can I test, ideally still under Xonotic where I have the best test case prepared?
Comment 13 Mircea Kitsune 2018-03-30 01:49:16 UTC
Created attachment 765502 [details]
Screenshot of the Blender window glitching

I should add another detail to the discussion. I know this may be a separate issue which might have nothing to do with the crash, but at the same time I wouldn't be surprised if it does: Glitched graphics often indicate something going wrong with the display, such as corrupt textures in video memory, which may ultimately lead to just such a lockup.

On occasion, certain programs (namely Firefox and Blender) glitch out and draw broken rectangles all over the window. Some of those glitches are just boxes of random colors, others contain pieces of past images (for instance I saw patterns from my lock screen background). Sometimes they quickly disappear on their own, at other times I have to restart the program as it becomes illegible and unusable. If I move anything the squares flicker all over the place. The glitches continue even after I disable desktop effects, thus KDE compositing should have nothing to do with it.

Attached is a screenshot of the glitch happening in Blender, showing its window covered in the corrupt squares. I'm curious what your opinion is. Again I know this may be an unrelated issue, but I'm wondering whether it indicates some video storage corruption that's also leading up to the lockups.
Comment 14 Mircea Kitsune 2018-04-01 19:53:06 UTC
I'm still struggling to debug this. The more I see the more my jaw drops.

First of all, the rule that disabling anti-aliasing decreases the frequency of the freeze (see the comments above) was just patched out: AA no longer has any effect either, it always freezes between 0 and 30 minutes now.

I ran the following new tests in Xonotic, none of which had any influence:

- Running with the following environment variable set: R600_DEBUG=checkir,precompile,nooptvariant,nodpbb,nodfsm,nodma,nowc,nooutoforder,nohyperz,norbplus,no2d,notiling,nodcc,nodccclear,nodccfb,nodccmsaa

- Disabling all shaders, even turning off OpenGL 2.0 support entirely.

- Resetting the entire BIOS to its failsafe defaults, making sure that neither overclocking nor any other settings are involved.

- Running under both an X11 and Wayland session (Plasma). In Wayland it crashes instantly so it's even worse.

- Verified that this occurs on both the "radeon" and "amdgpu" modules, meaning the video driver makes no difference either.

It's clear to me at this point that this is the work of a professional: The code causing the crash is carefully maintained and injected into my system. If this was just a bug, at least one of the countless things I tried would have affected it somehow, it's impossible for a randomly occurring bug to survive so many different settings and environments... the issue instead is adaptive, so that the moment I find and disable one implementation another is activated within minutes to keep the crashes going. I imagine the objective is to block the user from finding a solution and ultimately censor them from using specific programs. I find it unbelievable that someone out there is actively doing this.

Please help me get to the bottom of this: The crash clearly acts by simulating some sort of bug, so there must be a vulnerability deep in the system which hidden code is exploiting. I offered a lot of test data on this report: If the developers read this, please let me know what to try next!
Comment 15 Mircea Kitsune 2018-04-01 23:11:04 UTC
Created attachment 765652 [details]
Output of: watch cat /sys/kernel/debug/dri/0/amdgpu_pm_info

I decided to turn my attention to the last logical thing I can imagine: DPM (Dynamic Power Management) and the clocks on my video card. The kernel added support for realtime tuning of the frequencies a while ago, so I was pondering if the default setup may have led to excess overclocking.

I left a console to watch the file /sys/kernel/debug/dri/0/amdgpu_pm_info which I understand contains the video card frequencies. The maximum "power level" I seem to reach is 4, at sclk 101500 and mclk 140000. I'm attaching the peak output of this file here.

My video card is supposed to run at 1015 MHz (core clock) + 5600 MHz (memory clock). I don't fully understand how those numbers translate to frequencies, but from what I heard that represents the MHz * 100. If such is the case, my GPU clock is just right whereas my VRAM is actually under-clocked to a quarter of its default frequency! Can anyone confirm this so at least the hypothesis of bad clocks is out of the way?

I may try testing with the kernel parameters "radeon.dpm=0 amdgpu.dpm=0" later: I tried doing so briefly but the performance is too horrible to play a game, so I'll instead leave a bot match running in spectator mode while I'm away.
Comment 16 Mircea Kitsune 2018-04-04 17:35:57 UTC
Today I've ran two tests to ensure that frequencies and DPM are not a factor.

- Setting the DPM profile to low by running the following commands as root:

    echo battery > /sys/class/drm/card0/device/power_dpm_state
    echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level

- Booting my system with the following Kernel parameters to disable DPM:

    radeon.dpm=0 amdgpu.dpm=0

Just like with everything else, they made absolutely no difference: Xonotic froze the machine after only 8 minutes of running each time. The settings are applied and visible by checking /sys/kernel/debug/dri/0/amdgpu_pm_info, and are even reflected in the performance which was reduced from 60 FPS to below 30 FPS.

This is NOT a hardware failure: The freezes occur identically even if both the core (GPU) and memory (VRAM) clocks are under-clocked to very safe frequencies. The key must be something in the Linux firmware for this card.
Comment 17 Mircea Kitsune 2018-04-08 22:47:08 UTC
I have moved on to testing the various kernel parameters available for my driver and card. As was pointed out by malcolmlewis on the openSUSE forums, they can be listed with the following commands:

modinfo amdgpu
systool -vm amdgpu

I tested nearly half of them today, almost none made any difference. There were however a few settings that appeared to influence the frequency of the freeze. The most notable one of all seems to be the following:

amdgpu.moverate=4

With no parameters changed, the freeze now occurs roughly once per 30 minutes in Xonotic. With that move rate limited to 4MB/s however, I seemingly reduced it to only 90 minutes! The FPS will constantly drop and recover, but that makes sense as this setting explicitly limits the buffer migration rate.

I may test other variables in the days to come, but for now I'm hoping this offers at least some clue to get things started. My feeling is that the video card may be slowly loaded with information until something fills up, or perhaps some events throw too much data in at once and it reaches a bottleneck?
Comment 18 Max Staudt 2018-04-09 15:34:13 UTC
Thanks Mircea for the extensive testing.

(I've been away for a week, thus the delay)

Since your problems happen with both the radeon driver and the amdgpu driver, it sounds like an issue that is either common to both drivers (they're both written by AMD), or a hardware fault. Since limiting the bandwidth between the host and the card helps you, it's either a bug in memory management, or on the mainboard, or on the card.

My bet is on a hardware problem. You also suspect the GPU's firmware - this may well be a problem, too. Have you tried a different Radeon card? How about a 3rd party graphics card (e.g. Nvidia)? How about the same card, but a different computer?


For reference, I'll link your new upstream bug report here:

  https://bugs.freedesktop.org/show_bug.cgi?id=105425

And the old one:

  https://bugs.freedesktop.org/show_bug.cgi?id=101672

Thanks for reporting these!


Please let me also repeat what Christian König said upstream in the old bug - it is very, very unlikely that this is a symptom of malware being tested on your system. Why would someone want to keep you from playing Xonotic by hanging your GPU? That makes no sense to me.

And lo and behold, in the new bug, a hardware fault is suggested:

  https://bugs.freedesktop.org/show_bug.cgi?id=105425#c13

FWIW, I've had an ATI card fail me due to some video RAM pin coming loose somewhere. It happens.


In any case, I'm afraid upstream is more knowledgeable here. I'd appreciate if you kept us posted if there is an action to take, such as backporting a fix to a driver, and we'll gladly do that for you!

So let's close this as RESOLVED UPSTREAM, but I'll stay in CC and leave the Component set to Xorg, so we receive your updates.

*Please* try different hardware first, and also take the upstream commenter's suggestion seriously to try Windows and see whether the card locks up there. I'm really sorry that I can't be of more help :(
Comment 19 Mircea Kitsune 2018-04-09 19:47:50 UTC
(In reply to Max Staudt from comment #18)

The possibility of a hardware failure was ruled out by several tests:

- I've ran the card at reduced GPU / VRAM clock frequencies (DPM disabled). Even when underclocked, the behavior of the issue is in no way affected. When a hardware issue is at play, people always report the clock rates proportionally influencing the crashes.

- I constantly monitor the temperature of the system. The highest I ever caught the GPU at is 67°C, but typically it will only reach 64°C. The freezes are also not influenced by the changing temperature in my room, the effect is the same even if it's 23°C or 27°C in here.

- The freezes are only caused by 3D rendering, unrelated to the complexity of the data (even simple scenes may cause them). If the VRAM or GPU were broken, the same freezes would be caused by many other programs which put equal or more strain on resources (GTK / QT interfaces, desktop compositing, etc).

That being said, I've just gathered new evidence that the driver settings are involved in this issue: Today I tested the last amdgpu parameters on the list, and seem to have found a set that greatly mitigates the problem. Those parameters have given me up to 144 minutes before experiencing the freeze, a huge record compared to the previous 90 minutes! They are:

amdgpu.prim_buf_per_se=16
amdgpu.pos_buf_per_se=16
amdgpu.cntl_sb_buf_per_se=16
amdgpu.param_buf_per_se=16

By default, all 4 of those settings are set to 0 by the system. Setting them to 16 has, at least during one test case, reduced the problem to 1/5 of its previous frequency. The descriptions of the variables are:

parm: prim_buf_per_se:the size of Primitive Buffer per Shader Engine (default depending on gfx) (int)
parm: pos_buf_per_se:the size of Position Buffer per Shader Engine (default depending on gfx) (int)
parm: cntl_sb_buf_per_se:the size of Control Sideband per Shader Engine (default depending on gfx) (int)
parm: param_buf_per_se:the size of Off-Chip Pramater Cache per Shader Engine (default depending on gfx) (int)

I've obviously let the guys at freedesktop.org know about this as well, but unfortunately that tracker seems to be very inactive. I'd appreciate it if someone could at least check what those parameters do and let me know why they had such a fundamental effect on the issue earlier.

If it's later agreed on that those or other driver parameters were at fault, I may reopen this or start a new issue to suggest changing their defaults, if that is considered okay. If not I will continue this in the upstream issue like you suggested, which I'd like to ask the openSUSE devs to please follow if you believe this can be customized or prioritized by the OS:

https://bugs.freedesktop.org/show_bug.cgi?id=105425
Comment 20 Max Staudt 2018-04-10 10:36:42 UTC
Thanks!

I've subscribed to the upstream bug report. As far as I can see, people there have already replied to your update (including Alex Deucher, an AMD open source driver developer) and they know more than we possibly can, so please listen to them :)
Comment 21 Michal Srb 2018-04-17 10:57:45 UTC
*** Bug 1028575 has been marked as a duplicate of this bug. ***
Comment 22 Max Staudt 2018-05-18 16:07:08 UTC
Dear Mircea,

Just a quick update to let you know that I'm leaving the graphics team. If upstream finds a solution to your problem, and if that can be integrated into openSUSE, my teammates will be able to help you out.

Thanks!