Bug 1075901 - [Mesa 17.3.2] Some OpenGL applications cause system to freeze
[Mesa 17.3.2] Some OpenGL applications cause system to freeze
Status: RESOLVED WONTFIX
: 1075902 1075903 1075904 1075905 (view as bug list)
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X.Org
Current
x86-64 openSUSE Factory
: P5 - None : Major (vote)
: ---
Assigned To: Michal Srb
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-01-14 13:43 UTC by Filip Vaverka
Modified: 2018-06-21 21:36 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
glxinfo with Mesa 17.2.6 (101.32 KB, text/plain)
2018-01-15 18:38 UTC, Filip Vaverka
Details
Kernel log after freeze (Mesa 17.2.6) (3.02 KB, text/plain)
2018-01-15 18:39 UTC, Filip Vaverka
Details
R600_DEBUG=check_vm output of volumetric visualization in VisPy (48.44 KB, text/plain)
2018-01-16 19:17 UTC, Filip Vaverka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Filip Vaverka 2018-01-14 13:43:29 UTC
After switching to latest Mesa 17.3.2 (but this seems to hold for anything from 17.3.0) packages from OpenSUSE X11:XOrg repository (https://build.opensuse.org/project/show/X11:XOrg) some OpenGL applications cause system (or at least anything graphics related) to freeze. This happens for example to VisPy based applications, but not OpenGL accelerated KDE Plasma.

Hardware:
- AMD RX Vega 64 (VEGA10 GPU)
- AMD Ryzen R7 1800X

Software:
- OpenSUSE Tumbleweed
- Kernel 4.15.rc7-3.1 from Kernel:/HEAD/standard
- Mesa 17.3.2 (I had to switch from default packages because of Vega GPU)

Symptoms:
Launching some OpenGL based applications causes system to freeze and only hard reset seems to help.
After reboot I can see following errors (using "sudo journalctl -b -1"):

-- Logs begin at Mon 2017-07-24 18:28:06 CEST, end at Sun 2018-01-14 14:38:07 CET. --
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x0000003fe07ee000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x0000003fe0e2e000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x0000003fe1037000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x0000003fe0a01000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x0000003fe0c2b000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x0000003fe123a000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x0000003fdef18000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x000000402ccec000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x000000402cce9000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0050113D
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0:   at page 0x000000402cced000 from 27
Jan 14 13:07:10 Filip kernel: amdgpu 0000:0d:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vm_id:5 pas_id:0)
Jan 14 13:07:10 Filip kernel: gmc_v9_0_process_interrupt: 35 callbacks suppressed

Possible causes:
It seems like this problem might be related to LLVM 5 bug discovered on https://bugs.freedesktop.org/show_bug.cgi?id=104192 and fixed in https://reviews.llvm.org/rL320466. Assuming affected version of LLVM is being used.
Comment 1 Stefan Dirsch 2018-01-14 17:52:19 UTC
*** Bug 1075902 has been marked as a duplicate of this bug. ***
Comment 2 Stefan Dirsch 2018-01-14 17:53:18 UTC
*** Bug 1075903 has been marked as a duplicate of this bug. ***
Comment 3 Stefan Dirsch 2018-01-14 17:53:54 UTC
*** Bug 1075904 has been marked as a duplicate of this bug. ***
Comment 4 Stefan Dirsch 2018-01-14 17:54:43 UTC
*** Bug 1075905 has been marked as a duplicate of this bug. ***
Comment 5 Michal Srb 2018-01-15 10:20:41 UTC
It does not seem to be exactly the same issue as in the referenced freedesktop bug. That bug was happening with llvm 6.0.0, but we have 5.0.1 in Tumbleweed. (Because 6.0.0 does not have final release yet.)

None of the reverted svn revisions r320049, r320014 and r319894 is yet present in 5.0.1.
Comment 6 Filip Vaverka 2018-01-15 10:45:29 UTC
I see, I misread the freedesktop bug report and thought it said 5.0.0. Anyways, should I try to report the issue there? I don't think this is acceptable behavior even in the case VisPy itself does something bad with OpenGL (i'm using it for volumetric visualization).
Comment 7 Michal Srb 2018-01-15 10:52:25 UTC
I am currently searching the Mesa and LLVM history to see if there are any fixes that fixed this issue. I will probably ask you to try some test package later.

In the initial description you said that the issue is happening since 17.3.0, did I understand it correctly? Did you find any earlier version where it worked?

Also please attach output of glxinfo (with any Mesa, does not matter).
Comment 8 Filip Vaverka 2018-01-15 11:17:29 UTC
Yes, it did happen with 17.3.0, 17.3.1 and 17.3.2 and both with LLVM 5.0.0 and now 5.0.1 (as X11:XOrg packages were updated). I could try default Tumbleweed packages (these are 17.2.6 I think) and see if it works, but I'm not sure if Mesa 17.2.6 already supports Vega GPUs.
Comment 9 Stefan Dirsch 2018-01-15 13:05:13 UTC
(In reply to Filip Vaverka from comment #8)
> Yes, it did happen with 17.3.0, 17.3.1 and 17.3.2 and both with LLVM 5.0.0
> and now 5.0.1 (as X11:XOrg packages were updated). I could try default
> Tumbleweed packages (these are 17.2.6 I think) and see if it works, but I'm
> not sure if Mesa 17.2.6 already supports Vega GPUs.

Yeah, could be that before Mesa 17.3.0 swrast was still being used ...
Comment 10 Filip Vaverka 2018-01-15 18:38:35 UTC
Created attachment 756117 [details]
glxinfo with Mesa 17.2.6
Comment 11 Filip Vaverka 2018-01-15 18:39:32 UTC
Created attachment 756118 [details]
Kernel log after freeze (Mesa 17.2.6)
Comment 12 Filip Vaverka 2018-01-15 18:54:03 UTC
I've tried to go back to stable Mesa 17.2.6 and I got same results as before (system freeze). One thing I noticed is that LLVM 5.0.1 is still being used. I've attached glxinfo output and kernel error log.
Comment 13 Michal Srb 2018-01-16 09:32:36 UTC
Thank you for testing with different Mesa versions. To sum it up, it does not seem that the issue is caused by any change in Mesa.

As far as I can see, the amdgpu target in llvm4 did not support gfx9 cards, so your card was not supported. With llvm5 it is supported, but evidently still buggy.

Before running into this issue, did you try some older Tumbleweed or Leap? Can you confirm that with llvm4 you did not have accelerated rendering?

There are not many obvious fixes in git, I will try to backport some for you to test. If it turns out that we would need to backport bigger amount of patches, we will have to consider the card as still unsupported with llvm5 and wait for llvm6. The 6.0.0 version was already branched and is going thru pre-release testing. The final release is expected in early March. I would put it into Tumbleweed shortly after.
Comment 14 Filip Vaverka 2018-01-16 10:05:25 UTC
I use Tumbleweed for some time now. I used it with AMD R9 Nano (Fiji based) up until Kernel 4.15 got to RC stage, then I switched to Vega (as DC code got finally merged).

You are right about gfx9 support - gfx9 is supported with LLVM 5+ (when I made the switch LLVM 4.X was still used in Tumbleweed and I got no HW acceleration).

I might try to debug (i.e. comment out parts) shaders used by Vispy (probably these: https://github.com/vispy/vispy/blob/master/vispy/visuals/volume.py) to try to narrow it down a bit, but I don't have much experience with debugging Mesa (if it can somehow dump binaries of compiled shaders and so on - I played around a bit with GCN assembly before).
Comment 15 Michal Srb 2018-01-16 13:07:34 UTC
(In reply to Filip Vaverka from comment #14)
> I might try to debug (i.e. comment out parts) shaders used by Vispy
> (probably these:
> https://github.com/vispy/vispy/blob/master/vispy/visuals/volume.py) to try
> to narrow it down a bit, but I don't have much experience with debugging
> Mesa (if it can somehow dump binaries of compiled shaders and so on - I
> played around a bit with GCN assembly before).

You can run the Vispy program with R600_DEBUG environment variable set to something like "precompile,vs,tcs,tes,gs,ps,cs,noir,notgsi" and it should print the disassembled shaders into stderr. The variable name is a bit misleading, but it is for more than just R600.

For more details have a look at this file:
https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/radeonsi/si_pipe.c

If you can find the shader that causes it and get its assembly, please attach it here.

Meanwhile I am building test version of the llvm5 package with some amdgpu fixes on top. It will take a while, I'll let you know when it is available.
Comment 16 Michal Srb 2018-01-16 13:45:24 UTC
PS: Also running with R600_DEBUG=check_vm may give us some information. When the page fault happens, it should write a report into a file in ~/ddebug_dumps/.
Comment 17 Filip Vaverka 2018-01-16 19:17:59 UTC
Created attachment 756291 [details]
R600_DEBUG=check_vm output of volumetric visualization in VisPy

Tried to run the script with R600_DEBUG=check_vm and it caught VM page error, but I can't make much off of it.
Comment 18 Michal Srb 2018-01-25 08:22:26 UTC
Thank you for the VM fault report. Unfortunately I don't see anything wrong in the shader disassembly. There were various fixes in the AMDGPU target in LLVM, but I don't see any of them making a difference in this case.

The report also mentions that the Sampler slot was corrupted in GPU memory. If that is true, it would explain why this shader is failing (perhaps trying to sample from texture on some random address). But it is still unclear how it got corrupted. Maybe some shader that ran previously did it, or Mesa set it up incorrectly, or some memory management bug in the kernel driver...

Meanwhile LLVM6 RC1 came out. I have an experimental build of llvm-6.0.0 and Mesa-17.3.3 in this build service project:
https://build.opensuse.org/project/show/home:michalsrb:branches:Mesa-llvm6:X11:XOrg

Tumbleweed repository:
https://download.opensuse.org/repositories/home:/michalsrb:/branches:/Mesa-llvm6:/X11:/XOrg/openSUSE_Tumbleweed/

Can you try to update to this Mesa & llvm combination and test if it still happens? If yes, we could take the bug upstream.
Comment 19 Filip Vaverka 2018-01-25 10:12:46 UTC
I'll try it.
btw.: Is there any way to avoid system freeze when VM page fault happens? Its rather hard to test when I have to hard reset every time. I still have another GPU in the system so that could perhaps help?
Comment 20 Michal Srb 2018-01-25 10:40:59 UTC
There is a parameter amdgpu.vm_fault_stop:
"Stop on VM fault (0 = never (default), 1 = print first, 2 = always)"

But it should be 0 by default, so it should not stop as it is. I think the freeze is a unintentional side effect of the fault, no idea how to prevent it.

Hopefully no more than one test (and so at most one hard reset) is needed. Please set again the R600_DEBUG=check_vm variable when running the application.
Comment 21 Filip Vaverka 2018-01-25 17:48:17 UTC
I got some very good news, it seems to be working without any issues!
It runs fine and no VM page fault is caught when running under R600_DEBUG=check_vm.

I was asking about that freezing in case I would have to try to dissect shader and find what causes the issues - but that doesn't matter anymore.
Comment 22 Michal Srb 2018-01-29 10:38:00 UTC
Thank you for the test!

Since LLVM6 will be released relatively soon (March) and we do not have the resources to determine which exact fixes need to be backported, we have decided to disable GFX9/Vega support in Mesa with LLVM5.

This will prevent freezes for other users and allow them to use at least software rendering. Once LLVM6 is out, the acceleration will work again. You can use the LLVM6 & Mesa build from the "home:michalsrb:branches:Mesa-llvm6:X11:XOrg" repository until then if you want.
Comment 23 Michal Srb 2018-01-29 14:02:50 UTC
Patch that disables GFX9/Vega support with LLVM5 submitted to Factory:
https://build.opensuse.org/request/show/570623

Closing the bug as WONTFIX. It will be fixed with LLVM6.