Bug 1088902

Summary: amdgpu: Screen flickering every few seconds since update 2018-04-09
Product: [openSUSE] openSUSE Tumbleweed Reporter: Andreas Schneider <asn>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: asn, glin, tiwai
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Andreas Schneider 2018-04-10 15:11:14 UTC
Hi,

since the update to Tumbleweed yesterday (2018-04-09) my screen flickers ever now and then. When this happens I get the following call trace of the kernel:

[33171.510555] amdgpu 0000:01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[33171.510557] amdgpu 0000:01:00.0: swiotlb: coherent allocation failed, size=2097152
[33171.510559] CPU: 8 PID: 2224 Comm: X Tainted: G           O     4.16.0-1-default #1 openSUSE Tumbleweed (unreleased)
[33171.510560] Hardware name: System manufacturer System Product Name/P9X79 PRO, BIOS 4801 07/25/2014
[33171.510560] Call Trace:
[33171.510567]  dump_stack+0x85/0xc5
[33171.510572]  swiotlb_alloc_coherent+0x1b6/0x1d0
[33171.510579]  ttm_dma_pool_get_pages+0x1ed/0x5b0 [ttm]
[33171.510583]  ttm_dma_populate+0x25e/0x350 [ttm]
[33171.510586]  ttm_tt_bind+0x2c/0x60 [ttm]
[33171.510589]  ttm_bo_handle_move_mem+0x577/0x5b0 [ttm]
[33171.510593]  ttm_bo_validate+0x100/0x110 [ttm]
[33171.510609]  ? drm_vma_offset_add+0x41/0x60 [drm]
[33171.510617]  ? do_detailed_mode+0x51e/0x5a0 [drm]
[33171.510620]  ttm_bo_init_reserved+0x382/0x430 [ttm]
[33171.510659]  amdgpu_bo_do_create+0x1e9/0x480 [amdgpu]
[33171.510679]  ? amdgpu_fill_buffer+0x2d0/0x2d0 [amdgpu]
[33171.510697]  amdgpu_bo_create+0x3d/0x220 [amdgpu]
[33171.510717]  amdgpu_gem_object_create+0x6a/0xf0 [amdgpu]
[33171.510737]  ? amdgpu_gem_object_close+0x1b0/0x1b0 [amdgpu]
[33171.510755]  amdgpu_gem_create_ioctl+0x1c3/0x240 [amdgpu]
[33171.510774]  ? amdgpu_gem_object_close+0x1b0/0x1b0 [amdgpu]
[33171.510781]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[33171.510787]  drm_ioctl+0x2ad/0x350 [drm]
[33171.510805]  ? amdgpu_gem_object_close+0x1b0/0x1b0 [amdgpu]
[33171.510808]  ? timerqueue_add+0x52/0x80
[33171.510825]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[33171.510828]  do_vfs_ioctl+0x90/0x5f0
[33171.510831]  ? __sys_recvmsg+0x5d/0x70
[33171.510834]  ? __fget+0x6e/0xb0
[33171.510835]  SyS_ioctl+0x74/0x80
[33171.510838]  do_syscall_64+0x76/0x140
[33171.510840]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[33171.510842] RIP: 0033:0x7fe9516f7967
[33171.510843] RSP: 002b:00007ffe3d16ae48 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
[33171.510845] RAX: ffffffffffffffda RBX: 000055b6e55af6c0 RCX: 00007fe9516f7967
[33171.510845] RDX: 00007ffe3d16ae90 RSI: 00000000c0206440 RDI: 0000000000000018
[33171.510846] RBP: 00007ffe3d16ae90 R08: 000055b6e55af6c0 R09: 0000000000000004
[33171.510847] R10: 000055b6e4907010 R11: 0000000000003246 R12: 00000000c0206440
[33171.510847] R13: 0000000000000018 R14: 00007ffe3d16af18 R15: 000055b6e603ae70



Linux magrathea.fritz.box 4.16.0-1-default #1 SMP PREEMPT Wed Apr 4 13:35:56 UTC 2018 (e16f96d) x86_64 x86_64 x86_64 GNU/Linux

kernel-default-4.16.0-1.5.x86_64
Comment 1 Andreas Schneider 2018-04-10 15:11:55 UTC
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] (rev cf) (prog-if 00 [VGA controller])
        Subsystem: PC Partner Limited / Sapphire Technology Radeon RX 470/480
        Flags: bus master, fast devsel, latency 0, IRQ 38
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Memory at d0000000 (64-bit, prefetchable) [size=2M]
        I/O ports at e000 [size=256]
        Memory at fbe00000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] #15
        Capabilities: [270] #19
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [320] Latency Tolerance Reporting
        Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [370] L1 PM Substates
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu
Comment 2 Takashi Iwai 2018-04-10 15:19:08 UTC
Could you try the kernel in OBS home:tiwai:bnc1088658?
It's not published yet but the build of kernel-default already finished, so you can fetch binaries via "osc getbinaries home:tiwai:bnc1088658/kernel-default"

There is a known regression wrt swiotlb, and this looks like that.
Comment 3 Takashi Iwai 2018-04-10 15:20:26 UTC
... and it's published now:
  http://download.opensuse.org/repositories/home:/tiwai:/bnc1088658/standard/
Comment 4 Andreas Schneider 2018-04-10 15:48:05 UTC
I've installed the kernel you pointed me to and it looks like the issue is fixed.  I don't see the call trace anymore.

I had a few flickers when starting firefox, but there stopped. I think that the kernel fixes the issue!

Thanks Takashi for the quick reply!
Comment 5 Andreas Schneider 2018-04-10 18:33:55 UTC
I still have flickering, either it goes to white or black. But I dunno how I could debug that. It might be mesa or amdgpu ...
Comment 6 Takashi Iwai 2018-04-11 06:30:49 UTC
So we seem to have multiple issues.

Could you try amdgpu.dc=0 boot option?
Comment 7 Andreas Schneider 2018-04-11 07:01:07 UTC
I've tried it and after unlocking my encrypted partition I ended up with a black screen, probably loading X. I needed a hardware reset to restart the machine.
Comment 8 Takashi Iwai 2018-04-12 09:40:37 UTC
Too bad.

I refreshed the repo home:tiwai:bnc1088658 with a couple of more patches to improve the memory allocations.
I'm not sure whether the symptom is relevant with it, but in anyway, please give it a try later (now it's being built, so one hour later or so).
Comment 9 Andreas Schneider 2018-04-12 14:56:47 UTC
I've tested with kernel-default-4.16.1-2.1.g98a8438.x86_64

and after I typed in my password in the desktop manager and it loads KDE (probably the compositor for OpenGL) the machine completely freezes.
Comment 10 Takashi Iwai 2018-04-12 15:00:49 UTC
Hrm so it's worse than before?  That wasn't expected :-<
Comment 11 Andreas Schneider 2018-04-13 07:00:19 UTC
Oh wait.

Well, I didn't remove the 4.16.1-2.g98a8438-default kernel. So this morning I booted into this Kernel and it worked. Maybe the video card needed a complete power down.

The flickering is mostly gone. However I still have flickering when playing a youtube video with firefox. When I move the mouse or switch to a different window (youtube video still in sight) I get a complete white screen part of a second.
Comment 12 Andreas Schneider 2018-04-13 07:03:13 UTC
Ok, when switching windows it happens sometimes too, but it isn't as bad with the kernel you gave me.
Comment 13 Takashi Iwai 2018-04-13 08:03:06 UTC
OK, the fixs included in the previous 4.16.1-2.g98a8438-default are now in stable branch.  Could you check the kernel in OBS Kernel:stable works now equivalently?

I also updated again again home:tiwai:bnc1088658 repo based on 4.16.2.
If you have time, please try it again for double-check whether the further DMA allocation change causes the problem.

And, the rest flickering is a different cause, I suppose.  Do you see any kernel message if filtering happens?
Comment 14 Andreas Schneider 2018-04-13 09:48:13 UTC
I've installed 4.16.2-1.g4ef185f-default and it works fine.

I still have some flickering I dunno where it is coming from or what is causing it. There are no relevant messages in dmesg when it happens.
Comment 15 Takashi Iwai 2018-04-16 07:07:23 UTC
There was another amdgpu DC regression report and it pointed the buggy commit to revert.

A test kernel is being built on OBS home:tiwai:bnc1089615 repo.
Please check it later.
Comment 16 Andreas Schneider 2018-04-16 10:26:10 UTC
Takashi, that fixed the issue! Well spotted ;-)
Comment 17 Takashi Iwai 2018-04-16 11:32:24 UTC
Good to hear!  I'm going to merge the fix patch to stable branch.

One last favor: could you try also the kernel in OBS home:tiwai:bnc1088902 repo?
I'd like to see whether the DMA32 fix patch really gives any bad result.  Thanks.
Comment 18 Andreas Schneider 2018-04-16 14:16:20 UTC
Seems to work fine too.
Comment 19 Takashi Iwai 2018-04-16 14:20:25 UTC
Thanks, now I can submit the DMA32 patch to upstream, after confirming that it has no big side effect.

I think all issues are addressed, and the fix patch was now merged to stable branch, so let's close.
Comment 20 Andreas Schneider 2018-04-16 14:22:03 UTC
Thank you very much for your help adressing those issues!
Comment 21 Takashi Iwai 2018-04-18 08:56:12 UTC
*** Bug 1089998 has been marked as a duplicate of this bug. ***