Bugzilla – Bug 1180742
[amdgpu]An AMD Vega series GPU randomly crashes
Last modified: 2022-02-28 15:22:57 UTC
Created attachment 844970 [details] partial kernel log The AMDGPU kernel driver randomly crashes GPU, usually under load, with Radeon VII hardware. The GPU hang is relatively hard to hit, as it usually takes 5 to 7 days before it crashes. After a hang it attempts to reset the GPU, but sometimes the reset fails and system stays sort of unresponsive. You can still access it over network, and there's some sort of reaction on keyboard events, but display stays dead. Also, it seems to bring PCIe bus down to 1.0 mode, and it stays that until reboot. There's an upstream bug open that may have something to do about it: https://gitlab.freedesktop.org/drm/amd/-/issues/716 That particular GPU works fine on Windows machine openSUSE Leap 15.2, kernel 5.3.18-lp152.57-default #1 SMP Fri Dec 4 07:27:58 UTC 2020 (7be5551)
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too. Unfortunately there is no fix for this and likely not for Leap 15.2 kernel. Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x.
(In reply to Takashi Iwai from comment #1) > It's some GPU hang that leads to the real kernel crash.... which happened on > others sometimes, too. Unfortunately there is no fix for this and likely > not for Leap 15.2 kernel. > > Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS > Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x. kernel 5.3.18-100.g3524980 of Kernel:SLES15-SP3 won't boot on this machine (stuck right after bootloader, not even a single line after "loading initrd" on screen. Testing with Kernel:stable may require some time.
(In reply to Iakov Karpov from comment #2) > (In reply to Takashi Iwai from comment #1) > > It's some GPU hang that leads to the real kernel crash.... which happened on > > others sometimes, too. Unfortunately there is no fix for this and likely > > not for Leap 15.2 kernel. > > > > Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS > > Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x. > > kernel 5.3.18-100.g3524980 of Kernel:SLES15-SP3 won't boot on this machine > (stuck right after bootloader, not even a single line after "loading initrd" > on screen. That's bad. Do you have the secure boot enabled? If so, disable it when you test a kernel from OBS repo that is other than the official release.
(In reply to Takashi Iwai from comment #1) > It's some GPU hang that leads to the real kernel crash.... which happened on > others sometimes, too. Unfortunately there is no fix for this and likely > not for Leap 15.2 kernel. > > Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS > Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x. I've been testing kernel 5.10.6-3.g183dcff-default of Kernel:stable for almost 14 days now, not a single crash. (In reply to Takashi Iwai from comment #3) > That's bad. Do you have the secure boot enabled? If so, disable it when > you test a kernel from OBS repo that is other than the official release. I'm on kernel 5.3.18-107.g0b709ea-default of Kernel:SLE15-SP3 now, it works for me. Didn't change anything about secure boot, though, I don't think I had it enabled. I'll report back when in another 2 weeks if it won't crash sooner.
(In reply to Takashi Iwai from comment #1) > It's some GPU hang that leads to the real kernel crash.... which happened on > others sometimes, too. Unfortunately there is no fix for this and likely > not for Leap 15.2 kernel. > > Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS > Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x. It crashed on 12th day with 5.3.18-107.g0b709ea-default (Kernel:SLE15-SP3)
Created attachment 845864 [details] Partial kernel log of 5.3.18-107.g0b709ea-default
So something unstable is still floating round. Maybe tweaking the module options (like disabling power management) might work around, but it's no right solution. I believe the best way would be to report and/or track the upstream bug tracker.
It's rather similar to the upstream issue: https://gitlab.freedesktop.org/drm/amd/-/issues/934
Still not resolved in upstream according to the reports. Might be worked around by disabling the dynamic power management of the GPU or by the GPU frequency throttling manipulation. Iakov, by any chance, would the latest kernel from Leap 15.4 or the latest kernel from OBS Kernel:stable:Backport work better for you? Leap 15.2 is not supported anymore, Leap 15.3 is probably not better if I read your feedback correctly. Leap 15.4 will be based on v5.14 kernel.
(In reply to Miroslav Beneš from comment #9) > Still not resolved in upstream according to the reports. Might be worked > around by disabling the dynamic power management of the GPU or by the GPU > frequency throttling manipulation. > > Iakov, by any chance, would the latest kernel from Leap 15.4 or the latest > kernel from OBS Kernel:stable:Backport work better for you? Leap 15.2 is not > supported anymore, Leap 15.3 is probably not better if I read your feedback > correctly. Leap 15.4 will be based on v5.14 kernel. I'm currently using Leap 15.3 with kernel 5.15.13 of Kernel:stable:Backport. It's better, but still crashes sometimes. With 5.16.x kernels my crashing every few minutes, but I'm not sure the GPU is the case there. Was not able to recover any crash logs, so no bug report on that.
Thanks for the feedback. I'll leave the bug open and will occasionally monitor it. CCing Patrik and Thomas so that they are aware, but I am not sure if we can do anything here besides waiting for upstream.
One thing that might be worth is to update kernel-firmware-amdgpu from OBS Kernel:stable:Backport repo (if not done yet).