Bug 1161720 - i915 hang continues in 5.4.12-1-default
i915 hang continues in 5.4.12-1-default
Status: RESOLVED FIXED
: 1161785 (view as bug list)
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 Other
: P5 - None : Normal (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks: 1161207 1164498
  Show dependency treegraph
 
Reported: 2020-01-23 17:42 UTC by Clarence Dillon
Modified: 2022-07-21 17:55 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
contents of /sys/class/drm/card0/error (4.76 KB, text/plain)
2020-01-23 17:42 UTC, Clarence Dillon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Clarence Dillon 2020-01-23 17:42:49 UTC
Created attachment 828183 [details]
contents of /sys/class/drm/card0/error

I dropped down to 5.4.12-default today when the latest Tumbleweed release came out and have already experienced graphical environment freeze with this error. 

Perhaps I misunderstood that Tumbleweed release managers applied the drm/i915/gt patch to 5.4.12 default kernel? I understand that this patch is not planned to be applied to 5.4 branch kernels by i915 drm managers. (https://www.spinics.net/lists/stable/msg351278.html)

I hadn't experienced any freezes under 5.4.13 pre-release from Kernel:next (but I also don't any support for Bumblebee/NVIDIA in that kernel).

Also attaching the contents of dmesg | grep i915:

```
[    3.845376] i915 0000:00:02.0: vgaarb: deactivate vga console
[    3.847415] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    3.848292] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[    4.529697] [drm] Initialized i915 1.6.0 20190822 for 0000:00:02.0 on minor 0
[    4.681033] fbcon: i915drmfb (fb0) is primary device
[    4.725228] i915 0000:00:02.0: fb0: i915drmfb frame buffer device
[    9.671989] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[    9.721605] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[ 3021.917516] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[ 3021.917518] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 3021.918523] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 3021.919248] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3021.919351] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 3021.921095] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3021.921814] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}

```
Comment 1 Takashi Iwai 2020-01-23 17:52:52 UTC
(In reply to Clarence Dillon from comment #0)
> Perhaps I misunderstood that Tumbleweed release managers applied the
> drm/i915/gt patch to 5.4.12 default kernel? I understand that this patch is
> not planned to be applied to 5.4 branch kernels by i915 drm managers.
> (https://www.spinics.net/lists/stable/msg351278.html)

It's not included in TW kernel, either  Do you mean this has to be included for addressing your problem...?

> I hadn't experienced any freezes under 5.4.13 pre-release from Kernel:next
> (but I also don't any support for Bumblebee/NVIDIA in that kernel).

Do you mean OBS Kernel:stable repo?  The difference regarding i915 between 5.4.12 and 5.4.13 is just one patch to add the inclusion of linux/math64.h, so it's likely irrelevant.
Comment 2 Clarence Dillon 2020-01-23 18:49:08 UTC
Sorry, you're correct. I got the 5.4.13 kernel from Kernel:stable/standard

I saw _some_ drm/i915/gt fix was applied to to 5.4.12 because of this comment
in Factory/kernel-source (line 56) and expected it to be included in the next TW release. ...don't changes to Factory get released as the next TW release when it passes the automated testing?

https://build.opensuse.org/package/rdiff/openSUSE:Factory/kernel-source?linkrev=base&rev=521

Anyway, when I first reported the i915 hang to the Intel project team on freedesktop.org (https://gitlab.freedesktop.org/drm/intel/issues/993) they told me that it should be fixed by that patch.
Comment 3 Takashi Iwai 2020-01-23 20:14:54 UTC
The TW release testing is mostly performed on openQA, so i915 issues aren't covered.

So, you need a backport fix of the suggested patch?  Then I can try to build a test kernel (if possible).
Comment 4 Takashi Iwai 2020-01-23 20:34:54 UTC
A test kernel package with the backported patch is being built in OBS home:tiwai:bsc1161720 repo now.
It'll take some time (for an hour or so) until the build finishes.
Please give it a try after the build finishes.
Comment 5 Clarence Dillon 2020-01-23 21:33:41 UTC
Thank you! 

I'll watch it and switch over when it's ready. Then I can follow up tomorrow when I've had some time to check. The hang is intermittent, so...
Comment 6 Martin Wilck 2020-01-24 15:23:17 UTC
*** Bug 1161785 has been marked as a duplicate of this bug. ***
Comment 7 Clarence Dillon 2020-01-24 19:44:15 UTC
So far today, I've had no desktop freezes, which makes me pretty happy. There are some related problem with the Intel driver that causes other symptoms.

- Booting takes an unusually long time, then open onto a blank screen, which turns out to be in some power saving mode. Mouse or keyboard wakes the screen up to a login prompt, but no dots are visible in the fields. Still, I can login and wait again for the desktop to appear. 

- I'm running on a laptop + docking station + external monitor. After about 60 sec inactivity on either screen, that screen enters a power save mode (fade to black) which wakes up if I move the mouse to that desktop. This is annoying since I often read from one screen and work on the other, making me have to wake up the laptop screen every other minute. 

- Chromium (and all Chrome browser based apps) fail to start. The error is `libva error: /usr/local/lib/dri/iHD_drv_video.so init failed`. 

Web search shows some others getting the same error recently, but I have not looked into enough to know whether the cause is related. 

This libva issue is not present on the current TW kernel 5.4.13, at least for me.
Comment 8 Takashi Iwai 2020-01-24 20:45:07 UTC
OK, I pushed the fix to stable/for-next branch now. Hopefully it'll be merged soon later and will be included in the TW kernel later.
I also backported the fix to SLE15-SP2 (i.e. Leap 15.2) branch too.

Are the rest issues the regressions from the earlier kernels?
Comment 9 Clarence Dillon 2020-01-24 21:04:53 UTC
No, those are all new in this kernel. I suspect the underlying cause is that there is no bbswitch-kmp-default for 5.4.14 yet. 

The specific error I gave you was the wrong line from my cut-paste history of issues and searches. (That error is what Ubuntu is giving--most of the reports. Of course, we have it at `/usr/lib64/dri/iHD_drv_video.so` . 

Should I open a new bug for that? Or just wait for the next TW release and see if it's still present with the rest of the libraries in alignment?
Comment 10 Clarence Dillon 2020-01-25 17:29:44 UTC
I just added the bbswitch-kmp-default 5.4.14 from X11:/Bumblebee/Kernel_stable_standard and Chromium & Chromium-based apps are now working again, so that seems to have been the cause.

As far as I can tell, everything is working properly.
Comment 11 Takashi Iwai 2020-01-26 08:41:33 UTC
Thanks, then let's close now.
Comment 12 Takashi Iwai 2020-01-27 11:22:56 UTC
It turned out that my backport fix had an off-by-one error and caused another regression.

Meanwhile the stable branch was already moved to 5.5 kernel base, and the upstream fix is included there.  So this will be fixed in anyway in the next release with 5.5 kernel.
Comment 13 Clarence Dillon 2020-01-27 17:46:37 UTC
Thanks. I'll stay in this config until TW is released with 5.5. I have discovered a few glitchy things (like laptop screen keeps falling asleep) but I can live with things like this for a while.
Comment 14 Martin Wilck 2020-01-30 08:38:45 UTC
I saw the issue with the patch applied (5.4.14-2.1.g3041591), and so did Martin Sirringhaus (bug 1161207). Uptime was somewhat higher than without the patch (3 days vs. 1 day), but this isn't statistically significant.

Upstream reaction was pretty blunt:

> Linux 5.5 is released now, try updating to that.

@Takashi, would updating SLE15-SP2 to the 5.5 code base be an option?
Comment 15 Takashi Iwai 2020-01-30 08:45:38 UTC
(In reply to Martin Wilck from comment #14)
> I saw the issue with the patch applied (5.4.14-2.1.g3041591), and so did
> Martin Sirringhaus (bug 1161207). Uptime was somewhat higher than without
> the patch (3 days vs. 1 day), but this isn't statistically significant.
> 
> Upstream reaction was pretty blunt:
> 
> > Linux 5.5 is released now, try updating to that.
> 
> @Takashi, would updating SLE15-SP2 to the 5.5 code base be an option?

Unlikely.  We should ask upstream for fixing 5.4.y properly.  5.4.y is LTS stable kernel, so they are responsible for fixing it further.
Comment 16 Martin Wilck 2020-01-30 11:21:12 UTC
It just happened again with 5.4.14-2.1.g3041591. I guess I can't keep the assertion that that patch actually fixed anything. I have the vague impression though that the problems I see are related to chrome (RocketChat), which suggests that I may be looking at a Mesa-related issue.

I don't feel qualified to dig much deeper, I'm more a dumb user than anything else in this area.

(In reply to Takashi Iwai from comment #15)
> > @Takashi, would updating SLE15-SP2 to the 5.5 code base be an option?
> 
> Unlikely.  We should ask upstream for fixing 5.4.y properly.  5.4.y is LTS
> stable kernel, so they are responsible for fixing it further.

Ack. I already said so in the Gitlab issue. Maybe you can reach out to Chris/Intel, too?
Comment 17 Takashi Iwai 2020-01-30 14:05:48 UTC
Martin, could you confirm that the issue is still present on SLE15-SP2 / Leap 15.2 kernels?  Since TW shall be fixed after moving to 5.5, we'd need to track the bug specifically for SLE15-SP2.

I wonder how is the best way to trigger the bug.  I can provide a hackish patch, judging from the information in gitlab issue, a partial revert of f8c08d8faee5567803c8c533865296ca30286bbf.
Comment 18 Clarence Dillon 2020-02-01 02:10:30 UTC
Takashi,
Since it looks like the patch for this is still in the pipeline for a fix in 5.5, I thought I'd try to get ahead of an issue that will arise there. 

I mentioned before that the only problem I was experiencing with 5.5 was that I needed to reinstall nvidia/bumblebee. It turns out that the _actual problem_ is that nvidia drivers will not build on 5.5. There is a patch, but nvidia seems to be planning to wait until <after v440.44> to implement it. Our bumblebee driver is currently at 418.113. 

https://devtalk.nvidia.com/default/topic/1068332/linux/nvidia-driver-does-not-build-on-linux-v5-5-release-candidate-kernel/

So, not much to look forward to in 5.5, I'm afraid.