Bug 1174868

Summary: Upgrade to LEAP 15.2 makes Thinkpad 480s sometimes completely freezes, 20°C warmer and journal is trimmed
Product: [openSUSE] openSUSE Distribution Reporter: Markus Zimmermann <markus.zimmermann>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED NORESPONSE QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: markus.zimmermann, mbenes, tiwai
Version: Leap 15.2Flags: tiwai: needinfo? (markus.zimmermann)
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Dmesg for current Kernel of 15.1 running on a 15.2 system
Dmesg for current Kernel of 15.2 running on a 15.2 system
Hwinfo for current Kernel of 15.1 running on a 15.2 system
Hwinfo for current Kernel of 15.2 running on a 15.2 system

Description Markus Zimmermann 2020-08-04 11:39:50 UTC
One of our machines here a Thinkpad T480s on openSUSE 15.2 kernel 5.3.18-lp152.33-default. In general this machine was upgraded from LEAP 15.1 to 15.2 and then the problems started to show. The CPU is constantly hotter while working. by 20°C.

This machine now freezes sometimes completely.

The machine is then
- not pingable
- no keyboard interaction works
- lid close/open does nothng
- no mouse interaction works
- the notebook is really warm
- When a freeze happens also a trim of the journal happens... e.g. we loose the whole journal for the day.
- We SSH'ed into the machine and watched via "htop", "journal -f" and "sensors". WHen the machine freezes, there is no significant load via htop seeable, there are no interesting logs in journal and sensors shows that it is simply hotter.

Interestingly, we upgraded another machine before (also a Thinkpad T480s) and this does not have any problems.

We are btw on the latest firmwares using fwupdmgr.

What should we try? How can we debug it?
Comment 1 Takashi Iwai 2020-08-05 15:42:39 UTC
Try to install Leap 15.1 kernel on the Leap 15.2 system, and check whether you see the same problem or not.  This should indicate whether it's a kernel regression.

Note that we had some issue in the latest Leap 15.1 update.  Maybe better to try the one in OBS Kernel:openSUSE-15.1:Update repo for now:
  http://download.opensuse.org/repositories/Kernel:/openSUSE-15.1:/Update/standard/
Comment 2 Markus Zimmermann 2020-08-10 08:53:02 UTC
I have now access again to the machine in question. We installed the kernel from http://download.opensuse.org/repositories/Kernel:/openSUSE-15.1/standard/ on that machine. http://download.opensuse.org/repositories/Kernel:/openSUSE-15.1:/Update/standard/ does not exist?

Will test the kernel this week and will report on the end of the week how it went.

Btw. we have another T480s with a fresh 15.2 (no upgrade) which had now 3 freezes in 2 weeks.
Comment 3 Takashi Iwai 2020-08-10 15:40:41 UTC
The one in Kernel:openSUSE-15.1 should suffice, it's synced again with the latest git repo.

It's hard to diagnose without any logs.  Please give hwinfo output (running on both Leap 15.1 and Leap 15.2 kernels) and the kernel messages (dmesg outputs).
Hopefully we see some stack traces or such.

There seems a problem with i915 driver in Leap 15.2 kernel for certain models (or under certain scenarios) that comes from the upstream, but it's likely irrelevant with the weirdly high temperature as this report.
Comment 4 Markus Zimmermann 2020-08-12 09:54:59 UTC
We have tested the older kernel: still freezes. I will add "hwinfo" and "dmesg" outputs to this issue after this comment.

We have now (on SUSE 15.2):
- 1 Thinkpad T480s (with old and new Kernel) that freezes regularly (which i also cleaned, so there is no dust and the cooling-system looks good to me. this is btw the only notebook that has a higher temperature)
- 1 Thinkpad T480s that freezes maybe every two days
- 1 Thinkpad T480s that froze now once (since i reported this bug)
- 2 Thinkpad T480s that never froze

What bugs me the most is that "journal --since today" is cut off. Do you know why this could ever happen?

Since downgrading to SUSE 15.1 is also not really an option (EOL is coming and we have to migrate at one point) I am as always willing to test anything because the alternative is to switch maybe to Tumbleweed or completely away.

Can you pinpoint me to the "i915" issue or a repository that we should give a try?
Comment 5 Markus Zimmermann 2020-08-12 10:00:09 UTC
Created attachment 840542 [details]
Dmesg for current Kernel of 15.1 running on a 15.2 system
Comment 6 Markus Zimmermann 2020-08-12 10:00:28 UTC
Created attachment 840543 [details]
Dmesg for current Kernel of 15.2 running on a 15.2 system
Comment 7 Markus Zimmermann 2020-08-12 10:00:51 UTC
Created attachment 840544 [details]
Hwinfo for current Kernel of 15.1 running on a 15.2 system
Comment 8 Markus Zimmermann 2020-08-12 10:01:09 UTC
Created attachment 840545 [details]
Hwinfo for current Kernel of 15.2 running on a 15.2 system
Comment 9 Takashi Iwai 2020-08-17 13:04:56 UTC
The i915 bug was bsc#1174737, and the latest OBS Kernel:openSUSE-15.2 repo should contain the fix already.

BTW, I noticed that you've enabled Secure Boot.  What if you enable Secure Boot?  I don't think it's relevant, but at least this makes easier to test the non-standard kernel like the one above.
Comment 10 Markus Zimmermann 2020-08-19 12:47:30 UTC
The freezes on one machine went away with disabling power-management, at least for two days there were no freezes. 

But the original machine has still freezes with the old SUSE 15.1 kernel and the newest 15.2 kernel.

We are now testing http://download.opensuse.org/repositories/home:/tiwai:/bsc1174737-leap2/standard/ if that makes any difference.
Comment 11 Markus Zimmermann 2020-08-20 12:01:58 UTC
OK with the "second problem machine" the newest 15.2 kernel with powermanagement off did not help for that long. That machine froze today 3 times during VS.code, VirtualBox, Kubernetes/Docker and Git usage.

The "main problem machine" however has now the second day without problems with the http://download.opensuse.org/repositories/home:/tiwai:/bsc1174737-leap2/standard/ kernel. So we will try that now on the "second problem machine".

Btw. We have two other T480s with no problems and one T490 running 15.2 without problems. So i am still very very unsure about all of this.

And i overlooked something last time: "secure boot" we turned it off for these machines.
Comment 12 Markus Zimmermann 2020-08-27 08:37:05 UTC
With the kernel fromhttp://download.opensuse.org/repositories/home:/tiwai:/bsc1174737-leap2/standard/ we have 0 (ZERO!) problems. So this is definitely a kernel issue. How should we proceed with this issue?
Comment 13 Takashi Iwai 2020-08-27 09:42:00 UTC
Then the problem should have been already solved in the latest (or the upcoming) update.

To verify it, please test the kernel in OBS Kernel:openSUSE-15.2 repo:
  http://download.opensuse.org/repositories/Kernel:/openSUSE-15.2/standard/

This contains the build from the latest git branch.
Comment 14 Markus Zimmermann 2020-08-27 10:08:19 UTC
OK we installed "5.3.18-lp152.105" on one machine now. If it works one that one we will try on the others and i let you know. In case this works: how long does an update stay in http://download.opensuse.org/repositories/Kernel:/openSUSE-15.2/standard/ before it is regularly available? Also, thanks.
Comment 15 Takashi Iwai 2020-08-27 10:11:35 UTC
The next update is planned in the next week or so.  Let's cross fingers.
Comment 16 Markus Zimmermann 2020-08-28 07:32:46 UTC
I have two bad news
- the latest of http://download.opensuse.org/repositories/Kernel:/openSUSE-15.2/standard/ does not work. We had two freezes yesterday and two already today.
- i mentioned the wrong kernel. we have zero problems with http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/ so with the 5.7 kernel. Which i took from https://bugzilla.suse.com/show_bug.cgi?id=1174737. Sorry for the confusion.
Comment 17 Takashi Iwai 2020-08-28 07:45:20 UTC
Could you get any trace of crash somehow?  Otherwise it's quite difficult to diagnose which part went wrong.

You might be able to catch the crash via kdump, if we're lucky.  But it's often not reliably working in such a case like the complete hardware freeze.
Comment 19 Markus Zimmermann 2020-08-28 07:51:46 UTC
Unfortunately we never had any trace at all. Even viewing the logs live did not show anything. The "journal" is always just gone completely. Even yesterdays data is removed.

Since I never used kdump: can you recommend a tutorial/documentation? Is that a good starting point? https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
Comment 20 Markus Zimmermann 2020-08-28 07:53:33 UTC
If you have any commands that we should run so you have more info, please let me know. I still do not know what the difference between our Thinkpad T480s is and why the Thinkpad T490 ones are not affected.
Comment 21 Takashi Iwai 2020-08-28 08:04:17 UTC
(In reply to Markus Zimmermann from comment #19)
> Since I never used kdump: can you recommend a tutorial/documentation?

https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha-tuning-kexec.html

Better to disable Secure Boot for enabling this feature, too.
Comment 22 Takashi Iwai 2020-08-28 08:05:49 UTC
(In reply to Markus Zimmermann from comment #20)
> If you have any commands that we should run so you have more info, please
> let me know. I still do not know what the difference between our Thinkpad
> T480s is and why the Thinkpad T490 ones are not affected.

FYI, we've had a report about T490 crash (also a hard one without trace), too, so it can't be excluded.  But T490 has a totally different chip set, AFAIK, hence it's no wonder that the problem may hit only on T480.
Comment 26 Markus Zimmermann 2020-12-03 09:33:08 UTC
We are using http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/ for a long time now and it is pretty outdated at this point...i guess... but it is the kernel that has no problems. I gave this issue another go because i moved from Tumbleweed to 15.2 because Tumbleweed suddnely started to have IO timeouts inside of VirtualBox VMs. So this broke my workflow completley, i couldn't work anymore. So i am now with everyone else on 15.2 which still has these freezes with the latest Kernel.
Comment 27 Takashi Iwai 2020-12-07 13:53:16 UTC
Well, does the later kernel (5.9 or 5.10-rc) still show the problem?  If yes, we should move on to the upstream bug tracker.
Comment 28 Markus Zimmermann 2020-12-07 14:15:38 UTC
The current 5.9 kernel that is in Tumbleweed (at least the one 4 days ago) triggered IO problems in VirtualBox which made the VMs unusable for me. So i cannot test with that one at all. I guess i will try with 5.10 when it is released again.
Comment 29 Markus Zimmermann 2020-12-07 14:15:38 UTC
The current 5.9 kernel that is in Tumbleweed (at least the one 4 days ago) triggered IO problems in VirtualBox which made the VMs unusable for me. So i cannot test with that one at all. I guess i will try with 5.10 when it is released again.
Comment 30 Miroslav Beneš 2022-01-21 12:38:53 UTC
Markus, I apologize for the late response. What is the situation now? Do you still use Leap 15.2 (EOL now, but Leap 15.3 would not make anything better regarding this, I suspect). TW has also much newer kernel now.
Comment 31 Markus Zimmermann 2022-07-20 16:14:40 UTC
AFAIK this is not an issue with 15.3 anymore. However, i am wondering if this is due to using https://github.com/erpalma/throttled which is a default in our development environment since a long time. Otherwise we cannot use the full potential of our CPUs, which is just sad.