Bugzilla – Bug 1174868
Upgrade to LEAP 15.2 makes Thinkpad 480s sometimes completely freezes, 20°C warmer and journal is trimmed
Last modified: 2022-07-20 16:14:40 UTC
One of our machines here a Thinkpad T480s on openSUSE 15.2 kernel 5.3.18-lp152.33-default. In general this machine was upgraded from LEAP 15.1 to 15.2 and then the problems started to show. The CPU is constantly hotter while working. by 20°C. This machine now freezes sometimes completely. The machine is then - not pingable - no keyboard interaction works - lid close/open does nothng - no mouse interaction works - the notebook is really warm - When a freeze happens also a trim of the journal happens... e.g. we loose the whole journal for the day. - We SSH'ed into the machine and watched via "htop", "journal -f" and "sensors". WHen the machine freezes, there is no significant load via htop seeable, there are no interesting logs in journal and sensors shows that it is simply hotter. Interestingly, we upgraded another machine before (also a Thinkpad T480s) and this does not have any problems. We are btw on the latest firmwares using fwupdmgr. What should we try? How can we debug it?
Try to install Leap 15.1 kernel on the Leap 15.2 system, and check whether you see the same problem or not. This should indicate whether it's a kernel regression. Note that we had some issue in the latest Leap 15.1 update. Maybe better to try the one in OBS Kernel:openSUSE-15.1:Update repo for now: http://download.opensuse.org/repositories/Kernel:/openSUSE-15.1:/Update/standard/
I have now access again to the machine in question. We installed the kernel from http://download.opensuse.org/repositories/Kernel:/openSUSE-15.1/standard/ on that machine. http://download.opensuse.org/repositories/Kernel:/openSUSE-15.1:/Update/standard/ does not exist? Will test the kernel this week and will report on the end of the week how it went. Btw. we have another T480s with a fresh 15.2 (no upgrade) which had now 3 freezes in 2 weeks.
The one in Kernel:openSUSE-15.1 should suffice, it's synced again with the latest git repo. It's hard to diagnose without any logs. Please give hwinfo output (running on both Leap 15.1 and Leap 15.2 kernels) and the kernel messages (dmesg outputs). Hopefully we see some stack traces or such. There seems a problem with i915 driver in Leap 15.2 kernel for certain models (or under certain scenarios) that comes from the upstream, but it's likely irrelevant with the weirdly high temperature as this report.
We have tested the older kernel: still freezes. I will add "hwinfo" and "dmesg" outputs to this issue after this comment. We have now (on SUSE 15.2): - 1 Thinkpad T480s (with old and new Kernel) that freezes regularly (which i also cleaned, so there is no dust and the cooling-system looks good to me. this is btw the only notebook that has a higher temperature) - 1 Thinkpad T480s that freezes maybe every two days - 1 Thinkpad T480s that froze now once (since i reported this bug) - 2 Thinkpad T480s that never froze What bugs me the most is that "journal --since today" is cut off. Do you know why this could ever happen? Since downgrading to SUSE 15.1 is also not really an option (EOL is coming and we have to migrate at one point) I am as always willing to test anything because the alternative is to switch maybe to Tumbleweed or completely away. Can you pinpoint me to the "i915" issue or a repository that we should give a try?
Created attachment 840542 [details] Dmesg for current Kernel of 15.1 running on a 15.2 system
Created attachment 840543 [details] Dmesg for current Kernel of 15.2 running on a 15.2 system
Created attachment 840544 [details] Hwinfo for current Kernel of 15.1 running on a 15.2 system
Created attachment 840545 [details] Hwinfo for current Kernel of 15.2 running on a 15.2 system
The i915 bug was bsc#1174737, and the latest OBS Kernel:openSUSE-15.2 repo should contain the fix already. BTW, I noticed that you've enabled Secure Boot. What if you enable Secure Boot? I don't think it's relevant, but at least this makes easier to test the non-standard kernel like the one above.
The freezes on one machine went away with disabling power-management, at least for two days there were no freezes. But the original machine has still freezes with the old SUSE 15.1 kernel and the newest 15.2 kernel. We are now testing http://download.opensuse.org/repositories/home:/tiwai:/bsc1174737-leap2/standard/ if that makes any difference.
OK with the "second problem machine" the newest 15.2 kernel with powermanagement off did not help for that long. That machine froze today 3 times during VS.code, VirtualBox, Kubernetes/Docker and Git usage. The "main problem machine" however has now the second day without problems with the http://download.opensuse.org/repositories/home:/tiwai:/bsc1174737-leap2/standard/ kernel. So we will try that now on the "second problem machine". Btw. We have two other T480s with no problems and one T490 running 15.2 without problems. So i am still very very unsure about all of this. And i overlooked something last time: "secure boot" we turned it off for these machines.
With the kernel fromhttp://download.opensuse.org/repositories/home:/tiwai:/bsc1174737-leap2/standard/ we have 0 (ZERO!) problems. So this is definitely a kernel issue. How should we proceed with this issue?
Then the problem should have been already solved in the latest (or the upcoming) update. To verify it, please test the kernel in OBS Kernel:openSUSE-15.2 repo: http://download.opensuse.org/repositories/Kernel:/openSUSE-15.2/standard/ This contains the build from the latest git branch.
OK we installed "5.3.18-lp152.105" on one machine now. If it works one that one we will try on the others and i let you know. In case this works: how long does an update stay in http://download.opensuse.org/repositories/Kernel:/openSUSE-15.2/standard/ before it is regularly available? Also, thanks.
The next update is planned in the next week or so. Let's cross fingers.
I have two bad news - the latest of http://download.opensuse.org/repositories/Kernel:/openSUSE-15.2/standard/ does not work. We had two freezes yesterday and two already today. - i mentioned the wrong kernel. we have zero problems with http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/ so with the 5.7 kernel. Which i took from https://bugzilla.suse.com/show_bug.cgi?id=1174737. Sorry for the confusion.
Could you get any trace of crash somehow? Otherwise it's quite difficult to diagnose which part went wrong. You might be able to catch the crash via kdump, if we're lucky. But it's often not reliably working in such a case like the complete hardware freeze.
Unfortunately we never had any trace at all. Even viewing the logs live did not show anything. The "journal" is always just gone completely. Even yesterdays data is removed. Since I never used kdump: can you recommend a tutorial/documentation? Is that a good starting point? https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
If you have any commands that we should run so you have more info, please let me know. I still do not know what the difference between our Thinkpad T480s is and why the Thinkpad T490 ones are not affected.
(In reply to Markus Zimmermann from comment #19) > Since I never used kdump: can you recommend a tutorial/documentation? https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha-tuning-kexec.html Better to disable Secure Boot for enabling this feature, too.
(In reply to Markus Zimmermann from comment #20) > If you have any commands that we should run so you have more info, please > let me know. I still do not know what the difference between our Thinkpad > T480s is and why the Thinkpad T490 ones are not affected. FYI, we've had a report about T490 crash (also a hard one without trace), too, so it can't be excluded. But T490 has a totally different chip set, AFAIK, hence it's no wonder that the problem may hit only on T480.
We are using http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/ for a long time now and it is pretty outdated at this point...i guess... but it is the kernel that has no problems. I gave this issue another go because i moved from Tumbleweed to 15.2 because Tumbleweed suddnely started to have IO timeouts inside of VirtualBox VMs. So this broke my workflow completley, i couldn't work anymore. So i am now with everyone else on 15.2 which still has these freezes with the latest Kernel.
Well, does the later kernel (5.9 or 5.10-rc) still show the problem? If yes, we should move on to the upstream bug tracker.
The current 5.9 kernel that is in Tumbleweed (at least the one 4 days ago) triggered IO problems in VirtualBox which made the VMs unusable for me. So i cannot test with that one at all. I guess i will try with 5.10 when it is released again.
Markus, I apologize for the late response. What is the situation now? Do you still use Leap 15.2 (EOL now, but Leap 15.3 would not make anything better regarding this, I suspect). TW has also much newer kernel now.
AFAIK this is not an issue with 15.3 anymore. However, i am wondering if this is due to using https://github.com/erpalma/throttled which is a default in our development environment since a long time. Otherwise we cannot use the full potential of our CPUs, which is just sad.