Bug 1177624

Summary: [HUAWEI] watchdog: BUG: soft lockup - CPU#16 stuck for 23s! [qemu-system-aar:4554]
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP4 Reporter: Guillaume GARDET <guillaume.gardet>
Component: KernelAssignee: Kernel Bugs <kernel-bugs>
Status: REOPENED --- QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P2 - High CC: afaerber, aginies, chester.lin, claudio.fontana, dmueller, fvogt, jcheung, jlee, jroedel, lubos.kocman, lyan, mbenes, mbrugger, okurz, ro
Version: PublicBeta-202202   
Target Milestone: ---   
Hardware: aarch64   
OS: openSUSE Leap 15.3   
See Also: https://bugzilla.suse.com/show_bug.cgi?id=1002860
https://bugzilla.suse.com/show_bug.cgi?id=1123535
https://bugzilla.suse.com/show_bug.cgi?id=1007813
https://bugzilla.suse.com/show_bug.cgi?id=1055940
https://bugzilla.suse.com/show_bug.cgi?id=1164534
https://bugzilla.suse.com/show_bug.cgi?id=1125435
https://bugzilla.suse.com/show_bug.cgi?id=1008510
https://bugzilla.suse.com/show_bug.cgi?id=910159
https://bugzilla.suse.com/show_bug.cgi?id=1124667
https://bugzilla.suse.com/show_bug.cgi?id=1048175
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: dmesg.log

Description Guillaume GARDET 2020-10-13 11:47:19 UTC
Created attachment 842562 [details]
dmesg.log

The D05 machine used as a worker for openqa.opensuse.org have been upgraded from Leap 15.1 to Leap 15.2 and shows lots of performance problems.
Lots of mistyping in qemu, few screen corruptions.

Checking the kernel log (in attachment), we can see some traces related to the watchdog.
Comment 1 Guillaume GARDET 2020-10-13 13:30:15 UTC
Sometimes, I also have only:

[  220.878852] watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-aar:5195]
[  220.886671] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet tun nfnetlink_cttimeout iscsi_ibft iscsi_boot_sysfs rfkill openvswitch nsh nf_conncount xt_MASQUERADE ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables ext4 mbcache jbd2 nls_iso8859_1 nls_cp437 vfat fat ipmi_ssif marvell aes_ce_blk crypto_simd cryptd aes_ce_cipher hibmc_drm crct10dif_ce drm_vram_helper ghash_ce joydev aes_arm64 ttm sha2_ce ses sha256_arm64 enclosure drm_kms_helper sha1_ce drm sbsa_gwdt efi_pstore ipmi_si fb_sys_fops ipmi_devintf ipmi_msghandler syscopyarea sysfillrect sysimgblt hns_dsaf hns_enet_drv hnae hns_mdio btrfs libcrc32c xor
[  220.886740]  xor_neon zlib_deflate raid6_pq sd_mod hid_generic usbhid ehci_platform hisi_sas_v2_hw ehci_hcd hisi_sas_main libsas scsi_transport_sas usbcore libata overlay sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs
[  220.886758] CPU: 24 PID: 5195 Comm: qemu-system-aar Not tainted 5.3.18-lp152.44-default #1 openSUSE Leap 15.2
[  220.886759] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.50 06/01/2018
[  220.886761] pstate: 80000005 (Nzcv daif -PAN -UAO)
[  220.886772] pc : invalidate_icache_range+0x28/0x50
[  220.886776] lr : user_mem_abort+0x954/0xad0
[  220.886777] sp : ffff000025bd38e0
[  220.886778] x29: ffff000025bd38e0 x28: 00000000000016bd 
[  220.886780] x27: 0000000040208000 x26: 0000000000000001 
[  220.886782] x25: ffff8015843c8008 x24: 0000000001e40000 
[  220.886784] x23: ffff8016d38fe270 x22: ffff000011699000 
[  220.886786] x21: ffff8016dbe746a0 x20: 0000000000000000 
[  220.886788] x19: ffff8016e8810000 x18: 000000009b190274 
[  220.886789] x17: 02ffff800001000e x16: 0000000000000000 
[  220.886791] x15: 0000000000000000 x14: 0000000000000000 
[  220.886793] x13: 0000000000000000 x12: 000000009b1906a4 
[  220.886795] x11: 0000000041d8c440 x10: 0000ffff00000000 
[  220.886797] x9 : 00000000410fd082 x8 : 0000000047e00000 
[  220.886799] x7 : 0000000fbc000000 x6 : 0000000000000018 
[  220.886801] x5 : 0000000000000000 x4 : ffff000011999218 
[  220.886803] x3 : ffff801e7e9d4f40 x2 : 0000000000000040 
[  220.886804] x1 : ffff801e80000000 x0 : ffff801e40000000 
[  220.886807] Call trace:
[  220.886810]  invalidate_icache_range+0x28/0x50
[  220.886813]  kvm_handle_guest_abort+0x258/0x468
[  220.886815]  handle_exit+0x14c/0x1c8
[  220.886818]  kvm_arch_vcpu_ioctl_run+0x298/0x8a8
[  220.886821]  kvm_vcpu_ioctl+0x490/0x8b0
[  220.886824]  do_vfs_ioctl+0xc4/0x778
[  220.886826]  ksys_ioctl+0x8c/0xa0
[  220.886828]  __arm64_sys_ioctl+0x28/0x38
[  220.886831]  el0_svc_common.constprop.0+0xa0/0x1f8
[  220.886833]  el0_svc_handler+0x34/0x90
[  220.886836]  el0_svc+0x10/0x14
[  406.194705] hrtimer: interrupt took 2740 ns
[  655.700660] BTRFS info (device sda2): qgroup scan completed (inconsistency flag cleared)
Comment 2 Guillaume GARDET 2020-12-07 12:33:56 UTC
It has been reproduced with kernel 5.8.15-1.gc680e93-default from Kernel:stable repo.
Comment 3 Miroslav BeneŇ° 2022-01-21 12:23:28 UTC
The bug fell through the cracks. I am sorry about that.

Guillaume, is the issue still reproducible with the latest Leap 15.3 kernel?
Comment 4 Guillaume GARDET 2022-01-21 14:21:12 UTC
I think this is ok with Leap 15.3.
Let's close this bug for 15.2 which is EOL.
Comment 5 Oliver Kurz 2022-03-08 10:31:33 UTC
As you reported yourself this is still appearing, see https://progress.opensuse.org/issues/107980
Comment 6 Guillaume GARDET 2022-03-08 10:36:00 UTC
Updated the bug to match Leap 15.3
Comment 7 Lubos Kocman 2022-03-08 12:30:41 UTC
*** Bug 1196864 has been marked as a duplicate of this bug. ***
Comment 8 Lubos Kocman 2022-03-08 12:36:00 UTC
Changing product to PUBLIC SLES based on the new bugzilla policy https://en.opensuse.org/openSUSE:Submitting_bug_reports
Comment 9 Lubos Kocman 2022-03-08 12:36:21 UTC
Happens on 15.4 as well
Comment 10 Lubos Kocman 2022-03-08 14:29:00 UTC
This started to hit quite a lot of desktop issues, where the UI is irresponsive for a while, and then continues, once the openQA timesout.
Comment 12 Ruediger Oertel 2022-03-08 15:29:58 UTC
well transparent huge pages is something our kernel did by default IIRC
but we run a special hugepage setup at bootup on these machines:

echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
echo 65536 > /proc/sys/vm/min_free_kbytes
echo $((3000*4/2)) > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
echo "got $(< /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages) huge pages"
Comment 13 Chester Lin 2022-03-08 18:06:23 UTC
(In reply to Oliver Kurz from comment #5)
> As you reported yourself this is still appearing, see
> https://progress.opensuse.org/issues/107980

Checked the soft lockup messages from this link and found that in some cases the CPUs were spinning on acquiring locks (Perhaps kvm->mmu_lock or something else?), For example:

[14000.639084] watchdog: BUG: soft lockup - CPU#10 stuck for 26s! [qemu-system-aar:29364]
...
[14000.647137] Call trace:
[14000.647141]  queued_spin_lock_slowpath+0x208/0x2d8
[14000.647143]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0
...

[14000.699085] watchdog: BUG: soft lockup - CPU#28 stuck for 26s! [qemu-system-aar:29380]
...
[14000.707108] Call trace:
[14000.707111]  queued_spin_lock_slowpath+0x250/0x2d8
[14000.707113]  kvm_handle_guest_abort+0xaac/0xe88
...

[14008.649039] watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [qemu-system-aar:29381]
...
[14008.657085] Call trace:
[14008.657089]  queued_spin_lock_slowpath+0x208/0x2d8
[14008.657091]  kvm_handle_guest_abort+0xaac/0xe88

...
[14012.619015] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [qemu-system-aar:32507]
[14012.626968] Call trace:
[14012.626971]  queued_spin_lock_slowpath+0x208/0x2d8
[14012.626973]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0
Comment 14 Miroslav BeneŇ° 2022-03-11 12:09:06 UTC
CCing Bruce, Joerg and Matthias if it rings a bell
Comment 15 Joerg Roedel 2022-05-04 12:25:02 UTC
(In reply to Chester Lin from comment #13)
> (In reply to Oliver Kurz from comment #5)
> > As you reported yourself this is still appearing, see
> > https://progress.opensuse.org/issues/107980
> 
> Checked the soft lockup messages from this link and found that in some cases
> the CPUs were spinning on acquiring locks (Perhaps kvm->mmu_lock or
> something else?), For example:
> 
> [14000.639084] watchdog: BUG: soft lockup - CPU#10 stuck for 26s!
> [qemu-system-aar:29364]
> ...
> [14000.647137] Call trace:
> [14000.647141]  queued_spin_lock_slowpath+0x208/0x2d8
> [14000.647143]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0
> ...
> 
> [14000.699085] watchdog: BUG: soft lockup - CPU#28 stuck for 26s!
> [qemu-system-aar:29380]
> ...
> [14000.707108] Call trace:
> [14000.707111]  queued_spin_lock_slowpath+0x250/0x2d8
> [14000.707113]  kvm_handle_guest_abort+0xaac/0xe88
> ...
> 
> [14008.649039] watchdog: BUG: soft lockup - CPU#13 stuck for 22s!
> [qemu-system-aar:29381]
> ...
> [14008.657085] Call trace:
> [14008.657089]  queued_spin_lock_slowpath+0x208/0x2d8
> [14008.657091]  kvm_handle_guest_abort+0xaac/0xe88
> 
> ...
> [14012.619015] watchdog: BUG: soft lockup - CPU#3 stuck for 26s!
> [qemu-system-aar:32507]
> [14012.626968] Call trace:
> [14012.626971]  queued_spin_lock_slowpath+0x208/0x2d8
> [14012.626973]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0

Can you please run a kernel with lock debugging enabled on this machine and trigger the issue? It would be interesting to see whether lockdep warnings are seen there.