Bug 1177624 - [HUAWEI] watchdog: BUG: soft lockup - CPU#16 stuck for 23s! [qemu-system-aar:4554]
[HUAWEI] watchdog: BUG: soft lockup - CPU#16 stuck for 23s! [qemu-system-aar:...
Status: REOPENED
: 1196864 (view as bug list)
Classification: openSUSE
Product: PUBLIC SUSE Linux Enterprise Server 15 SP4
Classification: openSUSE
Component: Kernel
PublicBeta-202202
aarch64 openSUSE Leap 15.3
: P2 - High : Major
: ---
Assigned To: Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-10-13 11:47 UTC by Guillaume GARDET
Modified: 2022-12-05 10:26 UTC (History)
17 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
dmesg.log (122.59 KB, text/plain)
2020-10-13 11:47 UTC, Guillaume GARDET
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Guillaume GARDET 2020-10-13 11:47:19 UTC
Created attachment 842562 [details]
dmesg.log

The D05 machine used as a worker for openqa.opensuse.org have been upgraded from Leap 15.1 to Leap 15.2 and shows lots of performance problems.
Lots of mistyping in qemu, few screen corruptions.

Checking the kernel log (in attachment), we can see some traces related to the watchdog.
Comment 1 Guillaume GARDET 2020-10-13 13:30:15 UTC
Sometimes, I also have only:

[  220.878852] watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-aar:5195]
[  220.886671] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet tun nfnetlink_cttimeout iscsi_ibft iscsi_boot_sysfs rfkill openvswitch nsh nf_conncount xt_MASQUERADE ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables ext4 mbcache jbd2 nls_iso8859_1 nls_cp437 vfat fat ipmi_ssif marvell aes_ce_blk crypto_simd cryptd aes_ce_cipher hibmc_drm crct10dif_ce drm_vram_helper ghash_ce joydev aes_arm64 ttm sha2_ce ses sha256_arm64 enclosure drm_kms_helper sha1_ce drm sbsa_gwdt efi_pstore ipmi_si fb_sys_fops ipmi_devintf ipmi_msghandler syscopyarea sysfillrect sysimgblt hns_dsaf hns_enet_drv hnae hns_mdio btrfs libcrc32c xor
[  220.886740]  xor_neon zlib_deflate raid6_pq sd_mod hid_generic usbhid ehci_platform hisi_sas_v2_hw ehci_hcd hisi_sas_main libsas scsi_transport_sas usbcore libata overlay sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs
[  220.886758] CPU: 24 PID: 5195 Comm: qemu-system-aar Not tainted 5.3.18-lp152.44-default #1 openSUSE Leap 15.2
[  220.886759] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.50 06/01/2018
[  220.886761] pstate: 80000005 (Nzcv daif -PAN -UAO)
[  220.886772] pc : invalidate_icache_range+0x28/0x50
[  220.886776] lr : user_mem_abort+0x954/0xad0
[  220.886777] sp : ffff000025bd38e0
[  220.886778] x29: ffff000025bd38e0 x28: 00000000000016bd 
[  220.886780] x27: 0000000040208000 x26: 0000000000000001 
[  220.886782] x25: ffff8015843c8008 x24: 0000000001e40000 
[  220.886784] x23: ffff8016d38fe270 x22: ffff000011699000 
[  220.886786] x21: ffff8016dbe746a0 x20: 0000000000000000 
[  220.886788] x19: ffff8016e8810000 x18: 000000009b190274 
[  220.886789] x17: 02ffff800001000e x16: 0000000000000000 
[  220.886791] x15: 0000000000000000 x14: 0000000000000000 
[  220.886793] x13: 0000000000000000 x12: 000000009b1906a4 
[  220.886795] x11: 0000000041d8c440 x10: 0000ffff00000000 
[  220.886797] x9 : 00000000410fd082 x8 : 0000000047e00000 
[  220.886799] x7 : 0000000fbc000000 x6 : 0000000000000018 
[  220.886801] x5 : 0000000000000000 x4 : ffff000011999218 
[  220.886803] x3 : ffff801e7e9d4f40 x2 : 0000000000000040 
[  220.886804] x1 : ffff801e80000000 x0 : ffff801e40000000 
[  220.886807] Call trace:
[  220.886810]  invalidate_icache_range+0x28/0x50
[  220.886813]  kvm_handle_guest_abort+0x258/0x468
[  220.886815]  handle_exit+0x14c/0x1c8
[  220.886818]  kvm_arch_vcpu_ioctl_run+0x298/0x8a8
[  220.886821]  kvm_vcpu_ioctl+0x490/0x8b0
[  220.886824]  do_vfs_ioctl+0xc4/0x778
[  220.886826]  ksys_ioctl+0x8c/0xa0
[  220.886828]  __arm64_sys_ioctl+0x28/0x38
[  220.886831]  el0_svc_common.constprop.0+0xa0/0x1f8
[  220.886833]  el0_svc_handler+0x34/0x90
[  220.886836]  el0_svc+0x10/0x14
[  406.194705] hrtimer: interrupt took 2740 ns
[  655.700660] BTRFS info (device sda2): qgroup scan completed (inconsistency flag cleared)
Comment 2 Guillaume GARDET 2020-12-07 12:33:56 UTC
It has been reproduced with kernel 5.8.15-1.gc680e93-default from Kernel:stable repo.
Comment 3 Miroslav Beneš 2022-01-21 12:23:28 UTC
The bug fell through the cracks. I am sorry about that.

Guillaume, is the issue still reproducible with the latest Leap 15.3 kernel?
Comment 4 Guillaume GARDET 2022-01-21 14:21:12 UTC
I think this is ok with Leap 15.3.
Let's close this bug for 15.2 which is EOL.
Comment 5 Oliver Kurz 2022-03-08 10:31:33 UTC
As you reported yourself this is still appearing, see https://progress.opensuse.org/issues/107980
Comment 6 Guillaume GARDET 2022-03-08 10:36:00 UTC
Updated the bug to match Leap 15.3
Comment 7 Lubos Kocman 2022-03-08 12:30:41 UTC
*** Bug 1196864 has been marked as a duplicate of this bug. ***
Comment 8 Lubos Kocman 2022-03-08 12:36:00 UTC
Changing product to PUBLIC SLES based on the new bugzilla policy https://en.opensuse.org/openSUSE:Submitting_bug_reports
Comment 9 Lubos Kocman 2022-03-08 12:36:21 UTC
Happens on 15.4 as well
Comment 10 Lubos Kocman 2022-03-08 14:29:00 UTC
This started to hit quite a lot of desktop issues, where the UI is irresponsive for a while, and then continues, once the openQA timesout.
Comment 12 Ruediger Oertel 2022-03-08 15:29:58 UTC
well transparent huge pages is something our kernel did by default IIRC
but we run a special hugepage setup at bootup on these machines:

echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
echo 65536 > /proc/sys/vm/min_free_kbytes
echo $((3000*4/2)) > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
echo "got $(< /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages) huge pages"
Comment 13 Chester Lin 2022-03-08 18:06:23 UTC
(In reply to Oliver Kurz from comment #5)
> As you reported yourself this is still appearing, see
> https://progress.opensuse.org/issues/107980

Checked the soft lockup messages from this link and found that in some cases the CPUs were spinning on acquiring locks (Perhaps kvm->mmu_lock or something else?), For example:

[14000.639084] watchdog: BUG: soft lockup - CPU#10 stuck for 26s! [qemu-system-aar:29364]
...
[14000.647137] Call trace:
[14000.647141]  queued_spin_lock_slowpath+0x208/0x2d8
[14000.647143]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0
...

[14000.699085] watchdog: BUG: soft lockup - CPU#28 stuck for 26s! [qemu-system-aar:29380]
...
[14000.707108] Call trace:
[14000.707111]  queued_spin_lock_slowpath+0x250/0x2d8
[14000.707113]  kvm_handle_guest_abort+0xaac/0xe88
...

[14008.649039] watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [qemu-system-aar:29381]
...
[14008.657085] Call trace:
[14008.657089]  queued_spin_lock_slowpath+0x208/0x2d8
[14008.657091]  kvm_handle_guest_abort+0xaac/0xe88

...
[14012.619015] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [qemu-system-aar:32507]
[14012.626968] Call trace:
[14012.626971]  queued_spin_lock_slowpath+0x208/0x2d8
[14012.626973]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0
Comment 14 Miroslav Beneš 2022-03-11 12:09:06 UTC
CCing Bruce, Joerg and Matthias if it rings a bell
Comment 15 Joerg Roedel 2022-05-04 12:25:02 UTC
(In reply to Chester Lin from comment #13)
> (In reply to Oliver Kurz from comment #5)
> > As you reported yourself this is still appearing, see
> > https://progress.opensuse.org/issues/107980
> 
> Checked the soft lockup messages from this link and found that in some cases
> the CPUs were spinning on acquiring locks (Perhaps kvm->mmu_lock or
> something else?), For example:
> 
> [14000.639084] watchdog: BUG: soft lockup - CPU#10 stuck for 26s!
> [qemu-system-aar:29364]
> ...
> [14000.647137] Call trace:
> [14000.647141]  queued_spin_lock_slowpath+0x208/0x2d8
> [14000.647143]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0
> ...
> 
> [14000.699085] watchdog: BUG: soft lockup - CPU#28 stuck for 26s!
> [qemu-system-aar:29380]
> ...
> [14000.707108] Call trace:
> [14000.707111]  queued_spin_lock_slowpath+0x250/0x2d8
> [14000.707113]  kvm_handle_guest_abort+0xaac/0xe88
> ...
> 
> [14008.649039] watchdog: BUG: soft lockup - CPU#13 stuck for 22s!
> [qemu-system-aar:29381]
> ...
> [14008.657085] Call trace:
> [14008.657089]  queued_spin_lock_slowpath+0x208/0x2d8
> [14008.657091]  kvm_handle_guest_abort+0xaac/0xe88
> 
> ...
> [14012.619015] watchdog: BUG: soft lockup - CPU#3 stuck for 26s!
> [qemu-system-aar:32507]
> [14012.626968] Call trace:
> [14012.626971]  queued_spin_lock_slowpath+0x208/0x2d8
> [14012.626973]  kvm_mmu_notifier_invalidate_range_start+0xdc/0xe0

Can you please run a kernel with lock debugging enabled on this machine and trigger the issue? It would be interesting to see whether lockdep warnings are seen there.
Comment 16 Till Dörges 2022-12-04 12:31:19 UTC
It looks like we've stumbled across another instance of this bug on a Dell PowerEdge R300. The R300 used to work fine with openSUSE Leap 15.3.

After going to openSUSE Leap 15.4 and 5.14.21-150400.24.33-default we started seeing messages similar to the above. Machine generally was not usable. Issuing a few shell commands would work. But running zypper to downgrade the kernel would only work in single user mode.

The solution was to downgrade to 5.14.21-150400.24.18-default. That was inspired by bug 1203630, but not sure whether the two are related.

If our symptoms are not related to this bug, I'm happy to open a new one. And of course if you need more information I'll happily try to provide it.


[  172.359285][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [qemu-kvm:2420]
[  172.359285][    C0] Modules linked in: mpt3sas raid_class nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw iptable_security ip_set nfnetlink tun bridge stp llc iscsi_ibft iscsi_boot_sysfs ebtable_filter ebtables rfkill ip6table_filter ip6_tables iptable_filter bpfilter dell_rbu(X) dmi_sysfs msr ipmi_ssif joydev coretemp radeon kvm_intel iTCO_wdt intel_pmc_bxt iTCO_vendor_support i2c_algo_bit drm_ttm_helper ttm kvm drm_kms_helper dcdbas(X) cec rc_core tg3 syscopyarea sysfillrect sysimgblt fb_sys_fops irqbypass libphy mptctl lpc_ich pcspkr ipmi_si mfd_core ipmi_devintf ipmi_msghandler button i5100_edac acpi_cpufreq drm fuse configfs ip_tables x_tables ext4 crc16 mbcache jbd2 sr_mod sd_mod cdrom t10_pi hid_generic
[  172.359285][    C0]  uas usb_storage usbhid ata_generic ata_piix ahci libahci libata ehci_pci serio_raw uhci_hcd ehci_hcd usbcore mptsas mptscsih mptbase scsi_transport_sas sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[  172.359285][    C0] Supported: Yes, External
[  172.359285][    C0] CPU: 0 PID: 2420 Comm: qemu-kvm Tainted: G      D        X    5.14.21-150400.24.33-default #1 SLE15-SP4 018fa6ac9e6418760457307f6579fc12335772aa
[  172.359285][    C0] Hardware name: Dell Inc. PowerEdge R300/abcdef, BIOS 1.5.2 11/02/2010
[  172.359285][    C0] RIP: 0010:queued_write_lock_slowpath+0x5c/0x80
[  172.359285][    C0] Code: 00 0f 1f 40 00 5b 5d c3 cc cc cc cc f0 81 0b 00 01 00 00 b9 ff 00 00 00 be 00 01 00 00 8b 03 3d 00 01 00 00 74 0c f3 90 8b 13 <81> fa 00 01 00 00 75 f4 89 f0 f0 0f b1 0b 74 c1 eb e1 89 c6 48 89
[  172.359285][    C0] RSP: 0018:ffffa47cc0c4bb98 EFLAGS: 00000206
[  172.359285][    C0] RAX: 00000000000001ff RBX: ffffa47cc0c65000 RCX: 00000000000000ff
[  172.359285][    C0] RDX: 00000000000001ff RSI: 0000000000000100 RDI: ffffa47cc0c65000
[  172.359285][    C0] RBP: ffffa47cc0c65004 R08: 0000000000000000 R09: 0000000000000000
[  172.359285][    C0] R10: ffffa47cc0c4be48 R11: 0000000000000000 R12: 0000000000000000
[  172.359285][    C0] R13: 00007fa4723f7000 R14: 00007fa4723f8000 R15: ffffffffc06df380
[  172.359285][    C0] FS:  00007fa472bf7700(0000) GS:ffff9909e7c00000(0000) knlGS:0000000000000000
[  172.359285][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  172.359285][    C0] CR2: 00007fa47768ddce CR3: 000000010ba60000 CR4: 00000000000426f0
[  172.359285][    C0] Call Trace:
[  172.359285][    C0]  <TASK>
[  172.359285][    C0]  kvm_mmu_notifier_invalidate_range_start+0x7b/0x2d0 [kvm d2fafb973bfd2d45f9ddb3def12513288c6a8a0b]
[  172.359285][    C0]  ? __switch_to_asm+0x42/0x80
[  172.359285][    C0]  __mmu_notifier_invalidate_range_start+0x6b/0x1e0
[  172.359285][    C0]  ? __SCT__cond_resched+0x8/0x8
[  172.359285][    C0]  zap_page_range+0x148/0x160
[  172.359285][    C0]  ? find_vma+0x64/0x70
[  172.359285][    C0]  do_madvise+0xa7c/0xfe0
[  172.359285][    C0]  ? __handle_mm_fault+0xd52/0x1220
[  172.359285][    C0]  ? __x64_sys_madvise+0x26/0x30
[  172.359285][    C0]  __x64_sys_madvise+0x26/0x30
[  172.359285][    C0]  do_syscall_64+0x5b/0x80
[  172.359285][    C0]  ? handle_mm_fault+0xd7/0x290
[  172.359285][    C0]  ? do_user_addr_fault+0x1ff/0x730
[  172.359285][    C0]  ? exc_page_fault+0x67/0x150
[  172.359285][    C0]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  172.359285][    C0] RIP: 0033:0x7fa477655257
[  172.359285][    C0] Code: ff ff ff ff c3 48 8b 15 37 7c 2d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 90 b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 7c 2d 00 f7 d8 64 89 01 48
[  172.359285][    C0] RSP: 002b:00007fa472bf6778 EFLAGS: 00000206 ORIG_RAX: 000000000000001c
[  172.359285][    C0] RAX: ffffffffffffffda RBX: 00007fa4723f7000 RCX: 00007fa477655257
[  172.359285][    C0] RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fa4723f7000
[  172.359285][    C0] RBP: 00007fa472bf7700 R08: 0000564a272fcc10 R09: 0000000000000000
[  172.359285][    C0] R10: 00007fa472bf6660 R11: 0000000000000206 R12: 00007ffc2aa0f63e
[  172.359285][    C0] R13: 00007ffc2aa0f63f R14: 00007ffc2aa0f6d0 R15: 0000000000000000
[  172.359285][    C0]  </TASK>