Bug 1162612 - rcu_sched detected stalls on CPUs/tasks, stuck on lspci
rcu_sched detected stalls on CPUs/tasks, stuck on lspci
Status: RESOLVED WORKSFORME
: 1156813 (view as bug list)
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.1
aarch64 Other
: P5 - None : Major (vote)
: ---
Assigned To: Matthias Brugger
E-mail List
:
Depends on:
Blocks: 1165642
  Show dependency treegraph
 
Reported: 2020-02-04 10:40 UTC by Oliver Kurz
Modified: 2022-04-25 17:40 UTC (History)
8 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
logs of IPMI SOL including "sysrq-w" output, showing lspci stuck and other stack traces (66.17 KB, text/x-log)
2020-03-18 08:17 UTC, Oliver Kurz
Details
dmidecode of affected machine (16.24 KB, text/plain)
2020-07-13 19:53 UTC, Oliver Kurz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Kurz 2020-02-04 10:40:31 UTC
## Observation

The SUSE internal machine "openqaworker-arm-1" which is an aarch64 Machine model: Cavium ThunderX CN88XX running openSUSE Leap 15.1 with kernel 4.12.14-lp151.28.36-default on 2020-01-22 showed the system to be pingable but we could not login over ssh. Being logged in one is greeted with
"BUG: workqueue lockup - pool cpus=15 node=0 flags=0x0 nice=0 stuck for …"

I did

```
sysctl -w kernel.softlockup_panic=1
sysctl -p
```

and could see the stack trace from `journalctl -kf`:

```
Jan 22 10:08:13 openqaworker-arm-1 kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Jan 22 10:08:13 openqaworker-arm-1 kernel:         23-...: (1 GPs behind) idle=9ca/140000000000000/0 softirq=420584/420588 fqs=724371 
Jan 22 10:08:13 openqaworker-arm-1 kernel:         (detected by 8, t=1896528 jiffies, g=400737, c=400736, q=12082145)
Jan 22 10:08:13 openqaworker-arm-1 kernel: Task dump for CPU 23:
Jan 22 10:08:13 openqaworker-arm-1 kernel: lspci           R  running task        0  3725   8151 0x00000002
Jan 22 10:08:13 openqaworker-arm-1 kernel: Call trace:
Jan 22 10:08:13 openqaworker-arm-1 kernel:  ret_from_fork+0x0/0x20
Jan 22 10:08:27 openqaworker-arm-1 kernel: BUG: workqueue lockup - pool cpus=15 node=0 flags=0x0 nice=0 stuck for 7575s!
Jan 22 10:08:27 openqaworker-arm-1 kernel: Showing busy workqueues and worker pools:
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue events: flags=0x0
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 30: cpus=15 node=0 flags=0x0 nice=0 active=2/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 322:wait_rcu_exp_gp
Jan 22 10:08:27 openqaworker-arm-1 kernel:     pending: cache_reap
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue mm_percpu_wq: flags=0x8
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 30: cpus=15 node=0 flags=0x0 nice=0 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     pending: vmstat_update
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue writeback: flags=0x4e
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 96: cpus=0-47 flags=0x4 nice=0 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 43459:wb_workfn
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue kblockd: flags=0x18
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 65: cpus=32 node=0 flags=0x0 nice=-20 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 0:blk_mq_timeout_work
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 0:blk_mq_timeout_work
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=331s workers=3 idle: 46484 18
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 30: cpus=15 node=0 flags=0x0 nice=0 hung=7575s workers=2 idle: 101
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 65: cpus=32 node=0 flags=0x0 nice=-20 hung=3138s workers=3 idle: 1204 204
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 96: cpus=0-47 flags=0x4 nice=0 hung=0s workers=7 idle: 13255 12971 9864 14174 13163 14060
```

sometimes there are processes mentioned, as in the above example of lspci:

```
lspci           R  running task        0  3725   8151 0x00000002
```

that's lspci with PID 3725 which is triggered by salt-minion:

```
root      3150  0.0  0.0  39532 28028 ?        Ss   Jan21   0:01 /usr/bin/python3 /usr/bin/salt-minion
root      8151  0.1  0.0 998512 71104 ?        Sl   Jan21   0:58  \_ /usr/bin/python3 /usr/bin/salt-minion
root      8814  0.0  0.0 123024 28572 ?        S    Jan21   0:00      \_ /usr/bin/python3 /usr/bin/salt-minion
root      3725  0.0  0.0   3476   900 ?        R    04:52   0:00      \_ /sbin/lspci -vmm
```

I tried to kill lspci but that did not seem to help. Eventually I had to power cycle the machine. On 2020-02-04 I found what looks like the same problem again:

```
[Tue Feb  4 10:18:34 2020] SFW2-INext-ACC-TCP IN=eth0 OUT= MAC=1c:1b:0d:68:7e:c7:00:de:fb:e3:d7:7c:08:00 SRC=10.163.1.98 DST=10.160.0.245 LEN=60 TOS=0x00 PREC=0x00 TTL=62 ID=62479 DF PROTO=TCP SPT=32952 DPT=22 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (0204051C0402080A1D51F8FA0000000001030307) 
[Tue Feb  4 10:19:46 2020] INFO: rcu_sched detected stalls on CPUs/tasks:
[Tue Feb  4 10:19:46 2020]      2-...: (1 GPs behind) idle=fbe/140000000000000/0 softirq=6509573/6509574 fqs=14344932 
[Tue Feb  4 10:19:46 2020]      (detected by 43, t=34503583 jiffies, g=6715213, c=6715212, q=231673150)
[Tue Feb  4 10:19:46 2020] Task dump for CPU 2:
[Tue Feb  4 10:19:46 2020] lspci           R  running task        0 26073   7344 0x00000002
[Tue Feb  4 10:19:46 2020] Call trace:
[Tue Feb  4 10:19:46 2020]  __switch_to+0xe4/0x150
[Tue Feb  4 10:19:46 2020]  0xffffffffffffffff
```

I triggered a complete stack trace of all processes with `echo t > /proc/sysrq-trigger` and saved the output with `dmesg -T > /tmp/dmesg_$(date +%F)`.

The dmesg log contains complete dmesg output from the current boot, the first suspicious entry:

```
[584684.843935] nvme nvme0: I/O 328 QID 26 timeout, completion polled
[585160.113773] nvme nvme0: I/O 985 QID 4 timeout, completion polled
[585160.113804] nvme nvme0: I/O 9 QID 23 timeout, completion polled
[591194.945207] nvme nvme0: I/O 209 QID 25 timeout, completion polled
[597368.812505] Synchronous External Abort: synchronous external abort (0x96000210) at 0xffff000018be201c
[597368.812513] Internal error: : 96000210 [#1] SMP
[597368.817161] Modules linked in: nf_log_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_TCPMSS nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_nat nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfnetlink_cthelper nfnetlink nfs lockd grace sunrpc fscache scsi_transport_iscsi af_packet tun ip_gre gre ip_tunnel openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_physdev br_netfilter bridge stp llc xt_pkttype xt_tcpudp iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack libcrc32c ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 vfat fat nicvf cavium_ptp nicpf joydev thunder_bgx mdio_thunder mdio_cavium thunderx_edac thunder_xcv
[597368.888273]  cavium_rng_vf cavium_rng uio_pdrv_genirq aes_ce_blk crypto_simd uio cryptd aes_ce_cipher crc32_ce crct10dif_ce ghash_ce aes_arm64 ipmi_ssif ipmi_devintf ipmi_msghandler sha2_ce sha256_arm64 sha1_ce btrfs xor zstd_decompress zstd_compress xxhash zlib_deflate raid6_pq hid_generic usbhid ast xhci_pci i2c_algo_bit xhci_hcd drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm nvme nvme_core gpio_keys drm_panel_orientation_quirks usbcore thunderx_mmc i2c_thunderx i2c_smbus mmc_core dm_mirror dm_region_hash dm_log sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[597368.942048] CPU: 3 PID: 480 Comm: kworker/3:1H Not tainted 4.12.14-lp151.28.36-default #1 openSUSE Leap 15.1
[597368.951977] Hardware name: GIGABYTE R120-T32/MT30-GS1, BIOS T19 09/29/2016
[597368.958963] Workqueue: kblockd blk_mq_timeout_work
[597368.963850] task: ffff803ee7152000 task.stack: ffff803e5fae8000
[597368.969863] pstate: 60000005 (nZCv daif -PAN -UAO)
[597368.974756] pc : nvme_timeout+0x48/0x340 [nvme]
[597368.979382] lr : blk_mq_check_expired+0x140/0x178
[597368.984180] sp : ffff803e5faebc20
[597368.987587] x29: ffff803e5faebc20 x28: ffff803e2c05b800 
[597368.992995] x27: 000000000000000e x26: 0000000000000000 
[597368.998405] x25: ffff803eedb73810 x24: ffff803e22950000 
[597369.003813] x23: ffff803e22950138 x22: ffff803e25ab1380 
[597369.009223] x21: ffff803ee735c000 x20: ffff000009339000 
[597369.014634] x19: ffff803e22950000 x18: 0000ffff7c2ee000 
[597369.020042] x17: 0000ffff7bf687a0 x16: ffff0000082f46d0 
[597369.025451] x15: 0000d0dd25e9e532 x14: 0029801fcab7a304 
[597369.030870] x13: 000000005e33ec9f x12: 0000000000000018 
[597369.036286] x11: 0000000005ea2fbc x10: 0000000000001950 
[597369.041695] x9 : ffff803e5faebd80 x8 : ffff803ee71539b0 
[597369.047102] x7 : 0000000000000002 x6 : ffff803e22ae94d0 
[597369.052513] x5 : ffff803c40d30258 x4 : ffff000009337000 
[597369.057922] x3 : 0000000000000001 x2 : ffff000018be201c 
[597369.063329] x1 : 0000000000000000 x0 : ffff000009339710 
[597369.068739] Process kworker/3:1H (pid: 480, stack limit = 0xffff803e5fae8000)
[597369.075968] Call trace:
[597369.078518]  nvme_timeout+0x48/0x340 [nvme]
[597369.082799]  blk_mq_check_expired+0x140/0x178
[597369.087252]  bt_for_each+0x118/0x140
[597369.090923]  blk_mq_queue_tag_busy_iter+0xa8/0x140
[597369.095808]  blk_mq_timeout_work+0x58/0x118
[597369.100087]  process_one_work+0x1e4/0x430
[597369.104202]  worker_thread+0x50/0x478
[597369.107963]  kthread+0x134/0x138
[597369.111286]  ret_from_fork+0x10/0x20
[597369.114958] Code: f94012f6 f94006d5 f94092a2 91007042 (b9400053) 
[597369.121172] ---[ end trace 0dc43e02d03f7b76 ]---
```

so it could be the NMVE being faulty causing the breakdown though there is quite some time between the timeout and the stack trace which also mentions "nvme_timeout". Next entry:

```
[599128.041657] Synchronous External Abort: synchronous external abort (0x96000210) at 0xffff000081000000
[599128.050998] Internal error: : 96000210 [#3] SMP
[599128.055626] Modules linked in: nf_log_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_TCPMSS nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_nat nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfnetlink_cthelper nfnetlink nfs lockd grace sunrpc fscache scsi_transport_iscsi af_packet tun ip_gre gre ip_tunnel openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_physdev br_netfilter bridge stp llc xt_pkttype xt_tcpudp iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack libcrc32c ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 vfat fat nicvf cavium_ptp nicpf joydev thunder_bgx mdio_thunder mdio_cavium thunderx_edac thunder_xcv
[599128.126717]  cavium_rng_vf cavium_rng uio_pdrv_genirq aes_ce_blk crypto_simd uio cryptd aes_ce_cipher crc32_ce crct10dif_ce ghash_ce aes_arm64 ipmi_ssif ipmi_devintf ipmi_msghandler sha2_ce sha256_arm64 sha1_ce btrfs xor zstd_decompress zstd_compress xxhash zlib_deflate raid6_pq hid_generic usbhid ast xhci_pci i2c_algo_bit xhci_hcd drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm nvme nvme_core gpio_keys drm_panel_orientation_quirks usbcore thunderx_mmc i2c_thunderx i2c_smbus mmc_core dm_mirror dm_region_hash dm_log sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[599128.180483] CPU: 44 PID: 20557 Comm: lspci Tainted: G      D          4.12.14-lp151.28.36-default #1 openSUSE Leap 15.1
[599128.191358] Hardware name: GIGABYTE R120-T32/MT30-GS1, BIOS T19 09/29/2016
[599128.198327] task: ffff803c3c186100 task.stack: ffff803c420e0000
[599128.204340] pstate: 20000085 (nzCv daIf -PAN -UAO)
[599128.209228] pc : pci_generic_config_read+0x5c/0xf0
[599128.214112] lr : pci_generic_config_read+0x48/0xf0
[599128.218995] sp : ffff803c420e3c00
[599128.222402] x29: ffff803c420e3c00 x28: 0000000000000000 
[599128.227810] x27: ffff803c3c1c1b00 x26: 0000000000000004 
[599128.233217] x25: ffff803c3c1c1b00 x24: 000000000000000f 
[599128.238624] x23: 0000000000000000 x22: 0000000000000000 
[599128.244031] x21: ffff803c420e3cbc x20: 0000000000000004 
[599128.249438] x19: ffff803ee7543800 x18: 0000fffffb2ad007 
[599128.254846] x17: 0000ffff9ea7c970 x16: ffff0000082dae98 
[599128.260265] x15: 000000000000000a x14: 0000000000000000 
[599128.265675] x13: 0000000000000000 x12: 0000000000000020 
[599128.271083] x11: 0000fffffb2ad008 x10: 0000000000000000 
[599128.276490] x9 : 0000ffff9ea2b5d8 x8 : 0000000000000043 
[599128.281897] x7 : 000000000000008f x6 : 00000000000000c7 
[599128.287304] x5 : 0000000000000010 x4 : 000000000000008f 
[599128.292711] x3 : ffff000080000000 x2 : 0000000000000090 
[599128.298118] x1 : 0000000001000000 x0 : ffff000081000000 
[599128.303528] Process lspci (pid: 20557, stack limit = 0xffff803c420e0000)
[599128.310322] Call trace:
[599128.312866]  pci_generic_config_read+0x5c/0xf0
[599128.317408]  thunder_pem_config_read+0x78/0x268
[599128.322032]  pci_user_read_config_dword+0x70/0x108
[599128.326918]  pci_read_config+0xdc/0x228
[599128.330848]  sysfs_kf_bin_read+0x6c/0xa8
[599128.334865]  kernfs_fop_read+0xa8/0x208
[599128.338796]  __vfs_read+0x48/0x140
[599128.342292]  vfs_read+0x94/0x150
[599128.345614]  SyS_pread64+0x8c/0xa8
[599128.349111]  el0_svc_naked+0x44/0x48
[599128.352781] Code: 7100069f 54000180 71000a9f 54000280 (b9400001) 
[599128.359002] ---[ end trace 0dc43e02d03f7b78 ]---
```

one after that, some time later:

```
[602781.468092] INFO: rcu_sched detected stalls on CPUs/tasks:
[602781.473695]         2-...: (1 GPs behind) idle=fbe/140000000000000/0 softirq=6509573/6509574 fqs=2690 
[602781.482486]         (detected by 14, t=6003 jiffies, g=6715213, c=6715212, q=40961)
[602781.489635] Task dump for CPU 2:
[602781.492956] lspci           R  running task        0 26073   7344 0x00000002
[602781.500102] Call trace:
[602781.502649]  __switch_to+0xe4/0x150
[602781.506231]  0xffffffffffffffff
[602961.514252] INFO: rcu_sched detected stalls on CPUs/tasks:
[602961.519855]         2-...: (1 GPs behind) idle=fbe/140000000000000/0 softirq=6509573/6509574 fqs=10747 
[602961.528733]         (detected by 3, t=24008 jiffies, g=6715213, c=6715212, q=163507)
[602961.535968] Task dump for CPU 2:
[602961.539289] lspci           R  running task        0 26073   7344 0x00000002
[602961.546435] Call trace:
[602961.548981]  __switch_to+0xe4/0x150
[602961.552564]  0xffffffffffffffff
```

which then repeats.


## Further details

Internal issue: https://progress.opensuse.org/issues/41882
Comment 1 Oliver Kurz 2020-03-18 08:17:16 UTC
Created attachment 833187 [details]
logs of IPMI SOL including "sysrq-w" output, showing lspci stuck and other stack traces

Another instance where this problem happened on the machine openqaworker-arm-1 . The stack traces reference the I/O block state of the filesystem(s) with also lspci among other processes stuck.
Comment 2 Oliver Kurz 2020-07-12 20:11:54 UTC
The problem is being reproduced. E.g. latest stack trace on 5.7.7-1.gcba119b-default

```
Jul 10 14:17:55 [70390.645946] Internal error: synchronous external abort: 96000210 [#1] SMP
Jul 10 14:17:55 [70390.657300] Modules linked in: nf_log_ipv6 xt_MASQUERADE xt_comment xt_TCPMSS nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_nat nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfnetlink_cth
elper nfnetlink nfs lockd grace sunrpc fscache af_packet tun iscsi_ibft iscsi_boot_sysfs rfkill ip_gre ip_tunnel gre openvswitch nsh nf_conncount nf_nat ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 br_netfilter bridge
 stp llc xt_physdev xt_pkttype xt_tcpudp iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast ip_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables x_tables nls_
iso8859_1 nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher joydev nicvf cavium_ptp nicpf cavium_rng_vf mdio_thunder thunder_bgx thunder_xcv mdio_cavium thunderx_edac cavium_rng crct10dif_ce ghash_ce ipmi_ssif sh
a2_ce ipmi_devintf sha256_arm64 uio_pdrv_genirq ipmi_msghandler sha1_ce efi_pstore uio raid0 md_mod btrfs libcrc32c xor
Jul 10 14:17:55 [70390.657381]  xor_neon raid6_pq hid_generic usbhid ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core xhci_pci xhci_hcd drm nvme thun
derx_mmc usbcore nvme_core gpio_keys mmc_core i2c_thunderx dm_mirror dm_region_hash dm_log sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
Jul 10 14:17:56 [70390.830836] CPU: 39 PID: 495 Comm: kworker/39:1H Kdump: loaded Not tainted 5.7.7-1.gcba119b-default #1 openSUSE Tumbleweed (unreleased)
Jul 10 14:17:56 [70390.853140] Hardware name: GIGABYTE R120-T32/MT30-GS1, BIOS T19 09/29/2016
Jul 10 14:17:56 [70390.865054] Workqueue: kblockd blk_mq_timeout_work
Jul 10 14:17:56 [70390.874811] pstate: 20000005 (nzCv daif -PAN -UAO)
Jul 10 14:17:56 [70390.884507] pc : nvme_timeout+0x4c/0x524 [nvme]
Jul 10 14:17:56 [70390.893864] lr : blk_mq_check_expired.part.0+0x188/0x1cc
Jul 10 14:17:56 [70390.903945] sp : ffff80001447bbd0
Jul 10 14:17:56 [70390.911946] x29: ffff80001447bbd0 x28: 00000000000000c0
Jul 10 14:17:56 [70390.921890] x27: ffff003dcbd6a710 x26: ffff003fcd010400
Jul 10 14:17:56 [70390.931745] x25: ffff003e61d8a1a0 x24: ffff003ddc971140
Jul 10 14:17:56 [70390.941513] x23: 0000000000000000 x22: ffff003e61d8a080
Jul 10 14:17:56 [70390.951201] x21: ffff003e26ec9e80 x20: ffff003dcbf90000
Jul 10 14:17:56 [70390.960804] x19: ffff80001360201c x18: 0000000000000000
Jul 10 14:17:56 [70390.970329] x17: 0000000000000000 x16: 0000000000000000
Jul 10 14:17:56 [70390.979763] x15: 0000000000000000 x14: 0000000000000000
Jul 10 14:17:56 [70390.989138] x13: 0000000000000000 x12: 0000000000000000
Jul 10 14:17:56 [70390.998418] x11: 0000000000000000 x10: 0000000000001a40
Jul 10 14:17:56 [70391.007617] x9 : ffff800010613ebc x8 : fefefefefefefeff
Jul 10 14:17:56 [70391.016730] x7 : 0000000000000018 x6 : 00000001006af300
Jul 10 14:17:56 [70391.025764] x5 : 0000000000000000 x4 : 0000000000000001
Jul 10 14:17:56 [70391.034706] x3 : 0000000000000001 x2 : ffff800008ea2a90
Jul 10 14:17:56 [70391.043559] x1 : 0000000000000000 x0 : ffff003e61d8a080
Jul 10 14:17:56 [70391.052326] Call trace:
Jul 10 14:17:56 [70391.058143]  nvme_timeout+0x4c/0x524 [nvme]
Jul 10 14:17:56 [70391.065629]  blk_mq_check_expired.part.0+0x188/0x1cc
Jul 10 14:17:56 [70391.073830]  blk_mq_check_expired+0x60/0x80
Jul 10 14:17:56 [70391.081156]  blk_mq_queue_tag_busy_iter+0x1d4/0x314
Jul 10 14:17:56 [70391.089106]  blk_mq_timeout_work+0x74/0x150
Jul 10 14:17:56 [70391.096276]  process_one_work+0x1e4/0x490
Jul 10 14:17:56 [70391.103193]  worker_thread+0x170/0x440
Jul 10 14:17:56 [70391.109769]  kthread+0x11c/0x120
Jul 10 14:17:56 [70391.115734]  ret_from_fork+0x10/0x18
Jul 10 14:17:56 [70391.121968] Code: d2800001 f9400314 f940ca93 91007273 (b9400273)
Jul 10 14:17:56 [70391.130716] SMP: stopping secondary CPUs
```

after which the machine is just dead and needs a power cycle to be recovered.
Comment 3 Matthias Brugger 2020-07-13 13:29:55 UTC
can you provide the output of dmidecode please?
Comment 4 Oliver Kurz 2020-07-13 19:53:57 UTC
Created attachment 839648 [details]
dmidecode of affected machine

sure, dmidecode attached.
Comment 5 Michal Koutný 2020-08-18 08:50:38 UTC
*** Bug 1156813 has been marked as a duplicate of this bug. ***
Comment 6 Michal Koutný 2020-08-18 08:53:15 UTC
Oliver, would it be possible to get a full crash dump of the affected system (to see what are the wait-for dependencies, I wasn't able to figure out the origin from comment 1, although the lspci may be suspected)?
Comment 7 Oliver Kurz 2020-08-18 12:20:37 UTC
(In reply to Michal Koutný from comment #6)
> Oliver, would it be possible to get a full crash dump of the affected system

I wouldn't know how. As stated the system crashes hard. kdump is enabled but no crash dumps have been recorded. Maybe you have some hints what else to try?
Comment 8 Michal Koutný 2022-04-25 17:40:04 UTC
I'd have suggested testing the kdump setup with `echo c >/proc/sysrq-trigger`.

(I just dug this up in my to-response list, sorry.)

Nevertheless, this bug is old and it's only good for reference now. I'm closing the bug withtout fixed resolution.