Bug 1165642 - aarch64 machine pingable, but ssh login not possible as well as serial console login, magic sysrq still works, logs recorded
aarch64 machine pingable, but ssh login not possible as well as serial consol...
Status: NEW
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.1
aarch64 Other
: P5 - None : Major (vote)
: ---
Assigned To: Wenruo Qu
E-mail List
:
Depends on: 1162612
Blocks:
  Show dependency treegraph
 
Reported: 2020-03-04 09:28 UTC by Oliver Kurz
Modified: 2020-03-25 16:23 UTC (History)
8 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
log of the IPMI SOL session with all output from sysrq-w, sysrq-t, etc. (127.93 KB, application/x-xz)
2020-03-04 09:28 UTC, Oliver Kurz
Details
log of stuck processes and sysrqw-w from openqaworker-arm-1 on kernel-default-5.6.rc6-1.1.g5c2f002.aarch64 (10.30 KB, application/x-xz)
2020-03-25 05:57 UTC, Oliver Kurz
Details
console log (1.37 MB, text/plain)
2020-03-25 13:47 UTC, Michal Suchanek
Details
log of stuck processes and sysrqw-w from openqaworker-arm-1 on kernel-default-5.6.rc6-1.1.g5c2f002.aarch64 (100.59 KB, text/plain)
2020-03-25 13:49 UTC, Michal Suchanek
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Kurz 2020-03-04 09:28:24 UTC
Created attachment 831851 [details]
log of the IPMI SOL session with all output from sysrq-w, sysrq-t, etc.

+++ This bug was initially created as a clone of Bug #1162612 +++

## Observation

The SUSE internal machine "openqaworker-arm-2" which is an aarch64 Machine model: Cavium ThunderX CN88XX running openSUSE Leap 15.1 with kernel 4.12.14-lp151.28.36-default on 2020-03-04 showed the system to be pingable but we could not login over ssh, similar as in #1162612. I could also not login over a serial terminal over IPMI SOL. I could call magic sysrq commands to find out what is blocking the host.

Attached a log of the IPMI SOL session with all output from sysrq-w, sysrq-t, etc. Then did `power reset` over IPMI to bring the machine back up. It is up and running now. The problem might have happened in the past. Though this time I managed to gather more logs which seem to point to "btrfs" on the first glimpse.

## Further details

Internal issue: https://progress.opensuse.org/issues/41882
Comment 1 Michal Suchanek 2020-03-18 11:58:14 UTC
Please retest with KOTD. There is known btrfs balance with current leap MU.
Comment 2 Oliver Kurz 2020-03-19 19:40:13 UTC
I will. Theoretically this experiment could run forever as the problem does not appear that often and also I am looking for the issue *not* appearing anymore. To have a high confidence I would be able to give that assessment only after multiple months of operations without the problem appearing once. Can I do something else to investigate further in case this issue happens again?
Comment 3 Michal Suchanek 2020-03-22 18:54:36 UTC
The btrfs issue is triggered by btrfsmaintenance so you can run it a few times to see if that is your problem and run it again with the updated kernel to see it it's fixed.
Comment 4 Michal Suchanek 2020-03-24 16:20:04 UTC
Please update to the current MU. The update should have the btrfs fix which should make diagnosing any remaining issues easier.
Comment 5 Oliver Kurz 2020-03-25 05:57:11 UTC
Created attachment 833804 [details]
log of stuck processes and sysrqw-w from openqaworker-arm-1 on kernel-default-5.6.rc6-1.1.g5c2f002.aarch64

Good suggestion! I triggered

```
for i in {1..20}; do msg="## https://progress.opensuse.org/issues/41882#note-34: Run $i" && logger $msg && echo $msg && systemctl start btrfs-scrub btrfs-balance btrfs-trim fstrim ; sleep 300; done
```

on the two machines "openqaworker-arm-1" which is running kernel-default-5.6.rc6-1.1.g5c2f002.aarch64 (KOTD) and "openqaworker-arm-2" running 4.12.14-lp151.28.40-default . After one night openqaworker-arm-1 with KOTD is in a state resembling the reported one, a lot of blocked I/O processes. logs attached.

So KOTD *still* reproduces the problem. I hope the logs are helpful to determine what could be the next step for investigation or fix.
Comment 6 Michal Suchanek 2020-03-25 13:47:46 UTC
Created attachment 833842 [details]
console log
Comment 7 Michal Suchanek 2020-03-25 13:49:32 UTC
Created attachment 833843 [details]
log of stuck processes and sysrqw-w from openqaworker-arm-1 on kernel-default-5.6.rc6-1.1.g5c2f002.aarch64
Comment 8 Michal Suchanek 2020-03-25 13:51:28 UTC
Triggered by btrfsmaintenance - likely btrfs related.