Bug 1172997 - kernel panic about "SError Interrupt on CPU…" on aarch64, 4.12.14-lp151.28.48-default
kernel panic about "SError Interrupt on CPU…" on aarch64, 4.12.14-lp151.28.48...
Status: NEW
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.1
aarch64 Other
: P5 - None : Major (vote)
: ---
Assigned To: Matthias Brugger
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-06-16 14:20 UTC by Oliver Kurz
Modified: 2020-09-29 12:20 UTC (History)
9 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Kurz 2020-06-16 14:20:41 UTC
+++ This bug was initially created as a clone of Bug #1162612 +++

## Observation

The SUSE internal machine "openqaworker-arm-3" is an aarch64 Machine model: Cavium ThunderX CN88XX running openSUSE Leap 15.1 with kernel 4.12.14-lp151.28.48-default showed multiple times system crashes with kernel panic and error reports about "SError Interrupt on CPU…", e.g. like:


```
Apr 28 15:17:30 [256898.042772] SError Interrupt on CPU81, code 0xbe000000 -- SError
Apr 28 15:17:30 [256898.042775] CPU: 81 PID: 24143 Comm: worker Tainted: G        W        4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1
Apr 28 15:17:30 [256898.042777] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
Apr 28 15:17:30 [256898.042778] task: ffff810cc501c080 task.stack: ffff810cc829c000
Apr 28 15:17:30 [256898.042779] pstate: 60000005 (nZCv daif -PAN -UAO)
Apr 28 15:17:30 [256898.042780] pc : ptep_set_access_flags+0xc0/0x100
Apr 28 15:17:30 [256898.042781] lr : 0xe0010bbfeb9bd3
Apr 28 15:17:30 [256898.042782] sp : ffff810cc829fca0
Apr 28 15:17:30 [256898.042784] x29: ffff810cc829fca0 x28: ffff810cc501c080
Apr 28 15:17:30 [256898.042787] x27: ffff810cc2443468 x26: 0000000000000007
Apr 28 15:17:30 [256898.042789] x25: ffff810f6f63d0c8 x24: 0000000000000054
Apr 28 15:17:30 [256898.042792] x23: ffff810cc2443400 x22: ffff810f6f63d0c8
Apr 28 15:17:30 [256898.042794] x21: 62c6000aaab37e3b x20: 00e0010bbfeb9fd3
Apr 28 15:17:30 [256898.042796] x19: ffff810cce57b1d8 x18: 0000000000000000
Apr 28 15:17:30 [256898.042799] x17: 0000ffffa9823238 x16: 0000aaaaeabfd350
Apr 28 15:17:30 [256898.042801] x15: 000040c4034d32e5 x14: 0000000000000000
Apr 28 15:17:30 [256898.042803] x13: 8080c08013012026 x12: 0000000000000000
Apr 28 15:17:30 [256898.042806] x11: 0000000000000040 x10: 000000000000000e
Apr 28 15:17:30 [256898.042808] x9 : 000000000000ff00 x8 : 0000ffffa937bac8
Apr 28 15:17:30 [256898.042810] x7 : 0000aaab1bb72d48 x6 : 0000000000000024
Apr 28 15:17:30 [256898.042813] x5 : 00e0010bbfeb9b53 x4 : 00e0010bbfeb9b53
Apr 28 15:17:30 [256898.042815] x3 : 0080000000000400 x2 : 00e0010bbfeb9fd3
Apr 28 15:17:30 [256898.042817] x1 : 00e0010bbfeb9bd3 x0 : 00000000001262c6
Apr 28 15:17:30 [256898.042820] Kernel panic - not syncing: Asynchronous SError Interrupt
Apr 28 15:17:30 [256898.042822] CPU: 81 PID: 24143 Comm: worker Tainted: G        W        4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1
Apr 28 15:17:30 [256898.042823] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
Apr 28 15:17:30 [256898.042824] Call trace:
Apr 28 15:17:30 [256898.042825]  dump_backtrace+0x0/0x188
Apr 28 15:17:30 [256898.042826]  show_stack+0x24/0x30
Apr 28 15:17:30 [256898.042827]  dump_stack+0x90/0xb0
Apr 28 15:17:30 [256898.042828]  panic+0x114/0x28c
Apr 28 15:17:30 [256898.042828]  nmi_panic+0x7c/0x80
Apr 28 15:17:30 [256898.042829]  arm64_serror_panic+0x80/0x90
Apr 28 15:17:30 [256898.042830]  __pte_error+0x0/0x50
Apr 28 15:17:30 [256898.042831]  el1_error+0x7c/0xdc
Apr 28 15:17:30 [256898.042832]  ptep_set_access_flags+0xc0/0x100
Apr 28 15:17:30 [256898.042833]  __handle_mm_fault+0x204/0x500
Apr 28 15:17:30 [256898.042834]  handle_mm_fault+0xd4/0x178
Apr 28 15:17:30 [256898.042835]  do_page_fault+0x1a0/0x448
Apr 28 15:17:30 [256898.042836]  do_mem_abort+0x54/0xb0
Apr 28 15:17:30 [256898.042837]  el0_da+0x24/0x28
Apr 28 15:17:30 [256899.126203] SError Interrupt on CPU6, code 0xbe000000 -- SError
Apr 28 15:17:30 [256899.126205] CPU: 6 PID: 19627 Comm: /usr/bin/isotov Tainted: G        W        4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1
Apr 28 15:17:30 [256899.126207] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
Apr 28 15:17:30 [256899.126208] task: ffff810cbfc24080 task.stack: ffff810cc9564000
Apr 28 15:17:30 [256899.126209] pstate: 60000005 (nZCv daif -PAN -UAO)
Apr 28 15:17:30 [256899.126210] pc : ptep_clear_flush+0x88/0xd8
Apr 28 15:17:30 [256899.126211] lr : wp_page_copy+0x298/0x6c0
Apr 28 15:17:30 [256899.126212] sp : ffff810cc9567be0
Apr 28 15:17:30 [256899.126213] x29: ffff810cc9567be0 x28: 0000aaaafc866000
Apr 28 15:17:30 [256899.126216] x27: ffff810cb93f8330 x26: ffff810cc1142c00
Apr 28 15:17:30 [256899.126218] x25: 0000aaaafc866000 x24: ffff810cc1142c00
Apr 28 15:17:30 [256899.126220] x23: 00e801048f9ccf53 x22: ffff810f56d2a708
Apr 28 15:17:30 [256899.126223] x21: ffff810f56d2a708 x20: 0000000aaaafc866
Apr 28 15:17:30 [256899.126225] x19: ffff810cb93f8330 x18: 0000000000000000
Apr 28 15:17:30 [256899.126228] x17: 0000aaaafc85f420 x16: 0000aaaafc8693c0
Apr 28 15:17:30 [256899.126230] x15: 4000000b00000001 x14: 0000aaaafc863880
Apr 28 15:17:30 [256899.126232] x13: 0000000000000032 x12: 0000110100000001
Apr 28 15:17:30 [256899.126235] x11: 0000aaaafc866fb8 x10: 0000aaaafc867fd0
Apr 28 15:17:30 [256899.126237] x9 : 1000440300000001 x8 : 0000aaaafc85f440
Apr 28 15:17:30 [256899.126239] x7 : 0000000000000029 x6 : 0000110100000001
Apr 28 15:17:30 [256899.126242] x5 : 0088000000000000 x4 : 00e0000e8ed2ff53
Apr 28 15:17:30 [256899.126244] x3 : 00e001048f9ccfd3 x2 : 0000000000000000
Apr 28 15:17:30 [256899.126247] x1 : 3804000aaaafc866 x0 : 00e0000e8ed2ffd3
Apr 28 15:17:30 [256899.126271] SMP: stopping secondary CPUs
Apr 28 15:17:30 [256899.126272] SMP: failed to stop secondary CPUs 1,6,48,81,84-85
Apr 28 15:17:30 [256899.126273] Kernel Offset: disabled
Apr 28 15:17:30 [256899.126274] CPU features: 0x01,04101128
Apr 28 15:17:30 [256899.126275] Memory Limit: none
Apr 28 15:17:30 [256899.126277] SError Interrupt on CPU85, code 0xbe000000 -- SError
Apr 28 15:17:30 [256899.126279] CPU: 85 PID: 23855 Comm: worker Tainted: G        W        4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1
Apr 28 15:17:30 [256899.126280] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
Apr 28 15:17:30 [256899.126281] task: ffff810cc9f0a180 task.stack: ffff810cc99b4000
Apr 28 15:17:30 [256899.126282] pstate: 60000005 (nZCv daif -PAN -UAO)
Apr 28 15:17:30 [256899.126283] pc : ptep_set_access_flags+0xc0/0x100
Apr 28 15:17:30 [256899.126284] lr : 0xe0000a0c000fd1
Apr 28 15:17:30 [256899.126285] sp : ffff810cc99b7c30
Apr 28 15:17:30 [256899.126286] x29: ffff810cc99b7c30 x28: ffff810cc9f0a180
Apr 28 15:17:30 [256899.126288] x27: ffff000008d32000 x26: 0000000000000008
Apr 28 15:17:30 [256899.126291] x25: ffff7e0000000000 x24: ffff810f6b56ced8
Apr 28 15:17:30 [256899.126293] x23: ffff00000933a000 x22: ffff810f6b56ced8
Apr 28 15:17:30 [256899.126295] x21: 5f3e000aaab2f400 x20: 00e8000a0c000f51
Apr 28 15:17:30 [256899.126298] x19: ffff810cdab62bd0 x18: 0000000000000000
Apr 28 15:17:30 [256899.126300] x17: 0000ffffafddac80 x16: 0000aaaae4c21020
Apr 28 15:17:30 [256899.126302] x15: 000089a652d558cc x14: 0000000000000000
Apr 28 15:17:30 [256899.126305] x13: 0d4b4f2030303220 x12: 312e312f50545448
Apr 28 15:17:30 [256899.126307] x11: 0000aaab233d1eb0 x10: 00000000c6cf1b21
Apr 28 15:17:30 [256899.126309] x9 : 0000000000000009 x8 : 0000aaab24a44d00
Apr 28 15:17:31 [256899.126311] x7 : 41203832202c6575 x6 : 54203a657461440a
Apr 28 15:17:31 [256899.126314] x5 : 0000000000100073 x4 : 00e0000a0c000f51
Apr 28 15:17:31 [256899.126316] x3 : 0088000000000480 x2 : 00e8000a0c000f51
Apr 28 15:17:31 [256899.126318] x1 : 00e0000a0c000fd1 x0 : 0000000000125f3e
Apr 28 15:17:31 [256899.725248] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt
Apr 28 16:18:35 �
```


## Reproducible

Happened multiple times but sporadic, every couple of days or weeks.
On two other, very similar machines running 5.6.0 there are also random crashes but the same error report was never seen.


## Further details

Internal issue: https://progress.opensuse.org/issues/41882
Comment 1 Richard Fan 2020-06-17 03:53:59 UTC
Hello Oliver,

Added some core dump messages regarding to "isotovideo" process here.

As we discussed in another mail thread before, I can see lots of core dump messages in journal log like below:

-----

Jun 16 21:29:14 openqaworker-arm-3 systemd-coredump[4258]: Process 95237 (/usr/bin/isotov) of user 481 dumped core.

                                                           Stack trace of thread 96054:
                                                           #0  0x0000aaaac704fb20 Perl_csighandler (/usr/bin/perl)
-----

I can see 82 core dump files generated during past 3-4 days.
#pwd
/var/lib/systemd/coredump
#ls |grep isotov|wc
     82      82    7704

I don't know if the core dump causes cpu stuck or not. however, hopefully the messages can provide some help

BR//Richard.
Comment 2 Oliver Kurz 2020-06-17 06:41:19 UTC
the crashes from isotovideo are also described in https://progress.opensuse.org/issues/53999 and at least we know that not every isotovideo crash leads to a system crash but that does not rule out the possibility of complete system crashes still being triggered by this occassionally.
Comment 3 Matthias Brugger 2020-06-17 11:54:53 UTC
I think we would need a memory dump through kdump to be able to understand what is happening. Can you provide that for the next crash you see?
Comment 4 Oliver Kurz 2020-06-20 08:57:32 UTC
Well, unfortunately automatic kdump is enabled but fails to record any usable information. In many cases the system just crashes without any information at all. Only sometimes I am lucky to see the report over IPMI SoL recorded in a text file. I therefore do not see how I can provide that information.