Bug 1173870

Summary: Internal error: synchronous parity or ECC error - while building on obs-arm-2/3 (ThunderX)
Product: [openSUSE] openSUSE Distribution Reporter: Guillaume GARDET <guillaume.gardet>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED WORKSFORME QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: adrian.schroeter, afaerber, bpetkov, dmueller, mbenes, ro
Version: Leap 15.2   
Target Milestone: ---   
Hardware: aarch64   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Guillaume GARDET 2020-07-08 06:40:10 UTC
I noticed a build failure on obs-arm-3 machine (ThunderX1) about  'synchronous parity or ECC error', while building a kernel-preempt package.

Kernel traces:

[11412s] [11360.577260] Internal error: synchronous parity or ECC error: 96000018 [#1] SMP
[11412s] [11360.585784] Modules linked in: qemu_fw_cfg e1000 sd_mod nls_iso8859_1 nls_cp437 vfat fat virtio_rng virtio_blk virtio_mmio xfs btrfs xor xor_neon zlib_deflate raid6_pq libcrc32c reiserfs ext4 mbcache jbd2 squashfs lz4_decompress fuse dm_snapshot dm_bufio dm_crypt dm_mod binfmt_misc loop sg scsi_mod
[11412s] [11360.619330] CPU: 4 PID: 26711 Comm: cc1 Not tainted 5.3.18-lp152.84-default #1 openSUSE Leap 15.2 (unreleased)
[11412s]   CC [M]  drivers/net/ethernet/neterion/vxge/vxge-config.o
[11412s] [11360.635110] Hardware name: linux,dummy-virt (DT)
[11412s] [11360.655870] pstate: 80000005 (Nzcv daif -PAN -UAO)
[11412s] [11360.659090] pc : __arch_copy_to_user+0x10c/0x220
[11412s] [11360.662182] lr : cp_new_stat+0x140/0x178
[11412s] [11360.672133] sp : ffff000017943cd0
[11412s] [11360.675914] x29: ffff000017943cd0 x28: ffff8000f99d9e80 
[11412s] [11360.683608] x27: 0000000000000000 x26: 0000000000000000 
[11412s] [11360.688386] x25: 0000000056000000 x24: 0000000000000015 
[11412s] [11360.698666] x23: 0000000000000000 x22: 0000000017943d18 
[11412s] [11360.722256] x21: ffff8000f99d9e80 x20: ffff000011679000 
[11412s] [11360.726995] x19: ffff000017943dc0 x18: 0000000000000000 
[11412s] [11360.731753] x17: 0000000000000000 x16: 0000000000000000 
[11412s] [11360.736488] x15: 0000000000000000 x14: 0000000000000000 
[11412s] [11360.741256] x13: 0000000000000000 x12: 0000000000000000 
[11412s] [11360.745976] x11: 0000000000000000 x10: 00000000000000eb 
[11412s] [11360.750741] x9 : 000000005f003b39 x8 : 00000001000081a4 
[11412s] [11360.755451] x7 : 00000000000a42b5 x6 : 0000000017943d30 
[11412s] [11360.760184] x5 : 0000000017943d98 x4 : 0000000000000008 
[11412s] [11360.764885] x3 : 000000000000fd10 x2 : fffffffffffffff8 
[11412s] [11360.769629] x1 : ffff000017943d20 x0 : 0000000017943d18 
[11412s] [11360.774332] Call trace:
[11412s] [11360.776631]  __arch_copy_to_user+0x10c/0x220
[11412s] [11360.780498]  __se_sys_newfstat+0x58/0x88
[11412s] [11360.784049]  __arm64_sys_newfstat+0x20/0x30
[11412s] [11360.788029]  el0_svc_common.constprop.0+0xa0/0x1f8
[11412s] [11360.792513]  el0_svc_handler+0x34/0x90
[11412s] [11360.796102]  el0_svc+0x10/0x14
[11412s] [11360.799389] Code: a8c12027 a88120c7 d503201f d503201f (a8c12829) 
[11412s] [11360.804980] ---[ end trace 929f196ebb48dbcd ]---
[11412s]   CC [M]  drivers/net/ethernet/neterion/vxge/vxge-ethtool.o
[11412s] gcc: internal compiler error: Segmentation fault (program cc1)
[11412s] Please submit a full bug report,
[11412s] with preprocessed source if appropriate.
[11412s] See <https://bugs.opensuse.org/> for instructions.
[11412s] make[6]: *** [../scripts/Makefile.build:282: drivers/net/ethernet/neterion/vxge/vxge-ethtool.o] Error 4
[11412s] make[6]: *** Waiting for unfinished jobs....
Comment 1 Dirk Mueller 2020-07-08 10:35:04 UTC
when was that build log? date/timestamp?

obs-arm-3 # uptime
 10:34:41  up 1 day  4:57,  1 user,  load average: 49.68, 58.70, 61.89

there is no such message on the host system. so this is odd.
Comment 2 Guillaume GARDET 2020-07-08 11:06:09 UTC
(In reply to Dirk Mueller from comment #1)
> when was that build log? date/timestamp?

4th of July.
Comment 3 Dirk Mueller 2020-07-08 11:10:07 UTC
ok, lets see if it happens again.
Comment 4 Guillaume GARDET 2020-07-24 07:50:58 UTC
Happened on obs-arm-2 (ThunderX1) yesterday.
Comment 5 Dirk Mueller 2020-07-27 09:27:48 UTC
do you have the exact timestamp?

on the host things look normal, albeit the nvme timeouts are weird:

[So Jul 26 14:15:59 2020] nvme nvme0: I/O 407 QID 7 timeout, completion polled
[So Jul 26 14:51:58 2020] nvme nvme0: I/O 726 QID 14 timeout, completion polled
[So Jul 26 20:11:21 2020] nvme nvme0: I/O 115 QID 18 timeout, completion polled
[So Jul 26 20:12:01 2020] nvme nvme0: I/O 798 QID 4 timeout, completion polled
[So Jul 26 20:21:10 2020] nvme nvme0: I/O 673 QID 25 timeout, completion polled
[So Jul 26 20:26:30 2020] nvme nvme0: I/O 365 QID 11 timeout, completion polled
[So Jul 26 20:56:37 2020] nvme nvme0: I/O 738 QID 16 timeout, completion polled
[So Jul 26 20:59:03 2020] nvme nvme0: I/O 90 QID 20 timeout, completion polled
[So Jul 26 20:59:53 2020] nvme nvme0: I/O 315 QID 19 timeout, completion polled
[So Jul 26 21:01:35 2020] nvme nvme0: I/O 897 QID 29 timeout, completion polled
[So Jul 26 21:02:29 2020] nvme nvme0: I/O 745 QID 17 timeout, completion polled
[So Jul 26 21:03:11 2020] nvme nvme0: I/O 489 QID 18 timeout, completion polled
[So Jul 26 21:03:16 2020] nvme nvme0: I/O 826 QID 20 timeout, completion polled
[So Jul 26 21:14:30 2020] nvme nvme0: I/O 901 QID 25 timeout, completion polled
[So Jul 26 21:16:43 2020] nvme nvme0: I/O 932 QID 26 timeout, completion polled
[So Jul 26 21:17:22 2020] nvme nvme0: I/O 153 QID 26 timeout, completion polled
[So Jul 26 21:18:19 2020] nvme nvme0: I/O 341 QID 31 timeout, completion polled
[So Jul 26 21:23:53 2020] nvme nvme0: I/O 476 QID 21 timeout, completion polled
[So Jul 26 21:43:30 2020] nvme nvme0: I/O 264 QID 8 timeout, completion polled
[So Jul 26 21:45:17 2020] nvme nvme0: I/O 731 QID 31 timeout, completion polled
[So Jul 26 21:58:37 2020] nvme nvme0: I/O 393 QID 3 timeout, completion polled
[So Jul 26 21:59:12 2020] nvme nvme0: I/O 537 QID 2 timeout, completion polled
[So Jul 26 22:04:07 2020] nvme nvme0: I/O 281 QID 15 timeout, completion polled
[So Jul 26 22:06:56 2020] nvme nvme0: I/O 414 QID 23 timeout, completion polled
[So Jul 26 22:37:24 2020] nvme nvme0: I/O 176 QID 13 timeout, completion polled
Comment 6 Dirk Mueller 2020-07-27 10:05:32 UTC
obs-arm-2 has been updated (host was leap 15.1, now is tw 20200721). lets see if it reappears.
Comment 7 Dirk Mueller 2020-07-27 10:06:22 UTC
(I am somehow assuming that ECC errors would not only be reported in the guest but also in the host. maybe that assumption isn't correct?)
Comment 8 Guillaume GARDET 2020-07-27 11:54:34 UTC
(In reply to Dirk Mueller from comment #5)
> do you have the exact timestamp?

No, I don't.
Comment 9 Miroslav BeneŇ° 2020-11-19 14:31:21 UTC
Has it happened again since the update to TW?

Seeing ECC error Boris comes to mind. Boris, this is more just FYI, but if you could reply to comment 7, it would be nice. Also mind this is arm64.
Comment 10 Guillaume GARDET 2020-11-19 14:40:07 UTC
I did not noticed it lately.
Comment 11 Borislav Petkov 2020-11-19 17:20:47 UTC
For RAS on arm64 talk to James Morse. On x86 there are certain conditions when host hw errors get injected into the guest for further handling.

Looking at this:

7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")

apparently arm64 is doing pages offlining too but this commit is in the 5.8 kernel and thus not in 5.3.

Also, there's code like that in arch/arm64/kvm/mmu.c:

        /* Synchronous External Abort? */
        if (kvm_vcpu_abt_issea(vcpu)) {
                /*
                 * For RAS the host kernel may handle this abort.
                 * There is no need to pass the error into the guest.
                 */
                if (kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu)))

but how many of the errors do get reported on the host, I don't know. James most certainly knows more there.

HTH.
Comment 12 Miroslav BeneŇ° 2020-12-23 11:44:50 UTC
Thanks, Boris.

So it seems like upgrading the system to TW may have really helped here. The issue has not appeared again since then, so let's close for now. If it reappears, we will revisit.