Bugzilla – Bug 1202275
Kernel OOPS on filesystem operations
Last modified: 2022-10-27 08:16:22 UTC
This OOPS shows up on the openQA worker "power8" after some uptime. After it occurs for the first time, other similar OOPSes show up frequently and other weird issues appear, like zypper and openQA workers getting stuck in syscalls such as "poll". Aug 08 19:23:06 power8 openqa-continuous-update[37713]: Loading repository data... Aug 08 19:23:06 power8 openqa-continuous-update[37713]: Reading installed packages... Aug 08 19:23:07 power8 kernel: Kernel attempted to read user page (16) - exploit attempt? (uid: 0) Aug 08 19:23:07 power8 kernel: BUG: Kernel NULL pointer dereference on read at 0x00000016 Aug 08 19:23:07 power8 kernel: Faulting instruction address: 0xc0000000005561dc Aug 08 19:23:07 power8 kernel: Oops: Kernel access of bad area, sig: 11 [#1] Aug 08 19:23:07 power8 kernel: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV Aug 08 19:23:07 power8 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache netfs kvm_hv kvm af_packet rfkill crct10dif_vpmsum ipmi_powernv(X) ipmi_devintf ipmi_msghandler leds_powernv(X) r> Aug 08 19:23:07 power8 kernel: Supported: No, Unsupported modules are loaded Aug 08 19:23:07 power8 kernel: CPU: 72 PID: 37713 Comm: Zypp-main Tainted: G W X N 5.14.21-150400.24.11-default #1 SLE15-SP4 5031505b0a65e234cdf253965338bef90a38442d Aug 08 19:23:07 power8 kernel: NIP: c0000000005561dc LR: c00000000053c9a8 CTR: 0000000000000001 Aug 08 19:23:07 power8 kernel: REGS: c0000010208c3730 TRAP: 0300 Tainted: G W X N (5.14.21-150400.24.11-default) Aug 08 19:23:07 power8 kernel: MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 84442222 XER: 00000000 Aug 08 19:23:07 power8 kernel: CFAR: c00000000000cb8c DAR: 0000000000000016 DSISR: 40000000 IRQMASK: 0 GPR00: c00000000053c9a8 c0000010208c39d0 c000000002833a00 c000000804bde100 GPR04: c0000010208c3be8 c0000010208c3ae4 0000000000000000 0000000000000000 GPR08: c00000100a97002c 00000000018e0000 c008000000000000 ffffffffffff0000 GPR12: 0000000000002200 c0000017fffdae80 00007fffe0bae620 00007fffe0bae600 GPR16: 00007fffe0bae5d0 00007fffe0bae5f0 c00000100a970020 00007fffe0bae610 GPR20: 0000000000000000 c0000010208c3be8 0000000000000043 c0000010208c3af8 GPR24: ffffffffffffffff 0000000000000000 c0000010208c3be8 0000000000000000 GPR28: c000000804bde100 c000000804bde100 0000000063800036 fffffffffffffffe Aug 08 19:23:07 power8 kernel: NIP [c0000000005561dc] __d_lookup+0x8c/0x290 Aug 08 19:23:07 power8 kernel: LR [c00000000053c9a8] lookup_fast+0x108/0x240 Aug 08 19:23:07 power8 kernel: Call Trace: Aug 08 19:23:07 power8 kernel: [c0000010208c39d0] [c0000010208c3a00] 0xc0000010208c3a00 (unreliable) Aug 08 19:23:07 power8 kernel: [c0000010208c3a40] [c00000000053c9a8] lookup_fast+0x108/0x240 Aug 08 19:23:07 power8 kernel: [c0000010208c3aa0] [c00000000054267c] path_openat+0x25c/0x1330 Aug 08 19:23:07 power8 kernel: [c0000010208c3ba0] [c0000000005456e4] do_filp_open+0xa4/0x130 Aug 08 19:23:07 power8 kernel: [c0000010208c3ce0] [c000000000524418] do_sys_openat2+0x2e8/0x440 Aug 08 19:23:07 power8 kernel: [c0000010208c3d50] [c000000000526198] do_sys_open+0x78/0xc0 Aug 08 19:23:07 power8 kernel: [c0000010208c3db0] [c00000000003269c] system_call_exception+0x15c/0x330 Aug 08 19:23:07 power8 kernel: [c0000010208c3e10] [c00000000000c74c] system_call_common+0xec/0x250 Aug 08 19:23:07 power8 kernel: --- interrupt: c00 at 0x7fffb5429e14 Aug 08 19:23:07 power8 kernel: NIP: 00007fffb5429e14 LR: 00007fffb53ed7ec CTR: 0000000000000000 Aug 08 19:23:07 power8 kernel: REGS: c0000010208c3e80 TRAP: 0c00 Tainted: G W X N (5.14.21-150400.24.11-default) Aug 08 19:23:07 power8 kernel: MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28444888 XER: 00000000 Aug 08 19:23:07 power8 kernel: IRQMASK: 0 GPR00: 000000000000011e 00007fffe0bae1a0 00007fffb5517200 ffffffffffffff9c GPR04: 000001001e2b5a40 0000000000084800 0000000000000000 000000006332692f GPR08: 0000000000004000 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00007fffb3c32450 00007fffe0bae620 00007fffe0bae600 GPR16: 00007fffe0bae5d0 00007fffe0bae5f0 00007fffb62c75a0 00007fffe0bae610 GPR20: 00007fffb62d62e0 00007fffe0bae630 00007fffb62d62d8 0000000000000000 GPR24: 0000000000000000 00007fffe0bae5e0 00007fffe0bae640 00007fffe0bae5d0 GPR28: 000001001e41e85b 000001001e2b5de0 000001001e41e848 000001001e2b5e50 Aug 08 19:23:07 power8 kernel: NIP [00007fffb5429e14] 0x7fffb5429e14 Aug 08 19:23:07 power8 kernel: LR [00007fffb53ed7ec] 0x7fffb53ed7ec Aug 08 19:23:07 power8 kernel: --- interrupt: c00 Aug 08 19:23:07 power8 kernel: Instruction dump: Aug 08 19:23:07 power8 kernel: fb610048 fb810050 fba10058 7c9a2378 7c7c1b78 3b600000 3b200000 3b00ffff Aug 08 19:23:07 power8 kernel: 48000010 ebff0000 2fbf0000 419e0060 <813f0018> 7f89f000 409effec 3bbf0050 Aug 08 19:23:07 power8 kernel: ---[ end trace f3fc0069d5e8e587 ]---
Not sure whether it's an arch-specific issue. The stack trace doesn't indicate it but looks as if it were a generic fs issue, though. Let's try to toss to filesystem people for taking a look at first.
Created attachment 860730 [details] Journal Apparently the full journal I attached was silently dropped, probably because of size. I attached the first part until the third OOPS message.
After upgrading to Leap 15.4 we've also seen other problems on power machines, see https://progress.opensuse.org/issues/114565 and https://bugzilla.opensuse.org/show_bug.cgi?id=1202138 and https://bugzilla.suse.com/show_bug.cgi?id=1201796. Of course these are all different symptoms and likely also have a different cause. However, it still strikes me that we see so many problems on power machines (with Leap 15.4).
What is ipmi_powernv and leds_powernv? How were they loaded and why are they not supported? Can this be reproduced without these modules? I am trying to eliminate memory corruption issues because of unsupported modules before we focus on VFS issues.
(In reply to Goldwyn Rodrigues from comment #4) > What is ipmi_powernv and leds_powernv? How were they loaded and why are they > not supported? Can this be reproduced without these modules? > > I am trying to eliminate memory corruption issues because of unsupported > modules before we focus on VFS issues. They're part of kernel-default, but marked as "supported: external".
After the machine got stuck and had to be rebooted a couple of times again, I downgraded to the latest kernel from 15.3. Let's see how that goes.
(In reply to Fabian Vogt from comment #6) > After the machine got stuck and had to be rebooted a couple of times again, > I downgraded to the latest kernel from 15.3. Let's see how that goes. The machine has an update of >5 days now without any issues. So it appears like this is an issue with the 15.4 kernel.
Could you please take a kdump? It would help analyze the issue better.
(In reply to Goldwyn Rodrigues from comment #8) > Could you please take a kdump? It would help analyze the issue better. I configured kdump and the service was running properly after a reboot. I triggered a crash to test it, but that didn't even print the "Starting crashkernel" message and even worse, even IPMI is dead now. Fun.
(In reply to Fabian Vogt from comment #9) > (In reply to Goldwyn Rodrigues from comment #8) > > Could you please take a kdump? It would help analyze the issue better. > > I configured kdump and the service was running properly after a reboot. I > triggered a crash to test it, but that didn't even print the "Starting > crashkernel" message and even worse, even IPMI is dead now. Fun. IPMI came back after a few minutes, so I could "mc reset cold" and try again. Same result though, so kdump just doesn't want to work. Anything else I can do?
(In reply to Fabian Vogt from comment #10) > (In reply to Fabian Vogt from comment #9) > > (In reply to Goldwyn Rodrigues from comment #8) > > > Could you please take a kdump? It would help analyze the issue better. > > > > I configured kdump and the service was running properly after a reboot. I > > triggered a crash to test it, but that didn't even print the "Starting > > crashkernel" message and even worse, even IPMI is dead now. Fun. > > IPMI came back after a few minutes, so I could "mc reset cold" and try > again. Same result though, so kdump just doesn't want to work. Anything else > I can do? Does kdump work if you trigger it manually? e.g. echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger If kdump isn't an option then we could try to cobble something together with kprobes / ftrace or a custom vfs-debug kernel, but it'd be nice to get confirmation first. (In reply to Fabian Vogt from comment #2) > Created attachment 860730 [details] > Journal Aug 08 15:18:10 localhost kernel: opal: OPAL_CONSOLE_FLUSH missing. Aug 08 15:18:10 localhost kernel: WARNING: CPU: 11 PID: 1475 at ../arch/powerpc/platforms/powernv/opal.c:528 __opal_flush_console+0xfc/0x110 523 /* 524 * If OPAL_CONSOLE_FLUSH is not implemented in the firmware, 525 * the console can still be flushed by calling the polling 526 * function while it has OPAL_EVENT_CONSOLE_OUTPUT events. 527 */ 528 WARN_ONCE(1, "opal: OPAL_CONSOLE_FLUSH missing.\n"); Old firmware? Please try to upgrade to the latest IBM firmware if possible.
(In reply to David Disseldorp from comment #11) > (In reply to Fabian Vogt from comment #10) > > (In reply to Fabian Vogt from comment #9) > > > (In reply to Goldwyn Rodrigues from comment #8) > > > > Could you please take a kdump? It would help analyze the issue better. > > > > > > I configured kdump and the service was running properly after a reboot. I > > > triggered a crash to test it, but that didn't even print the "Starting > > > crashkernel" message and even worse, even IPMI is dead now. Fun. > > > > IPMI came back after a few minutes, so I could "mc reset cold" and try > > again. Same result though, so kdump just doesn't want to work. Anything else > > I can do? > > Does kdump work if you trigger it manually? e.g. > > echo 1 > /proc/sys/kernel/sysrq > echo c > /proc/sysrq-trigger That's what I tried: power8:~ # echo c > /proc/sysrq-trigger [ 233.397216][ T6857] sysrq: Trigger a crash [ 233.397251][ T6857] Kernel panic - not syncing: sysrq triggered crash [ 233.397270][ T6857] CPU: 104 PID: 6857 Comm: bash Tainted: G W X N 5.14.21-150400.24.18-default #1 SLE15-SP4 a5d3db5b7f5fbb29c4ee73b4cdefcad058b71f7f [ 233.397294][ T6857] Call Trace: [ 233.397310][ T6857] [c00000181de83ab0] [c00000000086755c] dump_stack_lvl+0x70/0xa4 (unreliable) [ 233.397337][ T6857] [c00000181de83af0] [c000000000159724] panic+0x164/0x400 [ 233.397357][ T6857] [c00000181de83b80] [c00000000095f230] sysrq_handle_crash+0x30/0x40 [ 233.397378][ T6857] [c00000181de83be0] [c00000000095fb50] __handle_sysrq+0xf0/0x2a0 [ 233.397397][ T6857] [c00000181de83c90] [c000000000960428] write_sysrq_trigger+0xd8/0x190 [ 233.397416][ T6857] [c00000181de83cd0] [c000000000624af8] proc_reg_write+0x108/0x1b0 [ 233.397436][ T6857] [c00000181de83d00] [c000000000529e10] vfs_write+0xf0/0x340 [ 233.397456][ T6857] [c00000181de83d60] [c00000000052a28c] ksys_write+0xdc/0x130 [ 233.397474][ T6857] [c00000181de83db0] [c00000000003269c] system_call_exception+0x15c/0x330 [ 233.397494][ T6857] [c00000181de83e10] [c00000000000c74c] system_call_common+0xec/0x250 [ 233.397514][ T6857] --- interrupt: c00 at 0x7fffb7a03114 [ 233.397532][ T6857] NIP: 00007fffb7a03114 LR: 00007fffb7978fe4 CTR: 0000000000000000 [ 233.397550][ T6857] REGS: c00000181de83e80 TRAP: 0c00 Tainted: G W X N (5.14.21-150400.24.18-default) [ 233.397568][ T6857] MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 22242222 XER: 00000000 [ 233.397595][ T6857] IRQMASK: 0 [ 233.397595][ T6857] GPR00: 0000000000000004 00007fffe857f0a0 00007fffb7af7200 0000000000000001 [ 233.397595][ T6857] GPR04: 0000010028618830 0000000000000002 0000000000000010 0000000000000000 [ 233.397595][ T6857] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 233.397595][ T6857] GPR12: 0000000000000000 00007fffb7c5b600 0000000000000000 000000011de337c0 [ 233.397595][ T6857] GPR16: 000000011dd97ed0 0000000000000000 000000011ddddd10 0000000000000000 [ 233.397595][ T6857] GPR20: 000000011ddedb28 0000010028773460 000000011de37828 000000011de36c70 [ 233.397595][ T6857] GPR24: 0000010028518030 0000000000000000 0000000000000002 0000010028618830 [ 233.397595][ T6857] GPR28: 0000000000000002 00007fffb7af16d8 0000010028618830 0000000000000002 [ 233.397670][ T6857] NIP [00007fffb7a03114] 0x7fffb7a03114 [ 233.397686][ T6857] LR [00007fffb7978fe4] 0x7fffb7978fe4 [ 233.397702][ T6857] --- interrupt: c00 (stuck here) ~. [terminated ipmitool] (IPMI broke) ^C SIGN INT: Close Interface IPMI v2.0 RMCP+ LAN Interface ^C^C^C^C Close Session command failed As the crash is apparently so severe that it upsets the BMC, I'm not hopeful here. > If kdump isn't an option then we could try to cobble something together with > kprobes / ftrace or a custom vfs-debug kernel, but it'd be nice to get > confirmation first. > > (In reply to Fabian Vogt from comment #2) > > Created attachment 860730 [details] > > Journal > > Aug 08 15:18:10 localhost kernel: opal: OPAL_CONSOLE_FLUSH missing. > Aug 08 15:18:10 localhost kernel: WARNING: CPU: 11 PID: 1475 at > ../arch/powerpc/platforms/powernv/opal.c:528 __opal_flush_console+0xfc/0x110 > > 523 /* > > 524 * If OPAL_CONSOLE_FLUSH is not implemented in the > firmware, > 525 * the console can still be flushed by calling the > polling > 526 * function while it has OPAL_EVENT_CONSOLE_OUTPUT > events. > 527 */ > > 528 WARN_ONCE(1, "opal: OPAL_CONSOLE_FLUSH missing.\n"); > > Old firmware? Please try to upgrade to the latest IBM firmware if possible. I don't really know anything about this system, so I'll leave that to someone else. @Marius: Could you attempt a FW update (if a newer version is available) or know someone who could?
(In reply to Fabian Vogt from comment #12) > (In reply to David Disseldorp from comment #11) > > (In reply to Fabian Vogt from comment #10) > > > (In reply to Fabian Vogt from comment #9) > > > > (In reply to Goldwyn Rodrigues from comment #8) > > > > > Could you please take a kdump? It would help analyze the issue better. > > > > > > > > I configured kdump and the service was running properly after a reboot. I > > > > triggered a crash to test it, but that didn't even print the "Starting > > > > crashkernel" message and even worse, even IPMI is dead now. Fun. > > > > > > IPMI came back after a few minutes, so I could "mc reset cold" and try > > > again. Same result though, so kdump just doesn't want to work. Anything else > > > I can do? > > > > Does kdump work if you trigger it manually? e.g. > > > > echo 1 > /proc/sys/kernel/sysrq > > echo c > /proc/sysrq-trigger > > That's what I tried: > > power8:~ # echo c > /proc/sysrq-trigger > [ 233.397216][ T6857] sysrq: Trigger a crash > [ 233.397251][ T6857] Kernel panic - not syncing: sysrq triggered crash > [ 233.397270][ T6857] CPU: 104 PID: 6857 Comm: bash Tainted: G W > X N 5.14.21-150400.24.18-default #1 SLE15-SP4 > a5d3db5b7f5fbb29c4ee73b4cdefcad058b71f7f > [ 233.397294][ T6857] Call Trace: > [ 233.397310][ T6857] [c00000181de83ab0] [c00000000086755c] > dump_stack_lvl+0x70/0xa4 (unreliable) > [ 233.397337][ T6857] [c00000181de83af0] [c000000000159724] > panic+0x164/0x400 > [ 233.397357][ T6857] [c00000181de83b80] [c00000000095f230] > sysrq_handle_crash+0x30/0x40 > [ 233.397378][ T6857] [c00000181de83be0] [c00000000095fb50] > __handle_sysrq+0xf0/0x2a0 > [ 233.397397][ T6857] [c00000181de83c90] [c000000000960428] > write_sysrq_trigger+0xd8/0x190 > [ 233.397416][ T6857] [c00000181de83cd0] [c000000000624af8] > proc_reg_write+0x108/0x1b0 > [ 233.397436][ T6857] [c00000181de83d00] [c000000000529e10] > vfs_write+0xf0/0x340 > [ 233.397456][ T6857] [c00000181de83d60] [c00000000052a28c] > ksys_write+0xdc/0x130 > [ 233.397474][ T6857] [c00000181de83db0] [c00000000003269c] > system_call_exception+0x15c/0x330 > [ 233.397494][ T6857] [c00000181de83e10] [c00000000000c74c] > system_call_common+0xec/0x250 > [ 233.397514][ T6857] --- interrupt: c00 at 0x7fffb7a03114 > [ 233.397532][ T6857] NIP: 00007fffb7a03114 LR: 00007fffb7978fe4 CTR: > 0000000000000000 > [ 233.397550][ T6857] REGS: c00000181de83e80 TRAP: 0c00 Tainted: G > W X N (5.14.21-150400.24.18-default) > [ 233.397568][ T6857] MSR: 900000000280f033 > <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 22242222 XER: 00000000 > [ 233.397595][ T6857] IRQMASK: 0 > [ 233.397595][ T6857] GPR00: 0000000000000004 00007fffe857f0a0 > 00007fffb7af7200 0000000000000001 > [ 233.397595][ T6857] GPR04: 0000010028618830 0000000000000002 > 0000000000000010 0000000000000000 > [ 233.397595][ T6857] GPR08: 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 > [ 233.397595][ T6857] GPR12: 0000000000000000 00007fffb7c5b600 > 0000000000000000 000000011de337c0 > [ 233.397595][ T6857] GPR16: 000000011dd97ed0 0000000000000000 > 000000011ddddd10 0000000000000000 > [ 233.397595][ T6857] GPR20: 000000011ddedb28 0000010028773460 > 000000011de37828 000000011de36c70 > [ 233.397595][ T6857] GPR24: 0000010028518030 0000000000000000 > 0000000000000002 0000010028618830 > [ 233.397595][ T6857] GPR28: 0000000000000002 00007fffb7af16d8 > 0000010028618830 0000000000000002 > [ 233.397670][ T6857] NIP [00007fffb7a03114] 0x7fffb7a03114 > [ 233.397686][ T6857] LR [00007fffb7978fe4] 0x7fffb7978fe4 > [ 233.397702][ T6857] --- interrupt: c00 > (stuck here) > ~. [terminated ipmitool] > (IPMI broke) > ^C > SIGN INT: Close Interface IPMI v2.0 RMCP+ LAN Interface > ^C^C^C^C > Close Session command failed > > As the crash is apparently so severe that it upsets the BMC, I'm not hopeful > here. Hmm, it might be worth increasing the memory reservation for the dump-capture kernel, although I wouldn't expect those limits to affect the BMC. I'd be happy to take a look at 15.4 kdump if someone can provide access to some ppc64le (preferably power8) hardware.
I was on vacation. I can attempt a firmware update today. If you're SUSE employees you can ping me to get access to the hardware. Not sure whether I can give you access to the o3 worker (as it requires access to the o3 network) but I could give you access to qa-power8-4-kvm.qa.suse.de which is reachable from engineering network. On that machine I've also experienced crashes but haven't got any dumps despite kdump being enabled (see issue https://bugzilla.suse.com/show_bug.cgi?id=1202138#c12 for details).
I've asked on our internal testing channel whether any PowerPC experts can help with upgrading the firmware. Note that the BMC is reachable via https://openqaworker-power8-ipmi.suse.de (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/432/diffs for credentials). I figured I need firmware for 8274-22L from https://www.ibm.com/support/fixcentral/main/selectFixes?parent=ibm~power&product=ibm~power~824722L&release=All&platform=All but I cannot download anything from there without IBM account. It is also not 100 % clear to me what download is suitable. Currently we have SV840_043 installed and I suppose upgrading to the latest version SV860_243 would make most sense. I'm also not quite sure how I'd actually conduct the firmware upgrade after the download. It doesn't seem possibly directly via the BMC interface.
Thanks. I'll try flashing. It seems the tool comes from powerpc-utils which is already installed.
The BMC now shows the new version but the machine hasn't come up again an IPMI access seems broken. Setting the IPMI password within the BMC doesn't help and rebooting the machine within the BMC also hasn't helped so far. That was the output of the firmware update: ``` power8:/tmp/fwupdate # update_flash -f 01SV860_243_165.img info: Temporary side will be updated with a newer or identical image. Projected Flash Update Results: Current T Image: SV840_043 Current P Image: SV840_043 New T Image: SV860_243 New P Image: SV840_043 FLASH: Image ready...rebooting the system... FLASH: This will take several minutes. FLASH: Do not power off! Connection to power8 closed by remote host. Connection to power8 closed. ```
At least the host is back again. I suppose it really just too a while or did you do something? Maybe I can restore IPMI access now using IPMI tool locally.
I haven't seen this issue on any of our other PowerPC hosts. So unfortunately we can likely not investigate the issue any further right now.
(In reply to Marius Kittler from comment #24) > I haven't seen this issue on any of our other PowerPC hosts. So > unfortunately we can likely not investigate the issue any further right now. So let's just be optimistic and close this until it occurs somewhere again.