Bug 1141881

Summary: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000029" on OBS job
Product: [openSUSE] openSUSE Tumbleweed Reporter: Oliver Kurz <okurz>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED WORKSFORME QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: adrian.schroeter, jslaby, nfbrown, okurz, rgoldwyn
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: full log file of failed OBS job with the kernel bug stack trace

Description Oliver Kurz 2019-07-17 15:05:57 UTC
Created attachment 810755 [details]
full log file of failed OBS job with the kernel bug stack trace

## Observation

https://build.opensuse.org/build/devel:openQA:tested/openSUSE_Tumbleweed/x86_64/openQA/_log shows:

```
[ 1844s] + rm -rf /home/abuild/rpmbuild/BUILDROOT/openQA-4.6.1563206570.e00d3964-220.2.x86_64/DB
[ 1844s] [ 1832.460286] BUG: unable to handle kernel NULL pointer dereference at 0000000000000029
[ 1844s] [ 1832.461661] #PF error: [normal kernel read fault]
[ 1844s] [ 1832.461848] PGD 0 P4D 0 
[ 1844s] [ 1832.461848] Oops: 0000 [#1] SMP NOPTI
[ 1844s] [ 1832.461848] CPU: 7 PID: 7939 Comm: rm Not tainted 5.1.16-1-default #1 openSUSE Tumbleweed (unreleased)
[ 1844s] [ 1832.461848] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
[ 1844s] [ 1832.461848] RIP: 0010:vfs_unlink+0xb3/0x1c0
[ 1844s] [ 1832.461848] Code: bc f0 ff ff ff e8 2d da e4 ff 44 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 f0 83 44 24 fc 00 49 8b 86 68 01 00 00 48 85 c0 74 49 <48> 8b 50 28 48 8d 48 28 48 39 ca 0f 84 de 00 00 00 ba 04 00 00 00
[ 1844s] [ 1832.461848] RSP: 0018:ffffa50c40e1be98 EFLAGS: 00010202
[ 1844s] [ 1832.461848] RAX: 0000000000000001 RBX: ffffa50c40e1bee0 RCX: 000000000000018f
[ 1844s] [ 1832.461848] RDX: 0000000000000000 RSI: ffff978410ebc300 RDI: ffff97842fc6fa88
[ 1844s] [ 1832.461848] RBP: ffff978410ebc300 R08: 0000000000000020 R09: ffffa50c40e1be90
[ 1844s] [ 1832.461848] R10: 0000000000000005 R11: 00007fffffffffff R12: 0000000000000000
[ 1844s] [ 1832.461848] R13: ffff97842fc6fa88 R14: ffff978410ec31a8 R15: ffff978410ec3248
[ 1844s] [ 1832.461848] FS:  00007fa066ffe580(0000) GS:ffff9784b7bc0000(0000) knlGS:0000000000000000
[ 1844s] [ 1832.461848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1844s] [ 1832.461848] CR2: 0000000000000029 CR3: 0000000231c72000 CR4: 00000000000406e0
[ 1844s] [ 1832.461848] Call Trace:
[ 1844s] [ 1832.461848]  do_unlinkat+0x18b/0x2c0
[ 1844s] [ 1832.461848]  do_syscall_64+0x60/0x120
[ 1844s] [ 1832.461848]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
```

in an OBS job.
Comment 1 Jiri Slaby 2019-08-09 08:00:46 UTC
inode->i_flctx is 1 (rax) in this test in break_lease:
if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) 
        return __break_lease(inode, mode, FL_LEASE);

Then, there is 0x28(%rax) in the code to fetch flc_lease which indeed crashes.

Goldwyn, any idea how this could happen? Or who else could know? I don't see any fixes post 5.1 in this area, so I suppose this was not fixed upstream yet?
Comment 2 Goldwyn Rodrigues 2019-08-09 15:05:43 UTC
inode->i_flctx should clearly be a pointer. The only way it can have 0x1 is corruption. Without a dump this may be difficult to diagnose.

Adding Neil to check if he has seen something similar before.
Comment 3 Neil Brown 2019-08-12 07:40:50 UTC
> Adding Neil to check if he has seen something similar before.

No, I haven't seen anything like this. I agree with your analysis.
Comment 4 Jiri Slaby 2019-08-12 09:22:25 UTC
(In reply to Goldwyn Rodrigues from comment #2)
> Without a dump this may be difficult to diagnose.

Could you setup kdump on OBS workers somehow?
Comment 5 Oliver Kurz 2019-08-12 19:12:05 UTC
no, sorry. Not possible for myself. I don't have that kind of access to OBS workers. I suggest you get in contact with the administrators of the OBS instance or declare the bug as "WORKSFORME" because I have not seen this again so it might not be easily reproducible anyway.
Comment 6 Jiri Slaby 2019-10-31 11:58:30 UTC
If you see it again, feel free to reopen.