Bug 1170822 - xfs related crash while building llvm
xfs related crash while building llvm
Status: RESOLVED NORESPONSE
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 Other
: P5 - None : Normal (vote)
: ---
Assigned To: Anthony Iliopoulos
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-04-29 12:57 UTC by Ismail Dönmez
Modified: 2021-01-03 18:34 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
ailiopoulos: needinfo? (ismail)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ismail Dönmez 2020-04-29 12:57:54 UTC
While building (and sometimes while installing the build) I can reliably crash the kernel, here is the info & backtrace from crash(1):

> crash /usr/lib/debug/boot/vmlinux-5.6.6-1-default.debug /boot/vmlinux-5.6.6-1-default.xz /var/crash/2020-04-29-14\:30/vmcore

crash 7.2.8                                                                                                                                                                                                        
Copyright (C) 2002-2020  Red Hat, Inc.                                                                                                                                                                             
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation                                                                                                                                                              
Copyright (C) 1999-2006  Hewlett-Packard Co                                                                                                                                                                        Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited                                                                                                                                                              Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.                                                                                                                                                              Copyright (C) 2005, 2011  NEC Corporation                                                                                                                                                                          Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.                                                                                                                                                             Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.                                                                                                                                                 This program is free software, covered by the GNU General Public License,                                                                                                                                          
and you are welcome to change it and/or distribute copies of it under                                                                                                                                              
certain conditions.  Enter "help copying" to see the conditions.                                                                                                                                                   
This program has absolutely no warranty.  Enter "help warranty" for details.                                                                                                                                       
                                                                                                                                                                                                                   
GNU gdb (GDB) 7.6                                                                                                                                                                                                  
Copyright (C) 2013 Free Software Foundation, Inc.                                                                                                                                                                  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>                                                                                                                                      This is free software: you are free to change and redistribute it.                                                                                                                                                 There is NO WARRANTY, to the extent permitted by law.  Type "show copying"                                                                                                                                         and "show warranty" for details.                                                                                                                                                                                   This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [646MB]: patching 113108 gdb minimal_symbol values

      KERNEL: /boot/vmlinux-5.6.6-1-default.xz
   DEBUGINFO: /usr/lib/debug/boot/vmlinux-5.6.6-1-default.debug
    DUMPFILE: /var/crash/2020-04-29-14:30/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Wed Apr 29 14:30:45 2020
      UPTIME: 01:23:51
LOAD AVERAGE: 10.83, 10.31, 10.11
       TASKS: 273
    NODENAME: havana
     RELEASE: 5.6.6-1-default
     VERSION: #1 SMP Wed Apr 22 04:15:55 UTC 2020 (c11f000)
     MACHINE: x86_64  (3491 Mhz)
      MEMORY: 15.9 GB
       PANIC: "kernel BUG at mm/filemap.c:1318!"
         PID: 170
     COMMAND: "kworker/4:1"
        TASK: ffff97e5c73b9ec0  [THREAD_INFO: ffff97e5c73b9ec0]
         CPU: 4
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 170    TASK: ffff97e5c73b9ec0  CPU: 4   COMMAND: "kworker/4:1"
 #0 [ffffb12f80507af0] machine_kexec at ffffffffa966dce1
 #1 [ffffb12f80507b48] __crash_kexec at ffffffffa974f9fd
 #2 [ffffb12f80507c10] crash_kexec at ffffffffa9750795
 #3 [ffffb12f80507c20] oops_end at ffffffffa9634ed2
 #4 [ffffb12f80507c40] do_trap at ffffffffa963149b
 #5 [ffffb12f80507c90] do_invalid_op at ffffffffa9631ee7
 #6 [ffffb12f80507cb0] invalid_op at ffffffffaa000d68
    [exception RIP: end_page_writeback+95]
    RIP: ffffffffa9819e4f  RSP: ffffb12f80507d68  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff97e5c42e0d48  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff97e8effe7000  RDI: 0000000000000000
    RBP: ffffd4df47f80a80   R8: 0000000000000001   R9: ffff97e8effe7000
    R10: 00000000000353c0  R11: ffffffffffffffc0  R12: 0000000000000000
    R13: 0000000000000000  R14: ffff97e892522180  R15: ffffd4df47f80a80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #7 [ffffb12f80507d70] iomap_finish_ioend at ffffffffa996125e
 #8 [ffffb12f80507dd8] iomap_finish_ioends at ffffffffa9961393
 #9 [ffffb12f80507e08] xfs_end_ioend at ffffffffc08a646d [xfs]
#10 [ffffb12f80507e40] xfs_end_io at ffffffffc08a6fdc [xfs]
#11 [ffffb12f80507e78] process_one_work at ffffffffa96b9433
#12 [ffffb12f80507eb8] worker_thread at ffffffffa96b964d
#13 [ffffb12f80507f10] kthread at ffffffffa96bfd09
#14 [ffffb12f80507f50] ret_from_fork at ffffffffaa0001ff

Let me know if I can provide more information.
Comment 1 Ismail Dönmez 2020-04-29 12:59:28 UTC
FWIW I can crash it with kernel-vanilla too, but the dump is from kernel-default.
Comment 2 Anthony Iliopoulos 2020-04-29 13:19:35 UTC
Thanks for the report, Ismail.

If you could place the kdump (either from kernel-default or vanilla) somewhere, and point me to a path to the debug rpms, I can have a look.
Comment 3 Ismail Dönmez 2020-04-29 15:21:21 UTC
I have put everything under /mounts/users-space/idoenmez/bsc-1170822
Comment 4 Anthony Iliopoulos 2020-04-29 15:50:56 UTC
(In reply to Ismail Dönmez from comment #3)
> I have put everything under /mounts/users-space/idoenmez/bsc-1170822

thanks! I just need a chmod g+r /mounts/users-space/idoenmez/bsc-1170822/vmcore as it's currently owner-readable only.
Comment 5 Ismail Dönmez 2020-04-29 16:41:28 UTC
(In reply to Anthony Iliopoulos from comment #4)
> (In reply to Ismail Dönmez from comment #3)
> > I have put everything under /mounts/users-space/idoenmez/bsc-1170822
> 
> thanks! I just need a chmod g+r
> /mounts/users-space/idoenmez/bsc-1170822/vmcore as it's currently
> owner-readable only.

Oops, fixed!
Comment 6 Anthony Iliopoulos 2020-05-12 14:22:32 UTC
It looks like writeback somehow ends up with a bogus page and triggers the BUG().

The page flags don't look right for a page expected to be under writeback:

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS                                                                                                                                                                    
ffffd4df47f80a80 1fe02a000 ffff97e708965899     81a9  1 2ffff800080034 uptodate,lru,active,swapbacked                                                                                                                            

Not sure yet how we can end up with this page in the bio.

@Ismail, since this is reliably reproducible, could you please share how you're building llvm to trigger this? Also, please provide the xfs_info for the particular fs so that I can see all the relevant params.

I'd like to reproduce locally, as we may have to bisect this one (there has been quite some code refacting in the iomap/writeback, merged in v5.5-rc1 that may have introduced a bug).
Comment 7 Ismail Dönmez 2020-05-13 12:51:02 UTC
havana ~ > xfs_info /data/                                                                                                                                                                                         
meta-data=/dev/sda4              isize=512    agcount=4, agsize=22917568 blks                                                                                                                                      
         =                       sectsz=4096  attr=2, projid32bit=1                                                                                                                                                
         =                       crc=1        finobt=1, sparse=1, rmapbt=0                                                                                                                                         
         =                       reflink=0                                                                                                                                                                         
data     =                       bsize=4096   blocks=91670272, imaxpct=25                                                                                                                                          
         =                       sunit=0      swidth=0 blks                                                                                                                                                        
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1                                                                                                                                                  
log      =internal log           bsize=4096   blocks=44760, version=2                                                                                                                                              
         =                       sectsz=4096  sunit=1 blks, lazy-count=1                                                                                                                                           
realtime =none                   extsz=4096   blocks=0, rtextents=0

For building llvm:

git clone git@github.com:llvm/llvm-project.git
mkdir llvm-project/build; cd llvm-project/build

cmake -GNinja -DLLVM_ENABLE_PROJECTS='clang;clang-tools-extra;compiler-rt;flang;libunwind;lld;lldb;openmp;parallel-libs;polly;pstl;mlir' -DLLVM_ENABLE_RUNTIMES='libcxx;libcxxabi' -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DLLVM_CCACHE_BUILD=ON -DENABLE_LINKER_BUILD_ID=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=OFF -DCLANG_ENABLE_ARCMT=OFF -DCLANG_ENABLE_STATIC_ANALYZER=OFF -DLLVM_ENABLE_PIC=ON -DLLVM_ENABLE_LLD=ON -DCLANG_DEFAULT_CXX_STDLIB=libc++ -DLLVM_ENABLE_LIBCXX=ON -DLIBCXX_ENABLE_PARALLEL_ALGORITHMS=OFF -DLLVM_STATIC_LINK_CXX_STDLIB=ON -DCMAKE_INSTALL_PREFIX=/data/binaries/llvm -DPSTL_PARALLEL_BACKEND=tbb -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_61 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=61 ../llvm 

ninja && ninja install/strip

This setups cuda as well but I would think that's not needed for the bug.
Comment 8 Anthony Iliopoulos 2020-05-20 20:51:48 UTC
Thanks Ismail, I still wasn't able to reproduce this locally unfortunately. Given that you can trigger this reliably on your machine, is there any chance you recall the approximate last kernel this wasn't an problem? 

We could try to maybe bisect and pinpoint the culprit, restricting the bisection to fs/iomap and fs/xfs (I still suspect the iomap writeback refactoring in v5.5-rc1, but I cannot verify since it doesn't reproduce on my hardware). The bisection between v5.4..v5.6.6 for fs/{iomap,xfs} seems to be just 7 steps.
Comment 9 Ismail Dönmez 2020-05-25 06:56:15 UTC
(In reply to Anthony Iliopoulos from comment #8)
> Thanks Ismail, I still wasn't able to reproduce this locally unfortunately.
> Given that you can trigger this reliably on your machine, is there any
> chance you recall the approximate last kernel this wasn't an problem? 
> 
> We could try to maybe bisect and pinpoint the culprit, restricting the
> bisection to fs/iomap and fs/xfs (I still suspect the iomap writeback
> refactoring in v5.5-rc1, but I cannot verify since it doesn't reproduce on
> my hardware). The bisection between v5.4..v5.6.6 for fs/{iomap,xfs} seems to
> be just 7 steps.

I can (un)fortunately can reproduce this still :-) I guess you want me to checkout kernel-source for our packages from github and do a git bisect, starting from v5.4?
Comment 10 Ismail Dönmez 2020-05-25 13:10:58 UTC
> I can (un)fortunately can reproduce this still :-) I guess you want me to
> checkout kernel-source for our packages from github and do a git bisect,
> starting from v5.4?

FWIW I reproduced the bug and kdump somehow didn't kick in and the machine is now unreachable. It'll be offline until I can get it rebooted again. Meanwhile I'll wait for your answer to above.
Comment 11 Anthony Iliopoulos 2020-05-25 15:54:45 UTC
(In reply to Ismail Dönmez from comment #9)
>
> I can (un)fortunately can reproduce this still :-) I guess you want me to
> checkout kernel-source for our packages from github and do a git bisect,
> starting from v5.4?

You mentioned this being reproducible on vanilla too, correct? In that case, once your machine is back online, I'd first check if this is reproducible still on our latest vanilla build. If you can still trigger it there, it would certainly indicate an upstream bug and would probably be easier to bisect directly on the mainline upstream tree, as our vanilla tree isn't updated on every upstream commit (so a single kernel-source/vanilla commit will contain multiple upstream commits).

I suppose we can easily build a vanilla 5.4 or so in ibs and also check if this works or not, to have a starting point for a bisection (unless you remember any older specific version that this wasn't reproducible). I'd probably restrict the bisection to fs and mm (hopefully the bug is contained there) to minimize the steps. I wonder if this would be reproducible in a large VM on top of the same machine, so that you could automate the bisect and save you some time.

Ping me if you want before you start bisection to discuss any details.
Comment 12 Anthony Iliopoulos 2020-11-04 16:26:28 UTC
Shall we close this one, or is this still reproducible on current kernels?
Comment 13 Anthony Iliopoulos 2021-01-03 18:34:42 UTC
closing due to inactivity, please reopen if this is reproducible in current kernels.