Bug 1180917 - kernel BUG at mm/huge_memory.c:2144!
kernel BUG at mm/huge_memory.c:2144!
Status: VERIFIED INVALID
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
S/390-64 Other
: P2 - High : Major (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2021-01-14 09:28 UTC by Berthold Gunreben
Modified: 2022-02-16 08:29 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
full build log (16.15 KB, application/gzip)
2021-01-14 09:29 UTC, Berthold Gunreben
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Berthold Gunreben 2021-01-14 09:28:39 UTC
Recently, a number of packages fails to build and instead hangs with the command cc1plus. In a recent build, I found a kernel trace in a hanging worker:

[  608s] [  596.875979] kernel BUG at mm/huge_memory.c:2144!
[  608s] [  596.876290] monitor event: 0040 ilc:2 [#1] SMP 
[  608s] [  596.876374] Modules linked in: sha256_s390 sha_common overlay sd_mod t10_pi nls_iso8859_1 nls_cp437 vfat fat virtio_rng rng_core virtio_blk xfs btrfs blake2b_generic xor raid6_pq libcrc32c crc32_vx_s390 reiserfs squashfs fuse dm_snapshot dm_bufio dm_crypt dm_mod binfmt_misc loop sg scsi_mod
[  608s] [  596.876666] CPU: 1 PID: 2660 Comm: cc1plus Not tainted 5.10.5-1-default #1 openSUSE Tumbleweed
[  608s] [  596.876750] Hardware name: IBM 2964 N63 400 (KVM/Linux)
[  608s] [  596.876797] Krnl PSW : 0704e00180000000 00000000e33fb9ea (__split_huge_pmd+0x62a/0xc30)
[  608s] [  596.877158]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[  608s] [  596.877328] Krnl GPRS: 0000000000000000 00000000b2b40215 0000000003c91000 fffffffffffff800
[  608s] [  596.877408]            0000000081691a00 00000000f2440237 00000000000000c0 0000000000000000
[  608s] [  596.877493]            0000000081691800 000003d083c91030 0000000086d81d48 000003ff71a40000
[  608s] [  596.877567]            0000000086c1c000 00000000e47c4268 00000000e33fb748 000003e003ef3a80
[  608s] [  596.877652] Krnl Code: 00000000e33fb9de: a71f0400		cghi	%r1,1024
[  608s] [  596.877652]            00000000e33fb9e2: a784ff18		brc	8,00000000e33fb812
[  608s] [  596.877652]           #00000000e33fb9e6: af000000		mc	0,0
[  608s] [  596.877652]           >00000000e33fb9ea: a55b0602		oill	%r5,1538
[  608s] [  596.877652]            00000000e33fb9ee: a7f4ff06		brc	15,00000000e33fb7fa
[  608s] [  596.877652]            00000000e33fb9f2: e32010080004	lg	%r2,8(%r1)
[  608s] [  596.877652]            00000000e33fb9f8: a7210001		tmll	%r2,1
[  608s] [  596.877652]            00000000e33fb9fc: a77401f1		brc	7,00000000e33fbdde
[  608s] [  596.878162] Call Trace:
[  608s] [  596.878190]  [<00000000e33fb9ea>] __split_huge_pmd+0x62a/0xc30 
[  608s] [  596.878257] ([<00000000e33fb6ca>] __split_huge_pmd+0x30a/0xc30)
[  608s] [  596.878327]  [<00000000e3379116>] zap_p4d_range+0x246/0xbb0 
[  608s] [  596.878396]  [<00000000e33808f6>] zap_page_range+0x1a6/0x2e0 
[  608s] [  596.878458]  [<00000000e33b2e14>] do_madvise.part.0+0x844/0xc70 
[  608s] [  596.879153]  [<00000000e33b32a8>] __s390x_sys_madvise+0x68/0x80 
[  608s] [  596.879247]  [<00000000e3c676bc>] system_call+0xe0/0x2ac 
[  608s] [  596.879347] Last Breaking-Event-Address:
[  608s] [  596.879390]  [<00000000e33fb9b2>] __split_huge_pmd+0x5f2/0xc30
[  608s] [  596.879465] ---[ end trace 50ad5147a244f7d2 ]---

I don't know if this is related boo#1163684
Comment 1 Berthold Gunreben 2021-01-14 09:29:32 UTC
Created attachment 845109 [details]
full build log
Comment 2 Sarah Julia Kriesch 2021-01-15 13:07:22 UTC
That is happening during packaging continuously.
Comment 3 LTC BugProxy 2021-01-15 15:41:19 UTC
------- Comment From geraldsc@de.ibm.com 2021-01-15 10:30 EDT-------
(In reply to comment #4)
[..]
>
> I don't know if this is related boo#1163684

If that kernel does not have the fix from that bugzilla (LTC bug#184202, SUSE bug#1163684), then it most likely is related.

If it has the fix, then it must be something new. This is a BUG message from common code, do you only see this on s390?

BTW, where can I find the source code for this kernel "5.10.5-1-default", or any openSUSE Tumbleweed kernel? I cannot find it at the same place where the SLES kernel source lives: https://github.com/openSUSE/kernel.git
Comment 4 Sarah Julia Kriesch 2021-01-15 16:10:01 UTC
We are using kernel 5.10.7 in the latest Tumbleweed version.
The latest iso image is available under: 
https://download.opensuse.org/ports/zsystems/tumbleweed/iso/
Comment 5 Sarah Julia Kriesch 2021-01-15 16:10:54 UTC
(In reply to LTC BugProxy from comment #3)
> ------- Comment From geraldsc@de.ibm.com 2021-01-15 10:30 EDT-------
> (In reply to comment #4)
> [..]
> >
> > I don't know if this is related boo#1163684
> 
> If that kernel does not have the fix from that bugzilla (LTC bug#184202,
> SUSE bug#1163684), then it most likely is related.
> 
> If it has the fix, then it must be something new. This is a BUG message from
> common code, do you only see this on s390?
> 
That is s390x related.
Comment 6 Sarah Julia Kriesch 2021-01-15 17:33:13 UTC
Our kernel developers are working upstream.
Comment 7 LTC BugProxy 2021-01-15 18:11:07 UTC
------- Comment From geraldsc@de.ibm.com 2021-01-15 13:02 EDT-------
(In reply to comment #9)
> We are using kernel 5.10.7 in the latest Tumbleweed version.
> The latest iso image is available under:
> https://download.opensuse.org/ports/zsystems/tumbleweed/iso/

Hmm, it says "5.10.5-1-default" in the kernel BUG output. In order to match the given line 2144 from "mm/huge_memory.c:2144" and to find the corresponding kernel code, a matching kernel source would be needed.

Is there any other means of kernel source access for openSUSE Tumbleweed, ideally a git repo like for SLES? Seems hard to believe that "open"SUSE kernel source is harder to find / access than SLES code...

Anyway, SUSE developers surely have such access, and since this is BUG statement in common memory management code anyway, I would suggest to let one of the corresponding SUSE developers have a look first.

BTW, some information that might help is the fact(?) that THP worked fine on s390 with Tumbleweed, at least for some very short time, when verifying the other THP fix in LTC bug#184202 / SUSE bug#1163684. IIUC, then it was verified there with 5.9.11, but that is not 100% clear to me from the other bugzilla. Please verify on which kernel version it worked fine the last time. Then, with having access to some proper source repo (and not just an ISO), one might be able to see what was changed in between and with regard to THP, maybe madvise.
Comment 8 Berthold Gunreben 2021-01-15 22:30:18 UTC
(In reply to LTC BugProxy from comment #7)

Thats a whole lot of questions, some of which need some more explanations.

> ------- Comment From geraldsc@de.ibm.com 2021-01-15 13:02 EDT-------
> (In reply to comment #9)
> > We are using kernel 5.10.7 in the latest Tumbleweed version.
> > The latest iso image is available under:
> > https://download.opensuse.org/ports/zsystems/tumbleweed/iso/
> 
> Hmm, it says "5.10.5-1-default" in the kernel BUG output. In order to match
> the given line 2144 from "mm/huge_memory.c:2144" and to find the
> corresponding kernel code, a matching kernel source would be needed.

The kernel to use is special for the builds. It originates from Tumbleweed, but it is possible to substitute the kernel with special versions in the build systems, and thus it is not automatically updated to the latest version. The version string that you see tells the truth.
> 
> Is there any other means of kernel source access for openSUSE Tumbleweed,
> ideally a git repo like for SLES? Seems hard to believe that "open"SUSE
> kernel source is harder to find / access than SLES code...

It is not hard to find at all. All you need to know is, that the different flavors of kernels all depend on a central package called kernel-source, which has an own mechanics to integrate patches depending on a variety of conditions. The source can be found in the package http://download.opensuse.org/ports/zsystems/tumbleweed/repo/oss/noarch/kernel-source-5.10.5-1.1.noarch.rpm

I downloaded this in case it gets overwritten and would not be available that easy anymore. Note, that one can always rebuild older versions, because OBS does not throw away sources. Therefore you can just rebuild an older version of a package.

> Anyway, SUSE developers surely have such access, and since this is BUG
> statement in common memory management code anyway, I would suggest to let
> one of the corresponding SUSE developers have a look first.

That is the reason, why the assignee is the openSUSE Kernel Developers.
 
> BTW, some information that might help is the fact(?) that THP worked fine on
> s390 with Tumbleweed, at least for some very short time, when verifying the
> other THP fix in LTC bug#184202 / SUSE bug#1163684. IIUC, then it was
> verified there with 5.9.11, but that is not 100% clear to me from the other
> bugzilla. 

Now that is an interesting question as well. We never could reliably reproduce the behavior, it is more kind of a statistical experience. From my feeling, I would say, that the kernel at least worked for some time.

One thing that is also a little strange is, that now only one process leads to issues, which is cc1plus. On the other hand, the compile process is one of the biggest (from a memory perspective) processes to be found. Often enough, restarting the build just makes the build work.

> Please verify on which kernel version it worked fine the last
> time. Then, with having access to some proper source repo (and not just an
> ISO), one might be able to see what was changed in between and with regard
> to THP, maybe madvise.

So, the changes can be found in the changelog to the rpm. This is the reliable source for knowing what has changed when. The changelog is found with rpm (rpm -q --changelog ...) and also next to the spec file with a changes extension.

With regards to the sources, you can get the sources for the kernel-source package with the command:

osc co -r 77cf39676446e7f7aa15ea53ef337b64 openSUSE:Factory:zSystems kernel-source

The config is found in the file config.tar.bz2 within. The definition of what patch is applied in what case is found in the series.conf file. 

Would it be helpful to temporarily add some extra kernel parameter for testing? With boo#1163684 it was very helpful to have a reliable build environment. I know that this kind of issue is hard to debug and hard to find. However, I also believe that it is vital to find it before it hits customers with enterprise distributions. This case hits even less often than boo#1163684 but that does not help those who are hit.
Comment 9 LTC BugProxy 2021-01-19 14:51:02 UTC
------- Comment From geraldsc@de.ibm.com 2021-01-19 09:45 EDT-------
(In reply to comment #13)
[...]
> > Is there any other means of kernel source access for openSUSE Tumbleweed,
> > ideally a git repo like for SLES? Seems hard to believe that "open"SUSE
> > kernel source is harder to find / access than SLES code...
>
> It is not hard to find at all. All you need to know is, that the different
> flavors of kernels all depend on a central package called kernel-source,
> which has an own mechanics to integrate patches depending on a variety of
> conditions. The source can be found in the package
> http://download.opensuse.org/ports/zsystems/tumbleweed/repo/oss/noarch/
> kernel-source-5.10.5-1.1.noarch.rpm

Hmm, almost right, you also need to know that the kernel-source.noarch.rpm does not contain the full source, and that you need the kernel-devel.noarch.rpm on top (e.g. for arch code). This is the same for SLES, so fortunately I knew it, but of course for SLES it is much less annoying because you also have a proper public git for both kernel source tree and also kernel-source src.rpm content, so you don't really need to bother about knowing which rpms contain what...

Anyway, back to the bug, from the kernel source for 5.10.5-1-default I see that it happens on the BUG_ON(!pte_none(*pte)) in __split_huge_pmd_locked(). This is very strange / interesting, because those are the ptes from the pre-allocated and deposited pagetable, which was withdrawn just shortly before that BUG_ON, with pgtable_trans_huge_withdraw().

The pre-allocated pagetables are initialized with empty (invalid) ptes before deposit, so they should of course all (still) be pte_none() after withdrawal. If a pte is !pte_none, then this means that either the pre-allocated pagetable got corrupted while it was deposited, or maybe that pgtable_trans_huge_withdraw() returns something that is not really a pagetable at all.

E.g. in theory it could return NULL, if there were more withdrawals than deposits, IIUC the list handling code there correctly. Of course, such a thing should never happen (i.e. it would be a bug), but I am a bit confused why the common code does not also check this with a BUG_ON check. Having a system dump could help to see more of what was going on. Any chance that kdump generated a dump after the BUG_ON?

From the backtrace and register output, and a kernel disassembly, one can at least see that in %r1 we have the pte value that did not pass the !pte_none check: 00000000b2b40215. This actually looks like a valid pte, with present / young / read / write-protect set, so one could assume that this is not the "NULL returnend" case, but rather really a pre-allocated pagetable, which somehow got corrupted by someone having it in active access and filling it with valid ptes.

Of course, such a thing should also never happen, the pre-allocated and deposited pagetables can not be used until they are withdrawn, very strange. We do actually have an own implementation of the deposit/withdraw functions, because we cannot use the generic versions. On s390, we have 2K pagetables, and two of them within one 4K page, so we cannot use the generic logic that operates on struct pages for (4K) pagetable pages. The pgtable_t is therefore also not a struct page on s390, but rather a direct pointer to the pagetable. For maintainig the list of pre-allocatced pagetables, we put a list_head directly into the pageteables, at the beginning, instead of using page->lru of the struct page associated with the pagetable like it is donr in the generic case. Then, on withdraw, and after list_del, the first two ptes will be cleared so that the list_head gets overwritten and the whole pagetable should be empty again.

That is at least suspicious, and it could could explain why you only see this on s390 (do you really?). However, I do not really see how our implementation would change anything that allows the deposited pagetables to change before withdrawing them. It is really the same logic as in generic code, only that we put the list_head somewhere else. I still suspect some race in common code, e.g. some concurrent withdrawals w/o proper locking, but I could not yet find anything suspicious in the code...
Comment 10 LTC BugProxy 2021-01-19 15:01:39 UTC
------- Comment From geraldsc@de.ibm.com 2021-01-19 09:51 EDT-------
BTW, you seem to have a really good testcase for THP, at least you are able to trigger all kind of THP issues (at least on s390 :-)).

What exactly is your system setup? Is it a KVM or z/VM guest, or did it maybe crash in a KVM host or LPAR? The workload you are using is some "SUSE build sevice", right?
Comment 11 Sarah Julia Kriesch 2021-01-19 15:37:10 UTC
@Ihno Can you tell Gerald something about the setup of our OBS workers, please?
Comment 12 Ruediger Oertel 2021-01-19 15:48:46 UTC
all buildservice workers are LPARs running KVM to run the individual worker instances. the LPARS are (in case of OBS) usually running a current Tumbleweed
kernel and the KVM VMs on top usually run the kernel of the target distribution
of the package the worker is building for. the bug displayed above is happening
on the worker level so as a KVM VM running in an LPAR.
Comment 13 Sarah Julia Kriesch 2021-01-19 15:55:17 UTC
Thank you for the information, Rüdiger.
Comment 14 LTC BugProxy 2021-01-28 18:01:39 UTC
------- Comment From geraldsc@de.ibm.com 2021-01-28 12:56 EDT-------
(In reply to comment #17)
> all buildservice workers are LPARs running KVM to run the individual worker
> instances. the LPARS are (in case of OBS) usually running a current
> Tumbleweed
> kernel and the KVM VMs on top usually run the kernel of the target
> distribution
> of the package the worker is building for. the bug displayed above is
> happening
> on the worker level so as a KVM VM running in an LPAR.
>
> Thank you for the information, R?diger.

Thanks, one more thing, out of curiosity what could be the reason why this apparently only is reproducible in the Tumbleweed build service:

Do you use "real" large pages for the KVM guests, i.e. have qemu (in the host) back the KVM guest with hugetlbfs mapping (see "hpage=1" parameter in https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ljdd/ljdd_c_kvm_host.html)?
Comment 15 Sarah Julia Kriesch 2021-02-04 15:46:28 UTC
Rüdiger has removed the hugepage parameter after your last bugfix.
If I understand that correctly, you can find our kvm configuration of the OBS here:
https://github.com/openSUSE/obs-build/blob/876d9080994a969809b19166c8bc8db205dd760d/build-vm-kvm#L148
Comment 16 LTC BugProxy 2021-02-04 16:11:08 UTC
------- Comment From geraldsc@de.ibm.com 2021-02-04 11:04 EDT-------
Would it be possible to collect a system dump if the problem occurs again? It might not be possible to find the root cause there, but maybe at least verify if the assumption is correct that a pre-allocated and deposited pagetable either was withdrawn twice or even used while deposited.
Comment 17 Sarah Kriesch 2021-06-18 08:51:56 UTC
I close this bug report after such a long time without any reproducible issue for this case. I will create new ones if it will happen again.
Comment 18 Ihno Krumreich 2022-02-16 08:29:50 UTC
Verified