Bug 1092456 - 4.17 regression loading custom SPI module
4.17 regression loading custom SPI module
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
aarch64 openSUSE Factory
: P3 - Medium : Normal (vote)
: ---
Assigned To: Michal Kubeček
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-05-09 01:16 UTC by Andreas Färber
Modified: 2022-07-21 17:22 UTC (History)
11 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
tentative patch (1.58 KB, patch)
2018-05-15 10:50 UTC, Michal Kubeček
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Färber 2018-05-09 01:16:28 UTC
On a Pine64 board (arm64) with Kernel:HEAD kernel-default 4.17.rc3 and 4.17.rc4, insmod'ing a custom SPI device driver kernel module built locally against kernel-default-devel (with spi0 enabled in the DT) leads to an underflowing usage count:

# insmod ./lan9252.ko
$ lsmod
Module                  Size  Used by
lan9252                16384  21656030
[...]
spi_sun6i              16384  0
[...]
# rmmod lan9252
rmmod: ERROR: Module lan9252 is in use

Using earlier kernel versions, such as 4.16.6, the usage counter is at 0 instead, and unloading works as expected.
Comment 1 Michal Kubeček 2018-05-09 06:25:22 UTC
"21656030" (0x014a71de) doesn't look like a typical underflow, rather a random
garbage, IMHO. Do we have source of the module?
Comment 2 Andreas Färber 2018-05-09 10:30:46 UTC
(In reply to Michal Kubeček from comment #1)
> Do we have source of the module?

Pushed to:

https://github.com/afaerber/lan9252-module

Derived from previously working ones:

https://github.com/afaerber/netx-module
https://github.com/afaerber/lora-modules
Comment 4 Michal Kubeček 2018-05-10 11:10:53 UTC
I couldn't see any obvious problem in the module source, neither have I any
idea what kind of change could result in such refcounting problem. Some tips
for debugging:

1. Seeing how does the refcount behave might give us some hint. Try e.g. to
display the refcount repeatedly after loading the module to see if it's growing,
decreasing or stays the same (in the third case, it would be interesting to see
how the values differ in repeated tests).

2. There are tracepoints module:module_get and module:module_point so that
you could use perf to trace refcount changes. If there is a misbalance in get
and put operations, it could tell us where.
Comment 5 Andreas Färber 2018-05-12 17:14:50 UTC
I'm seeing similar symptoms on a Raspberry Pi 3 Model B with 4.17-rc4:

# lsmod
Module                  Size  Used by
sx1276                 20480  10549406
lora_dev               16384  10510879 sx1276
lora                   16384  10455742
[...]
spi_bcm2835            20480  0
[...]

As you can see, the numbers are differing significantly between modules where they should have the same usage count. They do not differ between lsmod calls.
Comment 6 Jessica Yu 2018-05-14 09:08:02 UTC
What does the trace output on module load look like when you enable the module_get tracepoint (/sys/kernel/debug/tracing/events/module/module_get/)?

Another thought: is it possible that the config with which the module(s) were compiled differs slightly from that of the running kernel (while controlling for kernel version)? That may potentially produce a differently sized module struct whose fields could have slightly different offsets..
Comment 7 Michal Kubeček 2018-05-14 10:22:46 UTC
(In reply to Jessica Yu from comment #6)
> Another thought: is it possible that the config with which the module(s)
> were compiled differs slightly from that of the running kernel (while
> controlling for kernel version)? That may potentially produce a differently
> sized module struct whose fields could have slightly different offsets..

Looks like you are right. I checked the layout of struct module in vmlinux
and compiled module (compiled on SLE15 but that shouldn't make a difference)
and indeed, they differ:

--- m2.vmlinux  2018-05-14 06:15:09.710000000 -0400
+++ m2.module   2018-05-14 06:13:39.130000000 -0400
@@ -168,8 +168,9 @@
 /*           |     4 */    unsigned int percpu_size;
 /*           |     4 */    unsigned int num_tracepoints;
 /*           |     8 */    struct tracepoint * const *tracepoints_ptrs;
+/*           |     8 */    struct jump_entry *jump_entries;
+/*           |     4 */    unsigned int num_jump_entries;
 /*           |     4 */    unsigned int num_trace_bprintk_fmt;
-/* XXX  4-byte hole  */
 /*           |     8 */    const char **trace_bprintk_fmt_start;
 /*           |     8 */    struct trace_event_call **trace_events;
 /*           |     4 */    unsigned int num_trace_events;

As a result, the module thinks module::refcnt should be at offset 808 while
kernel expects it at offset 800. module's structure has module::exit at offset
800 and the numbers shown in comments 0 and 5 could pretty well be lower
32 bits of these callbacks.

The two extra members in module's version indicate a HAVE_JUMP_LABEL was
enabled when building the module but not kernel so the next question is where
does this difference come from.
Comment 8 Michal Kubeček 2018-05-14 11:05:18 UTC
HAVE_JUMP_LABEL is defined in <linux/jump_label.h> this way:

  #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
  # define HAVE_JUMP_LABEL
  #endif

where CONFIG_JUMP_LABEL is enabled. CC_HAVE_ASM_GOTO is detected on build
time by trying to compile this (see scripts/gcc-goto.sh):

-----------------------------------------------------------------------------
int main(void)
{
#if defined(__arm__) || defined(__aarch64__)
        /*
         * Not related to asm goto, but used by jump label
         * and broken on some ARM GCC versions (see GCC Bug 48637).
         */
        static struct { int dummy; int state; } tp;
        asm (".long %c0" :: "i" (&tp.state));
#endif

entry:
        asm goto ("" :::: entry);
        return 0;
}
-----------------------------------------------------------------------------

For some reason this succeeds when building the module but fails when building
the kernel packages in IBS. There is one related commit between 4.16 and
4.17-rc1: e501ce957a78 ("x86: Force asm-goto") but I don't see how it could
break the detection in IBS.
Comment 9 Michal Kubeček 2018-05-14 11:10:57 UTC
One strange thing in IBS build log is that it installs gcc 7.3.1 but libgcc_s1
version 8.0.1:

> [   35s] libgcc_s1-8.0.1+r259636-1.1           ########################################
> [   60s] gcc7-7.3.1+r258812-3.1                ########################################

No idea if it could be related. The SLE15 system I was testing on has also gcc 7.3.1 but
slightly different:

  gcc7-7.3.1+r258812-2.8.aarch64
Comment 10 Michal Kubeček 2018-05-14 11:13:16 UTC
(In reply to Michal Kubeček from comment #9)
> > [   35s] libgcc_s1-8.0.1+r259636-1.1           ########################################
> > [   60s] gcc7-7.3.1+r258812-3.1                ########################################
> 
> No idea if it could be related. The SLE15 system I was testing on has also
> gcc 7.3.1 but slightly different:
> 
>   gcc7-7.3.1+r258812-2.8.aarch64

Sorry for the noise, I was looking at libgcc_s1 above. So the version is the
same. It's probably either the libgcc_s1 or something in the environment.
Comment 11 Andreas Färber 2018-05-14 11:17:26 UTC
(In reply to Michal Kubeček from comment #8)
> [...] CC_HAVE_ASM_GOTO is detected on build
> time by trying to compile this (see scripts/gcc-goto.sh):
> 
> -----------------------------------------------------------------------------
> int main(void)
> {
> #if defined(__arm__) || defined(__aarch64__)
>         /*
>          * Not related to asm goto, but used by jump label
>          * and broken on some ARM GCC versions (see GCC Bug 48637).
>          */
>         static struct { int dummy; int state; } tp;
>         asm (".long %c0" :: "i" (&tp.state));
> #endif
> 
> entry:
>         asm goto ("" :::: entry);
>         return 0;
> }
> -----------------------------------------------------------------------------
> 
> For some reason this succeeds when building the module but fails when
> building
> the kernel packages in IBS.

I've confirmed that commenting out the two arm lines it still misbehaves, whereas if I torpedo the test by adding a syntactically invalid "foo" it works.
Comment 12 Andreas Färber 2018-05-14 11:20:02 UTC
Maybe the compiler folks have an idea of what's going wrong here?
Comment 13 Richard Biener 2018-05-14 11:34:42 UTC
What's this strange

        static struct { int dummy; int state; } tp;
        asm (".long %c0" :: "i" (&tp.state));

supposed to do besides having a random number as instruction?  That is,
if you are lucky it might be a valid instruction.  Maybe it shouldn't be
in the execution flow?  Or the testcase is supposed to be compiled and
not run?
Comment 14 Richard Biener 2018-05-14 11:36:46 UTC
So - how does this testcase FAIL when building the kernel?  Ah, possibly different compiler flags?  Like -fPIE/fPIC vs. no such flags?  Because  the quoted part
ends up emitting a relocation.
Comment 15 Andreas Färber 2018-05-14 11:45:51 UTC
(In reply to Richard Biener from comment #13)
> What's this strange
> 
>         static struct { int dummy; int state; } tp;
>         asm (".long %c0" :: "i" (&tp.state));
> 
> supposed to do besides having a random number as instruction?  That is,
> if you are lucky it might be a valid instruction.  Maybe it shouldn't be
> in the execution flow?  Or the testcase is supposed to be compiled and
> not run?

AFAIU it's a compile test only (needs to work for cross-compiling, too).

Confirming that Tumbleweed's published aarch64 gcc7 matches the revision Michal mentioned, which is the same that built the latest Kernel:HEAD kernel-default on aarch64. Local libgcc_s1 is also at version 8, no difference to OBS/IBS.

https://build.opensuse.org/package/show/Kernel:HEAD/kernel-default

(Note: -rc5 will arrive with the next cron job kernel source upload and potentially lead to newer Factory package versions used.)
Comment 16 Andreas Färber 2018-05-14 11:49:50 UTC
(In reply to Richard Biener from comment #14)
> So - how does this testcase FAIL when building the kernel?

ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637
Comment 17 Richard Biener 2018-05-14 11:52:31 UTC
(In reply to Andreas Färber from comment #16)
> (In reply to Richard Biener from comment #14)
> > So - how does this testcase FAIL when building the kernel?
> 
> ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637

So it ICEs for building the kernel?  Well. GCC trunk has for example

> ./cc1 -quiet t.c -I include -fPIE
t.c: In function ‘main’:
t.c:9:3: warning: asm operand 0 probably doesn’t match constraints
   asm (".long %c0" :: "i" (&tp.state));
   ^~~
t.c:9:3: error: impossible constraint in ‘asm’
> ./cc1 -quiet t.c -I include 

so whether it compiles depends on the flags.
Comment 18 Michal Kubeček 2018-05-14 11:56:17 UTC
(In reply to Richard Biener from comment #17)
> (In reply to Andreas Färber from comment #16)
> > (In reply to Richard Biener from comment #14)
> > > So - how does this testcase FAIL when building the kernel?
> > 
> > ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637
> 
> So it ICEs for building the kernel?  Well. GCC trunk has for example

That was a bug fixed back in 2011. It's an additional test ot make sure we
do not try to use jumplabels if kernel is built with (old) compiler exhibiting
it. Certainly not what is going on in IBS now.

> > ./cc1 -quiet t.c -I include -fPIE
> t.c: In function ‘main’:
> t.c:9:3: warning: asm operand 0 probably doesn’t match constraints
>    asm (".long %c0" :: "i" (&tp.state));
>    ^~~
> t.c:9:3: error: impossible constraint in ‘asm’
> > ./cc1 -quiet t.c -I include 
> 
> so whether it compiles depends on the flags.

This is more like it.
Comment 19 Andreas Färber 2018-05-14 12:28:29 UTC
(In reply to Richard Biener from comment #17)
> (In reply to Andreas Färber from comment #16)
> > (In reply to Richard Biener from comment #14)
> > > So - how does this testcase FAIL when building the kernel?
> > 
> > ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637
> 
> So it ICEs for building the kernel?

Oh, you mean _our_ kernel build. :) We don't see that in the build log because gcc-goto.sh redirects output to /dev/null; I hoped you had a hunch what might differ between the two environments and how to investigate or fix.

The test has been in the kernel for years already:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/scripts/gcc-goto.sh?id=f3c003f72dfb2497056bcbb864885837a1968ed5
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/scripts/gcc-goto.sh?id=a9468f30b5eac6957c86aea97954553bfb7c1f18

Did we recently try building the kernel with PIE or other changing flags?
Comment 20 Andreas Färber 2018-05-14 12:32:25 UTC
(In reply to Andreas Färber from comment #19)
> Did we recently try building the kernel with PIE or other changing flags?

gcc-PIE-7-2.7 :( which I locally do not have installed.

Relying on gcc-PIE seems rather risky to me - can we BuildIgnore that and instead make sure the flags are always the same, both in OBS and for local builds?
Comment 21 Michal Kubeček 2018-05-14 12:37:41 UTC
I'm confused... Kernel:stable (4.16.8) from May 9 seems to be built with the
same gcc packages as Kernel:HEAD but it has jump_entries and num_jump_entries
in struct module. So there seems to be some difference between master and
stable. Some problem with commit e501ce957a78, perhaps?
Comment 22 Dominique Leuenberger 2018-05-14 12:40:22 UTC
(In reply to Andreas Färber from comment #20)
> (In reply to Andreas Färber from comment #19)
> > Did we recently try building the kernel with PIE or other changing flags?
> 
> gcc-PIE-7-2.7 :( which I locally do not have installed.

> Relying on gcc-PIE seems rather risky to me - can we BuildIgnore that and
> instead make sure the flags are always the same, both in OBS and for local
> builds?

ANY build in OBS uses gcc-PIE, also if you use 'osc build' it is being installed.

PIE has been in use since 2017-05-27 (on Factory)

BuildIgnore: gcc-PIE is possible, but should be discussed with the sec team first (who asked for PIE to be enabled in first place)
Comment 23 Michal Kubeček 2018-05-14 12:43:42 UTC
Installing gcc-PIE makes the out of tree module compatible with kernel image
(no jump_entries and num_jump_entries in struct module). But it still does not
explain the difference between Kernel:HEAD (master) and Kernel:stable (stable).
Comment 24 Michal Kubeček 2018-05-14 12:49:40 UTC
BuildIgnore gcc-PIE when building the kernel would IMHO result in an opposite
problem if an out of tree module is built on a system with gcc-PIE installed.
Perhaps rather let kernel-*-devel (or kernel-devel) depend on gcc-PIE.

Anyway, long term solution should be fixing the testcase so that we detect
CC_HAVE_ASM_GOTO correctly, with or without gcc-PIE. But that is beyond my
ARM64 assembler skills.
Comment 25 Andreas Färber 2018-05-14 12:50:50 UTC
https://kernel.opensuse.org/cgit/kernel/commit/?id=e501ce957a78 moves the gcc-goto.sh test to an earlier location - possibly the $(KBUILD_CFLAGS) passed to the script make a difference?
Comment 26 Michal Kubeček 2018-05-14 12:52:03 UTC
(In reply to Andreas Färber from comment #25)
> https://kernel.opensuse.org/cgit/kernel/commit/?id=e501ce957a78 moves the
> gcc-goto.sh test to an earlier location - possibly the $(KBUILD_CFLAGS)
> passed to the script make a difference?

This looks promising. There is

  KBUILD_CFLAGS  += $(call cc-option,-fno-PIE)
  KBUILD_AFLAGS  += $(call cc-option,-fno-PIE)

between the new and old location.
Comment 27 Michal Kubeček 2018-05-15 10:50:55 UTC
Created attachment 770256 [details]
tentative patch

This shold do the trick, just move the lines disabling PIE before the asm goto
test again.

Tested with the lan9252 module: both with and without gcc-PIE package, struct
module has jump_entries and num_jump_entries members.

Going to check that it does the trick in IBS as well but unfortunately I got
some slow worker, the build (project home:mkubecek:bsc1092456) is already
running for over 3 hours and doesn't seem to be going to finish any time soon.

Once tested, I would like to submit the patch to upstream, unless someone has
a better idea. In any case, it would be nice to fix the testcase but that
would require someone else.
Comment 28 Michal Kubeček 2018-05-16 05:23:59 UTC
Good news... with kernel packages built in IBS with the patch from comment 27,
struct module looks the same in vmlinux, lan9252.ko built with gcc-PIE package
installed and lan9252.ko built without gcc-PIE installed. I also tried to
insmod lan9252.ko, refcount is zero after loading and module can be unloaded
cleanly.

I'm going to submit the patch to upstream. Let's see if someone comes with
a nicer solution.
Comment 29 Michal Kubeček 2018-05-16 06:19:43 UTC
(In reply to Michal Kubeček from comment #28)
> Good news... with kernel packages built in IBS with the patch from comment 27,
> struct module looks the same

For the sake of completeness: "the same" means it has both members related
to jump labels (i.e. what we want).
Comment 32 Michal Kubeček 2018-05-21 10:43:54 UTC
The fix (or rather workaround) was accepted to kbuild tree but it's not in
v4.17-rc6 yet so I cherry picked it to our master branch. 4.17-rc6 packages
building now in IBS Devel:Kernel:master already have it. OBS Kernel:HEAD was
not updated to RC6 yet but once it is, it should have the patch as well.
Comment 33 Michal Kubeček 2018-05-31 07:21:28 UTC
4.17-rc7 packages available in both OBS Kernel:HEAD and IBS Devel:Kernel:master
now should be fixed. Feel free to reopen if something goes wrong.