Bug 1082704 - compiling nvidia kernel modules uses wrong kernel tree?
compiling nvidia kernel modules uses wrong kernel tree?
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X11 3rd Party Driver
Current
Other Other
: P3 - Medium : Normal (vote)
: ---
Assigned To: E-mail List
Stefan Dirsch
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-24 20:30 UTC by Peter Sütterlin
Modified: 2021-06-14 21:58 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
zypp history from upgrade (23.77 KB, application/x-xz)
2018-02-25 21:59 UTC, Peter Sütterlin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Sütterlin 2018-02-24 20:30:54 UTC
My desktop has a NVidia GTX 1060, I use the proprietary driver from the nvidia repo and recent Tumbleweed.

The drivers kernel modules are version is nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1.x86_64, the kernel (TW 20180221) is 4.15.4-1-default. 

For checking some other issue, I had rebooted to the previous kernel (4.15.2-1-default) today.  The drivers wouldn't load. 

I have
/lib/modules/4.15.2-1-default/updates/nvidia.ko
/lib/modules/4.15.4-1-default/weak-updates/updates/nvidia.ko -> /lib/modules/4.15.2-1-default/updates/nvidia.ko

However,

modinfo /lib/modules/4.15.2-1-default/updates/nvidia.ko
filename:       /lib/modules/4.15.2-1-default/updates/nvidia.ko
alias:          char-major-195-*
version:        390.25
supported:      external
license:        NVIDIA
srcversion:     B5B1CA3087B567ADFADC070
alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        ipmi_msghandler
retpoline:      Y
name:           nvidia
vermagic:       4.15.4-1-default SMP preempt mod_unload modversions 

or
strings /lib/modules/4.15.2-1-default/updates/nvidia.ko | grep 4\\.15
4.15.4-1H
/usr/src/linux-4.15.4-1/include/linux/dma-mapping.h
/usr/src/linux-4.15.4-1/include/linux/dma-mapping.h
......

Seems it was compiled using the 4.15.4 kernel tree, but installed in the 4.15.2 modules directory?

No manual compiles etc., just using standard 'zypper dup'
Comment 1 Michael Hirmke 2018-02-25 09:54:57 UTC
I can confirm this behaviour.
On my Tumbleweed machines the same is happening.
Comment 2 Michael Hirmke 2018-02-25 09:56:36 UTC
Evene booting the older kernels 4.15.[23] and running "zypper up --force <nvidia-packages>" doesn't change this behaviour.
So booting older kernels gives no gui at the moment.
Comment 3 Stefan Dirsch 2018-02-25 20:54:41 UTC
First, there is nothing wrong with the weak-updates compatibility symlink. Please attach dmesg output with the kernel where it doesn't work.
Comment 4 Peter Sütterlin 2018-02-25 21:40:13 UTC
No, of course the weak-update links are as they should be.  But it should be the 4.15.2 module in /lib/modules/4.15.2-1-default/updates.  It obviously is not.

Here's the dmesg when I tried to manually load the module in kernel 4.15.2 (I had booted to runlevel 3, logged in on console and typed the modprobe command):

Feb 24 20:07:05 lux login[1775]: ROOT LOGIN ON tty1
Feb 24 20:07:23 lux kernel: nvidia: disagrees about version of symbol kmem_cache_alloc_trace
Feb 24 20:07:23 lux kernel: nvidia: Unknown symbol kmem_cache_alloc_trace (err -22)
Feb 24 20:07:23 lux kernel: nvidia: disagrees about version of symbol kmem_cache_alloc
Feb 24 20:07:23 lux kernel: nvidia: Unknown symbol kmem_cache_alloc (err -22)
Feb 24 20:07:23 lux kernel: nvidia: disagrees about version of symbol kmem_cache_free
Feb 24 20:07:23 lux kernel: nvidia: Unknown symbol kmem_cache_free (err -22)

There were similar/identical messages earlier in the boot, as the machine is nvidia-only, and the module is supposed to load already from initrd, which of course also failed.

It is supposedly the same issue that you requested a separate report for in
https://bugzilla.opensuse.org/show_bug.cgi?id=1080742#c113
Comment 5 Peter Sütterlin 2018-02-25 21:59:46 UTC
Created attachment 761606 [details]
zypp history from upgrade

This is the /var/log/zypp/history from the upgrade that installed both new kernel (4.15.4) and nvidia modules.

Line 148 is the start of the additional rpm output for kernel-default-devel-4.15.4-1.5, which compiles the nvidia module in /usr/src/linux-4.15.4-1-obj/x86_64/default

Line 1156+ then does this again when installing nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1, again in the new kernel tree (line 1359)

The log starts with warnings that
# Warning: /lib/modules/4.15.1-1-default is inconsistent
# Warning: /lib/modules/4.15.2-1-default is inconsistent
(lines 1328/1337)

To me it looks like the installation scripts will use the latest/highest installed kernel tree, but are hardcoded where to store the result.
Comment 6 Stefan Dirsch 2018-02-26 12:44:07 UTC
My guess is, that you have various versions of kernel-default-devel, kernel-devel, kernel-source installed.

I cannot investigate that issue without direct access to the system. I you want to investigate the issue yourself, you would need to run %post of nvidia-gfxG04-kmp-default manually. Please check out via

  rpm --scripts -q nvidia-gfxG04-kmp-default

postinstall scriptlet (using /bin/sh):
arch=x86_64
[...]
/usr/sbin/update-alternatives [...]

Maybe you can figure out something.
Comment 7 Peter Sütterlin 2018-02-26 14:21:54 UTC
(In reply to Stefan Dirsch from comment #6)
> My guess is, that you have various versions of kernel-default-devel,
> kernel-devel, kernel-source installed.

Yes, indeed.  
lux:~ # rpm -q kernel-default-devel kernel-devel
kernel-default-devel-4.15.2-1.4.x86_64
kernel-default-devel-4.15.4-1.5.x86_64
kernel-devel-4.15.2-1.4.noarch
kernel-devel-4.15.4-1.5.noarch

nvidia-gfxG04-kmp-default requires kernel-default-devel, and this gets (also) updated with every new kernel version when running 'zypper dup'.
So I'd assume every user of the nvidia repo would be in that situation?

> I cannot investigate that issue without direct access to the system. I you
> want to investigate the issue yourself, you would need to run %post of
> nvidia-gfxG04-kmp-default manually. Please check out via
> 
>   rpm --scripts -q nvidia-gfxG04-kmp-default

I had a look at those before, and also at the makefiles etc, but could so far not spot where it decides to use the latest devel version...
going through it again now.  There's no mantion of any update-alternatives though.

The postinstall scriptlet only consists of
-----
postinstall scriptlet (using /bin/sh):
nvr=nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1
wm2=/usr/lib/module-init-tools/weak-modules2
if [ -x $wm2 ]; then
     INITRD_IN_POSTTRANS=1 /bin/bash -${-/e/} $wm2 --add-kmp $nvr
fi
-----
And running that does not compile anything, I think it only creates the links.

So then the suspicion is it is the kernel-default-devel package.  The zypper log shows that that one did actually compile the nvidia modules.

I checked the postinstall of that one, but that only does the ..obj links.
Still, the log has
# 2018-02-22 23:24:25 kernel-default-devel-4.15.4-1.5.x86_64.rpm installed ok
# Additional rpm output:
# Changing symlink /usr/src/linux-obj/x86_64/default from ../../linux-4.15.2-1-obj/x86_64/default to ../../linux-4.15.4-1-obj/x86_64/default
# /usr/src/kernel-modules/nvidia-390.25-default /
# rm -f -r conftest
# make[1]: Entering directory '/usr/src/linux-4.15.2-1'

The 'Changing symlink' is from the post script.
No idea why it starts compiling now.  

Does it call dkms or something?
Comment 8 Stefan Dirsch 2018-02-26 14:39:24 UTC
This sounds weird. On my test system:

# rpm --scripts -q nvidia-gfxG04-kmp-default
[...]
postinstall scriptlet (using /bin/sh):
arch=x86_64
flavor=default
kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease)
make -C /usr/src/linux-obj/$arch/$flavor \
     modules \
     M=/usr/src/kernel-modules/nvidia-390.25-$flavor \
     SYSSRC=/lib/modules/$kver/source \
     SYSOUT=/usr/src/linux-obj/$arch/$flavor
pushd /usr/src/kernel-modules/nvidia-390.25-$flavor 
make -f Makefile \
     nv-linux.o \
     SYSSRC=/lib/modules/$kver/source \
     SYSOUT=/usr/src/linux-obj/$arch/$flavor
popd
install -m 755 -d /lib/modules/4.4.76-1-$flavor/updates
install -m 644 /usr/src/kernel-modules/nvidia-390.25-$flavor/nvidia*.ko \
        /lib/modules/4.4.76-1-$flavor/updates
depmod 4.4.76-1-$flavor
[...]

Oh. And there are also trigger scripts to rebuild the kernel module once a new kernel is being installed (kind of KMS reimplemented on package level).

# rpm --triggers -q nvidia-gfxG04-kmp-default
triggerpostun scriptlet (using /bin/sh) -- drm-kmp-default
flavor=default
pushd /usr/src/kernel-modules/nvidia-390.25-$flavor || true
cp -a Makefile{,.tmp} || true
make clean || true
mv Makefile{.tmp,} || true
popd || true
arch=x86_64
flavor=default
kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease)
make -C /usr/src/linux-obj/$arch/$flavor \
     modules \
     M=/usr/src/kernel-modules/nvidia-390.25-$flavor \
     SYSSRC=/lib/modules/$kver/source \
     SYSOUT=/usr/src/linux-obj/$arch/$flavor
pushd /usr/src/kernel-modules/nvidia-390.25-$flavor 
make -f Makefile \
     nv-linux.o \
     SYSSRC=/lib/modules/$kver/source \
     SYSOUT=/usr/src/linux-obj/$arch/$flavor
popd
install -m 755 -d /lib/modules/4.4.76-1-$flavor/updates
install -m 644 /usr/src/kernel-modules/nvidia-390.25-$flavor/nvidia*.ko \
        /lib/modules/4.4.76-1-$flavor/updates
depmod 4.4.76-1-$flavor
[...]


If you have complete different scripts in your packages, you're using complete
different packages.
Comment 9 Peter Sütterlin 2018-02-26 15:14:51 UTC
I definitely have different packages, as I am running Tumbleweed, whereas yours seem to indicate Leap.  I just re-downloaded the rpm package from nvidia and checked the scripts again, they are identical, and the postinstall only has the mentioned call to /usr/lib/module-init-tools/weak-modules2

And while I did have a look at the triggers of the kernel-default-devel package (there are none), I forgot to do the same for the nvidia pakage :((

And indeed, they show what I suspected:

--------
triggerin scriptlet (using /bin/sh) -- kernel-default-devel
flavor=default
pushd /usr/src/kernel-modules/nvidia-390.25-$flavor || true
cp -a Makefile{,.tmp} || true
make clean || true
mv Makefile{.tmp,} || true
popd || true
arch=x86_64
flavor=default
kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease)
make -C /usr/src/linux-obj/$arch/$flavor \
     modules \
     M=/usr/src/kernel-modules/nvidia-390.25-$flavor \
     SYSSRC=/lib/modules/$kver/source \
     SYSOUT=/usr/src/linux-obj/$arch/$flavor
pushd /usr/src/kernel-modules/nvidia-390.25-$flavor 
make -f Makefile \
     nv-linux.o \
     SYSSRC=/lib/modules/$kver/source \
     SYSOUT=/usr/src/linux-obj/$arch/$flavor
popd
install -m 755 -d /lib/modules/4.15.2-1-$flavor/updates
install -m 644 /usr/src/kernel-modules/nvidia-390.25-$flavor/nvidia*.ko \
        /lib/modules/4.15.2-1-$flavor/updates
depmod 4.15.2-1-$flavor
---------

It compiles for the latest kernel (kver resolves to 4.15.4-1-default), but then installs the modules to the (fixed) location of the old modules. 
 
IMHO the install should go to /lib/modules/$kver/updates
Comment 10 Stefan Dirsch 2018-02-26 16:25:19 UTC
Ok. You're right. I was testing on Leap 42.3. Indeed there we have the rebuild in %post and in the %triggers. On TW only in the %triggers.

I also believe it's correct to install to /lib/modules/$kver/updates instead
of the hardcoded path in the trigger scripts for TW.
Comment 11 Peter Sütterlin 2018-02-26 16:46:49 UTC
(In reply to Stefan Dirsch from comment #10)
> Ok. You're right. I was testing on Leap 42.3. Indeed there we have the
> rebuild in %post and in the %triggers. On TW only in the %triggers.

one mystery solved :)
 
> I also believe it's correct to install to /lib/modules/$kver/updates instead
> of the hardcoded path in the trigger scripts for TW.

I think it's wrong in both TW and Leap.

kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease)

will always point to the latest installed kernel.  This *can* be the one the modules were compiled for originally.  In that case $kver=4.4.76-1-$flavor (in your case).  But if it's different, it would overwrite the original one with a 'broken' one.  Likely the error doesn't show up in Leap, as that one doesn't really do big kernel jumps.

In principle it's easy:  If you compile against $kver, then also install in /lib/modules/$kver .....
Comment 12 Stefan Dirsch 2018-02-26 17:07:21 UTC
The reason the issue doesn't show up on Leap 42.3 is, because there we are kABI compatible. And without using the fixed tree weak-updates mechanism wouldn't work.

On TW we're no longer kABI compatible. 

Anyway, I've changed this now. But I haven't done any testing yet.

Mon Feb 26 16:22:07 UTC 2018 - sndirsch@suse.com

- rebuilded kernel modules in %trigger of TW packages should go to
  the tree against which the kernel module gets builded, not the
  hardcoded one during build of the package; introduced
  kmp-trigger.sh/kmp-trigger-old.sh script snippets for this based
  on kmp-post.sh/kmp-post-old.sh (boo#1082704)

--> obs://X11:Drivers:Video/nvidia-gfxG04

You can do a manually build, if you want (check the README file).
Comment 13 Peter Sütterlin 2018-02-26 17:29:30 UTC
(In reply to Stefan Dirsch from comment #12)
> The reason the issue doesn't show up on Leap 42.3 is, because there we are
> kABI compatible. And without using the fixed tree weak-updates mechanism
> wouldn't work.

> On TW we're no longer kABI compatible. 

Ah!  Then you wouldn't use/need it in TW at all...

> Anyway, I've changed this now. But I haven't done any testing yet.
> 
> Mon Feb 26 16:22:07 UTC 2018 - sndirsch@suse.com
> 
> - rebuilded kernel modules in %trigger of TW packages should go to
>   the tree against which the kernel module gets builded, not the
>   hardcoded one during build of the package; introduced
>   kmp-trigger.sh/kmp-trigger-old.sh script snippets for this based
>   on kmp-post.sh/kmp-post-old.sh (boo#1082704)
> 
> --> obs://X11:Drivers:Video/nvidia-gfxG04
> 
> You can do a manually build, if you want (check the README file).

I had a look at the kmp.post.sh, looks fine to me.

I had just manually compiled the 4.15.2-1-default version of the modules, based on the old TW script, changing $kver and the related directory names (linux-obj -> linux-${kver%$flavor}obj). Went fine, depmod $kver doesn't give errors.

Thanks!
Comment 14 Stefan Dirsch 2018-02-26 19:04:48 UTC
Probably you need kmp-trigger.sh. Thanks for giving it a try!
Comment 15 Peter Sütterlin 2018-02-28 09:09:51 UTC
I'm not really familiar with obs - is there an easy way to get a real package based on this (or a src.rpm),  or is it scheduled to be included in the official nvidia repo?
I'm currently holding back the update of my nvidia box (there's a kernel update), to check if things work properly....
Comment 16 Stefan Dirsch 2018-02-28 09:51:39 UTC
We cannot build the RPMs in obs, only provide the package sources. Due to legal reasons. :-(

So you would need to build the package itself. See my comment #12.

I cannot push the changes to NVIDIA before I have tested my changes myself.
Comment 17 Peter Sütterlin 2018-02-28 12:04:00 UTC
Puh, took a while to master obs...

So I built a package now, nvidia-gfxG04-kmp-default-390.25_k4.15.5_1-0.x86_64.rpm

I installed it on my system, which still runs 4.15.4-1, and it correctly compiled the modules for this kernel and installed them in /lib/modules/4.15.4-1-default/updates/.  No error from depmod.

So far, so good!

However, it looks like the weak-updates script has messed up things with the older kernel that is still installed (4.15.2-1).
That one did have (proper) modules in /lib/modules/4.15.2-1-default/updates/.

After installing the new package, this directory had been deleted, and instead a weak-updates/updates directory had been created, with links to /lib/modules/4.15.5-1-default/updates/ (i.e., the ones that came with the rpm).

Furthermore, those modules have been deleted, too (don't know whether by the install, or by the weak-update script - probably the latter), so the links go to nowhere :(

As you mentioned yourself, the weak-updates mechanism relies on kABI stability, which is not given for Tumbleweed.  It probably should not be used at all?

(and while at it - would there be a reason not to use a 'make -j ' for compiling the modules in the trigger script?)
Comment 18 Peter Sütterlin 2018-02-28 12:32:22 UTC
Final progress note:

Just did a zypper dup (-> new kernel) with the selfcompiled version installed.

Builds and installs the module for the new kernel correctly, and this time did not touch the previously compiled ones of the older kernel.

(and I just realize that the reported removal of the 4.15.2 modules was of course because they were part of nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1, which gets removed by an update....)
Comment 19 Stefan Dirsch 2018-02-28 13:53:26 UTC
I agree and removed weak-updates run on TW.

Wed Feb 28 13:50:50 UTC 2018 - sndirsch@suse.com

- do not run weak-updates on TW, since it creates more harm than
  benefit (boo#1082704)
Comment 20 Stefan Dirsch 2018-03-02 13:53:41 UTC
Eventually I was able to test the changes. Looks good so far. Changes will be for the next driver update. Closing as fixed.
Comment 21 Sebastian Turzański 2021-06-14 21:58:50 UTC
I think this is again the case after june 2021 updates

I get kernel: nvidia: disagrees about version of symbol module_layout

but i only have 5.12.9-1.1 version 

sudo rpm -qa |grep kernel
kernel-firmware-qcom-20210503-1.2.noarch
kernel-firmware-network-20210503-1.2.noarch
kernel-firmware-radeon-20210503-1.2.noarch
kernel-firmware-ath11k-20210503-1.2.noarch
kernel-firmware-realtek-20210503-1.2.noarch
kernel-firmware-sound-20210503-1.2.noarch
kernel-firmware-usb-network-20210503-1.2.noarch
kernel-firmware-qlogic-20210503-1.2.noarch
kernel-firmware-chelsio-20210503-1.2.noarch
kernel-firmware-bnx2-20210503-1.2.noarch
kernel-firmware-all-20210503-1.2.noarch
kernel-firmware-ti-20210503-1.2.noarch
kernel-firmware-marvell-20210503-1.2.noarch
kernel-firmware-atheros-20210503-1.2.noarch
kernel-firmware-liquidio-20210503-1.2.noarch
kernel-firmware-bluetooth-20210503-1.2.noarch
kernel-firmware-mediatek-20210503-1.2.noarch
kernel-firmware-serial-20210503-1.2.noarch
kernel-firmware-intel-20210503-1.2.noarch
kernel-firmware-nfp-20210503-1.2.noarch
kernel-firmware-nvidia-20210503-1.2.noarch
kernel-firmware-mellanox-20210503-1.2.noarch
kernel-firmware-dpaa2-20210503-1.2.noarch
kernel-firmware-iwlwifi-20210503-1.2.noarch
kernel-firmware-amdgpu-20210503-1.2.noarch
purge-kernels-service-0-8.1.noarch
kernel-firmware-i915-20210503-1.2.noarch
kernel-firmware-ueagle-20210503-1.2.noarch
kernel-firmware-brcm-20210503-1.2.noarch
kernel-firmware-mwifiex-20210503-1.2.noarch
kernel-firmware-media-20210503-1.2.noarch
kernel-firmware-ath10k-20210503-1.2.noarch
kernel-firmware-platform-20210503-1.2.noarch
kernel-syms-5.12.9-1.1.x86_64
kernel-firmware-prestera-20210503-1.2.noarch
kernel-devel-5.12.9-1.1.noarch
kernel-default-devel-5.12.9-1.1.x86_64
kernel-source-5.12.9-1.1.noarch
kernel-default-5.12.9-1.1.x86_64
kernel-macros-5.12.9-1.1.noarch