Bug 1168832 - nvidia 440.64: refcount_t: underflow; use-after-free
nvidia 440.64: refcount_t: underflow; use-after-free
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X11 3rd Party Driver
Current
x86-64 Other
: P3 - Medium : Normal (vote)
: ---
Assigned To: Stefan Dirsch
Stefan Dirsch
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-04-07 08:52 UTC by Dan Čermák
Modified: 2020-04-20 12:07 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Kernel trace (5.93 KB, text/plain)
2020-04-07 08:52 UTC, Dan Čermák
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Čermák 2020-04-07 08:52:53 UTC
Created attachment 835058 [details]
Kernel trace

I've found the attached trace in my system log this morning after booting the machine with the latest kernel on Tumbleweed.
Comment 1 Takashi Iwai 2020-04-07 09:27:16 UTC
Looks like some bug in nouveau driver.

You didn't get this in the earlier kernels, right?
Comment 2 Dan Čermák 2020-04-07 09:35:21 UTC
(In reply to Takashi Iwai from comment #1)
> Looks like some bug in nouveau driver.

I thought that I was not using nouveau, I've installed the proprietary NVidia driver but keep the nvidia gpu off via prime-select.

> 
> You didn't get this in the earlier kernels, right?

Turns out I did:

Boreas:~ # journalctl |grep refcount
Feb 13 17:05:37 Boreas kernel: refcount_t: underflow; use-after-free.
Feb 13 17:05:37 Boreas kernel: WARNING: CPU: 9 PID: 2991 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Feb 13 17:05:37 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Feb 17 01:11:31 Boreas kernel: refcount_t: underflow; use-after-free.
Feb 17 01:11:31 Boreas kernel: WARNING: CPU: 9 PID: 2712 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Feb 17 01:11:31 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Feb 24 01:24:37 Boreas kernel: refcount_t: underflow; use-after-free.
Feb 24 01:24:37 Boreas kernel: WARNING: CPU: 10 PID: 2728 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Feb 24 01:24:37 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 02 00:59:51 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 02 00:59:51 Boreas kernel: WARNING: CPU: 6 PID: 2727 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 02 00:59:51 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 09 01:18:05 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 09 01:18:05 Boreas kernel: WARNING: CPU: 7 PID: 2746 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 09 01:18:05 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 23 01:01:32 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 23 01:01:32 Boreas kernel: WARNING: CPU: 6 PID: 3163 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 23 01:01:32 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 24 00:17:49 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 24 00:17:49 Boreas kernel: WARNING: CPU: 0 PID: 3727 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 24 00:17:49 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 30 00:47:55 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 30 00:47:55 Boreas kernel: WARNING: CPU: 8 PID: 3790 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 30 00:47:55 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Apr 05 10:09:45 Boreas kernel: refcount_t: underflow; use-after-free.
Apr 05 10:09:45 Boreas kernel: WARNING: CPU: 5 PID: 2774 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Apr 05 10:09:45 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Apr 07 07:33:34 Boreas kernel: refcount_t: underflow; use-after-free.
Apr 07 07:33:34 Boreas kernel: WARNING: CPU: 9 PID: 2754 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Apr 07 07:33:34 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Comment 3 Takashi Iwai 2020-04-07 09:39:48 UTC
(In reply to Dan Čermák from comment #2)
> (In reply to Takashi Iwai from comment #1)
> > Looks like some bug in nouveau driver.
> 
> I thought that I was not using nouveau, I've installed the proprietary
> NVidia driver but keep the nvidia gpu off via prime-select.

Ah right, that must be, I didn't take a deep look.

So this should be rather a Nvidia binary driver problem.  Adding Stefan to Cc.
Comment 4 Stefan Dirsch 2020-04-07 10:00:03 UTC
This doesn't make much sense to me. You can't switch off the NVIDIA GPU as long as a nvidia kernel module as nouveau or the ones from the NVIDIA is loaded (only bbswitch module can be loaded).

So could you answer one question first. Are you interested in using NVIDIA's proprietrary driver or disabling  your NVIDIA GPU?
Comment 5 Dan Čermák 2020-04-07 13:44:19 UTC
(In reply to Stefan Dirsch from comment #4)
> This doesn't make much sense to me. You can't switch off the NVIDIA GPU as
> long as a nvidia kernel module as nouveau or the ones from the NVIDIA is
> loaded (only bbswitch module can be loaded).

I was under the impression that the whole point of suse-prime is to disable the NVIDIA GPU if you switch to the intel driver?

> 
> So could you answer one question first. Are you interested in using NVIDIA's
> proprietrary driver or disabling  your NVIDIA GPU?

I'd like NVIDIA GPU to be off, but I do occasionally need the proprietary driver so I'd prefer not to completely remove it, if that is somehow possible.
Comment 6 Stefan Dirsch 2020-04-07 21:27:26 UTC
Ok. You can do this, if you use the suse-prime-bbswitch package. 

Can you make sure that nvidia kernel modules are not loaded (lsmod | grep nvidia) and then check if the refcount messages are shown nevertheless?
Comment 7 Dan Čermák 2020-04-08 06:12:23 UTC
(In reply to Stefan Dirsch from comment #6)
> Ok. You can do this, if you use the suse-prime-bbswitch package. 

The official docs state that bbswitch is not supposed to be used with the G05 driver packages (which I am using). Should I use that nevertheless or use this instead: https://github.com/openSUSE/SUSEPrime#nvidia-power-off-support-with-435xxx-driver-and-later-g05-driver-packages ?

> Can you make sure that nvidia kernel modules are not loaded (lsmod | grep
> nvidia) and then check if the refcount messages are shown nevertheless?
Comment 8 Stefan Dirsch 2020-04-08 08:35:10 UTC
(In reply to Dan Čermák from comment #7)
> (In reply to Stefan Dirsch from comment #6)
> > Ok. You can do this, if you use the suse-prime-bbswitch package. 
> 
> The official docs state that bbswitch is not supposed to be used with the
> G05 driver packages (which I am using). Should I use that nevertheless or
> use this instead:
> https://github.com/openSUSE/SUSEPrime#nvidia-power-off-support-with-435xxx-
> driver-and-later-g05-driver-packages ?

Unfortunately that's not the whole truth. From the README:

Chapter 22. PCI-Express Runtime D3 (RTD3) Power Management
[...]
22B. SUPPORTED CONFIGURATIONS

This feature is available only when the following conditions are satisfied:

   o This feature is supported only on notebooks.

   o This feature requires system hardware as well as ACPI support (ACPI
     "_PR0" and "_PR3" methods are needed to control PCIe power). The
     necessary hardware and ACPI support was first added in Intel Coffeelake
     chipset series. Hence, this feature is supported from Intel Coffeelake
     chipset series.

   o This feature requires a Turing or newer GPU.

   o This feature is supported with Linux kernel versions 4.18 and newer. With
     older kernel versions, it may not work as intended.

   o This feature is supported when Linux kernel defines CONFIG_PM
     (CONFIG_PM=y). Typically, if the system supports S3 (suspend-to-RAM),
     then CONFIG_PM would be defined.
[...]

I doubt your laptop doesn't have a  NVIDIA Turing GPU.

https://en.wikipedia.org/wiki/Turing_(microarchitecture)
Comment 9 Stefan Dirsch 2020-04-08 08:39:05 UTC
> I doubt your laptop doesn't have a  NVIDIA Turing GPU.
 I doubt your laptop *has* a  NVIDIA Turing GPU.

> Can you make sure that nvidia kernel modules are not loaded (lsmod | grep nvidia) and then check if the refcount messages are shown 
> nevertheless?

I would like to see whether these messages you're complaining about are related to the NVIDIA kernel modules at all ... or are a generic kernel regression.
Comment 10 Dan Čermák 2020-04-09 06:43:14 UTC
(In reply to Stefan Dirsch from comment #9)
> > I doubt your laptop doesn't have a  NVIDIA Turing GPU.
>  I doubt your laptop *has* a  NVIDIA Turing GPU.

Indeed, it's a Quadro P1000.

> 
> > Can you make sure that nvidia kernel modules are not loaded (lsmod | grep nvidia) and then check if the refcount messages are shown 
> > nevertheless?
> 
> I would like to see whether these messages you're complaining about are
> related to the NVIDIA kernel modules at all ... or are a generic kernel
> regression.

I have setup suse-prime-bbswitch and will see if the error pops up again.
Comment 11 Stefan Dirsch 2020-04-09 07:26:49 UTC
Thanks. And please make sure you're now really in Intel mode without any nvidia kernel modules in place.
Comment 12 Dan Čermák 2020-04-20 10:03:38 UTC
(In reply to Stefan Dirsch from comment #11)
> Thanks. And please make sure you're now really in Intel mode without any
> nvidia kernel modules in place.

So far the last overflow was on April 7th and given the previous occurrences, it should have shown up by now:

journalctl |grep refcount
Feb 13 17:05:37 Boreas kernel: refcount_t: underflow; use-after-free.
Feb 13 17:05:37 Boreas kernel: WARNING: CPU: 9 PID: 2991 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Feb 13 17:05:37 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Feb 17 01:11:31 Boreas kernel: refcount_t: underflow; use-after-free.
Feb 17 01:11:31 Boreas kernel: WARNING: CPU: 9 PID: 2712 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Feb 17 01:11:31 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Feb 24 01:24:37 Boreas kernel: refcount_t: underflow; use-after-free.
Feb 24 01:24:37 Boreas kernel: WARNING: CPU: 10 PID: 2728 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Feb 24 01:24:37 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 02 00:59:51 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 02 00:59:51 Boreas kernel: WARNING: CPU: 6 PID: 2727 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 02 00:59:51 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 09 01:18:05 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 09 01:18:05 Boreas kernel: WARNING: CPU: 7 PID: 2746 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 09 01:18:05 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 23 01:01:32 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 23 01:01:32 Boreas kernel: WARNING: CPU: 6 PID: 3163 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 23 01:01:32 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 24 00:17:49 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 24 00:17:49 Boreas kernel: WARNING: CPU: 0 PID: 3727 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 24 00:17:49 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 30 00:47:55 Boreas kernel: refcount_t: underflow; use-after-free.
Mar 30 00:47:55 Boreas kernel: WARNING: CPU: 8 PID: 3790 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 30 00:47:55 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Apr 05 10:09:45 Boreas kernel: refcount_t: underflow; use-after-free.
Apr 05 10:09:45 Boreas kernel: WARNING: CPU: 5 PID: 2774 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Apr 05 10:09:45 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Apr 07 07:33:34 Boreas kernel: refcount_t: underflow; use-after-free.
Apr 07 07:33:34 Boreas kernel: WARNING: CPU: 9 PID: 2754 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Apr 07 07:33:34 Boreas kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0



Unfortunately due to http://bugzilla.opensuse.org/show_bug.cgi?id=1169386 I've had to switch back to suse-prime, so the nvidia kernel modules are now loaded again.
Comment 13 Stefan Dirsch 2020-04-20 10:51:17 UTC
Ok. So we no longer sse the issue in intel mode with nvidia kernel modules loaded. So this could be a regression in nouveau kernel module.
Comment 14 Takashi Iwai 2020-04-20 11:02:27 UTC
(In reply to Stefan Dirsch from comment #13)
> Ok. So we no longer sse the issue in intel mode with nvidia kernel modules
> loaded. So this could be a regression in nouveau kernel module.

Very unlikely.  The refcount warning does come from Nvidia binary driver code and has nothing to do with nouveau.

Since it's a refcount type and it happens intermittently after some long use, it's rather the bug that is triggered either via the casual race or after the accumulated usage counts.  By using intel mode, such a condition won't be triggered, hence no bug can be seen.

That said, it'd be better to be reported / pushed to Nvidia people.
Comment 15 Stefan Dirsch 2020-04-20 12:05:12 UTC
Seems to be releated

https://bugzilla.redhat.com/show_bug.cgi?id=1806257

Ok. Seems 440.82 has fixed the issue.
Comment 16 Stefan Dirsch 2020-04-20 12:07:04 UTC
Fixed.