Bug 1103356 - nouveau: fan stays on maximum speed after fanboost
nouveau: fan stays on maximum speed after fanboost
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.0
x86-64 Other
: P5 - None : Normal (vote)
: ---
Assigned To: Takashi Iwai
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-08-01 08:01 UTC by Thomas Blume
Modified: 2022-03-04 20:49 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
nouvea hwmon sysfs values Leap42.3 (20.37 KB, text/plain)
2018-08-01 08:02 UTC, Thomas Blume
Details
nouveau hwmon sysfs values Leap15 (19.19 KB, text/plain)
2018-08-01 08:03 UTC, Thomas Blume
Details
logs from debug kernel (71.22 KB, application/x-bzip)
2018-08-29 12:48 UTC, Thomas Blume
Details
logs from new debug kernel (49.84 KB, application/x-bzip)
2018-08-31 09:23 UTC, Thomas Blume
Details
debug logs from : home:tiwai:bsc1103356-test2 (43.84 KB, application/x-bzip)
2018-08-31 15:16 UTC, Thomas Blume
Details
Test fix patch (1.29 KB, patch)
2018-08-31 15:30 UTC, Takashi Iwai
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Blume 2018-08-01 08:01:28 UTC
I haven an older nvidia card:

-->
# hwinfo --gfxcard
23: PCI 100.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.378]
  Unique ID: VCu0.x9HhAPKYST4
  Parent ID: vSkL.sBCJa6uSmM6
  SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0
  SysFS BusID: 0000:01:00.0
  Hardware Class: graphics card
  Model: "nVidia Quadro FX 1500"
  Vendor: pci 0x10de "nVidia Corporation"
  Device: pci 0x029e "Quadro FX 1500"
  SubVendor: pci 0x10de "nVidia Corporation"
  SubDevice: pci 0x032c 
  Revision: 0xa1
  Driver: "nouveau"
  Driver Modules: "drm"
  Memory Range: 0xf2000000-0xf2ffffff (rw,non-prefetchable)
  Memory Range: 0xe0000000-0xefffffff (ro,non-prefetchable)
  Memory Range: 0xf1000000-0xf1ffffff (rw,non-prefetchable)
  I/O Ports: 0x4000-0x4fff (rw)
  IRQ: 28 (8673 events)
  I/O Ports: 0x3c0-0x3df (rw)
  Module Alias: "pci:v000010DEd0000029Esv000010DEsd0000032Cbc03sc00i00"
  Driver Info #0:
    XFree86 v4 Server Module: nv
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #8 (PCI bridge)

Primary display adapter: #23
--<

This worked fine with the nouveau driver until Leap42.3.
With Leap15 now, the graphics card fan starts running at maximum speed and never stops.
The switch to maximum speed might be in context with temperature management.
I get the following messages:

-->
2018-08-01T07:16:01.878010+02:00 alpha kernel: [  379.049048] nouveau 0000:01:00.0: therm: temperature (90 C) hit the 'fanboost' threshold
2018-08-01T07:16:08.881928+02:00 alpha kernel: [  386.049548] nouveau 0000:01:00.0: therm: temperature (87 C) went below the 'fanboost' threshold
--<

I would expect that the fan speed decreases after the temperature goes below the fanboost threshold, but it doesn't.
As written above, on 42.3, the fan stays nice quiet at low rotation speed:

-->
# cat /sys/class/drm/card0/device/hwmon/hwmon0/pwm1
20
--<

Attaching the sysfs valume of noveau hwmon from 42.3 and 15.
Any hint what I need to tune to get the 42.3 behaviour back?
Comment 1 Thomas Blume 2018-08-01 08:02:29 UTC
Created attachment 778568 [details]
nouvea hwmon sysfs values Leap42.3
Comment 2 Thomas Blume 2018-08-01 08:03:16 UTC
Created attachment 778569 [details]
nouveau hwmon sysfs values Leap15
Comment 3 Takashi Iwai 2018-08-01 08:31:45 UTC
Did you try the very latest kernel in OBS Kernel:openSUSE-15.0?
Basically the nouveau drm driver on Leap 15.0 took all 4.14.y backports.
It might be something missing in hwmon side, though.
Comment 4 Thomas Blume 2018-08-01 09:56:34 UTC
(In reply to Takashi Iwai from comment #3)
> Did you try the very latest kernel in OBS Kernel:openSUSE-15.0?
> Basically the nouveau drm driver on Leap 15.0 took all 4.14.y backports.
> It might be something missing in hwmon side, though.

Thanks for the hint Takashi.
I've tried with:

kernel-default-4.12.14-lp150.93.1.g8ee019b.x86_64 from https://download.opensuse.org/repositories/Kernel:/openSUSE-15.0/standard/

but the fan still runs at high speed, creating noise.

/sys/class/drm/card0/device/hwmon/hwmon0/pwm1

shows it running at 70%.
gpu temperature at 72°C.
Rebooting to 42.3 and it goes back to 20%.
Comment 5 Takashi Iwai 2018-08-01 10:14:36 UTC
So you tested Leap 42.3 kernel on top of Leap 15.0 system, or tested the whole Leap 42.3 system?  If the latter, please try the former; just install Leap 42.3 kernel on top of Leap 15.0 (with --force --oldpackage or whatever option), and check whether the problem doesn't happen with it.

If the problem doesn't happen with Leap 42.3 kernel, then please try TW kernel.  I checked the nouveau_hwmon code but there is no significant change, at least.  So it must be really a high temperature due to some incorrect mode (no proper power saving, etc).
Comment 6 Thomas Blume 2018-08-01 11:28:55 UTC
(In reply to Takashi Iwai from comment #5)
> So you tested Leap 42.3 kernel on top of Leap 15.0 system, or tested the
> whole Leap 42.3 system?  If the latter, please try the former; just install
> Leap 42.3 kernel on top of Leap 15.0 (with --force --oldpackage or whatever
> option), and check whether the problem doesn't happen with it.
> 
> If the problem doesn't happen with Leap 42.3 kernel, then please try TW
> kernel.  I checked the nouveau_hwmon code but there is no significant
> change, at least.  So it must be really a high temperature due to some
> incorrect mode (no proper power saving, etc).

The problem doesn't happen with the Leap 42.3 kernel on top of Leap 15.
It returns when installing the tumbleweed kernel on Leap 15.
Checking the kernel log for differences, I've found that the message below is only shown with the 42.3 kernel:

-->
2018-08-01T12:48:09.794514+02:00 linux-rr7g kernel: [    7.769999] nouveau 0000:01:00.0: DRM: 0xC73F: Parsing digital output script table
--<
Comment 7 Takashi Iwai 2018-08-01 11:58:13 UTC
(In reply to Thomas Blume from comment #6)
> The problem doesn't happen with the Leap 42.3 kernel on top of Leap 15.
> It returns when installing the tumbleweed kernel on Leap 15.

Thanks, so this is a still remaining regression.
Could you report it to upstream?  e.g. bugzilla.freedesktop.org category DRI/Nouveau.

> Checking the kernel log for differences, I've found that the message below
> is only shown with the 42.3 kernel:
> 
> -->
> 2018-08-01T12:48:09.794514+02:00 linux-rr7g kernel: [    7.769999] nouveau
> 0000:01:00.0: DRM: 0xC73F: Parsing digital output script table
> --<

This is a part of BIOS parsing stuff, so something might be missing in the recent kernel relevant with it...
Comment 8 Thomas Blume 2018-08-01 14:12:20 UTC
(In reply to Takashi Iwai from comment #7)
> (In reply to Thomas Blume from comment #6)
> > The problem doesn't happen with the Leap 42.3 kernel on top of Leap 15.
> > It returns when installing the tumbleweed kernel on Leap 15.
> 
> Thanks, so this is a still remaining regression.
> Could you report it to upstream?  e.g. bugzilla.freedesktop.org category
> DRI/Nouveau.
> 

Ah, that was the right pointer.
I've checked the bug reports there and found the debug option for the nouveau driver. Activating it I can see this:

-->
# grep 'therm' /mnt/dmesg-4_17.txt 
[    6.595497] nouveau 0000:01:00.0: therm: FAN control: PWM
[    6.595504] nouveau 0000:01:00.0: therm: parsing the fan table failed
[    6.595515] nouveau 0000:01:00.0: therm: fan management: automatic
[    6.595520] nouveau 0000:01:00.0: therm: FAN target request: 70%
[    6.595525] nouveau 0000:01:00.0: therm: FAN target: 70
[    6.595529] nouveau 0000:01:00.0: therm: FAN update: 23
[    6.595538] nouveau 0000:01:00.0: therm: internal sensor: yes
[    6.615401] nouveau 0000:01:00.0: therm: programmed thresholds [ 90(3), 95(3), 130(2), 135(5) ]
[    7.095580] nouveau 0000:01:00.0: therm: FAN update: 26
[    7.595674] nouveau 0000:01:00.0: therm: FAN update: 29
[    8.095757] nouveau 0000:01:00.0: therm: FAN update: 32
[    8.595853] nouveau 0000:01:00.0: therm: FAN update: 35
[    9.095938] nouveau 0000:01:00.0: therm: FAN update: 38
[    9.596029] nouveau 0000:01:00.0: therm: FAN update: 41
[   10.096105] nouveau 0000:01:00.0: therm: FAN update: 44
[   10.597783] nouveau 0000:01:00.0: therm: FAN update: 47
[   11.099110] nouveau 0000:01:00.0: therm: FAN update: 50
[   11.600452] nouveau 0000:01:00.0: therm: FAN update: 53
[   12.101842] nouveau 0000:01:00.0: therm: FAN update: 56
[   12.603128] nouveau 0000:01:00.0: therm: FAN update: 59
[   13.104425] nouveau 0000:01:00.0: therm: FAN update: 62
[   13.604474] nouveau 0000:01:00.0: therm: FAN update: 65
[   14.104522] nouveau 0000:01:00.0: therm: FAN update: 68
[   14.606060] nouveau 0000:01:00.0: therm: FAN update: 70
--<

Preparing the upstream bug report.
Comment 9 Takashi Iwai 2018-08-01 14:32:03 UTC
Great, feel free to put me (tiwai@suse.de) to Cc if you enter a bug report in freedesktop.org bugzilla.
(But I'm not 100% sure whether Bugzilla is the preferred tracker now, as they moved to gitlab recently.  It used to work reliably in the past, though.)
Comment 10 Takashi Iwai 2018-08-01 19:41:18 UTC
I'm building a test kernel with the revert of a suspected commit (800efb4c2857ec543).
It's being built on OBS home:tiwai:bsc1103356 repo.

Could you give it a try later?
Comment 11 Thomas Blume 2018-08-02 06:52:46 UTC
(In reply to Takashi Iwai from comment #10)
> I'm building a test kernel with the revert of a suspected commit
> (800efb4c2857ec543).
> It's being built on OBS home:tiwai:bsc1103356 repo.
> 
> Could you give it a try later?

Thanks a lot Takashi, I don't have access to the testmachine today, will try tomorrow.
Comment 12 Thomas Blume 2018-08-03 10:00:31 UTC
(In reply to Takashi Iwai from comment #10)
> I'm building a test kernel with the revert of a suspected commit
> (800efb4c2857ec543).
> It's being built on OBS home:tiwai:bsc1103356 repo.
> 
> Could you give it a try later?

This build fixes the issue on my machine.
The dmesg logs show:

-->
Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: FAN control: PWM
Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: parsing the fan table failed
Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: fan management: automatic
Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: internal sensor: yes
Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: programmed thresholds [ 90(3), 95(3), 130(2), 135(5) ]
-->

and the fan speed shows:

-->
 # cat /sys/class/drm/card0/device/hwmon/hwmon0/pwm1
20
--<
Comment 13 Takashi Iwai 2018-08-03 11:49:09 UTC
I reverted the commit as a temporary workaround until the upstream fixes it properly.
Comment 14 Swamp Workflow Management 2018-08-06 20:26:05 UTC
This is an autogenerated message for OBS integration:
This bug (1103356) was mentioned in
https://build.opensuse.org/request/show/627749 15.0 / kernel-source
Comment 15 Swamp Workflow Management 2018-08-07 19:24:12 UTC
openSUSE-SU-2018:2242-1: An update that solves two vulnerabilities and has 87 fixes is now available.

Category: security (important)
Bug References: 1012382,1037697,1046299,1046300,1046302,1046303,1046305,1046306,1046307,1046533,1046543,1050242,1050536,1050538,1050540,1051510,1054245,1056651,1056787,1058169,1058659,1060463,1066110,1068032,1075087,1075360,1077338,1077761,1077989,1085042,1085536,1085539,1086301,1086313,1086314,1086324,1086457,1087092,1087202,1087217,1087233,1090098,1090888,1091041,1091171,1093148,1093666,1094119,1096330,1097583,1097584,1097585,1097586,1097587,1097588,1098633,1099193,1100132,1100884,1101143,1101337,1101352,1101465,1101564,1101669,1101674,1101789,1101813,1101816,1102088,1102097,1102147,1102340,1102512,1102851,1103216,1103220,1103230,1103356,1103421,1103517,1103723,1103724,1103725,1103726,1103727,1103728,1103729,1103730
CVE References: CVE-2017-18344,CVE-2018-5390
Sources used:
openSUSE Leap 15.0 (src):    kernel-debug-4.12.14-lp150.12.10.1, kernel-default-4.12.14-lp150.12.10.1, kernel-docs-4.12.14-lp150.12.10.1, kernel-kvmsmall-4.12.14-lp150.12.10.1, kernel-obs-build-4.12.14-lp150.12.10.1, kernel-obs-qa-4.12.14-lp150.12.10.1, kernel-source-4.12.14-lp150.12.10.1, kernel-syms-4.12.14-lp150.12.10.1, kernel-vanilla-4.12.14-lp150.12.10.1
Comment 19 Takashi Iwai 2018-08-13 14:51:18 UTC
I took a deeper look at the patch, and the issue looks like that either the reported temperature is wrong or the reported duty value is wrong.

For further debugging, I'm building a test kernel that adds some debug prints (via nkvm_debug() calls).  It reverted the revert-patch and should show the buggy behavior again.  Please test it later, and give back the debug messages (appear as "XXX ...").
Comment 20 Takashi Iwai 2018-08-13 14:51:49 UTC
... and it's being built in OBS home:tiwai:bsc1103356-dbg repo.
Comment 23 Swamp Workflow Management 2018-08-16 16:18:14 UTC
SUSE-SU-2018:2380-1: An update that solves 11 vulnerabilities and has 61 fixes is now available.

Category: security (important)
Bug References: 1051510,1051979,1066110,1077761,1086274,1086314,1087081,1089343,1099811,1099813,1099844,1099845,1099846,1099849,1099858,1099863,1099864,1100132,1101116,1101331,1101669,1101828,1101832,1101833,1101837,1101839,1101841,1101843,1101844,1101845,1101847,1101852,1101853,1101867,1101872,1101874,1101875,1101882,1101883,1101885,1101887,1101890,1101891,1101893,1101895,1101896,1101900,1101902,1101903,1102633,1102658,1103097,1103356,1103421,1103517,1103723,1103724,1103725,1103726,1103727,1103728,1103729,1103730,1103917,1103920,1103948,1103949,1104066,1104111,1104174,1104211,1104319
CVE References: CVE-2018-10876,CVE-2018-10877,CVE-2018-10878,CVE-2018-10879,CVE-2018-10880,CVE-2018-10881,CVE-2018-10882,CVE-2018-10883,CVE-2018-3620,CVE-2018-3646,CVE-2018-5391
Sources used:
SUSE Linux Enterprise Workstation Extension 15 (src):    kernel-default-4.12.14-25.13.1
SUSE Linux Enterprise Module for Legacy Software 15 (src):    kernel-default-4.12.14-25.13.1
SUSE Linux Enterprise Module for Development Tools 15 (src):    kernel-docs-4.12.14-25.13.1, kernel-obs-build-4.12.14-25.13.1, kernel-source-4.12.14-25.13.1, kernel-syms-4.12.14-25.13.1, kernel-vanilla-4.12.14-25.13.1, lttng-modules-2.10.0-5.4.2
SUSE Linux Enterprise Module for Basesystem 15 (src):    kernel-default-4.12.14-25.13.1, kernel-source-4.12.14-25.13.1, kernel-zfcpdump-4.12.14-25.13.1
SUSE Linux Enterprise High Availability 15 (src):    kernel-default-4.12.14-25.13.1
Comment 24 Swamp Workflow Management 2018-08-16 16:29:06 UTC
SUSE-SU-2018:2381-1: An update that solves 11 vulnerabilities and has 61 fixes is now available.

Category: security (important)
Bug References: 1051510,1051979,1066110,1077761,1086274,1086314,1087081,1089343,1099811,1099813,1099844,1099845,1099846,1099849,1099858,1099863,1099864,1100132,1101116,1101331,1101669,1101828,1101832,1101833,1101837,1101839,1101841,1101843,1101844,1101845,1101847,1101852,1101853,1101867,1101872,1101874,1101875,1101882,1101883,1101885,1101887,1101890,1101891,1101893,1101895,1101896,1101900,1101902,1101903,1102633,1102658,1103097,1103356,1103421,1103517,1103723,1103724,1103725,1103726,1103727,1103728,1103729,1103730,1103917,1103920,1103948,1103949,1104066,1104111,1104174,1104211,1104319
CVE References: CVE-2018-10876,CVE-2018-10877,CVE-2018-10878,CVE-2018-10879,CVE-2018-10880,CVE-2018-10881,CVE-2018-10882,CVE-2018-10883,CVE-2018-3620,CVE-2018-3646,CVE-2018-5391
Sources used:
SUSE Linux Enterprise Module for Live Patching 15 (src):    kernel-default-4.12.14-25.13.1
Comment 25 Swamp Workflow Management 2018-08-20 13:21:06 UTC
SUSE-SU-2018:2450-1: An update that solves 12 vulnerabilities and has 88 fixes is now available.

Category: security (important)
Bug References: 1051510,1051979,1065600,1066110,1077761,1081917,1083647,1086274,1086288,1086314,1086315,1086317,1086327,1086331,1086906,1087081,1087092,1089343,1090888,1097104,1097577,1097808,1099811,1099813,1099844,1099845,1099846,1099849,1099858,1099863,1099864,1100132,1101116,1101331,1101669,1101822,1101828,1101832,1101833,1101837,1101839,1101841,1101843,1101844,1101845,1101847,1101852,1101853,1101867,1101872,1101874,1101875,1101882,1101883,1101885,1101887,1101890,1101891,1101893,1101895,1101896,1101900,1101902,1101903,1102633,1102658,1103097,1103269,1103277,1103356,1103363,1103421,1103445,1103517,1103723,1103724,1103725,1103726,1103727,1103728,1103729,1103730,1103886,1103917,1103920,1103948,1103949,1104066,1104111,1104174,1104211,1104319,1104353,1104365,1104427,1104494,1104495,1104708,1104777,1104897
CVE References: CVE-2018-10853,CVE-2018-10876,CVE-2018-10877,CVE-2018-10878,CVE-2018-10879,CVE-2018-10880,CVE-2018-10881,CVE-2018-10882,CVE-2018-10883,CVE-2018-3620,CVE-2018-3646,CVE-2018-5391
Sources used:
SUSE Linux Enterprise Module for Public Cloud 15 (src):    kernel-azure-4.12.14-5.13.1, kernel-source-azure-4.12.14-5.13.1, kernel-syms-azure-4.12.14-5.13.1
Comment 26 Thomas Blume 2018-08-27 09:06:22 UTC
(In reply to Takashi Iwai from comment #19)
> I took a deeper look at the patch, and the issue looks like that either the
> reported temperature is wrong or the reported duty value is wrong.
> 
> For further debugging, I'm building a test kernel that adds some debug
> prints (via nkvm_debug() calls).  It reverted the revert-patch and should
> show the buggy behavior again.  Please test it later, and give back the
> debug messages (appear as "XXX ...").

Sorry I was on vacation.
Will test ASAP.
Comment 27 Thomas Blume 2018-08-29 12:48:18 UTC
Created attachment 781156 [details]
logs from debug kernel

Hi Takashi,

the tests with your debug kernel were a bit surprising.
In contrast to the unpatched kernel, the fan didn't start immediately after loading the nouveau driver.
Even after logging into the Xserver it took a few minutes until the fan noise started.

cat /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:01\:00.0/hwmon/hwmon0/pwm1

showed zero, until the fan noise started.
Then it showed: 100

That is quite different from the behaviour of the kernel with the reverted patch.
That one constantly shows 20.
So, it seems that for any reason the debug kernel cause a sudden jump of the fan speed from 0 to 100.
The attached debug log contain a dmesg and the journal log from before the fan noise started and after.
Comment 28 Takashi Iwai 2018-08-29 14:26:47 UTC
Weird, indeed.  It's a heisenbug, then.

I wonder, though, why linear_duty debug is printed only once.  It implies that the method was switched to another one, like NVBIOS_THERM_FAN_TRIP.
But still it doesn't explain why it doesn't update the fan speed...

I refreshed the debug prints to show more data, including BIOS parser.
It's built in OBS home:tiwai:bsc1103356-dbg2 repo.  Please give it a try later again.
Comment 29 Takashi Iwai 2018-08-29 17:51:10 UTC
Doh, scratch that.  Of course, my test kernel shows the different behavior drastically.  It still contains the revert commit I merged in Leap 15.0 branch!

OK, I'll remove the revert and add a debug print again to see what's going on.
Stay tuned.
Comment 30 Takashi Iwai 2018-08-29 17:56:01 UTC
Erm, sorry, my previous comment was again wrong.  I disabled the patch in series.conf (although not mentioned in changelog, my bad).  So it's with the original kernel state plus a debug patch.  It's still a mystery why it behaved so differently.

That said, please go ahead testing with a new kernel in OBS home:tiwai:bsc1103356-dbg2 repo.  This is also with the revert of the revert, i.e. should be buggy.
Comment 31 Thomas Blume 2018-08-31 09:23:51 UTC
Created attachment 781492 [details]
logs from new debug kernel

Here are the logs with the new debug kernel.
With this kernel, the fan noise starts immediately and /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:01\:00.0/hwmon/hwmon0/pwm1 remains constant at 78
Comment 32 Takashi Iwai 2018-08-31 09:57:51 UTC
Thanks.  The behavior at this time looks normal, isn't it?

The measured temperature at start was 77.14C (= 7154 * 458/10000 - 25051/100), and it went down to 55.89C (= 6690), slightly up to 68C (6968).

Does the boost behavior appear when you turn off the console loglevel?

And my wild guess now is that it's because polling is disabled when entering this mode.  Will cook up another test patch.
Comment 33 Takashi Iwai 2018-08-31 10:17:13 UTC
A test kernel with a hopefully working patch is being built in OBS home:tiwai:bsc1103356-test2 repo.
Please give it a try later.
Comment 34 Thomas Blume 2018-08-31 14:09:06 UTC
(In reply to Takashi Iwai from comment #32)
> Thanks.  The behavior at this time looks normal, isn't it?
> 
> The measured temperature at start was 77.14C (= 7154 * 458/10000 -
> 25051/100), and it went down to 55.89C (= 6690), slightly up to 68C (6968).
> 
> Does the boost behavior appear when you turn off the console loglevel?
> 
> And my wild guess now is that it's because polling is disabled when entering
> this mode.  Will cook up another test patch.

The fan boost is indeed gone, with or without nouveau debug logging.
Still, the fan stays noisy /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:01\:00.0/hwmon/hwmon0/pwm1 shows 78 with and at 80 without debug logging, even though there is very low gpu load.
I'd expect that the fan speed decreases as the gpu cools down.
Will now try your latest patch and report.
Comment 35 Thomas Blume 2018-08-31 15:15:04 UTC
(In reply to Takashi Iwai from comment #33)
> A test kernel with a hopefully working patch is being built in OBS
> home:tiwai:bsc1103356-test2 repo.
> Please give it a try later.

Looks better, no more fan boost, but still the fan reaches an annoying noise level.
The fan speed stays at 63.
Attaching the new debug logs below.
Comment 36 Thomas Blume 2018-08-31 15:16:12 UTC
Created attachment 781557 [details]
debug logs from : home:tiwai:bsc1103356-test2
Comment 37 Takashi Iwai 2018-08-31 15:29:23 UTC
Thanks.  I guess this is the expected behavior of the current driver; at least it reads the temperature high and tries to cool down accordingly.

The difference is that, as you can see in the log, my patched kernel continuously updates the FAN target depending on the measured temperature.
So if this is still too high, it means either that the measured temperature is incorrect, or its evaluation is wrong.

Actually, the fact that it appeared working in the past was merely a casual effect, I guess.  The fan control seems to be NVBIOS_THERM_FAN_OTHER on your machine, and it didn't do anything unless cstate change happens in its clk code.  And, the cstate change doesn't seem happening on yours, hence it keeps running with the initial state.

The difference between the old and the recent kernels is only this initial state.  On an old kernel, it used to be nothing, i.e. no fan (until the board gets the alert high temperature).  On a recent kernel, the fan speed is evaluated from the temperature, and it kicks off high, and keeps running without further adjustment.
Comment 38 Takashi Iwai 2018-08-31 15:30:35 UTC
Created attachment 781563 [details]
Test fix patch
Comment 39 Takashi Iwai 2018-10-31 11:42:21 UTC
JFYI, I replaced the fix from the reverting one to my submitted patch on SLE15/Leap15.

It's still not included in 4.20, and I'll need to ping nouveau guys, but it's certainly closer to the upstream solution.
Comment 40 Swamp Workflow Management 2018-11-01 19:41:41 UTC
This is an autogenerated message for OBS integration:
This bug (1103356) was mentioned in
https://build.opensuse.org/request/show/645932 15.0 / kernel-source
Comment 41 Swamp Workflow Management 2018-11-07 20:14:45 UTC
openSUSE-SU-2018:3658-1: An update that solves 5 vulnerabilities and has 86 fixes is now available.

Category: security (important)
Bug References: 1051510,1055120,1065600,1066674,1067906,1076830,1079524,1083647,1084760,1084831,1091800,1094825,1095805,1100132,1103356,1103543,1104124,1104731,1105025,1105428,1105536,1106110,1106237,1106240,1108377,1109330,1109739,1109806,1109818,1109907,1109911,1109915,1109919,1109951,1110006,1111040,1111506,1111806,1111819,1111830,1111834,1111841,1111870,1111901,1111904,1111921,1111928,1111983,1112170,1112173,1112208,1112219,1112221,1112246,1112372,1112514,1112554,1112708,1112710,1112711,1112712,1112713,1112731,1112732,1112733,1112734,1112735,1112736,1112738,1112739,1112740,1112741,1112743,1112745,1112746,1112878,1112894,1112899,1112902,1112903,1112905,1112906,1112907,1113257,1113284,1113295,1113408,1113667,1113722,1113751,1113972
CVE References: CVE-2017-16533,CVE-2017-18224,CVE-2018-18386,CVE-2018-18445,CVE-2018-18710
Sources used:
openSUSE Leap 15.0 (src):    kernel-debug-4.12.14-lp150.12.25.1, kernel-default-4.12.14-lp150.12.25.1, kernel-docs-4.12.14-lp150.12.25.1, kernel-kvmsmall-4.12.14-lp150.12.25.1, kernel-obs-build-4.12.14-lp150.12.25.1, kernel-obs-qa-4.12.14-lp150.12.25.1, kernel-source-4.12.14-lp150.12.25.1, kernel-syms-4.12.14-lp150.12.25.1, kernel-vanilla-4.12.14-lp150.12.25.1
Comment 44 Swamp Workflow Management 2018-11-28 14:15:36 UTC
SUSE-SU-2018:3934-1: An update that solves 5 vulnerabilities and has 101 fixes is now available.

Category: security (important)
Bug References: 1051510,1055120,1061840,1065600,1066674,1067906,1076830,1079524,1083647,1084760,1084831,1086196,1091800,1094825,1095805,1100132,1101138,1103356,1103543,1103925,1104124,1104731,1105025,1105428,1105536,1106110,1106237,1106240,1106287,1106359,1106838,1108377,1108468,1108870,1109330,1109739,1109772,1109784,1109806,1109818,1109907,1109911,1109915,1109919,1109951,1110006,1111040,1111076,1111506,1111806,1111811,1111819,1111830,1111834,1111841,1111870,1111901,1111904,1111921,1111928,1111983,1112170,1112173,1112208,1112219,1112221,1112246,1112372,1112514,1112554,1112708,1112710,1112711,1112712,1112713,1112731,1112732,1112733,1112734,1112735,1112736,1112738,1112739,1112740,1112741,1112743,1112745,1112746,1112878,1112894,1112899,1112902,1112903,1112905,1112906,1112907,1113257,1113284,1113295,1113408,1113667,1113722,1113751,1113780,1113972,1114279
CVE References: CVE-2017-16533,CVE-2017-18224,CVE-2018-18386,CVE-2018-18445,CVE-2018-18710
Sources used:
SUSE Linux Enterprise Server 12-SP4 (src):    kernel-azure-4.12.14-6.3.1, kernel-source-azure-4.12.14-6.3.1, kernel-syms-azure-4.12.14-6.3.1
Comment 45 Swamp Workflow Management 2018-11-30 20:27:26 UTC
SUSE-SU-2018:3961-1: An update that solves 22 vulnerabilities and has 286 fixes is now available.

Category: security (important)
Bug References: 1012382,1031392,1043912,1044189,1046302,1046305,1046306,1046307,1046540,1046543,1050244,1050319,1050536,1050540,1051510,1054914,1055014,1055117,1055120,1058659,1060463,1061840,1065600,1065729,1066674,1067126,1067906,1068032,1069138,1071995,1076830,1077761,1077989,1078720,1079524,1080157,1082519,1082555,1083647,1083663,1084760,1084831,1085030,1085042,1085262,1086282,1086283,1086288,1086327,1089663,1090078,1091800,1092903,1094244,1094825,1095344,1095805,1096748,1097105,1097583,1097584,1097585,1097586,1097587,1097588,1098459,1098782,1098822,1099125,1099922,1099999,1100001,1100132,1101480,1101557,1101669,1102346,1102495,1102517,1102715,1102870,1102875,1102877,1102879,1102881,1102882,1102896,1103269,1103308,1103356,1103363,1103387,1103405,1103421,1103543,1103587,1103636,1103948,1103949,1103961,1104172,1104353,1104482,1104683,1104731,1104824,1104888,1104890,1105025,1105190,1105247,1105292,1105322,1105355,1105378,1105396,1105428,1105467,1105524,1105536,1105597,1105603,1105672,1105731,1105795,1105907,1106007,1106016,1106105,1106110,1106121,1106170,1106178,1106229,1106230,1106231,1106233,1106235,1106236,1106237,1106238,1106240,1106291,1106297,1106333,1106369,1106427,1106464,1106509,1106511,1106594,1106636,1106688,1106697,1106779,1106800,1106838,1106890,1106891,1106892,1106893,1106894,1106896,1106897,1106898,1106899,1106900,1106901,1106902,1106903,1106905,1106906,1106948,1106995,1107008,1107060,1107061,1107065,1107074,1107207,1107319,1107320,1107522,1107535,1107685,1107689,1107735,1107756,1107783,1107829,1107870,1107924,1107928,1107945,1107947,1107966,1108010,1108093,1108096,1108170,1108241,1108243,1108260,1108281,1108323,1108377,1108399,1108468,1108520,1108823,1108841,1108870,1109151,1109158,1109217,1109244,1109269,1109330,1109333,1109336,1109337,1109511,1109603,1109739,1109772,1109784,1109806,1109818,1109907,1109915,1109919,1109951,1109979,1109992,1110006,1110096,1110301,1110363,1110538,1110561,1110639,1110642,1110643,1110644,1110645,1110646,1110647,1110649,1110650,1111028,1111040,1111076,1111506,1111806,1111819,1111830,1111834,1111841,1111870,1111901,1111904,1111921,1111928,1111983,1112170,1112208,1112219,1112246,1112372,1112514,1112554,1112708,1112710,1112711,1112712,1112713,1112731,1112732,1112733,1112734,1112735,1112736,1112738,1112739,1112740,1112741,1112743,1112745,1112746,1112878,1112894,1112899,1112902,1112903,1112905,1112906,1112907,1113257,1113284,1113295,1113408,1113667,1113722,1113751,1113780,1113972,1114279,971975
CVE References: CVE-2017-16533,CVE-2017-18224,CVE-2018-10902,CVE-2018-10938,CVE-2018-10940,CVE-2018-1128,CVE-2018-1129,CVE-2018-12896,CVE-2018-13093,CVE-2018-13095,CVE-2018-14613,CVE-2018-14617,CVE-2018-14633,CVE-2018-15572,CVE-2018-16658,CVE-2018-17182,CVE-2018-18386,CVE-2018-18445,CVE-2018-18710,CVE-2018-6554,CVE-2018-6555,CVE-2018-9363
Sources used:
SUSE Linux Enterprise Module for Public Cloud 15 (src):    kernel-azure-4.12.14-5.16.1, kernel-source-azure-4.12.14-5.16.1, kernel-syms-azure-4.12.14-5.16.1
Comment 47 Swamp Workflow Management 2018-12-11 14:14:43 UTC
SUSE-SU-2018:4069-1: An update that solves 7 vulnerabilities and has 184 fixes is now available.

Category: security (important)
Bug References: 1051510,1055120,1061840,1065600,1065729,1066674,1067906,1068273,1076830,1078248,1079524,1082555,1082653,1083647,1084760,1084831,1085535,1086196,1089350,1091800,1094825,1095805,1097755,1100132,1103356,1103925,1104124,1104731,1104824,1105025,1105428,1106105,1106110,1106237,1106240,1107256,1107385,1107866,1108377,1108468,1109330,1109739,1109772,1109806,1109818,1109907,1109911,1109915,1109919,1109951,1110006,1110998,1111040,1111062,1111174,1111506,1111696,1111809,1111921,1111983,1112128,1112170,1112173,1112208,1112219,1112221,1112246,1112372,1112514,1112554,1112708,1112710,1112711,1112712,1112713,1112731,1112732,1112733,1112734,1112735,1112736,1112738,1112739,1112740,1112741,1112743,1112745,1112746,1112878,1112894,1112899,1112902,1112903,1112905,1112906,1112907,1112963,1113257,1113284,1113295,1113408,1113412,1113501,1113667,1113677,1113722,1113751,1113769,1113780,1113972,1114015,1114178,1114279,1114385,1114576,1114577,1114578,1114579,1114580,1114581,1114582,1114583,1114584,1114585,1114839,1115074,1115269,1115431,1115433,1115440,1115567,1115709,1115976,1116183,1116692,1116693,1116698,1116699,1116700,1116701,1116862,1116863,1116876,1116877,1116878,1116891,1116895,1116899,1116950,1117168,1117172,1117174,1117181,1117184,1117188,1117189,1117349,1117561,1117788,1117789,1117790,1117791,1117792,1117794,1117795,1117796,1117798,1117799,1117801,1117802,1117803,1117804,1117805,1117806,1117807,1117808,1117815,1117816,1117817,1117818,1117819,1117820,1117821,1117822,1118102,1118136,1118137,1118138,1118140,1118152,1118316
CVE References: CVE-2017-16533,CVE-2017-18224,CVE-2018-18281,CVE-2018-18386,CVE-2018-18445,CVE-2018-18710,CVE-2018-19824
Sources used:
SUSE Linux Enterprise Workstation Extension 12-SP4 (src):    kernel-default-4.12.14-95.3.1
SUSE Linux Enterprise Software Development Kit 12-SP4 (src):    kernel-docs-4.12.14-95.3.1, kernel-obs-build-4.12.14-95.3.2
SUSE Linux Enterprise Server 12-SP4 (src):    kernel-default-4.12.14-95.3.1, kernel-source-4.12.14-95.3.1, kernel-syms-4.12.14-95.3.1
SUSE Linux Enterprise High Availability 12-SP4 (src):    kernel-default-4.12.14-95.3.1
SUSE Linux Enterprise Desktop 12-SP4 (src):    kernel-default-4.12.14-95.3.1, kernel-source-4.12.14-95.3.1, kernel-syms-4.12.14-95.3.1
Comment 48 Swamp Workflow Management 2018-12-12 08:18:14 UTC
SUSE-SU-2018:4072-1: An update that solves 7 vulnerabilities and has 184 fixes is now available.

Category: security (important)
Bug References: 1051510,1055120,1061840,1065600,1065729,1066674,1067906,1068273,1076830,1078248,1079524,1082555,1082653,1083647,1084760,1084831,1085535,1086196,1089350,1091800,1094825,1095805,1097755,1100132,1103356,1103925,1104124,1104731,1104824,1105025,1105428,1106105,1106110,1106237,1106240,1107256,1107385,1107866,1108377,1108468,1109330,1109739,1109772,1109806,1109818,1109907,1109911,1109915,1109919,1109951,1110006,1110998,1111040,1111062,1111174,1111506,1111696,1111809,1111921,1111983,1112128,1112170,1112173,1112208,1112219,1112221,1112246,1112372,1112514,1112554,1112708,1112710,1112711,1112712,1112713,1112731,1112732,1112733,1112734,1112735,1112736,1112738,1112739,1112740,1112741,1112743,1112745,1112746,1112878,1112894,1112899,1112902,1112903,1112905,1112906,1112907,1112963,1113257,1113284,1113295,1113408,1113412,1113501,1113667,1113677,1113722,1113751,1113769,1113780,1113972,1114015,1114178,1114279,1114385,1114576,1114577,1114578,1114579,1114580,1114581,1114582,1114583,1114584,1114585,1114839,1115074,1115269,1115431,1115433,1115440,1115567,1115709,1115976,1116183,1116692,1116693,1116698,1116699,1116700,1116701,1116862,1116863,1116876,1116877,1116878,1116891,1116895,1116899,1116950,1117168,1117172,1117174,1117181,1117184,1117188,1117189,1117349,1117561,1117788,1117789,1117790,1117791,1117792,1117794,1117795,1117796,1117798,1117799,1117801,1117802,1117803,1117804,1117805,1117806,1117807,1117808,1117815,1117816,1117817,1117818,1117819,1117820,1117821,1117822,1118102,1118136,1118137,1118138,1118140,1118152,1118316
CVE References: CVE-2017-16533,CVE-2017-18224,CVE-2018-18281,CVE-2018-18386,CVE-2018-18445,CVE-2018-18710,CVE-2018-19824
Sources used:
SUSE Linux Enterprise Live Patching 12-SP4 (src):    kgraft-patch-SLE12-SP4_Update_1-1-7.1
Comment 50 Takashi Iwai 2019-01-21 15:17:36 UTC
My fix patch landed into upstream, and it's been backported to relevant branches.

The rest is rather the tuning from user-space, which was explained in the upstream bugzilla (or mail thread), IIRC.
Comment 51 Swamp Workflow Management 2019-02-01 23:17:59 UTC
SUSE-SU-2019:0224-1: An update that solves 13 vulnerabilities and has 253 fixes is now available.

Category: security (important)
Bug References: 1024718,1046299,1050242,1050244,1051510,1055120,1055121,1055186,1058115,1060463,1061840,1065600,1065729,1068273,1078248,1079935,1082387,1082555,1082653,1083647,1085535,1086196,1086282,1086283,1086423,1087978,1088386,1089350,1090888,1091405,1091800,1094244,1097593,1097755,1100132,1102875,1102877,1102879,1102882,1102896,1103257,1103356,1103925,1104124,1104353,1104427,1104824,1104967,1105168,1105428,1106105,1106110,1106237,1106240,1106615,1106913,1107256,1107385,1107866,1108270,1108468,1109272,1109772,1109806,1110006,1110558,1110998,1111040,1111062,1111174,1111183,1111188,1111469,1111696,1111795,1111809,1111921,1112878,1112963,1113295,1113408,1113412,1113501,1113667,1113677,1113722,1113751,1113769,1113780,1113972,1114015,1114178,1114279,1114385,1114576,1114577,1114578,1114579,1114580,1114581,1114582,1114583,1114584,1114585,1114839,1114871,1115074,1115269,1115431,1115433,1115440,1115567,1115709,1115976,1116040,1116183,1116336,1116692,1116693,1116698,1116699,1116700,1116701,1116803,1116841,1116862,1116863,1116876,1116877,1116878,1116891,1116895,1116899,1116950,1117115,1117162,1117165,1117168,1117172,1117174,1117181,1117184,1117186,1117188,1117189,1117349,1117561,1117656,1117788,1117789,1117790,1117791,1117792,1117794,1117795,1117796,1117798,1117799,1117801,1117802,1117803,1117804,1117805,1117806,1117807,1117808,1117815,1117816,1117817,1117818,1117819,1117820,1117821,1117822,1117953,1118102,1118136,1118137,1118138,1118140,1118152,1118215,1118316,1118319,1118428,1118484,1118505,1118752,1118760,1118761,1118762,1118766,1118767,1118768,1118769,1118771,1118772,1118773,1118774,1118775,1118798,1118809,1118962,1119017,1119086,1119212,1119322,1119410,1119714,1119749,1119804,1119946,1119962,1119968,1120036,1120046,1120053,1120054,1120055,1120058,1120088,1120092,1120094,1120096,1120097,1120173,1120214,1120223,1120228,1120230,1120232,1120234,1120235,1120238,1120594,1120598,1120600,1120601,1120602,1120603,1120604,1120606,1120612,1120613,1120614,1120615,1120616,1120617,1120618,1120620,1120621,1120632,1120633,1120743,1120954,1121017,1121058,1121263,1121273,1121477,1121483,1121599,1121621,1121714,1121715,1121973
CVE References: CVE-2018-12232,CVE-2018-14625,CVE-2018-16862,CVE-2018-16884,CVE-2018-18281,CVE-2018-18397,CVE-2018-18710,CVE-2018-19407,CVE-2018-19824,CVE-2018-19854,CVE-2018-19985,CVE-2018-20169,CVE-2018-9568
Sources used:
SUSE Linux Enterprise Workstation Extension 15 (src):    kernel-default-4.12.14-25.28.1
SUSE Linux Enterprise Module for Open Buildservice Development Tools 15 (src):    kernel-default-4.12.14-25.28.1, kernel-docs-4.12.14-25.28.1, kernel-obs-qa-4.12.14-25.28.1
SUSE Linux Enterprise Module for Legacy Software 15 (src):    kernel-default-4.12.14-25.28.1
SUSE Linux Enterprise Module for Development Tools 15 (src):    kernel-docs-4.12.14-25.28.1, kernel-obs-build-4.12.14-25.28.1, kernel-source-4.12.14-25.28.1, kernel-syms-4.12.14-25.28.1, kernel-vanilla-4.12.14-25.28.1
SUSE Linux Enterprise Module for Basesystem 15 (src):    kernel-default-4.12.14-25.28.1, kernel-source-4.12.14-25.28.1, kernel-zfcpdump-4.12.14-25.28.1
SUSE Linux Enterprise High Availability 15 (src):    kernel-default-4.12.14-25.28.1
Comment 52 Swamp Workflow Management 2019-02-02 00:01:28 UTC
SUSE-SU-2019:0224-1: An update that solves 13 vulnerabilities and has 253 fixes is now available.

Category: security (important)
Bug References: 1024718,1046299,1050242,1050244,1051510,1055120,1055121,1055186,1058115,1060463,1061840,1065600,1065729,1068273,1078248,1079935,1082387,1082555,1082653,1083647,1085535,1086196,1086282,1086283,1086423,1087978,1088386,1089350,1090888,1091405,1091800,1094244,1097593,1097755,1100132,1102875,1102877,1102879,1102882,1102896,1103257,1103356,1103925,1104124,1104353,1104427,1104824,1104967,1105168,1105428,1106105,1106110,1106237,1106240,1106615,1106913,1107256,1107385,1107866,1108270,1108468,1109272,1109772,1109806,1110006,1110558,1110998,1111040,1111062,1111174,1111183,1111188,1111469,1111696,1111795,1111809,1111921,1112878,1112963,1113295,1113408,1113412,1113501,1113667,1113677,1113722,1113751,1113769,1113780,1113972,1114015,1114178,1114279,1114385,1114576,1114577,1114578,1114579,1114580,1114581,1114582,1114583,1114584,1114585,1114839,1114871,1115074,1115269,1115431,1115433,1115440,1115567,1115709,1115976,1116040,1116183,1116336,1116692,1116693,1116698,1116699,1116700,1116701,1116803,1116841,1116862,1116863,1116876,1116877,1116878,1116891,1116895,1116899,1116950,1117115,1117162,1117165,1117168,1117172,1117174,1117181,1117184,1117186,1117188,1117189,1117349,1117561,1117656,1117788,1117789,1117790,1117791,1117792,1117794,1117795,1117796,1117798,1117799,1117801,1117802,1117803,1117804,1117805,1117806,1117807,1117808,1117815,1117816,1117817,1117818,1117819,1117820,1117821,1117822,1117953,1118102,1118136,1118137,1118138,1118140,1118152,1118215,1118316,1118319,1118428,1118484,1118505,1118752,1118760,1118761,1118762,1118766,1118767,1118768,1118769,1118771,1118772,1118773,1118774,1118775,1118798,1118809,1118962,1119017,1119086,1119212,1119322,1119410,1119714,1119749,1119804,1119946,1119962,1119968,1120036,1120046,1120053,1120054,1120055,1120058,1120088,1120092,1120094,1120096,1120097,1120173,1120214,1120223,1120228,1120230,1120232,1120234,1120235,1120238,1120594,1120598,1120600,1120601,1120602,1120603,1120604,1120606,1120612,1120613,1120614,1120615,1120616,1120617,1120618,1120620,1120621,1120632,1120633,1120743,1120954,1121017,1121058,1121263,1121273,1121477,1121483,1121599,1121621,1121714,1121715,1121973
CVE References: CVE-2018-12232,CVE-2018-14625,CVE-2018-16862,CVE-2018-16884,CVE-2018-18281,CVE-2018-18397,CVE-2018-18710,CVE-2018-19407,CVE-2018-19824,CVE-2018-19854,CVE-2018-19985,CVE-2018-20169,CVE-2018-9568
Sources used:
SUSE Linux Enterprise Workstation Extension 15 (src):    kernel-default-4.12.14-25.28.1
SUSE Linux Enterprise Module for Open Buildservice Development Tools 15 (src):    kernel-default-4.12.14-25.28.1, kernel-docs-4.12.14-25.28.1, kernel-obs-qa-4.12.14-25.28.1
SUSE Linux Enterprise Module for Live Patching 15 (src):    kernel-default-4.12.14-25.28.1, kernel-livepatch-SLE15_Update_8-1-1.3.1
SUSE Linux Enterprise Module for Legacy Software 15 (src):    kernel-default-4.12.14-25.28.1
SUSE Linux Enterprise Module for Development Tools 15 (src):    kernel-docs-4.12.14-25.28.1, kernel-obs-build-4.12.14-25.28.1, kernel-source-4.12.14-25.28.1, kernel-syms-4.12.14-25.28.1, kernel-vanilla-4.12.14-25.28.1
SUSE Linux Enterprise Module for Basesystem 15 (src):    kernel-default-4.12.14-25.28.1, kernel-source-4.12.14-25.28.1, kernel-zfcpdump-4.12.14-25.28.1
SUSE Linux Enterprise High Availability 15 (src):    kernel-default-4.12.14-25.28.1