Bug 1071375 - amdgpu/Polaris: Kernel update breaks amdgpu support due to missing firmware file polaris11_mc.bin
amdgpu/Polaris: Kernel update breaks amdgpu support due to missing firmware f...
Status: RESOLVED DUPLICATE of bug 1066682
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
Current
aarch64 Other
: P3 - Medium : Critical with 5 votes (vote)
: ---
Assigned To: Daniel Molkentin
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-12-05 20:16 UTC by andrey yakunin
Modified: 2017-12-11 09:59 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
daniel: needinfo? (gvozdila)


Attachments
zypper dup -d for breaking update. (15.22 KB, text/plain)
2017-12-05 20:16 UTC, andrey yakunin
Details
journalctl -b -1 after repairing from last good snapshot (168.36 KB, text/plain)
2017-12-05 20:17 UTC, andrey yakunin
Details
Xorg.0.log.old from bad state (4.72 KB, text/plain)
2017-12-05 20:18 UTC, andrey yakunin
Details
journalctl -b -1 after trying drm.debug=0x1e log_buf_len=1M (2.18 MB, text/plain)
2017-12-06 12:46 UTC, andrey yakunin
Details
result for ls -l /lib/firmware/amdgpu/ (6.66 KB, text/plain)
2017-12-06 15:25 UTC, andrey yakunin
Details
snapper diff for update breaking amdgpu support (7.95 KB, text/plain)
2017-12-06 15:26 UTC, andrey yakunin
Details
sudo dracut -f --debug 2>&1| gzip > dracut.log.gz (926.57 KB, application/gzip)
2017-12-11 06:16 UTC, andrey yakunin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description andrey yakunin 2017-12-05 20:16:26 UTC
Created attachment 751616 [details]
zypper dup -d for breaking update.

Tumbleweed update make system unable to boot. 
Screen is black, no console or any message. Hard-drive blink a little.

Update from 20171123 to 20171203. 
Same symptoms for "zypper update" and "zypper dup", tried it from 03 of December till 05 of Desember. 

Snapper allow to boot from last good state.
I'll add journalctl -b -1 as attachment, it contains things like:

Dec 05 21:50:57 linux-kqa8 display-manager[1405]: /usr/bin/xauth: (stdin):1: bad "remove" command line
Dec 05 21:50:57 linux-kqa8 display-manager[1405]: /usr/bin/xauth: (stdin):2: bad "add" command line
Dec 05 21:50:57 linux-kqa8 sddm[1447]: QProcess: Destroyed while process ("/usr/lib/sddm/sddm-helper") is still running.
Dec 05 21:50:57 linux-kqa8 sddm[1447]: Display server stopped.
Dec 05 21:50:57 linux-kqa8 sddm[1447]: Running display stop script "/usr/share/sddm/scripts/Xstop"
Dec 05 21:50:58 linux-kqa8 sddm[1447]: Socket server stopping...
Dec 05 21:50:58 linux-kqa8 sddm[1447]: Socket server stopped.

Xorg.0.log.old (added) contains "all ok, but":
[ 82671.095] (II) AMDGPU: Driver for AMD Radeon:
All GPUs supported by the amdgpu kernel driver
[ 82671.095] (II) [KMS] drm report modesetting isn't supported.
[ 82671.096] (II) [KMS] drm report modesetting isn't supported.
[ 82671.096] (EE) Screen 0 deleted because of no matching config section.
[ 82671.096] (II) UnloadModule: "amdgpu"
[ 82671.096] (EE) Screen 0 deleted because of no matching config section.
[ 82671.096] (II) UnloadModule: "amdgpu"
[ 82671.096] (EE) Device(s) detected, but none match those in the config file.
[ 82671.096] (EE)
Fatal server error:
[ 82671.096] (EE) no screens found(EE)
[ 82671.096] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
[ 82671.096] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[ 82671.096] (EE)
[ 82671.096] (EE) Server terminated with error (1). Closing log file.



Snapper shows only one changed entry in /etc/X11 

--- /.snapshots/391/snapshot/etc/X11/xdm/Xstartup 2014-12-15 21:34:46.000000000 +0300
+++ /.snapshots/392/snapshot/etc/X11/xdm/Xstartup 2017-11-23 13:42:33.000000000 +0300
@@ -54,6 +54,9 @@
#
# Find out if this is a local or remote connection
#
+if test $DISPLAY == '(null)' ; then
+ DISPLAY=""
+fi
LOCATION=${DISPLAY%:*}
LINE=:${DISPLAY#*:}
if test -z "$LOCATION" ; then
Comment 1 andrey yakunin 2017-12-05 20:17:47 UTC
Created attachment 751617 [details]
journalctl -b -1  after repairing from last good snapshot
Comment 2 andrey yakunin 2017-12-05 20:18:50 UTC
Created attachment 751618 [details]
Xorg.0.log.old from bad state
Comment 3 Stefan Dirsch 2017-12-06 10:30:23 UTC
[ 82671.095] (II) [KMS] drm report modesetting isn't supported.
[ 82671.096] (II) [KMS] drm report modesetting isn't supported.

Seems amdgpu Kernel module could not be laoded. Could you boot into this state again and provide the output dmesg. Add the kernel commandline options

  drm.debug=0x1e log_buf_len=1M 

before.
Comment 4 andrey yakunin 2017-12-06 12:36:48 UTC
Hello! 

||Seems amdgpu Kernel module could not be laoded. Could you boot into this state ||again and provide the output dmesg. Add the kernel commandline options
||
||  drm.debug=0x1e log_buf_len=1M 
||before.

Unfortunately it it not possible to get terminal in "bad" state. 
I tried to change:

 linux   /boot/vmlinuz-4.14.0-1-default root=UUID=a0f1cb50-17b6-4e2d-86f1-2b17ec6bde9a  ${extra_cmdline} resume=/dev/disk/by-uuid/daf78ba0-891b-41cf-8f31-d284340dd975 splash=silent quiet showopts

to:

linux   /boot/vmlinuz-4.14.0-1-default root=UUID=a0f1cb50-17b6-4e2d-86f1-2b17ec6bde9a  ${extra_cmdline} resume=/dev/disk/by-uuid/daf78ba0-891b-41cf-8f31-d284340dd975 debug drm.debug=0x1e log_buf_len=1M 

but without terminal i don't know how to get dmesg from that state. 
There is no "dmesg*" files in /var/log/ after i boot into "stable" state with snapper.
Sorry for silly questions.
Comment 5 andrey yakunin 2017-12-06 12:46:19 UTC
Created attachment 751718 [details]
journalctl -b -1  after trying drm.debug=0x1e log_buf_len=1M
Comment 6 Stefan Dirsch 2017-12-06 13:23:13 UTC
Dec 06 15:05:28 linux-kqa8 kernel: mc: Failed to load firmware "amdgpu/polaris11_mc.bin"
Dec 06 15:05:28 linux-kqa8 kernel: [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
Dec 06 15:05:28 linux-kqa8 kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -2
Dec 06 15:05:28 linux-kqa8 kernel: amdgpu 0000:01:00.0: amdgpu_init failed
Dec 06 15:05:28 linux-kqa8 kernel: amdgpu 0000:01:00.0: Fatal error during GPU init
Dec 06 15:05:28 linux-kqa8 kernel: [drm] amdgpu: finishing device.
Dec 06 15:05:28 linux-kqa8 kernel: [TTM] Memory type 2 has not been initialized

Likely a kernel regression, but could also be related to changes in power management according to Takashi.
Comment 7 Takashi Iwai 2017-12-06 13:34:04 UTC
(In reply to Stefan Dirsch from comment #6)
> Dec 06 15:05:28 linux-kqa8 kernel: mc: Failed to load firmware
> "amdgpu/polaris11_mc.bin"
> Dec 06 15:05:28 linux-kqa8 kernel: [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR*
> Failed to load mc firmware!

Looks rather like the missing firmware file.
Do you still have kernel-firmware package installed?  This should contain the file /lib/firmware/amdgpu/polaris11_mc.bin.

The PM thing is likely a different issue, see bsc#1068793.
Comment 8 andrey yakunin 2017-12-06 15:24:36 UTC
||Looks rather like the missing firmware file.
||Do you still have kernel-firmware package installed?  This should contain the ||file /lib/firmware/amdgpu/polaris11_mc.bin.

Now system is in stable state. And there is  /lib/firmware/amdgpu/polaris11_mc.bin

I checked snapper diff | grep /lib/firmware/ 
and there are no changes in polaris11_mc.bin
So it look like "bad" state had polaris11_mc.bin

I'll add list from stable system and diff from snapper ( grep /lib/firmware/ ) for breaking release. I can add full changelog, but it is big enough.
Comment 9 andrey yakunin 2017-12-06 15:25:46 UTC
Created attachment 751743 [details]
result for ls -l /lib/firmware/amdgpu/
Comment 10 andrey yakunin 2017-12-06 15:26:52 UTC
Created attachment 751744 [details]
snapper diff  for update breaking amdgpu support
Comment 11 Takashi Iwai 2017-12-06 15:27:46 UTC
The firmware might be loaded in initrd when amdgpu kernel module gets loaded there, too.  So you need to check the content of initrd whether it contains the corresponding firmware, not only the root system.

I can't say exactly which update or change triggered it, though.
Comment 12 andrey yakunin 2017-12-06 15:46:09 UTC
||The firmware might be loaded in initrd when amdgpu kernel module gets loaded ||there, too.  So you need to check the content of initrd whether it contains ||the corresponding firmware, not only the root system.


During update was added:
initrd-4.14.2-1-default

and changed:
initrd 
initrd-4.13.12-1-default 
initrd-4.14.0-1-default 

/boot/vmlinuz-4.14.0-1-default was not modified

is that why it was impossible to boot in "bad" state using the older kernel?
Comment 13 Takashi Iwai 2017-12-06 15:51:27 UTC
Well, if you can still reproduce the issue with a certain system (the kernel message shows the same error), please check the initrd content in that state.  Otherwise it's nothing but a guess work.
Comment 14 Stefan Dirsch 2017-12-06 15:58:51 UTC
JFYI, you can list the content of initrd via

  lsinitrd <path-of-initrd>
Comment 15 andrey yakunin 2017-12-06 19:01:18 UTC
I tried rollback and make update one more time (to 20171204_0).


Zypper log after update contains:

2017-12-06 21:06:35 <1> linux-kqa8(5728) [zypp::posttrans++] RpmPostTransCollector.cc(executeScripts):94 dracut: Possible missing firmware "amdgpu/polaris11_mc.bin" for kernel module "amdgpu.ko" 

diff between lsinitrd old and lsinitrdnew shows:

> -rw-r--r--   2 root     root        32604 Aug  2 12:26 lib/firmware/amdgpu/polaris11_mc.bin

so, there is no lib/firmware/amdgpu/polaris11_mc.bin in new initrd-4.14.0-1-default  (older kernel)



First of all i try to boot  with 4.14.0-1 and new (bad state) initrd.
Black screen, no console, dmesg:
Dec 06 21:35:27 linux-kqa8 kernel: amdgpu 0000:01:00.0: Direct firmware load for amdgpu/polaris11_mc.bin failed with error -2
Dec 06 21:35:27 linux-kqa8 kernel: mc: Failed to load firmware "amdgpu/polaris11_mc.bin"



Then i take older initrd for 4.14.0-1 (20171127 update) and boot in new system with it. Everything was good.  
It seems like the only problem in initrd from 20171203_update. 


When i try to lsinitrd initrd-4.14.2-1-default | grep polaris
i get empty string.
Comment 16 Stefan Dirsch 2017-12-07 10:38:24 UTC
Ok. Seems the kernel-firmware package on your system is now missing this file. :-(
I don't understand this. It's in kernel-firmware of openSUSE:Factory and Kernel:HEAD.
Comment 17 Stefan Dirsch 2017-12-07 10:46:09 UTC
I suggest to install the latest firmware package and run once more

  mkinitrd
Comment 18 andrey yakunin 2017-12-08 10:13:35 UTC
||Ok. Seems the kernel-firmware package on your system is now missing this file. ||:-(
||I don't understand this. It's in kernel-firmware of openSUSE:Factory and ||Kernel:HEAD.

I checked it. kernel-firmware is installed. 
zypper info kernel-firmware
Loading repository data...
Reading installed packages...


Information for package kernel-firmware:
----------------------------------------
Repository     : openSUSE-20170419-0             
Name           : kernel-firmware                 
Version        : 20171125-1.1                    
Arch           : noarch                          
Vendor         : openSUSE                        
Installed Size : 223.4 MiB                       
Installed      : Yes (automatically)             
Status         : up-to-date                      
Source package : kernel-firmware-20171125-1.1.src
Summary        : Linux kernel firmware files     
Description    :                                 
    This package contains the firmware for in-kernel drivers that was
    previously included in the kernel. It is shared by all kernels >=
    2.6.27-rc1.

Files in /lib/firmware/amdgpu are exist, not null. 
mkinitrd says:
...
dracut: Possible missing firmware "amdgpu/polaris11_smc_sk.bin" for kernel module "amdgpu.ko"
dracut: Possible missing firmware "amdgpu/polaris11_smc.bin" for kernel module "amdgpu.ko"
...
Comment 19 andrey yakunin 2017-12-08 10:25:57 UTC
kernel-firmware was forcefully updated. But nothing changes.


May be i can give more info? Just say a word.
Comment 20 andrey yakunin 2017-12-08 10:35:38 UTC
>>dracut: Possible missing firmware "amdgpu/polaris11_smc_sk.bin" for kernel module "amdgpu.ko"

Where is he trying to find polaris11_smc_sk.bin?
Is it possible to check dracut PATHs?
Comment 21 Takashi Iwai 2017-12-08 10:37:35 UTC
Reassigned to dracut maintainer.
Comment 22 Daniel Molkentin 2017-12-08 11:53:24 UTC
Just tried this on a up-to-date tumbleweed system:

# cat /etc/os-release 
NAME="openSUSE Tumbleweed"
# VERSION="20171206"
ID=opensuse
ID_LIKE="suse"
VERSION_ID="20171206"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20171206"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

# cd /tmp
# sudo dracut -f --add-drivers amdgpu initrd.test
# sudo lsinitrd initrd.test|grep polaris11

-rw-r--r--   2 root     root         8832 Nov 25 16:15 lib/firmware/amdgpu/polaris11_ce.bin
-rw-r--r--   1 root     root       130228 Nov 25 16:15 lib/firmware/amdgpu/polaris11_k_smc.bin
-rw-r--r--   1 root     root        32724 Nov 25 16:15 lib/firmware/amdgpu/polaris11_mc.bin
-rw-r--r--   1 root     root        17024 Nov 25 16:15 lib/firmware/amdgpu/polaris11_me.bin
-rw-r--r--   2 root     root            0 Nov 25 16:15 lib/firmware/amdgpu/polaris11_mec2.bin
-rw-r--r--   2 root     root       262784 Nov 25 16:15 lib/firmware/amdgpu/polaris11_mec.bin
-rw-r--r--   1 root     root        17024 Nov 25 16:15 lib/firmware/amdgpu/polaris11_pfp.bin
-rw-r--r--   1 root     root        23184 Nov 25 16:15 lib/firmware/amdgpu/polaris11_rlc.bin
-rw-r--r--   2 root     root        12692 Nov 25 16:15 lib/firmware/amdgpu/polaris11_sdma1.bin
-rw-r--r--   2 root     root        12692 Nov 25 16:15 lib/firmware/amdgpu/polaris11_sdma.bin
-rw-r--r--   1 root     root       130196 Nov 25 16:15 lib/firmware/amdgpu/polaris11_smc.bin
-rw-r--r--   1 root     root       130196 Nov 25 16:15 lib/firmware/amdgpu/polaris11_smc_sk.bin
-rw-r--r--   3 root     root            0 Nov 25 16:15 lib/firmware/amdgpu/polaris11_uvd.bin
-rw-r--r--   3 root     root            0 Nov 25 16:15 lib/firmware/amdgpu/polaris11_vce.bin

So I can't reproduce this. I had to force-include amdgpu, because the test system doesn't have an AMD graphics card. Are you confident your dracut run has picked up the amdgpu driver? Check for your running initrd with:

sudo lsinitrd|grep amdgpu.ko

For me, on my experimental module, this yields:

-rw-r--r--   1 root     root      3990136 Dec  6 12:31 lib/modules/4.14.3-1-default/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
Comment 23 andrey yakunin 2017-12-08 15:55:12 UTC
||So I can't reproduce this. I had to force-include amdgpu, because the test ||system doesn't have an AMD graphics card. Are you confident your dracut run ||has picked up the amdgpu driver? Check for your running initrd with:

||sudo lsinitrd|grep amdgpu.ko

||For me, on my experimental module, this yields:

||-rw-r--r--   1 root     root      3990136 Dec  6 12:31 lib/modules/4.14.3-1- ||default/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko


lsinitrd old_good(20171127) | grep amdgpu.ko

-rw-r--r--   1 root     root      3990136 Nov 22 15:33 lib/modules/4.14.0-1-default/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko

lsinitrd new_bad(20171203) | grep amdgpu.ko

-rw-r--r--   1 root     root      3990136 Nov 22 15:33 lib/modules/4.14.0-1-default/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko

outputs are equal.
Comment 24 Daniel Molkentin 2017-12-08 16:57:50 UTC
Please run

sudo dracut -f --debug 2>&1| gzip > dracut.log.gz

and upload the result.
Comment 25 andrey yakunin 2017-12-11 06:16:17 UTC
Created attachment 752247 [details]
sudo dracut -f --debug 2>&1| gzip > dracut.log.gz
Comment 26 Daniel Molkentin 2017-12-11 09:58:32 UTC
Theres the cuprit:

//etc/dracut.conf.d/amdgpu-4.14.0-1-default.conf@1(source): add_drivers+=' amdgpu'
//etc/dracut.conf.d/amdgpu-4.14.0-1-default.conf@2(source): add_drivers+=' amdkfd'
//etc/dracut.conf.d/amdgpu-4.14.0-1-default.conf@3(source): fw_dir+=/lib/firmware/4.14.0-1-default

Duplicate of #1066682:

the amdgpu package installs amdgpu-$(uname -r)-default.conf.
They think they can add directories by using fw_dir+= /lib/firmware/$(uname -r), but that's hot how fw_dir works, because it replaces the existing value (and that's not a bug, but by design). As a workaround, remove the fw_dir statements.
Comment 27 Daniel Molkentin 2017-12-11 09:59:48 UTC
Closing as duplicate.

*** This bug has been marked as a duplicate of bug 1066682 ***