Bug 1075359 - Long delay to switch root during boot
Long delay to switch root during boot
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
Current
x86-64 openSUSE Factory
: P5 - None : Normal (vote)
: ---
Assigned To: Dr. Werner Fink
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-01-10 11:01 UTC by Dr. Werner Fink
Modified: 2021-11-15 10:35 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Timings can be found in the output of the journal (141.76 KB, text/plain)
2018-01-10 11:01 UTC, Dr. Werner Fink
Details
output of jounrnalctl -xb ... (792.97 KB, text/plain)
2018-06-19 10:07 UTC, Dr. Werner Fink
Details
Screeen shot ... (4.70 MB, image/jpeg)
2018-06-19 10:57 UTC, Dr. Werner Fink
Details
2nd output of jounrnalctl -xb (790.82 KB, text/plain)
2018-06-19 14:29 UTC, Dr. Werner Fink
Details
3rd output now from blogd screen copy of /dev/console (146.14 KB, text/plain)
2018-06-19 14:57 UTC, Dr. Werner Fink
Details
4th output now from journalctl -b (137.19 KB, text/plain)
2018-06-20 11:58 UTC, Dr. Werner Fink
Details
4th output now from journalctl -b (121.67 KB, text/plain)
2018-06-20 12:05 UTC, Dr. Werner Fink
Details
/dev/shm/switch-root.log (239.95 KB, text/plain)
2018-06-21 10:43 UTC, Dr. Werner Fink
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dr. Werner Fink 2018-01-10 11:01:13 UTC
Created attachment 755507 [details]
Timings can be found in the output of the journal

This looks more like a problem with kernel as I see kernel error messages about tpm and (amd)kfd modules at switch root, the grep through dmesg output shows:

  [   17.942195] tpm_tis 00:04: 1.2 TPM (device-id 0x0, rev-id 78)
  [   17.974398] tpm tpm0: A TPM error (6) occurred attempting to read a pcr value
  [   17.975009] tpm tpm0: TPM is disabled/deactivated (0x6)
  [   17.975014] tpm tpm0: tpm_read_log_acpi: TCPA log area empty
  [   17.975029] tpm_tis: probe of 00:04 failed with error -5
  [   19.387074] kfd kfd: Initialized module
  [   21.240162] kfd kfd: DID 68f9 is missing in supported_devices
  [   21.240164] kfd kfd: kgd2kfd_probe failed

with lsmod also grepped

  amdkfd                147456  1
  amd_iommu_v2           20480  1 amdkfd
  tpm_tis                16384  0
  tpm_tis_core           20480  1 tpm_tis
  tpm                    65536  2 tpm_tis,tpm_tis_core

the system has an old ATI Radeon HD 5450 graphic card and an Interl i7-2600 CPU
Comment 1 Jiri Slaby 2018-06-16 11:51:53 UTC
Jan 10 11:38:08 noether purge-kernels[1178]: Removed:
Jan 10 11:38:08 noether purge-kernels[1178]:     kernel-syms-4.14.1-2.1.ga5bca71.x86_64
Jan 10 11:38:08 noether purge-kernels[1178]:     kernel-default-devel-4.14.1-2.1.ga5bca71.x86_64
Jan 10 11:38:08 noether purge-kernels[1178]:     kernel-default-4.14.1-2.1.ga5bca71.x86_64
Jan 10 11:38:08 noether purge-kernels[1178]:     kernel-source-4.14.1-2.1.ga5bca71.noarch
Jan 10 11:38:08 noether purge-kernels[1178]:     kernel-devel-4.14.1-2.1.ga5bca71.noarch
...
Jan 10 11:38:08 noether systemd[1]: Startup finished in 1.721s (kernel) + 2.448s (initrd) + 4min 15.223s (userspace) = 4min 19.392s.

purge-kernels might be slow. But anyway, it spent over 4 minutes in userspace.
Comment 2 Franck Bui 2018-06-18 13:52:37 UTC
Werner does it happen again if you reboot a second time ?

purge-kernels shouldn't be run a second time I guess...
Comment 4 Dr. Werner Fink 2018-06-18 13:59:16 UTC
(In reply to Franck Bui from comment #2)
> Werner does it happen again if you reboot a second time ?
> 
> purge-kernels shouldn't be run a second time I guess...

The problem seems to be gone in current Tumbleweed.  The purge of the kernels as well as the waiting on wicked services are still there but the hang on the kernel switch is now round about 4 or 5 seconds. (And the purge as well as wicked are performed *after* the switch).
Comment 5 Franck Bui 2018-06-18 14:02:26 UTC
(In reply to Dr. Werner Fink from comment #4)
> the waiting on wicked services are still there 

Just in case you can get rid of it by setting  WAIT_FOR_INTERFACES in /etc/sysconfig/network/config to something smaller. In most cases this shouldn't have any bad side effect.
Comment 6 Dr. Werner Fink 2018-06-19 06:54:24 UTC
Maybe bug #1096072 could be a duplicate of this
Comment 7 Franck Bui 2018-06-19 09:47:56 UTC
Werner, can you try to boot with the debug logs enabled ?

Perhaps also try to boot with the "emergency" default target so we can limit the number of services started after switching to the rootfs.
Comment 8 Dr. Werner Fink 2018-06-19 10:07:19 UTC
Created attachment 774463 [details]
output of jounrnalctl -xb ...

... hmmm I've seen messages which are *NOT* in the journal log ...someting like an warning aboout an empty domain in some system call.

This requires a screen shot with e.g. a smart phone
Comment 9 Dr. Werner Fink 2018-06-19 10:57:44 UTC
Created attachment 774474 [details]
Screeen shot ...

and trhere had been a lot of messages with Numerical argument out of domain ... but those are not visible anymore as I can not scroll back to those messages
Comment 10 Franck Bui 2018-06-19 12:33:32 UTC
(In reply to Dr. Werner Fink from comment #8)
> Created attachment 774463 [details]
> output of jounrnalctl -xb ...
> 
> ... hmmm I've seen messages which are *NOT* in the journal log ...someting
> like an warning aboout an empty domain in some system call.
> 
> This requires a screen shot with e.g. a smart phone

These messages are debug logs so I would ignore them for now. They're not part of the journal probably because journald is stopped when they are emitted.
Comment 11 Franck Bui 2018-06-19 12:35:27 UTC
(In reply to Dr. Werner Fink from comment #4)
> (In reply to Franck Bui from comment #2)
> > Werner does it happen again if you reboot a second time ?
> > 
> > purge-kernels shouldn't be run a second time I guess...
> 
> The problem seems to be gone in current Tumbleweed.  The purge of the
> kernels as well as the waiting on wicked services are still there but the
> hang on the kernel switch is now round about 4 or 5 seconds. (And the purge
> as well as wicked are performed *after* the switch).

Just to make sure, by "the kernel switch", you meant when the system switches from initramfs to the final rootfs or something lse ?
Comment 12 Dr. Werner Fink 2018-06-19 13:04:57 UTC
Found this https://github.com/systemd/systemd/issues/8001
Comment 13 Dr. Werner Fink 2018-06-19 13:06:48 UTC
(In reply to Franck Bui from comment #11)
> (In reply to Dr. Werner Fink from comment #4)
> > (In reply to Franck Bui from comment #2)
> > > Werner does it happen again if you reboot a second time ?
> > > 
> > > purge-kernels shouldn't be run a second time I guess...
> > 
> > The problem seems to be gone in current Tumbleweed.  The purge of the
> > kernels as well as the waiting on wicked services are still there but the
> > hang on the kernel switch is now round about 4 or 5 seconds. (And the purge
> > as well as wicked are performed *after* the switch).
> 
> Just to make sure, by "the kernel switch", you meant when the system
> switches from initramfs to the final rootfs or something lse ?

It is the switch from initramfs to final rootfs and the delay is before ... also the messages with `... ignoring: Numerical argument out of domain' are all before
Comment 14 Dr. Werner Fink 2018-06-19 13:19:31 UTC
maybe there is a problem with the detection of the architecture and the used system call for this and this seems to be a problem with udev rules due to the starting part of the messages `Failed to add rule for system call ...'

Or in initrd both the x86_64 as well as the i586 architecture will be probed in the udev rules
Comment 15 Dr. Werner Fink 2018-06-19 14:29:01 UTC
Created attachment 774505 [details]
2nd output of  jounrnalctl -xb

here we are
Comment 16 Dr. Werner Fink 2018-06-19 14:57:46 UTC
Created attachment 774509 [details]
3rd output now from blogd screen copy of /dev/console

Here this is the copy of the input of the system console
Comment 17 Dr. Werner Fink 2018-06-20 11:58:54 UTC
Created attachment 774668 [details]
4th output now from journalctl -b

Journbal with Storage=volatile
Comment 18 Dr. Werner Fink 2018-06-20 12:05:58 UTC
Created attachment 774670 [details]
4th output now from journalctl -b

OK ... now with  "printk.devkmsg=on"
Comment 19 Dr. Werner Fink 2018-06-21 10:43:52 UTC
Created attachment 774844 [details]
/dev/shm/switch-root.log

The result os a top process started within initrd-switch-root.service .. compare with bug boo#1096072
Comment 20 Dr. Werner Fink 2018-06-21 10:52:13 UTC
Hmmm ... maybe haveged should be masked by using argv[0][0]='@' ... also we have some zombies around like `systemd-journal' and `systemctl'
Comment 21 Dr. Werner Fink 2018-06-21 10:55:41 UTC
(In reply to Dr. Werner Fink from comment #20)
> Hmmm ... maybe haveged should be masked by using argv[0][0]='@' ... also we
> have some zombies around like `systemd-journal' and `systemctl'

and ... a chroot(2) system call to be placed on new root
Comment 22 Swamp Workflow Management 2018-07-12 13:10:06 UTC
This is an autogenerated message for OBS integration:
This bug (1075359) was mentioned in
https://build.opensuse.org/request/show/622301 Factory / haveged
Comment 23 Tomáš Chvátal 2018-07-26 12:08:43 UTC
Werner is there something else to do or can this be closed as your patches were integrated?
Comment 24 Dr. Werner Fink 2018-07-26 12:15:48 UTC
(In reply to Tomáš Chvátal from comment #23)
> Werner is there something else to do or can this be closed as your patches
> were integrated?

Nope, should be fine