Bug 1161232 - kernel stack trace on resume
kernel stack trace on resume
Status: RESOLVED WORKSFORME
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 SUSE Other
: P5 - None : Normal (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-01-17 18:39 UTC by Michael Hirmke
Modified: 2020-04-15 11:14 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
dmesg (38.31 KB, application/x-bzip)
2020-01-17 18:43 UTC, Michael Hirmke
Details
dmesg (213.02 KB, text/plain)
2020-01-20 12:27 UTC, Jiri Slaby
Details
dmesg latest crash (271.95 KB, text/plain)
2020-02-02 19:30 UTC, Michael Hirmke
Details
dmesg from the before last crash (213.02 KB, text/plain)
2020-02-03 17:06 UTC, Michael Hirmke
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Hirmke 2020-01-17 18:39:47 UTC
Starting with the 5.4.x kernel series on Tumbleweed my DELL XPS 13 9370 always crashes when resuming after the system has been hibernated, when an external Tunderbolt box (CalDigit TS3 Plus) with an external monitor is connected on hibernate, but is missing on resume.
See attached dmesg.txt.
The important messages seem to be:

[ 5392.134376] [drm:intel_dp_link_training_clock_recovery [i915]] *ERROR* failed to enable link training
[ 5394.363888] [drm:intel_dp_link_training_clock_recovery [i915]] *ERROR* failed to enable link training

This worked perfectly for more than a year with earlier kernel versions.
Comment 1 Michael Hirmke 2020-01-17 18:42:25 UTC
cant' add the attachement:

Server error!

The server encountered an internal error and was unable to complete your request.

Error message:
Malformed multipart POST: data truncated ,

If you think this is a server error, please contact the webmaster.
Error 500
bugzilla.opensuse.org
Fri Jan 17 18:42:04 2020
Apache
Comment 2 Michael Hirmke 2020-01-17 18:43:45 UTC
Created attachment 827758 [details]
dmesg
Comment 3 Michael Hirmke 2020-01-17 18:44:20 UTC
finally succeeded
Comment 4 Jiri Slaby 2020-01-20 12:25:55 UTC
> general protection fault: 0000 [#1] SMP PTI
> CPU: 1 PID: 16179 Comm: kworker/u16:3 Kdump: loaded Not tainted 5.4.10-1-default #1 openSUSE Tumbleweed (unreleased)
> Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.12.1 12/11/2019
> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> RIP: 0010:remove_files.isra.0+0x1f/0x70
> Code: 41 bc fe ff ff ff eb ef 0f 1f 00 0f 1f 44 00 00 41 54 49 89 d4 55 48 89 fd 53 48 85 f6 74 24 48 8b 06 48 89 f3 48 85 c0 74 19 <48> 8b 30 31 d2 48 89 ef 48 83 c3 08 e8 70 d6 ff ff 48 8b 03 48 85
> RSP: 0018:ffffbb9808ce3b78 EFLAGS: 00010206
> RAX: 79221371e1231feb RBX: ffff9c9646ef1ac0 RCX: 0000000000000000
> RDX: ffff9c964e131480 RSI: ffff9c9646ef1ac0 RDI: ffff9c963a721908
> RBP: ffff9c963a721908 R08: 0000000000000000 R09: ffffffffb0568700
> usb 3-2: USB disconnect, device number 3
> R10: ffff9c963a721990 R11: ffff9c92c7c540d8 R12: ffff9c964e131480
> R13: ffff9c963a7a88c0 R14: ffff9c966d11e0b0 R15: ffff9c92c7282750
> FS:  0000000000000000(0000) GS:ffff9c966d440000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000048eeee006 CR4: 00000000003606e0
> Call Trace:
>  sysfs_remove_group+0x3d/0x80
>  sysfs_remove_groups+0x29/0x40
>  device_remove_attrs+0x39/0x70
>  device_del+0x15a/0x370
>  cdev_device_del+0x15/0x30
>  posix_clock_unregister+0x21/0x50
>  ptp_clock_unregister+0x6e/0x80
>  igb_ptp_stop+0x21/0x50 [igb]
>  igb_remove+0x47/0x130 [igb]
>  pci_device_remove+0x3b/0xa0
>  device_release_driver_internal+0xe4/0x1c0
>  pci_stop_bus_device+0x68/0x90
>  pci_stop_bus_device+0x2c/0x90
>  pci_stop_bus_device+0x3d/0x90
>  pci_stop_bus_device+0x2c/0x90
>  pci_stop_bus_device+0x2c/0x90
>  pci_stop_and_remove_bus_device+0xe/0x20
>  disable_slot+0x49/0x90
>  acpiphp_check_bridge.part.0+0xba/0x140
>  acpiphp_hotplug_notify+0xe2/0x1e0
>  ? free_bridge+0x110/0x110
>  acpi_device_hotplug+0x9e/0x3f0
>  acpi_hotplug_work_fn+0x1a/0x30
>  process_one_work+0x1f0/0x3a0
>  worker_thread+0x4d/0x400
>  kthread+0xf9/0x130
>  ? process_one_work+0x3a0/0x3a0
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x3a/0x50
> Modules linked in: fuse rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet xt_set xt_tcpudp typec_displayport xt_pkttype ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6t_rpfilter nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_conntrack ip_set_hash_ip ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter rfcomm cmac snd_seq_midi snd_seq_midi_event snd_seq algif_hash bnep iTCO_wdt iTCO_vendor_support hid_multitouch mei_hdcp mei_wdt intel_rapl_msr snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match dell_laptop snd_hda_codec_hdmi snd_soc_acpi x86_pkg_temp_thermal intel_powerclamp snd_soc_core kvm_intel
>  snd_hda_codec_realtek
> done.
> usb 6-4: USB disconnect, device number 2
>  snd_hda_codec_generic snd_compress ledtrig_audio snd_pcm_dmaengine kvm dmi_sysfs dell_wmi snd_hda_intel ath10k_pci dell_smbios snd_intel_nhlt coretemp msr dcdbas ath10k_core irqbypass snd_usb_audio snd_hda_codec snd_usbmidi_lib ath pcspkr snd_hda_core snd_rawmidi snd_hwdep i2c_i801 wmi_bmof dell_wmi_descriptor intel_wmi_thunderbolt mac80211 igb snd_seq_device mc snd_pcm dca joydev btusb btrtl snd_timer btbcm btintel snd cfg80211 soundcore bluetooth thunderbolt libarc4 ucsi_acpi ecdh_generic typec_ucsi mei_me processor_thermal_device rfkill rtsx_pci_ms intel_lpss_pci intel_xhci_usb_role_switch intel_lpss intel_rapl_common memstick ecc mei roles idma64 intel_soc_dts_iosf intel_pch_thermal typec thermal int3403_thermal int340x_thermal_zone intel_hid sparse_keymap ac acpi_pad int3400_thermal acpi_thermal_rel button nls_iso8859_1 nls_cp437 vfat fat dm_crypt algif_skcipher af_alg hid_generic usbhid uas usb_storage i915 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
>  rtsx_pci_sdmmc
> usb 7-1: USB disconnect, device number 2
>  mmc_core i2c_algo_bit drm_kms_helper aesni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci xhci_hcd glue_helper crypto_simd drm cryptd usbcore serio_raw rtsx_pci wmi i2c_hid battery pinctrl_sunrisepoint video pinctrl_intel sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua loop efivarfs
Comment 5 Jiri Slaby 2020-01-20 12:27:17 UTC
Created attachment 827827 [details]
dmesg
Comment 6 Jiri Slaby 2020-01-20 12:29:57 UTC
Oliver, any idea why this happens?

> usb usb1: root hub lost power or was reset
> usb usb2: root hub lost power or was reset
> ACPI: EC: event unblocked
> usb usb3: root hub lost power or was reset
> usb usb4: root hub lost power or was reset
> xhci_hcd 0000:3c:00.0: Host halt failed, -19
> xhci_hcd 0000:3c:00.0: Host not accessible, reset failed.
> usb usb5: root hub lost power or was reset
> usb usb6: root hub lost power or was reset
> xhci_hcd 0000:3d:00.0: Host halt failed, -19
> usb usb7: root hub lost power or was reset
> usb usb8: root hub lost power or was reset
> xhci_hcd 0000:3d:00.0: Host not accessible, reset failed.
> xhci_hcd 0000:3e:00.0: Host halt failed, -19
> xhci_hcd 0000:3e:00.0: Host not accessible, reset failed.
> xhci_hcd 0000:3c:00.0: No Extended Capability registers, unable to set up roothub
> xhci_hcd 0000:3c:00.0: Host halt failed, -19
> xhci_hcd 0000:3c:00.0: Host not accessible, reset failed.
> xhci_hcd 0000:3c:00.0: PCI post-resume error -12!
> xhci_hcd 0000:3c:00.0: HC died; cleaning up
> PM: dpm_run_callback(): pci_pm_restore+0x0/0x90 returns -12
> PM: Device 0000:3c:00.0 failed to restore async: error -12
> xhci_hcd 0000:3d:00.0: No Extended Capability registers, unable to set up roothub
> xhci_hcd 0000:3d:00.0: Host halt failed, -19
> xhci_hcd 0000:3d:00.0: Host not accessible, reset failed.
> xhci_hcd 0000:3d:00.0: PCI post-resume error -12!
> xhci_hcd 0000:3d:00.0: HC died; cleaning up
> xhci_hcd 0000:3e:00.0: No Extended Capability registers, unable to set up roothub
> PM: dpm_run_callback(): pci_pm_restore+0x0/0x90 returns -12
> PM: Device 0000:3d:00.0 failed to restore async: error -12
> xhci_hcd 0000:3e:00.0: Host halt failed, -19
> xhci_hcd 0000:3e:00.0: Host not accessible, reset failed.
> xhci_hcd 0000:3e:00.0: PCI post-resume error -12!
> xhci_hcd 0000:3e:00.0: HC died; cleaning up
> PM: dpm_run_callback(): pci_pm_restore+0x0/0x90 returns -12
> PM: Device 0000:3e:00.0 failed to restore async: error -12
> PM: dpm_run_callback(): pci_pm_restore+0x0/0x90 returns -19
> PM: Device 0000:3f:00.0 failed to restore async: error -19
Comment 7 Michael Hirmke 2020-01-20 20:09:49 UTC
Just tested with 5.3.12-2 - works without any problem.
Comment 8 Michael Hirmke 2020-02-01 11:02:11 UTC
I was too hasty 8-<
Problem still exists in 5.4.14 and even in 5.5.0 8-(
It does not always happen, but in 3 out of 4 hibernate/resume cycles.
Comment 9 Michael Hirmke 2020-02-02 19:30:49 UTC
Created attachment 828869 [details]
dmesg latest crash
Comment 10 Oliver Neukum 2020-02-03 12:37:16 UTC
(In reply to Jiri Slaby from comment #6)
> Oliver, any idea why this happens?
> 
> > usb usb1: root hub lost power or was reset
> > usb usb2: root hub lost power or was reset
> > ACPI: EC: event unblocked
> > usb usb3: root hub lost power or was reset
> > usb usb4: root hub lost power or was reset
> > xhci_hcd 0000:3c:00.0: Host halt failed, -19
> > xhci_hcd 0000:3c:00.0: Host not accessible, reset failed.
> > usb usb5: root hub lost power or was reset
> > usb usb6: root hub lost power or was reset
> > xhci_hcd 0000:3d:00.0: Host halt failed, -19
> > usb usb7: root hub lost power or was reset
> > usb usb8: root hub lost power or was reset
> > xhci_hcd 0000:3d:00.0: Host not accessible, reset failed.
> > xhci_hcd 0000:3e:00.0: Host halt failed, -19
> > xhci_hcd 0000:3e:00.0: Host not accessible, reset failed.

[ 5390.147140] pcieport 0000:03:00.0: Refused to change power state, currently in D3
[ 5390.150504] pcieport 0000:04:01.0: Refused to change power state, currently in D3
[ 5390.150505] pcieport 0000:04:00.0: Refused to change power state, currently in D3
[ 5390.150506] pcieport 0000:04:02.0: Refused to change power state, currently in D3
[ 5390.150507] pcieport 0000:04:04.0: Refused to change power state, currently in D3
[ 5390.156540] pcieport 0000:3a:00.0: Refused to change power state, currently in D3
[ 5390.156545] thunderbolt 0000:05:00.0: Refused to change power state, currently in D3
[ 5390.159558] pcieport 0000:3b:00.0: Refused to change power state, currently in D3
[ 5390.159559] pcieport 0000:3b:01.0: Refused to change power state, currently in D3
[ 5390.159560] pcieport 0000:3b:02.0: Refused to change power state, currently in D3


This is basically a PCI issue, which in turn crashes USB. But the second dmesg is inconsistent with that. Could we have two or three crashes with the same kernel as dmesg, so we can be sure that we are not seeing memory corruption?
Comment 11 Michael Hirmke 2020-02-03 17:06:56 UTC
Created attachment 828952 [details]
dmesg from the before last crash

I'll attach every dmesg.txt from all of the upcoming crashes, until you say stop ;)
Comment 12 Oliver Neukum 2020-02-04 09:13:06 UTC
(In reply to Michael Hirmke from comment #11)
> Created attachment 828952 [details]
> dmesg from the before last crash
> 
> I'll attach every dmesg.txt from all of the upcoming crashes, until you say
> stop ;)

Again a PCI issue is first. The primary theory would be that if you unplug this the whole bus and bridge goes away in the hardware. A PCI hotplug code is faulty. Can you please test what happens if you replace your TB box with any other device in the type C port? A USB cable should be enough.
Comment 13 Michael Hirmke 2020-02-06 10:03:07 UTC
After installing the latest Tumbleweed snapshot (containing 5.4.14-2-default), the system doesn't create a crash dump any longer when resuming. Instead it shows a black screen after resuming and is completely unresponsive, i.e. not even is pingable.
I'll still have to check if the problem also occurs with different devices instead of the TB box.
Comment 14 Michael Hirmke 2020-02-06 16:43:14 UTC
I tested with

1. thunderbolt box connected to my notebook with an external monitor connected to the TB box
2. thunderbolt box connected to my notebook without an external monitor connected to the TB box
3. no thunderbolt box connected to my notebook, but an external ssd connected to the thunderbolt port of my notebook

For cases 2. and 3. I never got a crash/hang.
For case 1. I get a hang every second or third hibernate/resume cycle.

I'm not sure, though, whether this leads into the right direction.
Because even in case 1. everything works sometimes, so it might be working only accidentally in cases 2. and 3.
And I only had the time to run every test case for three times.
Comment 15 Michael Hirmke 2020-02-22 17:41:43 UTC
Problem didn't occur for at least ten days now.
I'm not sure what might have solved it - new kernel, new Mesa, ...?
Comment 16 Michael Hirmke 2020-02-23 09:50:24 UTC
Of course the problem reoccurred today after my last comment 8-<
Again I got a black screen and an unresponsive system after resuming from hibernation.
No log entries or anything else to see what might have gone wrong.
Comment 17 Takashi Iwai 2020-03-12 13:58:56 UTC
So does the issue still happen with 5.5.y kernel?  There have been a couple of ptp-clock-related fixes in 5.5, and I hoped that those addressed the issue.

If it's with 5.5 or later, please try to get a crash stack trace again.  Thanks.
Comment 18 Michael Hirmke 2020-03-12 16:01:56 UTC
(In reply to Takashi Iwai from comment #17)
> So does the issue still happen with 5.5.y kernel?  There have been a couple
> of ptp-clock-related fixes in 5.5, and I hoped that those addressed the
> issue.
> 
> If it's with 5.5 or later, please try to get a crash stack trace again. 
> Thanks.

No stack trace at all with later kernels, but every so often the machine only shows a black screen after resuming.
With 5.5.6 the situation was acceptable, with 5.5.7-1.1 it was really bad - I even got errors like

igb 0000:3f:00.0: can't change power state from D3cold to D0 (config space inaccessible)
xhci_hcd 0000:3e:00.0: No Extended Capability registers, unable to set up roothub

Now i'm testing with 5.5.7-1.2.
Comment 19 Michiel Janssens 2020-03-14 21:05:45 UTC
Hi, I'm using a Dell XPS 13 9360 with Dell dock WD15 and experienced similar behavior for about at least last month on Tumbleweed.
Hotplugging the dock doesn't work reliable anymore, I've had several freezes and black screen also and had to use REISUB to reboot. 2 DP monitors are attached.
Currently running kernel 5.5.7-1-default.

I searched if upstream similar reports were filed and found this one, which has similar behavior:

https://bugzilla.kernel.org/show_bug.cgi?id=206459

with a patch already submitted upstream:

https://lore.kernel.org/lkml/20200302141451.18983-1-mika.westerberg@linux.intel.com/

I will try and simulate the scenarios as in the upstream report and if my knowledge goes far enough perhaps investigate if the BIOS-e820 reserved area clipping as describe is the cause.
Comment 20 Michael Hirmke 2020-04-15 11:14:39 UTC
The problem vanished with at least kernels > 5.6.x!