Bug 1083866 - Kernel 4.15 - USB network adapter r8152 freeze
Kernel 4.15 - USB network adapter r8152 freeze
Status: RESOLVED INVALID
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 openSUSE Factory
: P5 - None : Major with 17 votes (vote)
: ---
Assigned To: Oliver Neukum
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-04 14:02 UTC by Steffen Ungruh
Modified: 2018-12-10 18:35 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
The "lsusb -v" output as requested. (45.31 KB, text/plain)
2018-03-08 18:10 UTC, Steffen Ungruh
Details
"ethtool enp57s0u1u4" to confirm. (762 bytes, text/plain)
2018-03-08 18:11 UTC, Steffen Ungruh
Details
"ethtool --show-offload enp57s0u1u4" for further confirmation. (1.50 KB, text/plain)
2018-03-08 18:12 UTC, Steffen Ungruh
Details
"udevadm info -e" for good measure. (189.84 KB, text/plain)
2018-03-08 18:13 UTC, Steffen Ungruh
Details
Call trace of the blocked kworker (10.08 KB, text/plain)
2018-04-24 12:34 UTC, Steffen Ungruh
Details
Pre-4.15 behaviour (1.51 KB, text/plain)
2018-04-24 12:35 UTC, Steffen Ungruh
Details
patch to disable LPM specifically for the r8153 (896 bytes, patch)
2018-05-02 14:03 UTC, Oliver Neukum
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Steffen Ungruh 2018-03-04 14:02:30 UTC
openSUSE Tumbleweed on a Dell XPS 13 with a Dell DA-200 4-in-1 Thunderbolt dongle.
Of the four (HDMI, VGA, USB3, LAN) only LAN (Realtek Semiconductor Corp. RTL8153 Gigabit Ethernet Adapter) is affected. This problem started with kernel 4.15.

Problem and symptoms:
- After some time of inactivity the network adapter freezes (goes into powersaving) and doesn't come back up.
- NFS share is mounted and minidlna server running - freezes nonetheless; even a simple "ls" in the console causes hangs.
- shutdown is locked-up. The graphical environment closes, system is then stuck on plymouth shutdown screen. Half of the time it manages to run the shutdown scripts and then locks-up. The other half it can't run any shutdown scripts at all or gets stuck in a loop with "Avahi" and other network related tasks.

Mitigation - Preliminary fix:
- In powertop, switching "Autosuspend for USB device USB 10/100/1000 LAN [Realtek]" from the default "GOOD" to "BAD".
- Equivalent of "echo 'on' > '/sys/bus/usb/devices/4-1.4/power/control'"

My two cents:
- Wasn't sure what caused the hangs. Suspected my NVME at first. I was aware of the NVME ALPM changes. Played with the switches, no change.
- The frozen KWorker processes then revealed they were dealing with two kernel modules: usbcore and r8152
- Tried to find powermanagement relevant changes concerning those modules in the 4.15 changelog, but the only ones standing out to me were the NVME/SSD dipm ALPM ones. I'm far from being an expert in that matter.
Comment 1 Steffen Ungruh 2018-03-07 15:56:49 UTC
Quick follow-up:

I have not provided any logs since the journal entries appear to be quite unhelpful. No clues given at all.

However, if you want me to investigate further - I do have hardware availability after all - let me know.  Just provide me with a step-by-step guide (what to install, what to run etc.) and I'll get you what you need.
Comment 2 Oliver Neukum 2018-03-08 11:05:44 UTC
(In reply to Steffen Ungruh from comment #1)
> Quick follow-up:
> 
> I have not provided any logs since the journal entries appear to be quite
> unhelpful. No clues given at all.
> 
> However, if you want me to investigate further - I do have hardware
> availability after all - let me know.  Just provide me with a step-by-step
> guide (what to install, what to run etc.) and I'll get you what you need.

Are you positive it worked with v4.14?

If so, the likely cause is:

commit 0b1655143df00ac5349f27b765b2ed13a3ac40ca
Author: Kai-Heng Feng <kai.heng.feng@canonical.com>
Date:   Tue Jan 16 16:46:27 2018 +0800

    r8152: disable RX aggregation on Dell TB16 dock

If it worked with v4.14, please attach the output of "lsusb -v"
Comment 3 Steffen Ungruh 2018-03-08 18:10:39 UTC
Created attachment 763133 [details]
The "lsusb -v" output as requested.
Comment 4 Steffen Ungruh 2018-03-08 18:11:43 UTC
Created attachment 763134 [details]
"ethtool enp57s0u1u4" to confirm.
Comment 5 Steffen Ungruh 2018-03-08 18:12:36 UTC
Created attachment 763135 [details]
"ethtool --show-offload enp57s0u1u4" for further confirmation.
Comment 6 Steffen Ungruh 2018-03-08 18:13:45 UTC
Created attachment 763136 [details]
"udevadm info -e" for good measure.
Comment 7 Steffen Ungruh 2018-03-08 18:30:05 UTC
Yes. I've been switching back and forth between 4.14 and 4.15 to make sure. The bug starts with 4.15.

As far as I can tell my Dell DA-200 dongle doesn't trip the wire set up by the commit. I added output mentioned in the patch/commit's original lauchpad thread (https://bugs.launchpad.net/dell-sputnik/+bug/1729674) to confirm. But to be honest, I can't make heads or tails of it.

Then again, considering that a "fix" that works for me under the 4.15 kernel is to simply run 

echo 'on' > '/sys/bus/usb/devices/4-1.4/power/control'    (It was set to "auto".)

suggests a powermanagement issue, right?

And, the frozen/stalled/dead KWorker process touches the "usbcore" module as well as the "r8152" one.

So, somewhere on that intersection between USB, powermanagement, the network driver and maybe even thunderbolt resides some code that has seen changes between the 4.14. and 4.15 release.

I'm just spitballing here. I really haven't got a clue.
Comment 8 Steffen Ungruh 2018-03-20 17:33:46 UTC
So. I've checked a number of times again just to make sure. The 4.14 kernel is the last "unfreezing" one. I also took a look at the most recent 4.16 from the Kernel:HEAD repo and discovered the same problem. My hopes of that issue going away have therefore vanished.

Any suggestions what the cause might be? Is my adapter affected by the commit changes at all?

Just to reiterate, I'm willing to install kernel debug packages and all the necessary tools in order to provide useful information. I can run both kernel versions for comparisons sake until the powermanagement (if that's the issue) kicks in or whatever else makes sense.

But for that I need some guidance i.e. needed tools, specific commands, what to look out for etc.
Kernel debugging would be a first for me.

I have a free weekend coming up. Bring it on and I'll deliver.
Comment 9 Oliver Neukum 2018-03-22 14:31:39 UTC
(In reply to Steffen Ungruh from comment #8)

> Any suggestions what the cause might be? Is my adapter affected by the
> commit changes at all?

No. It looks like a change in another part of the kernel affected the driver.
 
> Just to reiterate, I'm willing to install kernel debug packages and all the
> necessary tools in order to provide useful information. I can run both
> kernel versions for comparisons sake until the powermanagement (if that's
> the issue) kicks in or whatever else makes sense.

It looks like this issue will have to be bisected.
The procedure is described here:

https://www.kernel.org/doc/html/v4.14/admin-guide/bug-bisect.html

the good commit:
git bisect good v4.14

the bad commit:
git bisect v4.15
Comment 10 Steffen Ungruh 2018-04-24 12:34:56 UTC
Created attachment 768110 [details]
Call trace of the blocked kworker
Comment 11 Steffen Ungruh 2018-04-24 12:35:42 UTC
Created attachment 768111 [details]
Pre-4.15 behaviour
Comment 12 Steffen Ungruh 2018-04-24 12:58:07 UTC
Alright. That took longer than expected. And still, I can not provide a bad commit. While the regular Tumbleweed kernels keep on freezing like clockwork within minutes after a boot, the compiled/bisected ones produce false positives and false negatives alike and are way more erratic in their behavior. 

I ran the bisect process three times, each time trying to nail down the (then less than obvious) failed states. Once I got send to an drm/amd commit in the 4.13 kernel, once to 4.15 POWERPC spectre patches and finally to a usbip commit. But, my GPU is Intel on an x86-64 and I don't use usbip. So, that was a bust.

Nonetheless, it gave me an opportunity to observe the logs closely. Note: The compiled kernels are chattier than the distro ones. This is mostly due to a 

echo /proc/sys/kernel/hung_task_timeout_secs
0

setting in the latter.

So, this seems to be the call trace of the frozen kworker:

Apr 24 13:58:49 neroon kernel: INFO: task kworker/0:2:50 blocked for more than 60 seconds.
Apr 24 13:58:49 neroon kernel:       Not tainted 4.16.2-1-default #1
Apr 24 13:58:49 neroon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 24 13:58:49 neroon kernel: kworker/0:2     D    0    50      2 0x80000000
Apr 24 13:58:49 neroon kernel: Workqueue: events rtl_work_func_t [r8152]
Apr 24 13:58:49 neroon kernel: Call Trace:
Apr 24 13:58:49 neroon kernel:  ? __schedule+0x28f/0x890
Apr 24 13:58:49 neroon kernel:  schedule+0x2f/0x90
Apr 24 13:58:49 neroon kernel:  rpm_resume+0xf9/0x7a0
Apr 24 13:58:49 neroon kernel:  ? wait_woken+0x80/0x80
Apr 24 13:58:49 neroon kernel:  rpm_resume+0x547/0x7a0
Apr 24 13:58:49 neroon kernel:  ? dequeue_entity+0xd9/0x430
Apr 24 13:58:49 neroon kernel:  __pm_runtime_resume+0x3a/0x50
Apr 24 13:58:49 neroon kernel:  usb_autopm_get_interface+0x1d/0x50 [usbcore]
Apr 24 13:58:49 neroon kernel:  rtl_work_func_t+0x3c/0x27c [r8152]
Apr 24 13:58:49 neroon kernel:  process_one_work+0x1d4/0x3f0
Apr 24 13:58:49 neroon kernel:  worker_thread+0x2b/0x3d0
Apr 24 13:58:49 neroon kernel:  ? process_one_work+0x3f0/0x3f0
Apr 24 13:58:49 neroon kernel:  kthread+0x113/0x130
Apr 24 13:58:49 neroon kernel:  ? kthread_create_worker_on_cpu+0x50/0x50
Apr 24 13:58:49 neroon kernel:  ret_from_fork+0x24/0x50

I have attached a gdb trace (see file) of each function nailing it down to the line in the code. 

Furthermore, I compared it to the logs of older kernels and found a bunch of output that's bee missing since 4.15. If you take a look at the other attached file (Pre-4.15 behaviour) you'll see why.

Apr 01 14:18:25 neroon kernel: usb 4-1.4: Disable of device-initiated U1 failed.
Apr 01 14:18:25 neroon kernel: usb 4-1.4: Disable of device-initiated U2 failed.
Apr 01 14:18:25 neroon kernel: usb 4-1.4: usb_reset_and_verify_device Failed to disable LTM
                               .
Apr 01 14:18:25 neroon kernel: usb 4-1.4: USB disconnect, device number 5

Note: the single "." in line 4 is not a typo.
Apparently, autosuspend never worked properly. Instead, the kernel - after failing to wake up the device - performs an internal unplug/plug and enlists it thereby as a new one. Pretty dirty, but at least it worked and didn't cause all kinds of havoc on the entire system.

So. Now what? What else can I do to further the cause?
Comment 13 Gabriel Rauter 2018-04-25 15:55:21 UTC
Hi, I just wanted to add that I am having the same issue on Arch Linux on a Macbook Air since Kernel 4.15.

But in my case it isn't a thunderbolt adapter but a USB 3.0 to RJ45 Gigabit Ethernet Network Adapter also with Realtek Semiconductor Corp. RTL8153 chipset.

Steffen Ungruhs workaround with switching "Autosuspend" to "BAD" seems to work so far.
Comment 14 Oliver Neukum 2018-05-02 14:03:16 UTC
Created attachment 768826 [details]
patch to disable LPM specifically for the r8153
Comment 15 Oliver Neukum 2018-05-02 14:04:05 UTC
Could you try the patch from comment#14 or do you need a test kernel?
Comment 16 Steffen Ungruh 2018-05-03 15:38:23 UTC
Nice effort. I see what you did there.

Used the kernel-sources from the repo instead of the git tree for this. Patched it. Compiled it. Ran it.

Results in the same crash. Call trace is basically identical.
Comment 17 Oliver Neukum 2018-05-08 10:37:06 UTC
We absolutely need to verify whether this issue is seen upstream.
Could you test the kernel from:
http://kernel.suse.com/packages/vanilla
Comment 18 Steffen Ungruh 2018-05-12 07:15:06 UTC
You're on to something.

I've tested default and vanilla kernels from the OSS, kernel:vanilla and kernel:HEAD repos ranging from 4.16.x to 4.17-rc4.

The only kernels which don't break are the 4.17-rc4-vanilla ones regardless of the repo. Everything else freezes within minutes after a reboot.
Comment 19 Steffen Ungruh 2018-05-12 23:12:43 UTC
Correction.

After hours of running smoothly it froze. Right after copying files from an NFS share straight to an attached USB HDD. The files copied successfully. The output was as follows.


May 13 01:36:55 neroon kernel: INFO: task kworker/0:0:4 blocked for more than 60 seconds.
May 13 01:36:55 neroon kernel:       Tainted: G     U            4.17.0-rc4-1.g8257a00-vanilla #1
May 13 01:36:55 neroon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 13 01:36:55 neroon kernel: kworker/0:0     D    0     4      2 0x80000000
May 13 01:36:55 neroon kernel: Workqueue: events rtl_work_func_t [r8152]
May 13 01:36:55 neroon kernel: Call Trace:
May 13 01:36:55 neroon kernel:  ? __schedule+0x289/0x880
May 13 01:36:55 neroon kernel:  schedule+0x2f/0x90
May 13 01:36:55 neroon kernel:  rpm_resume+0xf9/0x7a0
May 13 01:36:55 neroon kernel:  ? wait_woken+0x80/0x80
May 13 01:36:55 neroon kernel:  rpm_resume+0x547/0x7a0
May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x40/0x70
May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x34/0x70
May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x40/0x70
May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x34/0x70
May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x40/0x70
May 13 01:36:55 neroon kernel:  __pm_runtime_resume+0x3a/0x50
May 13 01:36:55 neroon kernel:  usb_autopm_get_interface+0x1d/0x50 [usbcore]
May 13 01:36:55 neroon kernel:  rtl_work_func_t+0x3c/0x27c [r8152]
May 13 01:36:55 neroon kernel:  process_one_work+0x1d4/0x3f0
May 13 01:36:55 neroon kernel:  worker_thread+0x2b/0x3d0
May 13 01:36:55 neroon kernel:  ? process_one_work+0x3f0/0x3f0
May 13 01:36:55 neroon kernel:  kthread+0x113/0x130
May 13 01:36:55 neroon kernel:  ? kthread_create_worker_on_cpu+0x50/0x50
May 13 01:36:55 neroon kernel:  ret_from_fork+0x3a/0x50
Comment 20 Steffen Ungruh 2018-05-13 09:15:24 UTC
So, to sum up what we've learned so far:

1.  All kernels are affected.

2.  The distro kernels break within a very short timeframe after reboot.

3.  The vanilla and bisected ones are totally erratic. Sometimes they break, sometimes they don't. (In my last case within a second after network use)

4.  And, arch-kernels are also affected.


My questions:

What differentiates the default and vanilla kernels when it comes to USB power management?

How can I test this more efficiently? Meaning, how can this error be provoked. Waiting for something to happen/not happen has proven completely unreliable.
Comment 21 Michael Rans 2018-08-30 12:41:41 UTC
I stumbled across this bug report. I have been having similar issues on various versions of Linux Mint with various kernel versions (including 4.15). I have field a bug report against the Linux kernel: https://bugzilla.kernel.org/show_bug.cgi?id=200977

I have referenced this report in there.

Did you find a solution to this problem?
Comment 22 Steffen Ungruh 2018-09-16 13:45:14 UTC
The hang still occurs with the most recent kernel 4.18.7.

I have tried to bisect the problem repeatedly, but ended up nowhere - or in different places every single time depending on how you look at it. The fact that the issue arises with pre-4.15 kernels on other distributions might have something to do with that. However, 4.14 worked fine for me. The bisected kernels themselves kept on behaving rather strangely. Sometimes running for 24 hours without hanging, sometimes hanging within minutes after a reboot, sometimes with sometimes without hung_task output.

Advise on how to proceed would be appreciated.

Where do we go from here?
Comment 23 Colin Hamilton 2018-09-25 16:43:49 UTC
I'm experiencing this issue on openSUSE Leap 42.3, my current kernel is 4.4.155-68-default.

My network will suddenly lock up and this will appear in the dmesg:

[17521.847039] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17521.847050] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680010 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17521.847272] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17521.847285] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680020 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17521.973697] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17521.973710] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680030 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17522.047586] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17522.047596] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680040 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17522.376027] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17522.376038] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680050 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17522.376454] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17522.376467] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680060 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17522.410585] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17522.410596] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680070 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17522.434837] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[17522.434849] xhci_hcd 0000:0e:00.0: Looking for event-dma 00000001c5680080 trb-start 00000001c5681fe0 trb-end 00000001c5681fe0 seg-start 00000001c5681000 seg-end 00000001c5681ff0
[17543.007925] r8152 4-1.2:1.0 eth2: Tx timeout
[17543.049740] r8152 4-1.2:1.0 eth2: Tx status -2
[17543.059507] r8152 4-1.2:1.0 eth2: Tx status -2
[17543.072625] r8152 4-1.2:1.0 eth2: Tx status -2
[17543.082387] r8152 4-1.2:1.0 eth2: Tx status -2

The best way for me to recover is to unplug and replug the ethernet cable. My laptop is an XPS Precision 15 and I'm using the ethernet port on the Thunderbolt 3 Dock which is connected to the laptop via usb-c. 

08:00.0 System peripheral: Intel Corporation DSL6340 Thunderbolt 3 NHI [Alpine Ridge 2C 2015]
	Subsystem: Device 2222:1111
	Flags: bus master, fast devsel, latency 0, IRQ 255
	Memory at ea000000 (32-bit, non-prefetchable) [size=256K]
	Memory at ea040000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [80] Power Management version 3
	Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [c0] Express Endpoint, MSI 00
	Capabilities: [a0] MSI-X: Enable- Count=16 Masked-
	Capabilities: [100] Device Serial Number fa-1d-c0-3a-f7-e9-07-00
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] Virtual Channel
	Capabilities: [400] Power Budgeting <?>
	Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=088 <?>
	Capabilities: [600] Latency Tolerance Reporting

0e:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller (prog-if 30 [XHCI])
	Subsystem: Dell Device b054
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at d4000000 (64-bit, non-prefetchable) [size=32K]
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [68] MSI-X: Enable+ Count=8 Masked-
	Capabilities: [78] Power Management version 3
	Capabilities: [80] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci

I've tried these workarounds gleaning from different bug reports similar to this issue with no success:

1. ethtool --offload <device> rx off
2. udev rule:
ACTION=="add", SUBSYSTEM=="usb", ATTR{idVendor}=="0bda", ATTR{idProduct}=="8153", TEST=="power/control", ATTR{power/control}="on"
KERNEL=="cpu",RUN+="/bin/sh -c 'echo -n 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'"
3. /sys/bus/usb/devices/4-1.2/power/control was already set to "on" by default

This next one decreased the frequency of the issue. Whereas before I’d run into it multiple times a day, with this I run into it maybe once every few days:
4. ethtool -s eth2 speed 100 duplex full autoneg on

It’s unfortunate to have to decrease my speed though.

As far as I know I've always had this issue on Leap 42.3 so I don't know if there's a kernel where it doesn't occur.

Could this be related to this patch?

https://www.spinics.net/lists/linux-usb/msg157958.html

Anything I can provide to help figure this out?
Comment 24 Colin Hamilton 2018-10-17 17:07:23 UTC
Updating the firmware for the dock and the laptop resolved the issue for me.
Comment 25 Oliver Neukum 2018-12-10 15:30:35 UTC
Firmware problem
Comment 26 Steffen Ungruh 2018-12-10 18:35:38 UTC
Just to clarify, which firmware is the culprit?

My thunderbolt adapter's firmware is up to date at NVM26.

I am not sure to which dock Colin Hamilton is referring. The Dell DA200 - for which I filed this bug report - is just a dongle with no upgradable firmware to speak of.