Bug 961263

Summary: NCQ Timeout with SMR drives (e.g. Seagate 8tb hdd)
Product: [openSUSE] openSUSE Distribution Reporter: Ferdinand Thiessen <rpm>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: hare, rpm, sweet_f_a, tiwai
Version: Leap 42.1   
Target Milestone: ---   
Hardware: All   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Ferdinand Thiessen 2016-01-09 17:51:01 UTC
Running openSUSE 42.1 (and also with Tumbleweed) I experienced random hard disk drive crashes.
After some investigation I found this bug report which describes the problem better then I can:

https://bugzilla.kernel.org/show_bug.cgi?id=93581

It would be great to see this fixed in the kernel packages provided by openSUSE.

A fix can be found in comment 67:

https://bugzilla.kernel.org/show_bug.cgi?id=93581#c67

The bug is fixed in mainline kernel version 4.4:

http://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=bugzilla-93581&id=7c4fbd50bfece00abf529bc96ac989dd2bb83ca4

So at the moment I have to use the kernel provided by the Kernel:HEAD project to workaround this bug.
Comment 1 Ferdinand Thiessen 2016-01-09 17:55:54 UTC
The commit in the master branch is: http://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?id=ca369d51b3e1649be4a72addd6d6a168cfb3f537
Comment 2 Takashi Iwai 2016-01-10 07:54:16 UTC
It looks like this should have been backported via stable trees, as the original regression commit was also merged to stable trees...
Comment 3 Ferdinand Thiessen 2016-01-10 15:30:41 UTC
I have checked that, but I can not find it in the changelog of e.g. 4.1.5.
I have now patched 4.3.3 (from Kernel:STABLE project), and can confirm this fixes the bug. After 500GiB of written data no crash (where the drives crashes after 10-100GiB of data with an unpatched kernel).
Comment 4 Takashi Iwai 2016-01-10 17:15:25 UTC
The commit 4f258a4634 ('sd: Fix maximum I/O size for BLOCK_PC requests') was backported to 4.1.7 kernel.
Comment 5 Takashi Iwai 2016-01-12 10:26:36 UTC
Hmm, the fix commit (7c4fbd50bf) requires other previous changes as well: one is to revert a change 34b48db66 (commit 30e2bc08b2) and another is to bump BLK_DEF_MAX_SECTORS (commit d2be537c3ba).  So it alone cannot be applied as stable to 4.1.x.

More badly, this fix changes kABI by adding a new field to queue_limits struct that is embedded in other structs.  Fortunately, there is some hole at the tail of the struct, and the new field should fit there, at least for x86-64.

In anyways, I set up a test kernel including these fix patches in OBS home:tiwai:bnc961263 repo.  The package is being built now.  Could you try the kernel later from that OBS repo?
Comment 6 Ferdinand Thiessen 2016-01-14 02:13:02 UTC
(In reply to Takashi Iwai from comment #5)
> In anyways, I set up a test kernel including these fix patches in OBS
> home:tiwai:bnc961263 repo.  The package is being built now.  Could you try
> the kernel later from that OBS repo?

Yes of cause I can and will test it. But at the moment I can not reboot, so it will have to wait until Friday.
Comment 7 Ferdinand Thiessen 2016-01-17 13:37:53 UTC
Ok I have started to test the kernel.
I will write about 3TiB data, read it and then start a long time test (multiple reads / writes at the same time, let it go to sleep, start up (often crashed then)).

I am be back in a few days.
Comment 8 Ferdinand Thiessen 2016-01-18 16:48:39 UTC
Ok, looks good!
No issues found, system runs stable.
Comment 9 Takashi Iwai 2016-01-18 17:07:55 UTC
Good to hear.  I merged the fix patches to openSUSE-42.1 branch now.  The next update kernel will include the fix.

Let's close the bug.  Thanks for reporting and testing.
Comment 10 Swamp Workflow Management 2016-01-29 13:17:27 UTC
openSUSE-SU-2016:0280-1: An update that solves 10 vulnerabilities and has 18 fixes is now available.

Category: security (important)
Bug References: 865096,865259,913996,950178,950998,952621,954324,954532,954647,955422,956708,957152,957988,957990,958439,958463,958504,958510,958886,958951,959190,959399,960021,960710,961263,961509,962075,962597
CVE References: CVE-2015-7550,CVE-2015-8539,CVE-2015-8543,CVE-2015-8550,CVE-2015-8551,CVE-2015-8552,CVE-2015-8569,CVE-2015-8575,CVE-2015-8767,CVE-2016-0728
Sources used:
openSUSE Leap 42.1 (src):    kernel-debug-4.1.15-8.1, kernel-default-4.1.15-8.1, kernel-docs-4.1.15-8.3, kernel-ec2-4.1.15-8.1, kernel-obs-build-4.1.15-8.2, kernel-obs-qa-4.1.15-8.1, kernel-obs-qa-xen-4.1.15-8.1, kernel-pae-4.1.15-8.1, kernel-pv-4.1.15-8.1, kernel-source-4.1.15-8.1, kernel-syms-4.1.15-8.1, kernel-vanilla-4.1.15-8.1, kernel-xen-4.1.15-8.1
Comment 11 Ruediger Meier 2016-01-31 15:38:18 UTC
Patch ca369d51b3e1 ("block/sd: Fix device-imposed transfer length
limits") introduced a regression:

Reproduce bug:
$ modprobe -r scsi_debug
$ modprobe scsi_debug sector_size=512
$ udevadm settle
$ devname=$(grep --with-filename scsi_debug /sys/block/*/device/model | awk -F '/' '{print $4}')
$ cat /sys/block/$devname/queue/{minimum_io_size,optimal_io_size}
512
64

Should be:
512
>=512


I've send a github pull request to fix that
https://github.com/openSUSE/kernel-source/pull/2

Note: This regression also happens in kernel 4.4 (current Tumbleweed) but should be fixed in 4.5.
Comment 12 Takashi Iwai 2016-01-31 17:21:24 UTC
We don't handle github pull requests at all.  Please give just the upstream commit ids to cherry-pick.
Comment 13 Ruediger Meier 2016-01-31 18:09:16 UTC
It's

commit 9c1d9c207bb800498347a2716da298043ee280c5
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date:   Wed Dec 16 17:53:52 2015 -0500
Subject: sd: Reject optimal transfer length smaller than page size

and

commit d0eb20a863ba7dc1d3f4b841639671f134560be2
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date:   Wed Jan 20 11:01:23 2016 -0500
Subject: sd: Optimal I/O size is in bytes, not sectors


Maybe the first one wouldn't be needed but it is somehow related and also needed to apply the 2nd one without conflicts.
Comment 14 Takashi Iwai 2016-01-31 21:35:52 UTC
Thanks, now I applied both to openSUSE-42.1 branch.  The second patch will be merged to stable branch for TW soon later, too.
Comment 15 Swamp Workflow Management 2016-04-12 10:13:39 UTC
openSUSE-SU-2016:1008-1: An update that solves 15 vulnerabilities and has 26 fixes is now available.

Category: security (important)
Bug References: 814440,884701,949936,951440,951542,951626,951638,953527,954018,954404,954405,954876,958439,958463,958504,959709,960561,960563,960710,961263,961500,961509,962257,962866,962977,963746,963765,963767,963931,965125,966137,966179,966259,966437,966684,966693,968018,969356,969582,970845,971125
CVE References: CVE-2015-1339,CVE-2015-7799,CVE-2015-7872,CVE-2015-7884,CVE-2015-8104,CVE-2015-8709,CVE-2015-8767,CVE-2015-8785,CVE-2015-8787,CVE-2015-8812,CVE-2016-0723,CVE-2016-2069,CVE-2016-2184,CVE-2016-2383,CVE-2016-2384
Sources used:
openSUSE Leap 42.1 (src):    kernel-debug-4.1.20-11.1, kernel-default-4.1.20-11.1, kernel-docs-4.1.20-11.3, kernel-ec2-4.1.20-11.1, kernel-obs-build-4.1.20-11.2, kernel-obs-qa-4.1.20-11.1, kernel-obs-qa-xen-4.1.20-11.1, kernel-pae-4.1.20-11.1, kernel-pv-4.1.20-11.1, kernel-source-4.1.20-11.1, kernel-syms-4.1.20-11.1, kernel-vanilla-4.1.20-11.1, kernel-xen-4.1.20-11.1