Bug 910440 - "soft lockup CPU#0 stuck..." with bache devices
"soft lockup CPU#0 stuck..." with bache devices
Status: RESOLVED FIXED
: 909716 (view as bug list)
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
13.2
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-12-17 09:01 UTC by Stefan Seyfried
Modified: 2018-07-03 20:52 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
/var/log/messages with backtraces etc. (322.63 KB, application/x-gzip)
2014-12-17 09:01 UTC, Stefan Seyfried
Details
5 emails with patches inside (4.95 KB, application/x-gzip)
2014-12-17 09:04 UTC, Stefan Seyfried
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Seyfried 2014-12-17 09:01:43 UTC
Created attachment 617624 [details]
/var/log/messages with backtraces etc.

After updating my home server to 13.2 I got "soft lockup" messages in " bcache_gc" about once per week, often rendering the bcache device unusable (nothing would read or write to it anymore) and in the end requiring a hard reboot.

I have the following hardware setup:
* Abit IP35-E(Intel P35+ICH9R) mainboard
* core2 duo E6750 CPU
* 8GB memory
* 2TB 2.5" laptop disk as main system disk
* 240GB Crucial CT240M5 SSD to use as bcache device
* 2TB WD Red (uncached) for "big data"
* 1TB WD Green (uncached) for "backup data"

bcache-backed devices are:
30GB /home (very low "workout" overall, I'm the only active user on the box and do not really do anything in $HOME)
750GB /space1 (local data, lots of files, kernel development, embedded build system, git repos,... "high workout" filesystem)
4GB /var/log/journal (since 13.2, because the journal sucks on rotating rust)

So the errors I see:
Nov  9 21:56:31 server kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [bcache_gc:944]
Nov  9 21:56:31 server kernel: Modules linked in: binfmt_misc tcp_diag inet_diag tun nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc iscsi_ibft iscsi_bo
ot_sysfs af_packet w83627ehf hwmon_vid ts2020 ds3000 bcache coretemp gpio_ich pl2303 iTCO_wdt usbserial kvm_intel dvb_usb_dw2102 kvm xfs iTCO_vendor_suppor
t dvb_usb libcrc32c serio_raw i2c_i801 pcspkr sky2 lpc_ich mfd_core shpchp acpi_cpufreq button ppdev hid_generic ehci_pci uhci_hcd ehci_hcd sr_mod cdrom us
bhid usbcore usb_common parport_serial parport_pc parport edd fan ata_piix ata_generic pata_jmicron sg fuse thermal processor dm_mirror dm_region_hash dm_l
og dm_mod stv0299 budget_ci rc_core budget_core ttpci_eeprom saa7146 dvb_core bridge stp llc
Nov  9 21:56:31 server kernel: CPU: 0 PID: 944 Comm: bcache_gc Tainted: G        W      3.17.2-3.gbf63174-default #1
Nov  9 21:56:31 server kernel: Hardware name: .   .  /IP35-E(Intel P35+ICH9R), BIOS 6.00 PG 05/30/2008
Nov  9 21:56:31 server kernel: task: ffff8800daeca650 ti: ffff8800d7528000 task.ti: ffff8800d7528000
Nov  9 21:56:31 server kernel: RIP: 0010:[<ffffffffa03c86fd>]  [<ffffffffa03c86fd>] bch_btree_iter_next_filter+0x2dd/0x350 [bcache]
Nov  9 21:56:31 server kernel: RSP: 0018:ffff8800d752bc40  EFLAGS: 00000282
Nov  9 21:56:31 server kernel: RAX: 0000000000000005 RBX: ffff880200000001 RCX: 0000000000000002
Nov  9 21:56:31 server kernel: RDX: 0000000000000001 RSI: ffff88020ea21020 RDI: 0000000000000005
Nov  9 21:56:31 server kernel: RBP: ffff8800d752bc70 R08: ffff8800d752bca0 R09: ffff8800d752bc90
Nov  9 21:56:31 server kernel: R10: ffffffffffee89f8 R11: 0000000000000000 R12: ffff8800d752be38
Nov  9 21:56:31 server kernel: R13: ffffffffa03d0dc5 R14: ffff8800dbb40000 R15: ffff88020ea114e0
Nov  9 21:56:31 server kernel: FS:  0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
Nov  9 21:56:31 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  9 21:56:31 server kernel: CR2: 00007fc9cba2a000 CR3: 00000000d60b9000 CR4: 00000000000007f0
Nov  9 21:56:31 server kernel: Stack:
Nov  9 21:56:31 server kernel:  ffff880213cea4d0 000000000000229b ffff8800d752be38 ffff8800d752bde0
Nov  9 21:56:31 server kernel:  ffff8800daecc8d0 ffff8800d752bfd8 ffff880213cea4d0 ffffffffa03c8e98
Nov  9 21:56:31 server kernel:  0000000000000004 0000000000000003 ffff88020ea11510 ffff88020ea20870
Nov  9 21:56:31 server kernel: Call Trace:
Nov  9 21:56:31 server kernel:  [<ffffffffa03c8e98>] btree_gc_count_keys+0x48/0x60 [bcache]
Nov  9 21:56:31 server kernel:  [<ffffffffa03cde4d>] btree_gc_recurse+0x19d/0x300 [bcache]
Nov  9 21:56:31 server kernel:  [<ffffffffa03ce527>] bch_btree_gc+0x3d7/0x560 [bcache]
Nov  9 21:56:31 server kernel:  [<ffffffffa03ce6e8>] bch_gc_thread+0x38/0x120 [bcache]
Nov  9 21:56:31 server kernel:  [<ffffffff81078a4d>] kthread+0xbd/0xe0
Nov  9 21:56:31 server kernel:  [<ffffffff815e14bc>] ret_from_fork+0x7c/0xb0
Nov  9 21:56:31 server kernel: Code: 06 49 8b 3e 25 ff ff 0f 00 81 e7 ff ff 0f 00 48 39 f8 75 3a 4c 8b 56 08 4d 2b 56 08 4d 85 d2 48 0f 4e f8 49 0f 4f f6 4d 0f 4e c7 <48> 0f 4e ca e9 34 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 45 31 
Nov  9 21:56:55 server kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [bcache_gc:944]
Nov  9 21:56:55 server kernel: Modules linked in: binfmt_misc tcp_diag inet_diag tun nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc iscsi_ibft iscsi_boot_sysfs af_packet w83627ehf hwmon_vid ts2020 ds3000 bcache coretemp gpio_ich pl2303 iTCO_wdt usbserial kvm_intel dvb_usb_dw2102 kvm xfs iTCO_vendor_support dvb_usb libcrc32c serio_raw i2c_i801 pcspkr sky2 lpc_ich mfd_core shpchp acpi_cpufreq button ppdev hid_generic ehci_pci uhci_hcd ehci_hcd sr_mod cdrom usbhid usbcore usb_common parport_serial parport_pc parport edd fan ata_piix ata_generic pata_jmicron sg fuse thermal processor dm_mirror dm_region_hash dm_log dm_mod stv0299 budget_ci rc_core budget_core ttpci_eeprom saa7146 dvb_core bridge stp llc
Nov  9 21:56:55 server kernel: CPU: 0 PID: 944 Comm: bcache_gc Tainted: G        W    L 3.17.2-3.gbf63174-default #1
Nov  9 21:56:55 server kernel: Hardware name: .   .  /IP35-E(Intel P35+ICH9R), BIOS 6.00 PG 05/30/2008
Nov  9 21:56:55 server kernel: task: ffff8800daeca650 ti: ffff8800d7528000 task.ti: ffff8800d7528000
Nov  9 21:56:55 server kernel: RIP: 0010:[<ffffffffa03d0ee5>]  [<ffffffffa03d0ee5>] bch_extent_bad+0xf5/0x1b0 [bcache]
Nov  9 21:56:55 server kernel: RSP: 0018:ffff8800d752bc20  EFLAGS: 00000202
Nov  9 21:56:55 server kernel: RAX: 0000000000000000 RBX: ffff8800dbb40000 RCX: 000000000000000a
Nov  9 21:56:55 server kernel: RDX: ffffc9001107b000 RSI: 0000000000000004 RDI: 00000001288dd804
Nov  9 21:56:55 server kernel: RBP: ffff88021313a8d0 R08: ffff8800dbb40000 R09: 0000000000000001
Nov  9 21:56:55 server kernel: R10: 0000000000000000 R11: ffff8800dafae000 R12: ffff8800daecc800
Nov  9 21:56:55 server kernel: R13: 0000000000000018 R14: ffffffffa03cd419 R15: ffff8800daecc800
Nov  9 21:56:55 server kernel: FS:  0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
Nov  9 21:56:55 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  9 21:56:55 server kernel: CR2: 00007fc9cba2a000 CR3: 00000001d0ee9000 CR4: 00000000000007f0
Nov  9 21:56:55 server kernel: Stack:
Nov  9 21:56:55 server kernel:  ffff8800d752bc80 ffff8800d752bc70 ffffffffa03c8920 ffffffffa03c85db
Nov  9 21:56:55 server kernel:  ffff88021313a8d0 0000000000002b7a ffff8800d752be38 ffff8800d752bde0
Nov  9 21:56:55 server kernel:  ffff8800daecc8d0 ffff8800d752bfd8 ffff88021313a8d0 ffffffffa03c8e98
Nov  9 21:56:55 server kernel: Call Trace:
Nov  9 21:56:55 server kernel:  [<ffffffffa03c85db>] bch_btree_iter_next_filter+0x1bb/0x350 [bcache]
Nov  9 21:56:55 server kernel:  [<ffffffffa03c8e98>] btree_gc_count_keys+0x48/0x60 [bcache]
Nov  9 21:56:55 server kernel:  [<ffffffffa03cde4d>] btree_gc_recurse+0x19d/0x300 [bcache]
Nov  9 21:56:55 server kernel:  [<ffffffffa03ce527>] bch_btree_gc+0x3d7/0x560 [bcache]
Nov  9 21:56:55 server kernel:  [<ffffffffa03ce6e8>] bch_gc_thread+0x38/0x120 [bcache]
Nov  9 21:56:55 server kernel:  [<ffffffff81078a4d>] kthread+0xbd/0xe0
Nov  9 21:56:55 server kernel:  [<ffffffff815e14bc>] ret_from_fork+0x7c/0xb0
Nov  9 21:56:55 server kernel: Code: 48 c1 ea 08 81 e6 ff 0f 00 00 4c 21 e2 4d 8b 9c f0 40 0c 00 00 48 d3 ea 48 8d 34 52 49 8b 93 d8 0a 00 00 48 8d 34 b2 0f b6 76 06 <29> fe 40 80 fe 80 0f 87 7f 00 00 00 40 0f b6 d6 83 fa 60 76 66 

(I'll attach /var/log/messages)

After investigating, I found patches on the bcache mailing list, which are not really upstream yet, probably due to the maintainer being busy.

I applied those 5 patches and with the patched bcache module, no problems have been observed since 21th of November.

Patch 0-2 look "important", patch 3 and 4 only fix the error path in case the bcache init fails, which seems to be not that urgent.
Comment 1 Stefan Seyfried 2014-12-17 09:04:39 UTC
Created attachment 617628 [details]
5 emails with patches inside

The patches I applied.

Note that the code is unchanged upstream from 3.16 to 3.18.

I tried to solve this before patching by using Kernel:Stable (3.17 at that time) and Kernel:HEAD (3.18rc at that time) which did not change anything. So I patched bcache.ko which helped.

The patches should apply from 3.16 to 3.18.
Comment 2 Takashi Iwai 2015-02-10 15:09:53 UTC
Yeah, the upstream looks lazy, there are no significant commits for 3.19 and for 3.20 yet.

I backported the patches to openSUSE-13.2, stable and master branches for now.  Let's hope that these will be deprecated later on 3.20 development...
Comment 3 Bernhard Wiedemann 2015-02-12 10:00:11 UTC
This is an autogenerated message for OBS integration:
This bug (910440) was mentioned in
https://build.opensuse.org/request/show/285765 Factory / kernel-source
Comment 4 Oliver Neukum 2015-02-26 15:04:24 UTC
*** Bug 909716 has been marked as a duplicate of this bug. ***
Comment 5 Swamp Workflow Management 2015-04-13 12:11:17 UTC
openSUSE-SU-2015:0713-1: An update that solves 13 vulnerabilities and has 52 fixes is now available.

Category: security (important)
Bug References: 867199,893428,895797,900811,901925,903589,903640,904899,905681,907039,907818,907988,908582,908588,908589,908592,908593,908594,908596,908598,908603,908604,908605,908606,908608,908610,908612,909077,909078,909477,909634,910150,910322,910440,911311,911325,911326,911356,911438,911578,911835,912061,912202,912429,912705,913059,913466,913695,914175,915425,915454,915456,915577,915858,916608,917830,917839,918954,918970,919463,920581,920604,921313,922542,922944
CVE References: CVE-2014-8134,CVE-2014-8160,CVE-2014-8559,CVE-2014-9419,CVE-2014-9420,CVE-2014-9428,CVE-2014-9529,CVE-2014-9584,CVE-2014-9585,CVE-2015-0777,CVE-2015-1421,CVE-2015-1593,CVE-2015-2150
Sources used:
openSUSE 13.2 (src):    bbswitch-0.8-3.6.6, cloop-2.639-14.6.6, crash-7.0.8-6.6, hdjmod-1.28-18.7.6, ipset-6.23-6.6, kernel-docs-3.16.7-13.2, kernel-obs-build-3.16.7-13.7, kernel-obs-qa-3.16.7-13.1, kernel-obs-qa-xen-3.16.7-13.1, kernel-source-3.16.7-13.1, kernel-syms-3.16.7-13.1, pcfclock-0.44-260.6.2, vhba-kmp-20140629-2.6.2, virtualbox-4.3.20-10.2, xen-4.4.1_08-12.2, xtables-addons-2.6-6.2