Bugzilla – Bug 910440
"soft lockup CPU#0 stuck..." with bache devices
Last modified: 2018-07-03 20:52:58 UTC
Created attachment 617624 [details] /var/log/messages with backtraces etc. After updating my home server to 13.2 I got "soft lockup" messages in " bcache_gc" about once per week, often rendering the bcache device unusable (nothing would read or write to it anymore) and in the end requiring a hard reboot. I have the following hardware setup: * Abit IP35-E(Intel P35+ICH9R) mainboard * core2 duo E6750 CPU * 8GB memory * 2TB 2.5" laptop disk as main system disk * 240GB Crucial CT240M5 SSD to use as bcache device * 2TB WD Red (uncached) for "big data" * 1TB WD Green (uncached) for "backup data" bcache-backed devices are: 30GB /home (very low "workout" overall, I'm the only active user on the box and do not really do anything in $HOME) 750GB /space1 (local data, lots of files, kernel development, embedded build system, git repos,... "high workout" filesystem) 4GB /var/log/journal (since 13.2, because the journal sucks on rotating rust) So the errors I see: Nov 9 21:56:31 server kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [bcache_gc:944] Nov 9 21:56:31 server kernel: Modules linked in: binfmt_misc tcp_diag inet_diag tun nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc iscsi_ibft iscsi_bo ot_sysfs af_packet w83627ehf hwmon_vid ts2020 ds3000 bcache coretemp gpio_ich pl2303 iTCO_wdt usbserial kvm_intel dvb_usb_dw2102 kvm xfs iTCO_vendor_suppor t dvb_usb libcrc32c serio_raw i2c_i801 pcspkr sky2 lpc_ich mfd_core shpchp acpi_cpufreq button ppdev hid_generic ehci_pci uhci_hcd ehci_hcd sr_mod cdrom us bhid usbcore usb_common parport_serial parport_pc parport edd fan ata_piix ata_generic pata_jmicron sg fuse thermal processor dm_mirror dm_region_hash dm_l og dm_mod stv0299 budget_ci rc_core budget_core ttpci_eeprom saa7146 dvb_core bridge stp llc Nov 9 21:56:31 server kernel: CPU: 0 PID: 944 Comm: bcache_gc Tainted: G W 3.17.2-3.gbf63174-default #1 Nov 9 21:56:31 server kernel: Hardware name: . . /IP35-E(Intel P35+ICH9R), BIOS 6.00 PG 05/30/2008 Nov 9 21:56:31 server kernel: task: ffff8800daeca650 ti: ffff8800d7528000 task.ti: ffff8800d7528000 Nov 9 21:56:31 server kernel: RIP: 0010:[<ffffffffa03c86fd>] [<ffffffffa03c86fd>] bch_btree_iter_next_filter+0x2dd/0x350 [bcache] Nov 9 21:56:31 server kernel: RSP: 0018:ffff8800d752bc40 EFLAGS: 00000282 Nov 9 21:56:31 server kernel: RAX: 0000000000000005 RBX: ffff880200000001 RCX: 0000000000000002 Nov 9 21:56:31 server kernel: RDX: 0000000000000001 RSI: ffff88020ea21020 RDI: 0000000000000005 Nov 9 21:56:31 server kernel: RBP: ffff8800d752bc70 R08: ffff8800d752bca0 R09: ffff8800d752bc90 Nov 9 21:56:31 server kernel: R10: ffffffffffee89f8 R11: 0000000000000000 R12: ffff8800d752be38 Nov 9 21:56:31 server kernel: R13: ffffffffa03d0dc5 R14: ffff8800dbb40000 R15: ffff88020ea114e0 Nov 9 21:56:31 server kernel: FS: 0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000 Nov 9 21:56:31 server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 9 21:56:31 server kernel: CR2: 00007fc9cba2a000 CR3: 00000000d60b9000 CR4: 00000000000007f0 Nov 9 21:56:31 server kernel: Stack: Nov 9 21:56:31 server kernel: ffff880213cea4d0 000000000000229b ffff8800d752be38 ffff8800d752bde0 Nov 9 21:56:31 server kernel: ffff8800daecc8d0 ffff8800d752bfd8 ffff880213cea4d0 ffffffffa03c8e98 Nov 9 21:56:31 server kernel: 0000000000000004 0000000000000003 ffff88020ea11510 ffff88020ea20870 Nov 9 21:56:31 server kernel: Call Trace: Nov 9 21:56:31 server kernel: [<ffffffffa03c8e98>] btree_gc_count_keys+0x48/0x60 [bcache] Nov 9 21:56:31 server kernel: [<ffffffffa03cde4d>] btree_gc_recurse+0x19d/0x300 [bcache] Nov 9 21:56:31 server kernel: [<ffffffffa03ce527>] bch_btree_gc+0x3d7/0x560 [bcache] Nov 9 21:56:31 server kernel: [<ffffffffa03ce6e8>] bch_gc_thread+0x38/0x120 [bcache] Nov 9 21:56:31 server kernel: [<ffffffff81078a4d>] kthread+0xbd/0xe0 Nov 9 21:56:31 server kernel: [<ffffffff815e14bc>] ret_from_fork+0x7c/0xb0 Nov 9 21:56:31 server kernel: Code: 06 49 8b 3e 25 ff ff 0f 00 81 e7 ff ff 0f 00 48 39 f8 75 3a 4c 8b 56 08 4d 2b 56 08 4d 85 d2 48 0f 4e f8 49 0f 4f f6 4d 0f 4e c7 <48> 0f 4e ca e9 34 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 45 31 Nov 9 21:56:55 server kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [bcache_gc:944] Nov 9 21:56:55 server kernel: Modules linked in: binfmt_misc tcp_diag inet_diag tun nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc iscsi_ibft iscsi_boot_sysfs af_packet w83627ehf hwmon_vid ts2020 ds3000 bcache coretemp gpio_ich pl2303 iTCO_wdt usbserial kvm_intel dvb_usb_dw2102 kvm xfs iTCO_vendor_support dvb_usb libcrc32c serio_raw i2c_i801 pcspkr sky2 lpc_ich mfd_core shpchp acpi_cpufreq button ppdev hid_generic ehci_pci uhci_hcd ehci_hcd sr_mod cdrom usbhid usbcore usb_common parport_serial parport_pc parport edd fan ata_piix ata_generic pata_jmicron sg fuse thermal processor dm_mirror dm_region_hash dm_log dm_mod stv0299 budget_ci rc_core budget_core ttpci_eeprom saa7146 dvb_core bridge stp llc Nov 9 21:56:55 server kernel: CPU: 0 PID: 944 Comm: bcache_gc Tainted: G W L 3.17.2-3.gbf63174-default #1 Nov 9 21:56:55 server kernel: Hardware name: . . /IP35-E(Intel P35+ICH9R), BIOS 6.00 PG 05/30/2008 Nov 9 21:56:55 server kernel: task: ffff8800daeca650 ti: ffff8800d7528000 task.ti: ffff8800d7528000 Nov 9 21:56:55 server kernel: RIP: 0010:[<ffffffffa03d0ee5>] [<ffffffffa03d0ee5>] bch_extent_bad+0xf5/0x1b0 [bcache] Nov 9 21:56:55 server kernel: RSP: 0018:ffff8800d752bc20 EFLAGS: 00000202 Nov 9 21:56:55 server kernel: RAX: 0000000000000000 RBX: ffff8800dbb40000 RCX: 000000000000000a Nov 9 21:56:55 server kernel: RDX: ffffc9001107b000 RSI: 0000000000000004 RDI: 00000001288dd804 Nov 9 21:56:55 server kernel: RBP: ffff88021313a8d0 R08: ffff8800dbb40000 R09: 0000000000000001 Nov 9 21:56:55 server kernel: R10: 0000000000000000 R11: ffff8800dafae000 R12: ffff8800daecc800 Nov 9 21:56:55 server kernel: R13: 0000000000000018 R14: ffffffffa03cd419 R15: ffff8800daecc800 Nov 9 21:56:55 server kernel: FS: 0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000 Nov 9 21:56:55 server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 9 21:56:55 server kernel: CR2: 00007fc9cba2a000 CR3: 00000001d0ee9000 CR4: 00000000000007f0 Nov 9 21:56:55 server kernel: Stack: Nov 9 21:56:55 server kernel: ffff8800d752bc80 ffff8800d752bc70 ffffffffa03c8920 ffffffffa03c85db Nov 9 21:56:55 server kernel: ffff88021313a8d0 0000000000002b7a ffff8800d752be38 ffff8800d752bde0 Nov 9 21:56:55 server kernel: ffff8800daecc8d0 ffff8800d752bfd8 ffff88021313a8d0 ffffffffa03c8e98 Nov 9 21:56:55 server kernel: Call Trace: Nov 9 21:56:55 server kernel: [<ffffffffa03c85db>] bch_btree_iter_next_filter+0x1bb/0x350 [bcache] Nov 9 21:56:55 server kernel: [<ffffffffa03c8e98>] btree_gc_count_keys+0x48/0x60 [bcache] Nov 9 21:56:55 server kernel: [<ffffffffa03cde4d>] btree_gc_recurse+0x19d/0x300 [bcache] Nov 9 21:56:55 server kernel: [<ffffffffa03ce527>] bch_btree_gc+0x3d7/0x560 [bcache] Nov 9 21:56:55 server kernel: [<ffffffffa03ce6e8>] bch_gc_thread+0x38/0x120 [bcache] Nov 9 21:56:55 server kernel: [<ffffffff81078a4d>] kthread+0xbd/0xe0 Nov 9 21:56:55 server kernel: [<ffffffff815e14bc>] ret_from_fork+0x7c/0xb0 Nov 9 21:56:55 server kernel: Code: 48 c1 ea 08 81 e6 ff 0f 00 00 4c 21 e2 4d 8b 9c f0 40 0c 00 00 48 d3 ea 48 8d 34 52 49 8b 93 d8 0a 00 00 48 8d 34 b2 0f b6 76 06 <29> fe 40 80 fe 80 0f 87 7f 00 00 00 40 0f b6 d6 83 fa 60 76 66 (I'll attach /var/log/messages) After investigating, I found patches on the bcache mailing list, which are not really upstream yet, probably due to the maintainer being busy. I applied those 5 patches and with the patched bcache module, no problems have been observed since 21th of November. Patch 0-2 look "important", patch 3 and 4 only fix the error path in case the bcache init fails, which seems to be not that urgent.
Created attachment 617628 [details] 5 emails with patches inside The patches I applied. Note that the code is unchanged upstream from 3.16 to 3.18. I tried to solve this before patching by using Kernel:Stable (3.17 at that time) and Kernel:HEAD (3.18rc at that time) which did not change anything. So I patched bcache.ko which helped. The patches should apply from 3.16 to 3.18.
Yeah, the upstream looks lazy, there are no significant commits for 3.19 and for 3.20 yet. I backported the patches to openSUSE-13.2, stable and master branches for now. Let's hope that these will be deprecated later on 3.20 development...
This is an autogenerated message for OBS integration: This bug (910440) was mentioned in https://build.opensuse.org/request/show/285765 Factory / kernel-source
*** Bug 909716 has been marked as a duplicate of this bug. ***
openSUSE-SU-2015:0713-1: An update that solves 13 vulnerabilities and has 52 fixes is now available. Category: security (important) Bug References: 867199,893428,895797,900811,901925,903589,903640,904899,905681,907039,907818,907988,908582,908588,908589,908592,908593,908594,908596,908598,908603,908604,908605,908606,908608,908610,908612,909077,909078,909477,909634,910150,910322,910440,911311,911325,911326,911356,911438,911578,911835,912061,912202,912429,912705,913059,913466,913695,914175,915425,915454,915456,915577,915858,916608,917830,917839,918954,918970,919463,920581,920604,921313,922542,922944 CVE References: CVE-2014-8134,CVE-2014-8160,CVE-2014-8559,CVE-2014-9419,CVE-2014-9420,CVE-2014-9428,CVE-2014-9529,CVE-2014-9584,CVE-2014-9585,CVE-2015-0777,CVE-2015-1421,CVE-2015-1593,CVE-2015-2150 Sources used: openSUSE 13.2 (src): bbswitch-0.8-3.6.6, cloop-2.639-14.6.6, crash-7.0.8-6.6, hdjmod-1.28-18.7.6, ipset-6.23-6.6, kernel-docs-3.16.7-13.2, kernel-obs-build-3.16.7-13.7, kernel-obs-qa-3.16.7-13.1, kernel-obs-qa-xen-3.16.7-13.1, kernel-source-3.16.7-13.1, kernel-syms-3.16.7-13.1, pcfclock-0.44-260.6.2, vhba-kmp-20140629-2.6.2, virtualbox-4.3.20-10.2, xen-4.4.1_08-12.2, xtables-addons-2.6-6.2