Bug 1147412 - bcache-register fails during boot, leaving backing store busy
bcache-register fails during boot, leaving backing store busy
Status: RESOLVED WONTFIX
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 Linux
: P5 - None : Normal (vote)
: ---
Assigned To: Coly Li
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-08-23 22:43 UTC by Gary Buchanan
Modified: 2020-06-17 16:40 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gary Buchanan 2019-08-23 22:43:04 UTC
have a 16GB(Ha!) Optane and a 128GB SSD both mounted to an expansion card (same one actually) in a pcie slot.  After some fumbling around I got it to work using the  Optane to cache the SSD.  I'm playing around here, not trying to get anything useful done.

Works for a few weeks then apparently, some update breaks it (no I do not have a clear idea of what/when).  I get something like the following:

 snd_hda_codec_realtek hdaudioC0D0:    hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
------------[ cut here ]------------
kernel BUG at drivers/md/bcache/bset.h:433!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 PID: 668 Comm: bcache-register Not tainted 5.2.8-1-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97, BIOS 1605 10/25/2012
RIP: 0010:bch_extent_sort_fixup+0x724/0x730 [bcache]
Code: ff ff 4c 89 c8 e9 3e ff ff ff 49 39 f1 0f 97 c1 e9 74 ff ff ff 49 39 f2 41 0f 97 c5 e9 12 ff ff f>
RSP: 0018:ffff9773c239fa38 EFLAGS: 00010286
RAX: fffffffffffe242d RBX: ffff8ab82f878020 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9773c239faf8
RBP: ffff9773c239fa90 R08: 000000000773e168 R09: ffff8ab8209d8860
R10: 0000000000000000 R11: 0000000000000001 R12: 000000000775bea0
R13: 000000000775bec0 R14: ffff9773c239fae0 R15: ffff8ab82f878000
FS:  00007f498bb36bc0(0000) GS:ffff8ab832a40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdc60a79fe8 CR3: 0000000330250000 CR4: 00000000000406e0
Call Trace:
 btree_mergesort+0x19b/0x5c0 [bcache]
 ? bch_cache_allocator_start+0x50/0x50 [bcache]
 __btree_sort+0x9e/0x1d0 [bcache]
 bch_btree_node_read_done+0x2cb/0x3c0 [bcache]
 bch_btree_node_read+0xdb/0x180 [bcache]
 ? bch_keybuf_init+0x60/0x60 [bcache]
 bch_btree_check_recurse+0x127/0x1f0 [bcache]
 ? bch_extent_to_text+0x10f/0x190 [bcache]
 bch_btree_check+0x18e/0x1b0 [bcache]
 ? wait_woken+0x70/0x70
 run_cache_set+0x487/0x730 [bcache]
 register_bcache+0xc0b/0xf90 [bcache]
 ? __seccomp_filter+0x7b/0x640
 ? kernfs_fop_write+0x10e/0x190
 kernfs_fop_write+0x10e/0x190
 vfs_write+0xb6/0x1a0
 ksys_write+0x4f/0xc0
 do_syscall_64+0x6e/0x1e0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f498b97e874
Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 48 8d 05 c9 49 0d 00 8b 00 85 c>
RSP: 002b:00007ffc7657ed38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000f RCX: 00007f498b97e874
RDX: 000000000000000f RSI: 0000557c6b6c0260 RDI: 0000000000000003
RBP: 0000557c6b6c0260 R08: 00000000ffffffff R09: 000000000000000f
R10: 00007ffc76580ac7 R11: 0000000000000246 R12: 000000000000000f
R13: 00007ffc7657edc0 R14: 000000000000000f R15: 00007f498ba4e7c0
Modules linked in: snd_hda_codec_realtek(+) crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel fjes>
---[ end trace 4587505d36f45756 ]---
RIP: 0010:bch_extent_sort_fixup+0x724/0x730 [bcache]
Code: ff ff 4c 89 c8 e9 3e ff ff ff 49 39 f1 0f 97 c1 e9 74 ff ff ff 49 39 f2 41 0f 97 c5 e9 12 ff ff f>
RSP: 0018:ffff9773c239fa38 EFLAGS: 00010286
RAX: fffffffffffe242d RBX: ffff8ab82f878020 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9773c239faf8
RBP: ffff9773c239fa90 R08: 000000000773e168 R09: ffff8ab8209d8860
R10: 0000000000000000 R11: 0000000000000001 R12: 000000000775bea0
R13: 000000000775bec0 R14: ffff9773c239fae0 R15: ffff8ab82f878000
FS:  00007f498bb36bc0(0000) GS:ffff8ab832a40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdc60a79fe8 CR3: 0000000330250000 CR4: 00000000000406e0


afterwards the backing device is busy, and I cannot do anything with it, presumably because bcache-register never finishes.
Comment 1 Coly Li 2019-08-27 17:21:25 UTC
Can you try the vanilla kernel after Linux v5.3-rc3 ?
There are quite a lot fixes merged into Linux v5.3, we can have a try whether it is caused by known issue.

Thanks.
Comment 2 Coly Li 2019-09-03 08:49:36 UTC
(In reply to Coly Li from comment #1)
> Can you try the vanilla kernel after Linux v5.3-rc3 ?
> There are quite a lot fixes merged into Linux v5.3, we can have a try
> whether it is caused by known issue.

The fix is merged in SLE15 and SLE15-SP1 kernel, it will go into Tumbleweed very soon.
Comment 3 Coly Li 2019-09-03 08:50:43 UTC
(In reply to Coly Li from comment #2)
> (In reply to Coly Li from comment #1)
> > Can you try the vanilla kernel after Linux v5.3-rc3 ?
> > There are quite a lot fixes merged into Linux v5.3, we can have a try
> > whether it is caused by known issue.
> 
> The fix is merged in SLE15 and SLE15-SP1 kernel, it will go into Tumbleweed
> very soon.

I type into a mismatch bugzilla report, please ignore the above comment.
Comment 4 Coly Li 2019-09-03 09:06:04 UTC
(In reply to Coly Li from comment #1)
> Can you try the vanilla kernel after Linux v5.3-rc3 ?
> There are quite a lot fixes merged into Linux v5.3, we can have a try
> whether it is caused by known issue.

Stange, I remember I saw a commnet that you were not able to find a 5.3-rc3 to test, but I don't see this commnet here..

The kernel BUG location is line 433,
431 static inline bool bch_cut_back(const struct bkey *where, struct bkey *k)
432 {
433         BUG_ON(bkey_cmp(where, &START_KEY(k)) < 0);
434         return __bch_cut_back(where, k);
435 }

Which was an out-of-order bkey items triggered. Recently we have a report for such problem but the fix commit 31b90956b124 ("bcache: fix stack corruption by PRECEDING_KEY()") is merged into Linux v5.2 already.

This is the first report I see for bcache with Optane. Myabe this is another new bug hides for long time.

Do you use the 16GB Optane as cache device, and the 128GB SSD as backing device ? I am thinking of how to reproduce a similar configuration and see what happens.
Comment 5 Gary Buchanan 2019-09-03 17:20:50 UTC
On Tue, 27 Aug 2019 17:21:25 +0000, bugzilla_noreply@novell.com wrote:

    http://bugzilla.suse.com/show_bug.cgi?id=1147412
    http://bugzilla.suse.com/show_bug.cgi?id=1147412#c1

    --- Comment #1 from Coly Li <colyli@suse.com> ---
    Can you try the vanilla kernel after Linux v5.3-rc3 ?
    There are quite a lot fixes merged into Linux v5.3, we can have a try
    whether
    it is caused by known issue.

    Thanks.

=================================================

At first could not find -rc3 rpm to install  {as had never built a kernel before, still haven't as it turned out}

tried to build  -rc3 failed
tried to build -rc6 failed
scratched head and thought some - found this  5.3.0-rc6-1.g87ddd45-vanilla
installed, booted, logged this:


Aug 30 16:49:44 8-ball kernel: bcache: register_bdev() registered backing
device sdd1
Aug 30 16:49:44 8-ball systemd-udevd[562]: link_config: autonegotiation is
unset or enabled, the speed and duplex are not writable.
Aug 30 16:49:44 8-ball kernel: ------------[ cut here ]------------
Aug 30 16:49:44 8-ball kernel: kernel BUG at drivers/md/bcache/bset.h:433!
Aug 30 16:49:44 8-ball kernel: invalid opcode: 0000 [#1] SMP NOPTI
Aug 30 16:49:44 8-ball kernel: CPU: 6 PID: 641 Comm: bcache-register Not
tainted 5.3.0-rc6-1.g87ddd45-vanilla #1
Aug 30 16:49:44 8-ball kernel: Hardware name: To be filled by O.E.M. To be
filled by O.E.M./M5A97, BIOS 1605 10/25/2012
Aug 30 16:49:44 8-ball kernel: RIP: 0010:bch_extent_sort_fixup+0x724/0x730
[bcache]
Aug 30 16:49:44 8-ball kernel: Code: ff ff 4c 89 c8 e9 3e ff ff ff 49 39
f1 0f 97 c1 e9 74 ff ff ff 49 39 f2 41 0f 97 c5 e9 12 ff ff ff 48 8b 04 24
e9 88 fa ff ff <0f> 0b 0f 0b 48 29 d0 e9 >
Aug 30 16:49:44 8-ball kernel: RSP: 0018:ffffaaad80fc3a18 EFLAGS: 00010286
Aug 30 16:49:44 8-ball kernel: RAX: fffffffffffe242d RBX: ffff8ab7ed910020
RCX: 0000000000000000
Aug 30 16:49:44 8-ball kernel: RDX: 0000000000000000 RSI: 0000000000000000
RDI: ffffaaad80fc3ad8
Aug 30 16:49:44 8-ball kernel: RBP: ffffaaad80fc3a70 R08: 000000000773e168
R09: ffff8ab7ecd18860
Aug 30 16:49:44 8-ball kernel: R10: 0000000000000000 R11: 0000000000000001
R12: 000000000775bea0
Aug 30 16:49:44 8-ball kernel: R13: 000000000775bec0 R14: ffffaaad80fc3ac0
R15: ffff8ab7ed910000
Aug 30 16:49:44 8-ball kernel: FS:  00007fa0a2324bc0(0000)
GS:ffff8ab7f2b80000(0000) knlGS:0000000000000000
Aug 30 16:49:44 8-ball kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Aug 30 16:49:44 8-ball kernel: CR2: 00007ffe56b20098 CR3: 0000000326760000
CR4: 00000000000406e0
Aug 30 16:49:44 8-ball kernel: Call Trace:
Aug 30 16:49:44 8-ball kernel:  btree_mergesort+0x19b/0x5c0 [bcache]
Aug 30 16:49:44 8-ball kernel:  ? bch_cache_allocator_start+0x50/0x50
[bcache]
Aug 30 16:49:44 8-ball kernel:  __btree_sort+0x9e/0x1d0 [bcache]
Aug 30 16:49:44 8-ball kernel:  bch_btree_node_read_done+0x2cb/0x3c0
[bcache]
Aug 30 16:49:44 8-ball kernel:  bch_btree_node_read+0xdb/0x180 [bcache]
Aug 30 16:49:44 8-ball kernel:  ? bch_keybuf_init+0x60/0x60 [bcache]
Aug 30 16:49:44 8-ball kernel:  bch_btree_check_recurse+0x127/0x1f0
[bcache]
Aug 30 16:49:44 8-ball kernel:  ? bch_extent_to_text+0x10f/0x190 [bcache]
Aug 30 16:49:44 8-ball kernel:  bch_btree_check+0x18e/0x1b0 [bcache]
Aug 30 16:49:44 8-ball kernel:  ? wait_woken+0x70/0x70
Aug 30 16:49:44 8-ball kernel:  run_cache_set+0x487/0x780 [bcache]
Aug 30 16:49:44 8-ball kernel:  ? kernfs_activate+0x5f/0x80
Aug 30 16:49:44 8-ball kernel:  ? kernfs_add_one+0xe2/0x130
Aug 30 16:49:44 8-ball kernel:  register_bcache+0xc25/0xfb0 [bcache]
Aug 30 16:49:44 8-ball kernel:  ? __seccomp_filter+0x7b/0x640
Aug 30 16:49:44 8-ball kernel:  ? kernfs_fop_write+0x10e/0x190
Aug 30 16:49:44 8-ball kernel:  kernfs_fop_write+0x10e/0x190
Aug 30 16:49:44 8-ball kernel:  vfs_write+0xb6/0x1a0
Aug 30 16:49:44 8-ball kernel:  ksys_write+0x4f/0xc0
Aug 30 16:49:44 8-ball kernel:  do_syscall_64+0x6e/0x1e0
Aug 30 16:49:44 8-ball kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xbe
Aug 30 16:49:44 8-ball kernel: RIP: 0033:0x7fa0a2166874
Aug 30 16:49:44 8-ball kernel: Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff
ff eb bb 0f 1f 80 00 00 00 00 48 8d 05 c9 49 0d 00 8b 00 85 c0 75 13 b8 01
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 >
Aug 30 16:49:44 8-ball kernel: RSP: 002b:00007ffd671c8ee8 EFLAGS: 00000246
ORIG_RAX: 0000000000000001
Aug 30 16:49:44 8-ball kernel: RAX: ffffffffffffffda RBX: 000000000000000f
RCX: 00007fa0a2166874
Aug 30 16:49:44 8-ball kernel: RDX: 000000000000000f RSI: 0000561757534260
RDI: 0000000000000003
Aug 30 16:49:44 8-ball kernel: RBP: 0000561757534260 R08: 00000000ffffffff
R09: 000000000000000f
Aug 30 16:49:44 8-ball kernel: R10: 00007ffd671c9ac8 R11: 0000000000000246
R12: 000000000000000f
Aug 30 16:49:44 8-ball kernel: R13: 00007ffd671c8f70 R14: 000000000000000f
R15: 00007fa0a22367c0
Aug 30 16:49:44 8-ball kernel: Modules linked in: pcc_cpufreq(-)
glue_helper snd_hda_codec bcache eeepc_wmi asus_wmi uas snd_hda_core
sparse_keymap rfkill snd_hwdep video crc64 pcspkr wmi_b>
Aug 30 16:49:44 8-ball kernel: ---[ end trace ca4ea58bd8a7c544 ]---


Gary B.

============================================

Yes, I was using the Optane to cache the ssd, just aS a way to get my feet wet, 
not trying to really get anything accomplished other than that.
Comment 6 Coly Li 2019-09-04 00:00:44 UTC
(In reply to Gary Buchanan from comment #5)
[snipped]
> 
> Yes, I was using the Optane to cache the ssd, just aS a way to get my feet
> wet, 
> not trying to really get anything accomplished other than that.

I am not very familiar with the product name "Optane", it seems there is SSD named Optane, or also some NVDIMM chip named Optane (maybe I am confused with them). I guess you use the optane SSD, not NVDIMM, am I right ?
Comment 7 Gary Buchanan 2019-09-07 01:12:21 UTC
The Optane I have mounts in an M.2 socket and appears to my system as /dev/nvme0n1p1 (for the partition)

If it helps any, this link shows the type of thing I'm using, though it
is 32GB not 16GB





https://images.anandtech.com/doci/11210/imgp7358_678x452.jpg
Comment 8 Coly Li 2019-09-25 15:49:33 UTC
I find a bug in bcache btree code which may cause a dirty btree node being lack of journal protection from a power failure. The result might be an inconsistent and broken btree node, and trigger a similar panic as the port listed.

Now I am working on a fix and will test it.

Coly Li

P.S the fix looks like this,

commit d48baef4543246ef910b262959ae89c5a6d197f7
Author: Coly Li <colyli@suse.de>
Date:   Wed Sep 25 22:16:33 2019 +0800

    bcache: fix fifo index swapping condition in journal_pin_cmp()

    Fifo structure journal.pin is implemented by a cycle buffer, if the back
    index reaches highest location of the cycle buffer, it will be swapped
    to 0. Once the swapping happens, it means a smaller fifo index might be
    associated to a newer journal entry. So the btree node with oldest
    journal entry won't be selected in bch_btree_leaf_dirty() to reference
    the dirty B+tree leaf node. This problem may cause bcache journal won't
    protect unflushed oldest B+tree dirty leaf node in power failure, and
    this B+tree leaf node is possible to beinconsistent after reboot from
    power failure.

    This patch fixes the fifo index comparing logic in journal_pin_cmp(),
    to avoid potential corrupted B+tree leaf node when the back index of
    journal pin is swapped.

    Signed-off-by: Coly Li <colyli@suse.de>

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index fac1e32001b9..0e333d81e58b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -528,6 +528,32 @@ static void btree_node_write_work(struct work_struct *w)
        mutex_unlock(&b->write_lock);
 }

+/* return true if journal pin 'l' is newer than 'r' */
+static bool journal_pin_cmp(struct cache_set *c,
+                                    atomic_t *l,
+                                    atomic_t *r)
+{
+        int l_idx, r_idx, f_idx, b_idx;
+        bool ret = false;
+
+        l_idx = fifo_idx(&(c)->journal.pin, (l));
+        r_idx = fifo_idx(&(c)->journal.pin, (r));
+        f_idx = (c)->journal.pin.front;
+        b_idx = (c)->journal.pin.back;
+
+        if (l_idx > r_idx)
+                ret = true;
+        /* in case fifo back pointer is swapped */
+        if (b_idx < f_idx) {
+                if (l_idx <= b_idx && r_idx >= f_idx)
+                        ret = true;
+                else if (l_idx >= f_idx && r_idx <= b_idx)
+                        ret = false;
+        }
+
+        return ret;
+}
+
 static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
 {
        struct bset *i = btree_bset_last(b);
@@ -543,6 +569,11 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)

        set_btree_node_dirty(b);

+       /*
+        * w->journal is always the oldest journal pin of all bkeys
+        * in the leaf node, to make sure the oldest jset seq won't
+        * be increased before this btree node is flushed.
+        */
        if (journal_ref) {
                if (w->journal &&
                    journal_pin_cmp(b->c, w->journal, journal_ref)) {
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index f2ea34d5f431..06b3eaab7d16 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -157,10 +157,6 @@ struct journal_device {
 };

 #define BTREE_FLUSH_NR 8
-
-#define journal_pin_cmp(c, l, r)                               \
-       (fifo_idx(&(c)->journal.pin, (l)) > fifo_idx(&(c)->journal.pin, (r)))
-
 #define JOURNAL_PIN    20000

 #define journal_full(j)                                                \
Comment 9 Coly Li 2019-09-25 15:50:42 UTC
BTW, a reboot without power failure may also cause similar inconsistent B+tree node.
Comment 10 Coly Li 2020-02-16 15:11:47 UTC
(In reply to Coly Li from comment #8)
> I find a bug in bcache btree code which may cause a dirty btree node being
> lack of journal protection from a power failure. The result might be an
> inconsistent and broken btree node, and trigger a similar panic as the port
> listed.
> 
> Now I am working on a fix and will test it.
> 
> Coly Li
> 
> P.S the fix looks like this,
> 
> commit d48baef4543246ef910b262959ae89c5a6d197f7
> Author: Coly Li <colyli@suse.de>
> Date:   Wed Sep 25 22:16:33 2019 +0800
> 
>     bcache: fix fifo index swapping condition in journal_pin_cmp()
> 

This patch is invalid, the original code is correct. So far I don't find any suspicious place from the code.

Now I am doing bcache backport for a series fixes, let's see weather these fixes may be a bit helpful.
Comment 11 Coly Li 2020-03-04 03:07:03 UTC
(In reply to Coly Li from comment #10)
> (In reply to Coly Li from comment #8)
> > I find a bug in bcache btree code which may cause a dirty btree node being
> > lack of journal protection from a power failure. The result might be an
> > inconsistent and broken btree node, and trigger a similar panic as the port
> > listed.
> > 
> > Now I am working on a fix and will test it.
> > 
> > Coly Li
> > 
> > P.S the fix looks like this,
> > 
> > commit d48baef4543246ef910b262959ae89c5a6d197f7
> > Author: Coly Li <colyli@suse.de>
> > Date:   Wed Sep 25 22:16:33 2019 +0800
> > 
> >     bcache: fix fifo index swapping condition in journal_pin_cmp()
> > 
> 
> This patch is invalid, the original code is correct. So far I don't find any
> suspicious place from the code.
> 
> Now I am doing bcache backport for a series fixes, let's see weather these
> fixes may be a bit helpful.

I guess, it is because the in-memory btree node was not flushed in time, and the on-SSD btree node got corrupted. This is just my guess, and let me explain how I think in this way.

Before v5.3, there is a problem that when I/O is busy, it is possible a race will happy when flushing a dirty btree node onto the SSD. If this happens, undefined behavior will happen.

This bug was fixed in following 2 patches,
commit 91be66e1318f ("bcache: performance improvement for btree_flush_write()")
commit 2aa8c529387c (“bcache: avoid unnecessary btree nodes flushing in btree_flush_write()")

My suggestion is, try to backup the data, or make sure the data on backing device is consistent. Then update to latest tumbleweed kernel or Linux stable kernel, then re-make the bcache devices.

This is only my guess, so far this is the only related clue shows up in my brain.

Coly Li
Comment 12 Coly Li 2020-03-04 04:06:44 UTC
For the backing disk busy issue, can you see a file 
    /sys/fs/bcache/pendings_cleanup

If the kernel is new enough to have this file, try
    echo 1 > /sys/fs/bcache/pendings_cleanup
then the pending backing device (which is waiting for its dirty cache device) will be stopped.
Comment 13 Coly Li 2020-03-29 15:38:51 UTC
(In reply to Coly Li from comment #12)
> For the backing disk busy issue, can you see a file 
>     /sys/fs/bcache/pendings_cleanup
> 
> If the kernel is new enough to have this file, try
>     echo 1 > /sys/fs/bcache/pendings_cleanup
> then the pending backing device (which is waiting for its dirty cache
> device) will be stopped.

Hi Gary,

Recently I backport many bcache fixes upto Linux v5.6 into the 15.1 and 15.2 kernel. Could you please to update the kernel and check whether such problem still exists.

The such problem still happens during the bcache start up time, I guess maybe the on-SSD btree node is corrupted. Then you have to rebuild the cache device.

I still suspicious the btree node gets corrupted with the kernel which didn't have the commit 31b90956b124 ("bcache: fix stack corruption by PRECEDING_KEY()"). Even the new updated kernels have the fix, the corrupted btree node cannot be recovered.

Could you please to update to latest openSUSE 15.1 kernel and rebuilt the cache device and see whether such problem happens again ?

Thanks.

Coly Li
Comment 14 Gary Buchanan 2020-04-16 21:37:43 UTC
(In reply to Coly Li from comment #12)
> For the backing disk busy issue, can you see a file 
>     /sys/fs/bcache/pendings_cleanup
> 
> If the kernel is new enough to have this file, try
>     echo 1 > /sys/fs/bcache/pendings_cleanup
> then the pending backing device (which is waiting for its dirty cache
> device) will be stopped.

I found this file while running a 5.5.7 kernel.  After figuring out that sudo was not going to work, I did the echo. I had expected some reaction, did not even return to the prompt.  I tried on other occasions, and got the same thinhg. At some later time I thought to continue with the re-making of the bcache set up.  For not knowing what I was doing, it all worked fine.  I was able to destroy and re-create the cached/caching set up.  As far as I can tell, it is working.  I am currently on a 5.6.2 kernel (Tumbleweed).

I do not actually have a Leap 15.x system, I have been using Tumbleweed almost exclusively and have not updated past 42.3(?) on my leave-it-alone-and-don't-mess-with-it system.

Gary B.
Comment 15 Coly Li 2020-04-22 07:07:28 UTC
(In reply to Gary Buchanan from comment #14)
> (In reply to Coly Li from comment #12)
> > For the backing disk busy issue, can you see a file 
> >     /sys/fs/bcache/pendings_cleanup
> > 
> > If the kernel is new enough to have this file, try
> >     echo 1 > /sys/fs/bcache/pendings_cleanup
> > then the pending backing device (which is waiting for its dirty cache
> > device) will be stopped.
> 
> I found this file while running a 5.5.7 kernel.  After figuring out that
> sudo was not going to work, I did the echo. I had expected some reaction,
> did not even return to the prompt.  I tried on other occasions, and got the
> same thinhg. At some later time I thought to continue with the re-making of
> the bcache set up.  For not knowing what I was doing, it all worked fine.  I
> was able to destroy and re-create the cached/caching set up.  As far as I
> can tell, it is working.  I am currently on a 5.6.2 kernel (Tumbleweed).
> 
> I do not actually have a Leap 15.x system, I have been using Tumbleweed
> almost exclusively and have not updated past 42.3(?) on my
> leave-it-alone-and-don't-mess-with-it system.

Thanks for the update. Recently I also have a bug report for a potential issue in bache internal btree. The patch will go into Linux v5.7 and I will back port it into Tumbleweed kernel. Once I am sure the patch is merged and the kernel is released, I will let you know for further verification.

Coly Li
Comment 16 Coly Li 2020-06-17 16:40:08 UTC
(In reply to Coly Li from comment #15)
> (In reply to Gary Buchanan from comment #14)
> > (In reply to Coly Li from comment #12)
> > > For the backing disk busy issue, can you see a file 
> > >     /sys/fs/bcache/pendings_cleanup
> > > 
> > > If the kernel is new enough to have this file, try
> > >     echo 1 > /sys/fs/bcache/pendings_cleanup
> > > then the pending backing device (which is waiting for its dirty cache
> > > device) will be stopped.
> > 
> > I found this file while running a 5.5.7 kernel.  After figuring out that
> > sudo was not going to work, I did the echo. I had expected some reaction,
> > did not even return to the prompt.  I tried on other occasions, and got the
> > same thinhg. At some later time I thought to continue with the re-making of
> > the bcache set up.  For not knowing what I was doing, it all worked fine.  I
> > was able to destroy and re-create the cached/caching set up.  As far as I
> > can tell, it is working.  I am currently on a 5.6.2 kernel (Tumbleweed).
> > 
> > I do not actually have a Leap 15.x system, I have been using Tumbleweed
> > almost exclusively and have not updated past 42.3(?) on my
> > leave-it-alone-and-don't-mess-with-it system.
> 
> Thanks for the update. Recently I also have a bug report for a potential
> issue in bache internal btree. The patch will go into Linux v5.7 and I will
> back port it into Tumbleweed kernel. Once I am sure the patch is merged and
> the kernel is released, I will let you know for further verification.

This fix is an unrelated CVE patch, so I won't mention it in this bug report.

OK now I close this bug as won't fix, sine the pending_cleanup helps.

But in future if an identical issue happens again, please reopen this bug report.

Thanks.

Coly Li