Bug 1099312 - system turns unresponsive with very high load and many blocked tasks, e.g. in btrfs_buffer_uptodate
system turns unresponsive with very high load and many blocked tasks, e.g. in...
Status: CONFIRMED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
Other Other
: P5 - None : Critical (vote)
: ---
Assigned To: Jeff Mahoney
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-06-27 12:53 UTC by Oliver Kurz
Modified: 2019-03-25 17:35 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
screenshot showing stack trace of btrfs job stuck in btrfs_buffer_uptodate and others (1.21 MB, image/jpeg)
2018-06-27 12:53 UTC, Oliver Kurz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Kurz 2018-06-27 12:53:48 UTC
Created attachment 775414 [details]
screenshot showing stack trace of btrfs job stuck in btrfs_buffer_uptodate and others

## Observations

I did an installation of openSUSE Tumbleweed on an encrypted root partition including the default setup of subvolumes, snapshots, qgroups, etc. After two weeks in STR (suspend-to-ram) I woke up the system and btrfs balance et al. were triggered. This turned the system nearly unresponsive, only from time to time the clock in the plasma session updated as well as "htop" which I triggered two weeks ago. I managed to switch to a tty and execute some magic-sysrq steps and could gather the screenshot as attached

## Further details

This is openSUSE Tumbleweed running Linux 4.16
Comment 1 Jeff Mahoney 2018-07-11 19:45:48 UTC
This is the newer kernels not using the DWARF unwinder anymore and producing less "pristine" stack traces.

The screenshot shows an operation waiting on a read lock for an extent in the btree.  There needs to be an operation somewhere holding the lock.

Since this is Tumbleweed, it may be that we don't have the qgroup optimization patch reverted there, which can lead to deadlocks.

We'll need btrfs-qgroup-move-half-of-the-qgroup-accounting-time-out-of-commit-trans.patch on Tumbleweed as well.  It needs updating.  I have a version in my own tree already.
Comment 2 Jeff Mahoney 2019-03-25 17:35:33 UTC
I've just backported:

commit 38e3eebff643db725633657d1d87a3be019d1018
Author: Josef Bacik <josef@toxicpanda.com>
Date:   Wed Jan 16 11:00:57 2019 -0500

    btrfs: honor path->skip_locking in backref code

to the stable branch.