Bug 1099769 - btrfs balance generates 100% CPU usage for long periods (15min's +)
btrfs balance generates 100% CPU usage for long periods (15min's +)
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 SUSE Other
: P2 - High : Major (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-07-01 05:03 UTC by Chris .
Modified: 2019-06-17 06:26 UTC (History)
9 users (show)

See Also:
Found By: Community User
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
jslaby: needinfo? (fa2)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chris . 2018-07-01 05:03:27 UTC
System becomes unresponsive for long periods (15 minutes or more).

'top' command indicates btrfs process using 100% CPU.
Previously reported bug is still not fixed IMHO.

* Please advise which logs would be helpful & how to generate on OpenSUSE Tumbleweed.

Recently installed & updated OpenSUSE Tumbleweed.
KDE desktop.
Kernel 4.17.2-1-default.

Regards
Comment 1 Chris . 2018-07-01 05:09:49 UTC
* Affected systems are cold booted daily.
Comment 2 Adam Szyszko 2018-07-02 05:33:28 UTC
I can confirm this bug. It happened to me this morning also.
btrfs and btrfs-transacti eats 100% CPU for about few minutes. My lap is unusable during this process.
Comment 3 Nicolas Patricio Saenz Julienne 2018-11-06 10:37:03 UTC
Same here, running KOTD (4.19.0-1) and an update tumbleweed image. btrfs-balance kicked in consuming 100% of one of my CPUs making the system unusable.
Comment 4 Nicolas Patricio Saenz Julienne 2018-11-06 10:38:29 UTC
s/update/up to date/
Comment 5 Bruno Friedmann 2018-11-19 23:18:10 UTC
With tumbleweed snapshot 2018116 kernel 4.19.1
Hardware is Laptop Dell Precision 7510 with Xeon CPU E3-1535M v5 uefi boot
64GB Ram 2400 DDR4
Primary storage is Toshiba NVMe 1024GB (gpt)
p1 : 164MB vfat16 (efi)
p2 : 943GB luks encrypted / btrfs
p3 : 2GB luks encrypted swap
Grub is used to decrypted the root btrfs partition

I discover this morning a btrfs-transacti running at 100% (started by the btrfs balance timer during the night)

After a reboot, the system is now unsable with error due to timeout on mounting the btrfs root filesystem.

booting a rescue system (same tw snapshot) and trying to mount the fs manually result in same process btrfs-transacti eating 100% of cpu.
tonight Mount finally didn't success after waiting more than 45 minutes.

When the btrfs partition was last mounted (even with usebackuproot) there was btrfs balance running on

Trying to run a btrfs balance cancel /mnt lead to several backtrace in the kernel log (see attached dmesg captured if one day I can get them out of the broken fs)

It seems that there's not enough place on the partition even if df was telling 163GB free over the 943GB total of the partition.
From memory as it is not possible to mount or grab information from the defect volume.

BTW ; What's written in the wiki https://en.opensuse.org/SDB:BTRFS 
btrfs scrub start /dev/mapper/cr_nvme0n1p2 doesn't work it state error not mounted filesystem ? What's the problem as it should normally possible to use a device (here the cryptsetup luksOpen /dev/nvmen0p2 cr_nvme0n1p2 unlocked device)

It's the first time since one year the system is installed, that this happen. The only things that have changed, is a lot of newer data in (up to 85% used) and numerous file and directories cleanup yesterday afternoon.

What is the best possible recovery scenario ? and the best forensic approach to debug this nasty behaviour. any tricks is welcomed.

In the meantime I will at least retry to mount it ro and use a btrfs dump.
Comment 6 Jiri Slaby 2019-06-14 09:41:16 UTC
Was this fixed in the meantime by a chance?

Anyway, Nikolay might have a clue.
Comment 7 Nikolay Borisov 2019-06-17 06:10:55 UTC
This should have been improved with the latest work that Qu did on qgroups. His patches seem to have been backported to SLE12-SP3 and newer and SLE15 and newer. Additionally Mark Fasheh is working on further improvement to the generic backref walking logic.
Comment 8 Jiri Slaby 2019-06-17 06:26:26 UTC
Closing then -- if you still see some problems with up-to-date openSUSE, please reopen.