Bug 1079375 - Btrfs filesystem corrupted beyond repair after aborted balancing
Btrfs filesystem corrupted beyond repair after aborted balancing
Status: NEW
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: David Sterba
E-mail List
Depends on:
  Show dependency treegraph
Reported: 2018-02-05 14:53 UTC by Oliver Schmidt
Modified: 2018-02-06 09:13 UTC (History)
0 users

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---

kernel stack trace after mounting the Btrfs filesystem (1.40 MB, image/jpeg)
2018-02-05 14:57 UTC, Oliver Schmidt

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Schmidt 2018-02-05 14:53:35 UTC
I managed to get my btrfs rootfs messed up so badly that the kernel was constantly printing btrfs-related stacktraces and `btrfs check --repair` crashed when trying to repair it:

The scheduled (systemd-timer) btrfs balancing caused my system to freeze (GUI + also tty). As it didn't return to an usable state after 30min, I just forcedly powered-off the laptop. At the next reboot, the filesystem mounts timed out and the system only booted to a rescue mode with read-only filesystem.
There I ran a `btrfs check --force` filesystem check, which reported several errors (many of them related to qgroups). As repairing btrfs filesystems is only possible when they're unmounted, I fired up a systemrescuecd and ran a `btrfs check --repair` from there. The repair tool itself crashed before completing the repair. After that, mounting the filesystem resulted in the kernel constantly printing btrfs-related stacktraces (see attachment).

Sadly I can't provide many logfiles or other details as I just tried to make my machine production-ready again as fast as possible (now running a Tumbleweed with XFS only).
I still decided to file this bugreport to point out that the current state of Btrfs usage in openSUSE may not be *that* production ready so far. I especially suspect qgroups to play a role in this incident as they're probably the most unstable Btrfs feature in use currently (according to Btrfs wiki they're not considered fully stable so far and are known to slow down balancing with many subvolumes i.e. Snapper snapshots).

I can try to provide more details on request, but the broken system has been whiped and replaced now.

system details:
openSUSE tumbleweed, last updated on 02.01.2018 on a Thinkpad X230 with SSD
LUKS-encrypted Btrfs-based filesystem layout (without separate home partition)
Comment 1 Oliver Schmidt 2018-02-05 14:57:11 UTC
Created attachment 758841 [details]
kernel stack trace after mounting the Btrfs filesystem
Comment 2 Oliver Schmidt 2018-02-06 08:13:10 UTC
Richard Brown pointed out [1] that I may have made things worse with directly using `btrfs check --repair` and not following the SDB guide first.
That might be true, but the real problem is that the file system got into this messed up situation (likely) because of a scheduled balance operation.
Even if using the right tools and knowledge allows restoring the file system, it's not reasonable to expect from an end user that a core system feature (snapper + Btrfs balance timer) freezes their system first and then corrupt it.

[1] https://twitter.com/sysrich/status/960726166171717633