Bugzilla – Bug 1079375
Btrfs filesystem corrupted beyond repair after aborted balancing
Last modified: 2018-02-06 09:13:11 UTC
I managed to get my btrfs rootfs messed up so badly that the kernel was constantly printing btrfs-related stacktraces and `btrfs check --repair` crashed when trying to repair it:
The scheduled (systemd-timer) btrfs balancing caused my system to freeze (GUI + also tty). As it didn't return to an usable state after 30min, I just forcedly powered-off the laptop. At the next reboot, the filesystem mounts timed out and the system only booted to a rescue mode with read-only filesystem.
There I ran a `btrfs check --force` filesystem check, which reported several errors (many of them related to qgroups). As repairing btrfs filesystems is only possible when they're unmounted, I fired up a systemrescuecd and ran a `btrfs check --repair` from there. The repair tool itself crashed before completing the repair. After that, mounting the filesystem resulted in the kernel constantly printing btrfs-related stacktraces (see attachment).
Sadly I can't provide many logfiles or other details as I just tried to make my machine production-ready again as fast as possible (now running a Tumbleweed with XFS only).
I still decided to file this bugreport to point out that the current state of Btrfs usage in openSUSE may not be *that* production ready so far. I especially suspect qgroups to play a role in this incident as they're probably the most unstable Btrfs feature in use currently (according to Btrfs wiki they're not considered fully stable so far and are known to slow down balancing with many subvolumes i.e. Snapper snapshots).
I can try to provide more details on request, but the broken system has been whiped and replaced now.
openSUSE tumbleweed, last updated on 02.01.2018 on a Thinkpad X230 with SSD
LUKS-encrypted Btrfs-based filesystem layout (without separate home partition)
Created attachment 758841 [details]
kernel stack trace after mounting the Btrfs filesystem
Richard Brown pointed out  that I may have made things worse with directly using `btrfs check --repair` and not following the SDB guide first.
That might be true, but the real problem is that the file system got into this messed up situation (likely) because of a scheduled balance operation.
Even if using the right tools and knowledge allows restoring the file system, it's not reasonable to expect from an end user that a core system feature (snapper + Btrfs balance timer) freezes their system first and then corrupt it.