Bugzilla – Bug 1143865
btrfs mount takes more than 5 minutes
Last modified: 2019-08-27 11:29:12 UTC
Created attachment 812481 [details] open_ctree_failed error on boot Dear all, after a recent update in openSUSE LEAP 15.1 a server failed to boot with "open_ctree failed" (See attachment). I was able to resolve this issue by booting into the system without the corresponding device and run a 'btrfs device check' on that device. After that I was able to mount the system again, but it's now horribly slow: I have btrfs on an ARECA Hardware RAID for a 70 TiB filesystem. The system is about 70% full. btrfs is working ON the hardware RAID, no btrfs RAID functions are used in this system. The boot procedure of the filesystem failed without any useful information (after resolving the "open_ctree" failed issue). I was able to boot into the system by disabling the 70 TiB system in /etc/fstab. After booting I could manually mount the system, but mounting and unmounting took horribly long. Long as in more than 5 minutes: eso-storage:/ # time mount /dev/sdb1 /RAID real 5m37.123s user 0m0.001s sys 0m6.102 This explains, why the boot procedure fails, because it takes more than 4:30 minutes, which is displayed to be the timeout. Still, a mount of a rather large filesystem that takes more than 5 minutes seems to be a bit fishy. Using openSUSE LEAP 15.1 with all installed updated on Kernel 4.12.14-lp151.28.10-default. Here are some details about the filesystems: /dev/sda is the system ssd /dev/sdb is the ARECA RAID The filesystem has been created with `mkfs.btrfs` using the default settings. # lsblk /dev/sda1: UUID="36AC-53E1" TYPE="vfat" PARTUUID="b45ecebb-24f0-481a-9223-a2cde070ddab" /dev/sda2: UUID="ba1a7010-f9cd-42b3-801d-9e64f0f85352" UUID_SUB="9b7c4815-9db1-45c5-abc3-f77287901e1f" TYPE="btrfs" PTTYPE="dos" PARTUUID="fd92a855-ead4-491c-9f3b-e646a9020e2a" /dev/sdb1: UUID="ef07c50a-4759-484c-930d-e8bac9f673e9" UUID_SUB="0740289c-156b-4317-b807-d2de1906318c" TYPE="btrfs" PARTUUID="fbe19045-6d19-4201-81b7-c908abc3cef2" # btrfs filesystem show Label: none uuid: ba1a7010-f9cd-42b3-801d-9e64f0f85352 Total devices 1 FS bytes used 3.57GiB devid 1 size 50.00GiB used 4.53GiB path /dev/sda2 Label: none uuid: ef07c50a-4759-484c-930d-e8bac9f673e9 Total devices 1 FS bytes used 76.83TiB devid 1 size 98.22TiB used 76.94TiB path /dev/sdb1 Greetings, Felix
So there are a few things going on here. First, I think we need to report the error code that open_ctree failed with. I expect that systemd is killing the mount process and that's tripping a mutex_lock_interruptible or something, so the mount fails entirely. Second, what you're hitting is a known scalability issue in btrfs. The block groups are read in during mount. Each block group is 1 GB in size, so you'll have about 76k block groups. The block group items are located in the extent tree along with all of the extent records, so that means they are ultimately distributed across the disk and are read in with what is effectively a random I/O pattern. Still, it shouldn't take 5 minutes to do it. Is there any output in the log for that mount?
Created attachment 812707 [details] dmesg and journal of unsuccessful boot Tar archive containing the dmesg and journal output of the unsuccessful boot, when /RAID is enabled in /etc/fstab
Hi Jeff, thanks for the reply, I try to elaborate as precisely as possible: I have to add, the RAID is heavily used. We have several millions files on this storage, which is filled up to 70%. The setup is rather easy: Standard openSUSE LEAP 15.1 with NFS and Infiniband being the only additional service that runs on the server. I started to notice the unresponsive behavior after transferring the old RAID to this one, being consistent with a scalability issue. Yesterday afternoon the corresponding volume got a `btrfs balance start /RAID`, which triggered the whole filesystem completely unresponsive. A sucessive `btrfs balance status /RAID` was unresponsive for more than an hour without any output (but it has not been killed by the hung task detector either). Because the whole RAID becomes totally unusable with no visible progress in the last 12h, a `ls -l /RAID` hangs also for more than an hour, I am trying to reboot the machine now. After 5 minutes of not rebooting and now visible activity on any disk either I decided for a forcible reboot (CTRL+ALT+DEL more than 7 times) and another 5 minutes of complete unresponsiveness, no progress and no disk activity I turned the machine off. After a successful boot (without mounting /RAID) I re-enabled /RAID in /etc/fstab and rebooted : ## /etc/fstab #UUID=ef07c50a-4759-484c-930d-e8bac9f673e9 /RAID btrfs defaults 0 0 A move of the whole boot process is at my webserver: ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4 Here, of course, the open_ctree error was appearing. I dumped dmesg and the journal and am attaching both files here. After that, I disabled /RAID in /etc/fstab again and sucessfully rebooted the machine. After that, I was running a `btrfs device scan` on the unmounted volume: eso-storage:~ # time btrfs device scan /dev/sdb1 Scanning for Btrfs filesystems in '/dev/sdb1' real 0m0.004s user 0m0.001s sys 0m0.003s then I mounted the volume (took 5 minutes) eso-storage:~ # time mount /dev/sdb1 /RAID real 5m36.333s user 0m0.001s sys 0m6.165s Then, to test the filesystem and actually get an overview what the users do on this system I count the files eso-storage:~ # time find /RAID -type f | wc -l This is gonna take some time. I'll post the output when something changes and restart a `btrfs balance` over the remaining weekend, in the hope the filesystem repairs itself ...
Addendum to the mount. The only messages in dmesg I get after/while mounting are the following: [ 248.218567] BTRFS info (device sdb1): disk space caching is enabled [ 248.218571] BTRFS info (device sdb1): has skinny extents [ 607.781209] BTRFS info (device sda2): qgroup scan completed (inconsistency flag cleared)
Hey there, The file count operation didn't finish over the weekend. For comparison, I run that one on the old RAID (xfs) time find /ESO-RAID -type f | wc -l | tee /root/ESO-count.txt 18652209 real 0m48.494s user 0m18.983s sys 0m34.245s The system itself is pretty busy, load average: 4.00, 4.00, 4.0 (4core Intel Xeon E3-1230). perf top shows the following perf top --sort comm,dso 74.72% swapper [kernel] 25.28% irqbalance [kernel] 0.00% perf [kernel] For us, as we rely on the new NAS we probably need to switch back to xfs. I could negotiate with my supervisor, that we can use the system this week until Friday to run further tests, if required. So feel free to go all crazy, if you find any use of it :-) Greetings, Felix
I could resolve the issue by switching to a different filesystem. Thanks to Jeff for the reply and I hope the issue/log can help make btrfs work better under similar loads. I'm closing this now because the affected system runs now in a different configuration.