Bug 1143865 - btrfs mount takes more than 5 minutes
btrfs mount takes more than 5 minutes
Status: RESOLVED NORESPONSE
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.1
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-08-01 15:01 UTC by Felix Niederwanger
Modified: 2019-08-27 11:29 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
open_ctree_failed error on boot (3.32 MB, image/jpeg)
2019-08-01 15:01 UTC, Felix Niederwanger
Details
dmesg and journal of unsuccessful boot (180.00 KB, application/x-tar)
2019-08-03 12:27 UTC, Felix Niederwanger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Niederwanger 2019-08-01 15:01:28 UTC
Created attachment 812481 [details]
open_ctree_failed error on boot

Dear all,

after a recent update in openSUSE LEAP 15.1 a server failed to boot with "open_ctree failed" (See attachment).
I was able to resolve this issue by booting into the system without the corresponding device and run a 'btrfs device check' on that device. After that I was able to mount the system again, but it's now horribly slow:

I have btrfs on an ARECA Hardware RAID for a 70 TiB filesystem. The system is about 70% full.
btrfs is working ON the hardware RAID, no btrfs RAID functions are used in this system.
The boot procedure of the filesystem failed without any useful information (after resolving the "open_ctree" failed issue). I was able to boot into the system by disabling the 70 TiB system in /etc/fstab.
After booting I could manually mount the system, but mounting and unmounting took horribly long. Long as in more than 5 minutes:


	eso-storage:/ # time mount /dev/sdb1 /RAID
	
	real	5m37.123s
	user	0m0.001s
	sys	0m6.102


This explains, why the boot procedure fails, because it takes more than 4:30 minutes, which is displayed to be the timeout.
Still, a mount of a rather large filesystem that takes more than 5 minutes seems to be a bit fishy.


Using openSUSE LEAP 15.1 with all installed updated on Kernel 4.12.14-lp151.28.10-default. Here are some details about the filesystems:

	/dev/sda  is the system ssd
	/dev/sdb  is the ARECA RAID

The filesystem has been created with `mkfs.btrfs` using the default settings.



# lsblk
 
/dev/sda1: UUID="36AC-53E1" TYPE="vfat" PARTUUID="b45ecebb-24f0-481a-9223-a2cde070ddab"
/dev/sda2: UUID="ba1a7010-f9cd-42b3-801d-9e64f0f85352" UUID_SUB="9b7c4815-9db1-45c5-abc3-f77287901e1f" TYPE="btrfs" PTTYPE="dos" PARTUUID="fd92a855-ead4-491c-9f3b-e646a9020e2a"
/dev/sdb1: UUID="ef07c50a-4759-484c-930d-e8bac9f673e9" UUID_SUB="0740289c-156b-4317-b807-d2de1906318c" TYPE="btrfs" PARTUUID="fbe19045-6d19-4201-81b7-c908abc3cef2"
 
# btrfs filesystem show
 
Label: none  uuid: ba1a7010-f9cd-42b3-801d-9e64f0f85352
    Total devices 1 FS bytes used 3.57GiB
    devid    1 size 50.00GiB used 4.53GiB path /dev/sda2
 
Label: none  uuid: ef07c50a-4759-484c-930d-e8bac9f673e9
    Total devices 1 FS bytes used 76.83TiB
    devid    1 size 98.22TiB used 76.94TiB path /dev/sdb1


Greetings,
Felix
Comment 1 Jeff Mahoney 2019-08-02 15:47:01 UTC
So there are a few things going on here.

First, I think we need to report the error code that open_ctree failed with.  I expect that systemd is killing the mount process and that's tripping a mutex_lock_interruptible or something, so the mount fails entirely.

Second, what you're hitting is a known scalability issue in btrfs.  The block groups are read in during mount.  Each block group is 1 GB in size, so you'll have about 76k block groups.  The block group items are located in the extent tree along with all of the extent records, so that means they are ultimately distributed across the disk and are read in with what is effectively a random I/O pattern.  Still, it shouldn't take 5 minutes to do it.

Is there any output in the log for that mount?
Comment 2 Felix Niederwanger 2019-08-03 12:27:18 UTC
Created attachment 812707 [details]
dmesg and journal of unsuccessful boot

Tar archive containing the dmesg and journal output of the unsuccessful boot, when /RAID is enabled in /etc/fstab
Comment 3 Felix Niederwanger 2019-08-03 16:43:59 UTC
Hi Jeff,

thanks for the reply, I try to elaborate as precisely as possible:

I have to add, the RAID is heavily used. We have several millions files on this storage, which is filled up to 70%. The setup is rather easy: Standard openSUSE LEAP 15.1 with NFS and Infiniband being the only additional service that runs on the server.

I started to notice the unresponsive behavior after transferring the old RAID to this one, being consistent with a scalability issue.


Yesterday afternoon the corresponding volume got a `btrfs balance start /RAID`, which triggered the whole filesystem completely unresponsive. A sucessive `btrfs balance status /RAID` was unresponsive for more than an hour without any output (but it has not been killed by the hung task detector either).

Because the whole RAID becomes totally unusable with no visible progress in the last 12h, a `ls -l /RAID` hangs also for more than an hour, I am trying to reboot the machine now. After 5 minutes of not rebooting and now visible activity on any disk either I decided for a forcible reboot (CTRL+ALT+DEL more than 7 times) and another 5 minutes of complete unresponsiveness, no progress and no disk activity I turned the machine off.


After a successful boot (without mounting /RAID) I re-enabled /RAID in /etc/fstab and rebooted :

    ## /etc/fstab
    #UUID=ef07c50a-4759-484c-930d-e8bac9f673e9  /RAID                   btrfs  defaults                      0  0

A move of the whole boot process is at my webserver: ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4

Here, of course, the open_ctree error was appearing. I dumped dmesg and the journal and am attaching both files here. After that, I disabled /RAID in /etc/fstab again and sucessfully rebooted the machine.

After that, I was running a `btrfs device scan` on the unmounted volume:

    eso-storage:~ # time btrfs device scan /dev/sdb1
    Scanning for Btrfs filesystems in '/dev/sdb1'
    
    real	0m0.004s
    user	0m0.001s
    sys	0m0.003s


then I mounted the volume (took 5 minutes)

    eso-storage:~ # time mount /dev/sdb1 /RAID                                                                                
    
    real    5m36.333s
    user    0m0.001s
    sys     0m6.165s

Then, to test the filesystem and actually get an overview what the users do on this system I count the files

    eso-storage:~ # time find /RAID -type f | wc -l

This is gonna take some time. I'll post the output when something changes and restart a `btrfs balance` over the remaining weekend, in the hope the filesystem repairs itself ...
Comment 4 Felix Niederwanger 2019-08-03 16:46:23 UTC
Addendum to the mount. The only messages in dmesg I get after/while mounting are the following:

[  248.218567] BTRFS info (device sdb1): disk space caching is enabled
[  248.218571] BTRFS info (device sdb1): has skinny extents
[  607.781209] BTRFS info (device sda2): qgroup scan completed (inconsistency flag cleared)
Comment 5 Felix Niederwanger 2019-08-05 11:17:01 UTC
Hey there,

The file count operation didn't finish over the weekend. For comparison, I run that one on the old RAID (xfs)

    time find /ESO-RAID -type f | wc -l | tee /root/ESO-count.txt
    18652209
    
    real	0m48.494s
    user	0m18.983s
    sys	0m34.245s

The system itself is pretty busy, load average: 4.00, 4.00, 4.0 (4core Intel Xeon E3-1230). perf top shows the following

    perf top --sort comm,dso
     74.72%  swapper     [kernel]
     25.28%  irqbalance  [kernel]
      0.00%  perf        [kernel]

For us, as we rely on the new NAS we probably need to switch back to xfs.

I could negotiate with my supervisor, that we can use the system this week until Friday to run further tests, if required. So feel free to go all crazy, if you find any use of it :-)

Greetings,
Felix
Comment 6 Felix Niederwanger 2019-08-27 11:29:12 UTC
I could resolve the issue by switching to a different filesystem. Thanks to Jeff for the reply and I hope the issue/log can help make btrfs work better under similar loads.

I'm closing this now because the affected system runs now in a different configuration.