Bug 950178

Summary: Can't rename files (on btrfs)
Product: [openSUSE] openSUSE Distribution Reporter: Stephan Kulow <coolo>
Component: KernelAssignee: Jeff Mahoney <jeffm>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P2 - High CC: aschnell, ddiss, dimstar, fdmanana, lmuelle, lnussel, mge
Version: Leap 42.1Flags: coolo: SHIP_STOPPER+
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: btrfs kernel patch

Description Stephan Kulow 2015-10-13 14:52:14 UTC
There is something really fishy on my test machine - and it happened out of the sudden when I tried to zypper dup:

tortuga:/ # df -h
Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf
devtmpfs        7,9G    8,0K  7,9G    1% /dev
tmpfs           7,9G       0  7,9G    0% /dev/shm
tmpfs           7,9G     11M  7,8G    1% /run
tmpfs           7,9G       0  7,9G    0% /sys/fs/cgroup
/dev/sda3        41G    4,5G   36G   12% /
/dev/sda3        41G    4,5G   36G   12% /var/opt
/dev/sda3        41G    4,5G   36G   12% /var/lib/mailman
/dev/sda3        41G    4,5G   36G   12% /var/tmp
/dev/sda3        41G    4,5G   36G   12% /var/spool
/dev/sda3        41G    4,5G   36G   12% /var/log
/dev/sda3        41G    4,5G   36G   12% /var/lib/pgsql
/dev/sda3        41G    4,5G   36G   12% /var/lib/named
/dev/sda3        41G    4,5G   36G   12% /var/lib/mariadb
/dev/sda3        41G    4,5G   36G   12% /var/lib/libvirt/images
/dev/sda3        41G    4,5G   36G   12% /boot/grub2/x86_64-efi
/dev/sda3        41G    4,5G   36G   12% /var/crash
/dev/sda3        41G    4,5G   36G   12% /usr/local
/dev/sda3        41G    4,5G   36G   12% /tmp
/dev/sda3        41G    4,5G   36G   12% /opt
/dev/sda3        41G    4,5G   36G   12% /srv
/dev/sda3        41G    4,5G   36G   12% /boot/grub2/i386-pc
/dev/sda1       156M    4,5M  152M    3% /boot/efi
/dev/sda5       1,0T     15G 1010G    2% /home
tortuga:/ # LC_ALL=C mv bar baz
mv: cannot move 'bar' to 'baz': No space left on device
tortuga:/ # touch baz


I simply can't move files, but I can create new ones - also by copy. 
Running 4.1.8-1-default from 42.1-Current
Comment 1 Stephan Kulow 2015-10-14 11:03:10 UTC
Not being able to update packages on Leap is a ship stopper
Comment 2 Stephan Kulow 2015-10-14 12:05:20 UTC
after a reboot it's working now
Comment 3 Filipe Manana 2015-10-14 12:28:46 UTC
When this happens try running "sync" or "btrfs filesystem sync <mountpoint>" and retry the rename/move.

Also, for this sort of issues, the report from df is pretty much useless.
The output of "btrfs filesystem usage <mountpoint>" gives useful information, how much space is allocated for data and metadata and how much of that space is used/free.

A rename in btrfs requires at least 2883584 bytes of free metadata space (quite large, but it's because it's very pessimistic).

If an issue persists across unmount/mount, an image of the fs is useful for debugging, made with "btrfs-image -c 9 /dev/sda image_file" (the fs must not be mounted).
Comment 4 Stephan Kulow 2015-10-15 09:25:20 UTC
tortuga:/ # mv bar baz
mv: cannot move 'bar' to 'baz': No space left on device
tortuga:/ # sync /
sync: ignoring all arguments
tortuga:/ # mv bar baz
mv: cannot move 'bar' to 'baz': No space left on device
tortuga:/ # btrfs filesystem sync /
FSSync '/'
tortuga:/ # mv bar baz
mv: cannot move 'bar' to 'baz': No space left on device
tortuga:/ # btrfs filesystem usage /
Overall:
    Device size:                  40.00GiB
    Device allocated:             40.00GiB
    Device unallocated:            1.00MiB
    Device missing:                  0.00B
    Used:                          4.69GiB
    Free (estimated):             35.12GiB      (min: 35.12GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:               64.00MiB      (used: 0.00B)

Data,single: Size:39.44GiB, Used:4.32GiB
   /dev/sda3      39.44GiB                                                                                                                                                                          
                                                                                                                                                                                                    
Metadata,DUP: Size:256.00MiB, Used:189.27MiB                                                                                                                                                        
   /dev/sda3     512.00MiB                                                                                                                                                                          
                                                                                                                                                                                                    
System,DUP: Size:32.00MiB, Used:16.00KiB                                                                                                                                                            
   /dev/sda3      64.00MiB                                                                                                                                                                          
                                                                                                                                                                                                    
Unallocated:                                                                                                                                                                                        
   /dev/sda3       1.00MiB
Comment 5 Stephan Kulow 2015-10-15 09:26:17 UTC
extracting an image of / is a bit challenging ;(
Comment 6 Stephan Kulow 2015-10-15 11:48:24 UTC
thanks to the rescue system: /suse/coolo/Export/BUG950178.btrfs

note however that after a reboot, it's working for a short while, but zypper up will stop at the 3rd to 5th package with the same problem.
Comment 7 Filipe Manana 2015-10-15 13:15:16 UTC
When this happens try running balance against data block groups only to see if it helps:

btrfs balance -dusage=0 <mountpoint>
btrfs balance -dusage=5 <mountpoint>
btrfs balance -dusage=10 <mountpoint>

Where dusage means data block groups with a usage of N% or below.
It seems there's way too much space allocated for data that is preventing new metadata block groups from being allocated.

Though there's still considerably more than enough metadata free to allow for the rename operation to succeed (there's probably some bad accounting of metadata bug somewhere).
Comment 8 Jeff Mahoney 2015-10-15 14:16:23 UTC
For renames, there's only 3 MB available for the transaction.  What we should be doing here is allocating a new metadata block group since there is ~ 36GB free.
Comment 9 Jeff Mahoney 2015-10-15 14:17:07 UTC
Nevermind. "Free" isn't unallocated in this context. Yeah, the balance should help.
Comment 10 Stephan Kulow 2015-10-16 05:35:21 UTC
So if that's what helps, why is btrfs not doing it itself?
Comment 11 Ludwig Nussel 2015-10-20 12:22:54 UTC
How can btrfs run into this situation and how to prevent it? GM is just a few days away ...
Comment 12 Matthias Eckermann 2015-10-22 11:46:35 UTC
(In reply to Ludwig Nussel from comment #11)
> How can btrfs run into this situation and how to prevent it? GM is just a
> few days away ...

While "auto balance" might have serious performance impact (and thus
is not implemented upstream in brtfs yet, as far as I know), the
"btrfsmaintenance" package contains the right scripts to keep the
btrfs filesystem in a reasonable shape. Make sure that "btrfsmaintenance"
is installed by default, and the scripts run.
Comment 13 Dominique Leuenberger 2015-10-27 15:13:25 UTC
(In reply to Matthias Eckermann from comment #12)
> (In reply to Ludwig Nussel from comment #11)
> > How can btrfs run into this situation and how to prevent it? GM is just a
> > few days away ...
> 
> While "auto balance" might have serious performance impact (and thus
> is not implemented upstream in brtfs yet, as far as I know), the
> "btrfsmaintenance" package contains the right scripts to keep the
> btrfs filesystem in a reasonable shape. Make sure that "btrfsmaintenance"
> is installed by default, and the scripts run.

btrfsmaintenance is well installed in a default setup, BUT:
> systemctl status btrfsmaintenance-refresh
● btrfsmaintenance-refresh.service - Update cron periods from /etc/sysconfig/btrfsmaintenance
   Loaded: loaded (/usr/lib/systemd/system/btrfsmaintenance-refresh.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

being disabled (as any default service) it is not going to do a lot of work - if this should be changed, a diff to systemd-presets-branding-openSUSE is needed.

After I started the service, it indeed created links in /etc/cron.{weekly,monthly} as expected.
Comment 14 Stephan Kulow 2015-10-27 15:21:21 UTC
that doesn't seem to be the problem though, tortuga has these cron jobs. But I doubt a weekly cronjob is the solution to the problem we saw.
Comment 15 Filipe Manana 2015-10-30 16:23:46 UTC
So, I looked at the image and restored it into a test device on my machine, with an upstream 4.2 kernel.

I couldn't reproduce the issue. Some analysis details:

There are over 20 data blocks groups that are empty, i.e. not used:

$ btrfs-image -r BUG950178.btrfs image
$ btrfs-debug-tree image 2>&1 | egrep 'BLOCK_GROUP_ITEM' -A 1
(...)
	item 194 key (26705133568 BLOCK_GROUP_ITEM 1073741824) itemoff 5977 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 195 key (27778875392 BLOCK_GROUP_ITEM 1073741824) itemoff 5953 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 196 key (28852617216 BLOCK_GROUP_ITEM 1073741824) itemoff 5929 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 197 key (29926359040 BLOCK_GROUP_ITEM 1073741824) itemoff 5905 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 198 key (31000100864 BLOCK_GROUP_ITEM 1073741824) itemoff 5881 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 199 key (32073842688 BLOCK_GROUP_ITEM 1073741824) itemoff 5857 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 200 key (33147584512 BLOCK_GROUP_ITEM 1073741824) itemoff 5833 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 201 key (34221326336 BLOCK_GROUP_ITEM 1073741824) itemoff 5809 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 202 key (35295068160 BLOCK_GROUP_ITEM 1073741824) itemoff 5785 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 203 key (36368809984 BLOCK_GROUP_ITEM 1073741824) itemoff 5761 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 204 key (37442551808 BLOCK_GROUP_ITEM 1073741824) itemoff 5737 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 205 key (38516293632 BLOCK_GROUP_ITEM 1073741824) itemoff 5713 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 206 key (39590035456 BLOCK_GROUP_ITEM 1073741824) itemoff 5689 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 207 key (40663777280 BLOCK_GROUP_ITEM 1073741824) itemoff 5665 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 208 key (41737519104 BLOCK_GROUP_ITEM 1073741824) itemoff 5641 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 209 key (42811260928 BLOCK_GROUP_ITEM 1073741824) itemoff 5617 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 210 key (43885002752 BLOCK_GROUP_ITEM 1073741824) itemoff 5593 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 211 key (44958744576 BLOCK_GROUP_ITEM 1073741824) itemoff 5569 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 212 key (46032486400 BLOCK_GROUP_ITEM 687865856) itemoff 5545 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 213 key (46720352256 BLOCK_GROUP_ITEM 635437056) itemoff 5521 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 214 key (47355789312 BLOCK_GROUP_ITEM 134217728) itemoff 5497 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 215 key (47490007040 BLOCK_GROUP_ITEM 67108864) itemoff 5473 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
	item 216 key (47557115904 BLOCK_GROUP_ITEM 11534336) itemoff 5449 itemsize 24
		block group used 0 chunk_objectid 256 flags DATA
(...)

As of the upstream kernel 3.19, we have a background kernel thread that automatically deletes unused block groups.

When I mount the image, these unused block groups get deallocated shortly after (about 30s after the fs is mounted, which is the default transaction commit period), releasing over 20G of space.

Yes, for a fs for which there's no more unallocated space it means we have a time window of 30s where applications can get ENOSPC errors after mount - far from ideal, but the problem you are having on that machine seems to be that the deallocation isn't happening at all.

We had bugs with this automatic cleanup of unused block groups since 3.19, but none of them were related to not getting the unused block groups deleted (we had BUG_ON's, crashes, etc) as far as I remember and can see through git history. The only thing I can think of that prevents them from getting deleted is if at some time the fs turned into readonly mode, due to some serious issue - if that's the case you would see a message and a trace about this in dmesg/syslog.

Also, did you end up running a manual balance operation as suggested in a previous comment?

btrfs balance -dusage=0 <mountpoint>

I know it's not ideal and that people should not worry about this, but it's just to confirm that even if doing that the unused block group deallocations aren't happening.
Comment 18 Filipe Manana 2015-11-12 17:13:53 UTC
Hi,

I don't need the machine, the image was enough.
The fs got into a state where it can't even reserve/allocate metadata to delete unused block groups (deleting them requires COWing the extent, device and chunk trees to delete some items and update another one).

Not sure at this point if it's really possible to guarantee 100% we don't reach this scenario (however it can be minimized a bit) - haven't seen this happening before.

thanks
Comment 19 Filipe Manana 2015-11-12 17:58:58 UTC
Created attachment 655735 [details]
btrfs kernel patch

Are you able to test patches on that box btw? (kernel patch)
If so please try the attached patch and see if it helps (at least after a sync).
Comment 20 Filipe Manana 2015-11-12 18:32:57 UTC
So I was able to reproduce by hacking btrfs-image' restore behaviour so that it sets the target fs total size to match the used size:

diff --git a/btrfs-image.c b/btrfs-image.c
index 82eed05..1d44ea1 100644
--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -2386,6 +2386,7 @@ static int fixup_devices(struct btrfs_fs_info *fs_info,
 
        devid = btrfs_stack_device_id(dev_item);
 
+       dev_size = mdres->alloced_chunks;
        btrfs_set_stack_device_total_bytes(dev_item, dev_size);
        btrfs_set_stack_device_bytes_used(dev_item, mdres->alloced_chunks);


With that applied to btrfs-progs, I was able to reproduce your enospc issue. After mounting the restored fs, the unused block groups were never deleted, because the cleaner kthread always got a -ENOSPC error when trying to start a transaction to delete unused block groups - the cleaner does not log such errors nor takes any special action, whence why it was silent (nothing in syslog/dmesg).

With the patch I attached earlier, soon after the mounting the fs the unused block groups were deleted, and about 33Gb of space released.
Comment 21 Jeff Mahoney 2015-11-13 06:15:44 UTC
(In reply to Ludwig Nussel from comment #16)
> I just tried zypper dup on an older tumbleweed installation with kernel
> 4.1.6-3-desktop and it ran into the same situation. You can log into
> jkeil.suse.de to check.
> 
> jkeil:~ # btrfs balance start -dusage=0 /
> ERROR: error during balancing '/' - No space left on device
> There may be more info in syslog - try dmesg | tail

There are a few issues here but they have the same root cause in different places in the code.  The removal of a block group will use the global reservation but starting the transaction won't without fixing.  So we fail to start the transaction and don't even touch the global reservation.  Filipe's fix (and mine, which I'm testing overnight and in the morning) fix these cases.

But, even with those fixes, you'd still get the ENOSPC for balance start.  The balance will report ENOSPC but will kick the cleaner thread while it's doing it and the locking means that the cleaner thread will usually run before btrfs balance start returns.  So that's now a UI issue and one I'm working on a fix for too.
Comment 22 Stephan Kulow 2015-11-13 07:19:40 UTC
I should be able to test on the test machine that triggered the bug report. I left the file system as is and reinstalled on another partition. But I don't understand the kernel patching system well enough to build such a kernel. So if you have some rpm in OBS, let me know.
Comment 23 Jeff Mahoney 2015-11-13 16:47:49 UTC
We can give you an RPM with the fixes we have already.  Those will just go into git pretty much immediately.

The false ENOSPC reporting issue w/ balance is one I'm still tracking down.
Comment 24 Filipe Manana 2015-11-13 17:05:26 UTC
For the non-balance related issues, automatic removal of unused block group not working and a BUG_ON/crash when deleting a block group, I've sent the changes upstream today:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/50150

We'll get them to sle12/leap/opensuse soon.
Comment 25 Ludwig Nussel 2015-12-23 14:08:01 UTC
Will Santa bring the fixes to our users finally?
Comment 26 Filipe Manana 2015-12-23 14:21:13 UTC
(In reply to Ludwig Nussel from comment #25)
> Will Santa bring the fixes to our users finally?

Only to SLE12 and SLE12-SP1 users... And upstream 4.4 kernel (now at rc6) users.
Comment 27 Ludwig Nussel 2015-12-31 12:19:36 UTC
What about Leap?
Comment 29 Ludwig Nussel 2016-01-26 16:08:52 UTC
https://build.opensuse.org/request/show/355284 references this bug for leap, does it mean it contains a fix or is that just partial?
Comment 30 Filipe Manana 2016-01-26 16:45:37 UTC
(In reply to Ludwig Nussel from comment #29)
> https://build.opensuse.org/request/show/355284 references this bug for leap,
> does it mean it contains a fix or is that just partial?

It means it contains the 2 patches I mentioned before (comment 24). 2 fixes for 2 different scenarios leading to ENOSPC.
Jeff was/is still investigating the other (probably less common) case.
Comment 31 Swamp Workflow Management 2016-01-29 13:12:53 UTC
openSUSE-SU-2016:0280-1: An update that solves 10 vulnerabilities and has 18 fixes is now available.

Category: security (important)
Bug References: 865096,865259,913996,950178,950998,952621,954324,954532,954647,955422,956708,957152,957988,957990,958439,958463,958504,958510,958886,958951,959190,959399,960021,960710,961263,961509,962075,962597
CVE References: CVE-2015-7550,CVE-2015-8539,CVE-2015-8543,CVE-2015-8550,CVE-2015-8551,CVE-2015-8552,CVE-2015-8569,CVE-2015-8575,CVE-2015-8767,CVE-2016-0728
Sources used:
openSUSE Leap 42.1 (src):    kernel-debug-4.1.15-8.1, kernel-default-4.1.15-8.1, kernel-docs-4.1.15-8.3, kernel-ec2-4.1.15-8.1, kernel-obs-build-4.1.15-8.2, kernel-obs-qa-4.1.15-8.1, kernel-obs-qa-xen-4.1.15-8.1, kernel-pae-4.1.15-8.1, kernel-pv-4.1.15-8.1, kernel-source-4.1.15-8.1, kernel-syms-4.1.15-8.1, kernel-vanilla-4.1.15-8.1, kernel-xen-4.1.15-8.1
Comment 32 Ludwig Nussel 2016-02-09 08:10:07 UTC
Thanks, reducing severity for less common case mentioned in comment #30
Comment 33 Tomáš Chvátal 2018-04-13 14:59:07 UTC
This is automated batch bugzilla cleanup.

The openSUSE 42.1 changed to end-of-life (EOL [1]) status. As such
it is no longer maintained, which means that it will not receive any
further security or bug fix updates.
As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
openSUSE, or you can still observe it under openSUSE Leap 15.0, please
feel free to reopen this bug against that version (see the "Version"
component in the bug fields), or alternatively open
a new ticket.

Thank you for reporting this bug and we are sorry it could not be fixed
during the lifetime of the release.

[1] https://en.opensuse.org/Lifetime