Bug 908706 - cachefilesd on nfs via autofs makes kwin run wild
cachefilesd on nfs via autofs makes kwin run wild
Status: RESOLVED INVALID
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Other
13.2
x86-64 openSUSE 13.2
: P5 - None : Minor (vote)
: ---
Assigned To: Neil Brown
E-mail List
maint:running:61844:important maint:r...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-12-06 11:28 UTC by Mark Draheim
Modified: 2015-08-13 09:52 UTC (History)
6 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
messages log from boot to fail (193.87 KB, text/plain)
2014-12-08 17:52 UTC, Mark Draheim
Details
nfs client dmesg via netconsole (19.35 KB, text/plain)
2015-04-08 15:27 UTC, Mark Draheim
Details
nfs server and client confs df du (2.20 KB, text/plain)
2015-04-08 15:28 UTC, Mark Draheim
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Draheim 2014-12-06 11:28:41 UTC
This used to work in previous releases. Computers are fully up to date 13.2 but have been upgraded from 13.1, so no clean install of 13.2.

Here is the setup:

* nfs4 server on box1
* cachefilesd running on box2
* autofs the nfs share with option nfs2,fsc on box2

* problem: kwin runs at 100% CPU time at login, effectively freezing the box

Right, it is a rather unusual setup and I know that autofs for nfs does not implement all nfs features. I set this to minor because one can work around the problem. However, there seems to be something wrong with one of the components in this setup.

* cachefilesd running but it does not seem to cache anything
* it creates directories in /var/cache/fscache but du is always 0
* /proc nfsfs shows fsc enabled for the nfs share
* /proc fscache shows some stats for fscache but way less than in previous releases
* lots of cachefs "Lookup failed -1" in messages

If I mount the nfs share with fsc from fstab, then kwin does not run mad. Still, caching does not seem to work. Actually on one box I have /home statically nfs mounted via fstab AND data store autofs nfs mounted from a different server. Both are supposed to make use of cachefilesd. Disabling one of the nfs,fsc mounts does not help, though. On another box, I have only autofs nfs,fsc mount of data store. There, KDE login works but it will freeze sometime into the session.

If I mount nfs via autofs but without fscache, then kwin has no problems.

For now, I disabled cachefilesd but I would like to use it again because it speeds up things considerably. I use autofs a lot on my laptops to mount my data store while at home.

Anything I could do to debug the problem? Is there something I missed?
Comment 1 Bernhard Wiedemann 2014-12-06 15:36:27 UTC
There might be more than one bug here,
but it is a bit hard to tell them apart.

To debug it further, you can try to find out what kwin is doing,
when it is running at 100% CPU, using
strace -p `pidof kwin`

or gdb -p `pidof kwin`
with "thread apply all bt" command to dump all backtraces - best after installing some relevant debuginfo packages.


As always, it is also helpful to give steps to reproduce
(possibly as commented script) so that someone like me who does not want to spend some hours hunting the right docs to find the right config values and places
can still try to reproduce and narrow down the problem
Comment 2 Takashi Iwai 2014-12-06 17:17:31 UTC
Also, try to install openSUSE 13.1 kernel on your running 13.2 system, boot with it, and check whether the issue happens.  If yes, it's a regression in the kernel side.
Comment 3 Mark Draheim 2014-12-07 18:02:15 UTC
Some further testing. I thin I can reliably crash the system by having two nfs mounts with at least one of them using fsc.

/home mounted via fstab:
10.0.2.3:/home  /home   nfs4    defaults,fsc 0 0 

/vol mounted via autofs /etc/auto.vol:
data -fstype=nfs4 10.0.2.8:/store/data

Note that the second mount does not configured to use fsc in the above setup. The autofs mount gets mounted the moment I log in to KDE. The kernel then goes panicking. I can only get a glimpse at that on VT1. What I did see was this

Kernel BUG at ../fs/fscache/operation.c:38!

If I read the code correctly, it's here:

void fscache_enqueue_operation(struct fscache_operation *op)
{
        _enter("{OBJ%x OP%x,%u}",
               op->object->debug_id, op->debug_id, atomic_read(&op->usage));

        ASSERT(list_empty(&op->pend_link));
        ASSERT(op->processor != NULL);
        ASSERT(fscache_object_is_available(op->object));  <---

I am writing this on the same machine but with autofs disabled, thus preventing mount of /vol. My /home here is mounted via nfs,fsc.

On the other hand, I can work with both nfs mounts, if I disable fsc.

Still testing. Will try old kernel as suggested and check file systems. Seems I had btrfs errors on more than one machine which I had upgraded to 13.2. Unlikely for this machine because the system was copied (not dd) off the disk and then back to a new disk and there a newly formatted btrfs partition. Still the copied data might be corrupt.
Comment 4 Takashi Iwai 2014-12-08 08:31:51 UTC
Could you give the full kernel messages?  You might be able to get the log via netconsole, for example.

Also, another try would be the latest stable kernel available at OBS Kernel:stable repo.
Comment 5 Mark Draheim 2014-12-08 17:52:25 UTC
Created attachment 616303 [details]
messages log from boot to fail
Comment 6 Mark Draheim 2014-12-08 17:57:35 UTC
tested 3.11.10 kernel from 13.1 and 3.17.6-1.1.g12b7bf1 from kernel:stable, both desktop version. Both oops out at various points in fscache. The log in the previous comment is from the 3.11.10 kernel.

That kernel has worked fine in 13.1 with my setup. Which leads me to suspect that the kernel is not to blame and that these errors come from somewhere else with fscache just not able to handle them.

Also, kwin/KDE has nothing to do with it other than triggering the error by shuffling files from nfs /home to this computer. I get the error simply be copying my Documents folder to /tmp, ie from nfs to local.
Comment 7 Weihua Du 2014-12-22 02:32:16 UTC
If it is not the kernel or Kwin problem, I think we should see more deeply about nfs. Neil, would you please have a look? Thanks!
Comment 8 Neil Brown 2014-12-22 02:49:39 UTC
I'm afraid this will have to wait until I return from leave in late January.
Comment 9 Neil Brown 2015-04-01 00:54:19 UTC
Hi ... early April is close to late January - right?

sorry for the delay.

There are no upstream patches to cachefiled and nothing in the kernel which looks relevant, so it is probable that this bug still exists.

Firstly, could you check that /var/cache/fscache isn't running out of space.
In particular, the output of:

 df /var/cache/fscache
 du -s /var/cache/fscache
 cat /etc/cachefilesd.conf

would be helpful - if only to confirm there is nothing odd there.

Secondly, can you get me a full kernel log of a cache on the most recent kernel that you can conveniently run?  One way to get a kernel log would be to configure netconsole.  I can give details if you need them.  This requires another computer running Linux to gather the logs.

It might also help to turn on cachefiles debugging.
Either 
 echo 7 > /sys/module/cachefiles/parameters/debug

or add
  debug 7

to /etc/cachefilesd.conf

Thanks.
Comment 10 Mark Draheim 2015-04-07 20:07:24 UTC
Will try the netconsole setup on a Thinkstation that recently got a fresh 13.2 install. It is running Kernel:stable 3.19. Will report back with the logs.
Comment 11 Mark Draheim 2015-04-08 15:27:31 UTC
Created attachment 630383 [details]
nfs client dmesg via netconsole
Comment 12 Mark Draheim 2015-04-08 15:28:43 UTC
Created attachment 630384 [details]
nfs server and client confs df du
Comment 13 Mark Draheim 2015-04-08 15:42:48 UTC
Tried again: kernel 3.19.3, cachefilesd on nfs, see the last two attachments for details. The debug parameter on cachefiles does not provide any more info.

Setup: 

* new export from the server, new nfs mount on the client
* placed gcc source on the share
* put some load on the system: find . -type f -name '*.c' -exec sed -i -e 's/$/bar/g' \{\} \; && find . -type f -name '*.h' -exec sed -i -e 's/$/bar/g' \{\} \;

It will eventually oops out. The error is the same as with kernel 3.16. It only seems to happen under rw load. That is why my initial report blamed KDE but it probably was only akonadi/baloo doing heavy rw which triggered the oops.

The nfs server is a fairly low spec machine. I wonder if I should try again on a more powerful computer or maybe the nfs export conf is odd. That too is in the attachments.

What really puzzles me is that du always reports 0 on fscache, though that apparently does cache something. The "Lookup failed -1" message is always there on the first run, ie nothing in the cache yet(?). On the second run the lookup fail message show less often.
Comment 14 Mark Draheim 2015-04-08 21:34:31 UTC
I will try to implant David Howells' fscache-fixes into a kernel build and see if that helps with the problem. There has been some activity lately

 http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes
Comment 15 Neil Brown 2015-04-09 00:54:19 UTC
Hi Mark,
thanks for all the details.

Part of the problem is that your /var/cache/fscache is in a 'btrfs' volume.
fscache doesn't work on btrfs.  It requires the filesystem to provide a 'bmap' interface user e.g. by the FIBMAP ioctl.  btrfs doesn't provide that (for good reason).
fscache (or more acturately: cachefiles which is the cache backend in use) refuses to work if bmap isn't available, and this results in the 

[  859.668561] CacheFiles: Lookup failed error -1
[  859.672224] CacheFiles: Lookup failed error -1
[  859.675402] CacheFiles: Lookup failed error -1
[  859.694269] CacheFiles: Lookup failed error -1


messages, and results in all the files always being empty.

The crash:

[  859.700614] FS-Cache: Assertion failed
[  859.700704] ------------[ cut here ]------------
[  859.700765] kernel BUG at ../fs/fscache/operation.c:38!

seems to be caused by the object being removed from the cache before it has even been added.  I suspect that is being caused because creating the object fails due to the lack of 'bmap'.

Is there any way you can mount some other filesystem on /var/cache/fscache?
You could possible create a large file somewhere, then
  losetup /dev/loop0 /my/large/file
  mkfs.ext4 /dev/loop0
  mount /dev/loop0 /var/cache/fscache

Then restart cachefilesd.
Comment 16 Mark Draheim 2015-04-09 12:45:53 UTC
Neil, you are absolutely right. I formatted an SD Card with ext4 and used that for fscache. And that works fine.

Hum, I could do some loopback magic at boot to mount an ext4 image for fscache, maybe even put that in RAM. But that seems overkill. Or switch to another fs, though I happen to like btrfs. And isn't btrfs now the default even in SLES? Ah, okay, cachefiles is client stuff and certainly not a default :-D

Anyway, I think this bug is resolved now, invalid because of user doing strange things, and it was badly named in the first place.

Thank you and all the others who commented on this. Always good to learn something about Linux internals.
Comment 17 Neil Brown 2015-04-10 01:02:52 UTC
Yes, btrfs is default for sles, and while fscache isn't a default, some of our customers do use it.  So I will be fixing that side of the problem.

I cannot get the bug to trigger in my testing, which is a shame as I'd like to be sure that got fixed too....

Thanks for the report!
Comment 18 Swamp Workflow Management 2015-05-29 09:53:37 UTC
An update workflow for this issue was started.
This issue was rated as moderate.
Please submit fixed packages until 2015-06-12.
When done, reassign the bug to security-team@suse.de.
https://swamp.suse.de/webswamp/wf/61844
Comment 19 Swamp Workflow Management 2015-06-08 12:00:40 UTC
An update workflow for this issue was started.
This issue was rated as important.
Please submit fixed packages until 2015-06-15.
When done, reassign the bug to security-team@suse.de.
https://swamp.suse.de/webswamp/wf/61904
Comment 20 Swamp Workflow Management 2015-07-02 15:09:57 UTC
SUSE-SU-2015:1174-1: An update that solves 15 vulnerabilities and has 71 fixes is now available.

Category: security (moderate)
Bug References: 831029,877456,889221,891212,891641,900881,902286,904242,904883,904901,906027,908706,909309,909312,909477,909684,910517,911326,912202,912741,913080,913598,914726,914742,914818,914987,915045,915200,915577,916521,916848,917093,917120,917648,917684,917830,917839,918333,919007,919018,919357,919463,919589,919682,919808,921769,922583,923344,924142,924271,924333,924340,925012,925370,925443,925567,925729,926016,926240,926439,926767,927190,927257,927262,927338,928122,928130,928142,928333,928970,929145,929148,929283,929525,929647,930145,930171,930226,930284,930401,930669,930786,930788,931014,931015,931850
CVE References: CVE-2014-8086,CVE-2014-8159,CVE-2014-9419,CVE-2014-9529,CVE-2014-9683,CVE-2015-0777,CVE-2015-1421,CVE-2015-2041,CVE-2015-2042,CVE-2015-2150,CVE-2015-2830,CVE-2015-2922,CVE-2015-3331,CVE-2015-3339,CVE-2015-3636
Sources used:
SUSE Linux Enterprise Server 11 SP3 for VMware (src):    kernel-bigsmp-3.0.101-0.47.55.1, kernel-default-3.0.101-0.47.55.1, kernel-pae-3.0.101-0.47.55.1, kernel-source-3.0.101-0.47.55.1, kernel-syms-3.0.101-0.47.55.1, kernel-trace-3.0.101-0.47.55.1, kernel-xen-3.0.101-0.47.55.1
SUSE Linux Enterprise Server 11 SP3 (src):    kernel-bigsmp-3.0.101-0.47.55.1, kernel-default-3.0.101-0.47.55.1, kernel-ec2-3.0.101-0.47.55.1, kernel-pae-3.0.101-0.47.55.1, kernel-ppc64-3.0.101-0.47.55.1, kernel-source-3.0.101-0.47.55.1, kernel-syms-3.0.101-0.47.55.1, kernel-trace-3.0.101-0.47.55.1, kernel-xen-3.0.101-0.47.55.1, xen-4.2.5_08-0.7.1
SUSE Linux Enterprise High Availability Extension 11 SP3 (src):    cluster-network-1.4-2.28.1.21, gfs2-2-0.17.1.21, ocfs2-1.6-0.21.1.21
SUSE Linux Enterprise Desktop 11 SP3 (src):    kernel-bigsmp-3.0.101-0.47.55.1, kernel-default-3.0.101-0.47.55.1, kernel-pae-3.0.101-0.47.55.1, kernel-source-3.0.101-0.47.55.1, kernel-syms-3.0.101-0.47.55.1, kernel-trace-3.0.101-0.47.55.1, kernel-xen-3.0.101-0.47.55.1, xen-4.2.5_08-0.7.1
SLE 11 SERVER Unsupported Extras (src):    kernel-bigsmp-3.0.101-0.47.55.1, kernel-default-3.0.101-0.47.55.1, kernel-pae-3.0.101-0.47.55.1, kernel-ppc64-3.0.101-0.47.55.1, kernel-xen-3.0.101-0.47.55.1
Comment 21 Swamp Workflow Management 2015-08-12 17:13:14 UTC
SUSE-SU-2015:1376-1: An update that solves 15 vulnerabilities and has 71 fixes is now available.

Category: security (important)
Bug References: 831029,877456,889221,891212,891641,900881,902286,904242,904883,904901,906027,908706,909309,909312,909477,909684,910517,911326,912202,912741,913080,913598,914726,914742,914818,914987,915045,915200,915577,916521,916848,917093,917120,917648,917684,917830,917839,918333,919007,919018,919357,919463,919589,919682,919808,921769,922583,923344,924142,924271,924333,924340,925012,925370,925443,925567,925729,926016,926240,926439,926767,927190,927257,927262,927338,928122,928130,928142,928333,928970,929145,929148,929283,929525,929647,930145,930171,930226,930284,930401,930669,930786,930788,931014,931015,931850
CVE References: CVE-2014-8086,CVE-2014-8159,CVE-2014-9419,CVE-2014-9529,CVE-2014-9683,CVE-2015-0777,CVE-2015-1421,CVE-2015-2041,CVE-2015-2042,CVE-2015-2150,CVE-2015-2830,CVE-2015-2922,CVE-2015-3331,CVE-2015-3339,CVE-2015-3636
Sources used:
SUSE Linux Enterprise Real Time Extension 11 SP3 (src):    cluster-network-1.4-2.28.1.22, drbd-kmp-8.4.4-0.23.1.22, iscsitarget-1.4.20-0.39.1.22, kernel-rt-3.0.101.rt130-0.33.38.1, kernel-rt_trace-3.0.101.rt130-0.33.38.1, kernel-source-rt-3.0.101.rt130-0.33.38.1, kernel-syms-rt-3.0.101.rt130-0.33.38.1, lttng-modules-2.1.1-0.12.1.20, ocfs2-1.6-0.21.1.22, ofed-1.5.4.1-0.14.1.22