Bug 909634 - NFS: 100% iowait CPU load with error -10023 (STALE_STATE_ID) or -10025 (BAD_STATE_ID)
NFS: 100% iowait CPU load with error -10023 (STALE_STATE_ID) or -10025 (BAD_S...
Status: RESOLVED WORKSFORME
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
13.2
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: Neil Brown
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-12-12 09:14 UTC by Joschi Brauchle
Modified: 2016-05-20 04:09 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
nfbrown: needinfo? (joschibrauchle)


Attachments
ps auxf output (stripped of user processes) (19.54 KB, text/plain)
2015-01-06 14:33 UTC, Joschi Brauchle
Details
RPC-debug output in journald (49.43 KB, text/x-log)
2015-01-06 14:37 UTC, Joschi Brauchle
Details
NFS-debug output in journald (9.14 KB, text/x-log)
2015-01-06 14:41 UTC, Joschi Brauchle
Details
RPC/NFS debug log (253.75 KB, text/x-log)
2015-04-22 12:42 UTC, Joschi Brauchle
Details
RPC/NFS debug log (253.62 KB, text/x-log)
2015-04-22 12:48 UTC, Joschi Brauchle
Details
Wireshark trace of NFS/RPC calls (302.65 KB, application/vnd.tcpdump.pcap)
2015-04-23 10:06 UTC, Joschi Brauchle
Details
TCP Dump at start of BAD_STATEID error (1.69 MB, application/vnd.tcpdump.pcap)
2015-06-17 14:34 UTC, Joschi Brauchle
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joschi Brauchle 2014-12-12 09:14:24 UTC
On openSUSE 13.2 with kernel 3.16.6 we seeing the following problem on multiple machine connected to a 'NetApp FAS 3140 with ONTAP 8.1.4P1 7-mode' NFSv3/4 Server. We are mounting user homes and other shares using 'sec=krb' and NFSv4 and the user logged into the machine has a valid KRB5 ticket.


Something without a PID (i.e. the kernel) uses 100% on one core, 'top' shows '71.7 id, 25.6 wa' on a 4 core machine.

Cranking up debug level on RPC and NFS yields the following, constantly repeating messages (for the -10025 error case):
---------------------------
``rpcdebug -m nfs all``
---------------------------
...
kernel: NFS: nfs_pgio_result:  9702, (status -10025)
kernel: nfs4_free_slot: slotid 0 highest_used_slotid 1
kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER>
kernel: --> nfs4_alloc_slot used_slots=0002 highest_used=1 max_slots=1024
kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=0
kernel: --> nfs_put_client({4})
kernel: NFS: nfs_pgio_result:  9701, (status -10025)
kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER>
kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=1024
kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
kernel: --> nfs_put_client({4})
kernel: NFS: nfs_pgio_result:  9702, (status -10025)
kernel: nfs4_free_slot: slotid 0 highest_used_slotid 1
kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER>
kernel: --> nfs4_alloc_slot used_slots=0002 highest_used=1 max_slots=1024
kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=0
kernel: --> nfs_put_client({4})
kernel: NFS: nfs_pgio_result:  9701, (status -10025)
kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER>
kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=1024
kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
kernel: --> nfs_put_client({4})
...
---------------------------

and ``rpcdebug -m rpc all``
---------------------------
...
kernel: RPC:  9702 gss_unwrap_resp returning -10025
kernel: RPC:  9702 call_decode result -10025
kernel: RPC:       wake_up_first(ffff8802348e1a98 "NFSv4.0 transport Slot table")
kernel: RPC:  9702 sleep_on(queue "NFS client" time 5089081878)
kernel: RPC:  9702 added to queue ffff880233645dc0 "NFS client"
kernel: RPC:  9702 __rpc_wake_up_task (now 5089081878)
kernel: RPC:  9702 removed from queue ffff880233645dc0 "NFS client"
kernel: RPC:       __rpc_wake_up_task done
kernel: RPC:       freeing buffer of size 5316 at ffff8801b24b4000
kernel: RPC:  9702 release request ffff8802230c1a00
kernel: RPC:       wake_up_first(ffff8800b7b1b320 "xprt_backlog")
kernel: RPC:  9702 call_start nfs4 proc WRITE (async)
kernel: RPC:  9702 call_reserve (status 0)
kernel: RPC:  9702 reserved req ffff8802230c1a00 xid 21f704d8
kernel: RPC:       wake_up_first(ffff8800b7b1b190 "xprt_sending")
kernel: RPC:  9702 call_reserveresult (status 0)
kernel: RPC:  9702 call_refresh (status 0)
kernel: RPC:  9702 refreshing RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9702 call_refreshresult (status 0)
kernel: RPC:  9702 call_allocate (status 0)
kernel: RPC:  9702 allocated buffer of size 5316 at ffff8801b24b4000
kernel: RPC:  9702 call_bind (status 0)
kernel: RPC:  9702 call_connect xprt ffff8800b7b1b000 is connected
kernel: RPC:  9702 call_transmit (status 0)
kernel: RPC:  9702 xprt_prepare_transmit
kernel: RPC:  9702 rpc_xdr_encode (status 0)
kernel: RPC:  9702 marshaling RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9702 gss_marshal
kernel: RPC:       gss_get_mic_v1
kernel: RPC:       krb5_make_rc4_seq_num:
kernel: krb5_rc4_setup_seq_key: entered
kernel: krb5_rc4_setup_seq_key: returning 0
kernel: RPC:       krb5_encrypt returns 0
kernel: RPC:  9702 using AUTH_GSS cred ffff88021f305d00 to wrap rpc data
kernel: RPC:  9702 gss_wrap_req
kernel: RPC:  9702 gss_wrap_req returning 0
kernel: RPC:  9702 xprt_transmit(32984)
kernel: RPC:       xs_tcp_send_request(32984) = 32984
kernel: RPC:  9702 xmit complete
kernel: RPC:  9702 sleep_on(queue "xprt_pending" time 5089081878)
kernel: RPC:  9702 added to queue ffff8800b7b1b258 "xprt_pending"
kernel: RPC:  9702 setting alarm for 60000 ms
kernel: RPC:       wake_up_first(ffff8800b7b1b190 "xprt_sending")
kernel: RPC:       xs_tcp_data_ready...
kernel: RPC:       xs_tcp_data_recv started
kernel: RPC:       reading TCP record fragment of length 92
kernel: RPC:       reading XID (4 bytes)
kernel: RPC:       reading reply for XID 20f704d8
kernel: RPC:       reading CALL/REPLY flag (4 bytes)
kernel: RPC:       read reply XID 20f704d8
kernel: RPC:       XID 20f704d8 read 84 bytes
kernel: RPC:       xprt = ffff8800b7b1b000, tcp_copied = 92, tcp_offset = 92, tcp_reclen = 92
kernel: RPC:  9701 xid 20f704d8 complete (92 bytes received)
kernel: RPC:  9701 __rpc_wake_up_task (now 5089081879)
kernel: RPC:  9701 disabling timer
kernel: RPC:  9701 removed from queue ffff8800b7b1b258 "xprt_pending"
kernel: RPC:       __rpc_wake_up_task done
kernel: RPC:       xs_tcp_data_recv done
kernel: RPC:  9701 __rpc_execute flags=0xa01
kernel: RPC:  9701 call_status (status 92)
kernel: RPC:  9701 call_decode (status 92)
kernel: RPC:  9701 validating RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9701 gss_validate
kernel: RPC:       krb5_read_token
kernel: RPC:       krb5_get_seq_num:
kernel: RPC:       krb5_get_rc4_seq_num:
kernel: krb5_rc4_setup_seq_key: entered
kernel: krb5_rc4_setup_seq_key: returning 0
kernel: RPC:       gss_k5decrypt returns 0
kernel: RPC:  9701 gss_validate: gss_verify_mic succeeded.
kernel: RPC:  9701 using AUTH_GSS cred ffff88021f305d00 to unwrap rpc data
kernel: RPC:  9701 gss_unwrap_resp returning -10025
kernel: RPC:  9701 call_decode result -10025
kernel: RPC:       wake_up_first(ffff8802348e1a98 "NFSv4.0 transport Slot table")
kernel: RPC:  9701 sleep_on(queue "NFS client" time 5089081879)
kernel: RPC:  9701 added to queue ffff880233645dc0 "NFS client"
kernel: RPC:       freeing buffer of size 5316 at ffff880233d04000
kernel: RPC:  9701 release request ffff8802230c1c00
kernel: RPC:  9701 __rpc_wake_up_task (now 5089081879)
kernel: RPC:  9701 removed from queue ffff880233645dc0 "NFS client"
kernel: RPC:       __rpc_wake_up_task done
kernel: RPC:       wake_up_first(ffff8800b7b1b320 "xprt_backlog")
kernel: RPC:  9701 call_start nfs4 proc WRITE (async)
kernel: RPC:  9701 call_reserve (status 0)
kernel: RPC:  9701 reserved req ffff8802230c1c00 xid 22f704d8
kernel: RPC:       wake_up_first(ffff8800b7b1b190 "xprt_sending")
kernel: RPC:  9701 call_reserveresult (status 0)
kernel: RPC:  9701 call_refresh (status 0)
kernel: RPC:  9701 refreshing RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9701 call_refreshresult (status 0)
kernel: RPC:  9701 call_allocate (status 0)
kernel: RPC:  9701 allocated buffer of size 5316 at ffff880233d04000
kernel: RPC:  9701 call_bind (status 0)
kernel: RPC:  9701 call_connect xprt ffff8800b7b1b000 is connected
kernel: RPC:  9701 call_transmit (status 0)
kernel: RPC:  9701 xprt_prepare_transmit
kernel: RPC:  9701 rpc_xdr_encode (status 0)
kernel: RPC:  9701 marshaling RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9701 gss_marshal
kernel: RPC:       gss_get_mic_v1
kernel: RPC:       krb5_make_rc4_seq_num:
kernel: krb5_rc4_setup_seq_key: entered
kernel: krb5_rc4_setup_seq_key: returning 0
kernel: RPC:       krb5_encrypt returns 0
kernel: RPC:  9701 using AUTH_GSS cred ffff88021f305d00 to wrap rpc data
kernel: RPC:  9701 gss_wrap_req
kernel: RPC:  9701 gss_wrap_req returning 0
kernel: RPC:  9701 xprt_transmit(32984)
kernel: RPC:       xs_tcp_send_request(32984) = 32984
kernel: RPC:  9701 xmit complete
kernel: RPC:  9701 sleep_on(queue "xprt_pending" time 5089081879)
kernel: RPC:  9701 added to queue ffff8800b7b1b258 "xprt_pending"
kernel: RPC:  9701 setting alarm for 60000 ms
kernel: RPC:       wake_up_first(ffff8800b7b1b190 "xprt_sending")
kernel: RPC:       xs_tcp_data_ready...
kernel: RPC:       xs_tcp_data_recv started
kernel: RPC:       reading TCP record fragment of length 92
kernel: RPC:       reading XID (4 bytes)
kernel: RPC:       reading reply for XID 21f704d8
kernel: RPC:       reading CALL/REPLY flag (4 bytes)
kernel: RPC:       read reply XID 21f704d8
kernel: RPC:       XID 21f704d8 read 84 bytes
kernel: RPC:       xprt = ffff8800b7b1b000, tcp_copied = 92, tcp_offset = 92, tcp_reclen = 92
kernel: RPC:  9702 xid 21f704d8 complete (92 bytes received)
kernel: RPC:  9702 __rpc_wake_up_task (now 5089081880)
kernel: RPC:  9702 disabling timer
kernel: RPC:  9702 removed from queue ffff8800b7b1b258 "xprt_pending"
kernel: RPC:       __rpc_wake_up_task done
kernel: RPC:       xs_tcp_data_recv done
kernel: RPC:  9702 __rpc_execute flags=0xa01
kernel: RPC:  9702 call_status (status 92)
kernel: RPC:  9702 call_decode (status 92)
kernel: RPC:  9702 validating RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9702 gss_validate
kernel: RPC:       krb5_read_token
kernel: RPC:       krb5_get_seq_num:
kernel: RPC:       krb5_get_rc4_seq_num:
kernel: krb5_rc4_setup_seq_key: entered
kernel: krb5_rc4_setup_seq_key: returning 0
kernel: RPC:       gss_k5decrypt returns 0
kernel: RPC:  9702 gss_validate: gss_verify_mic succeeded.
kernel: RPC:  9702 using AUTH_GSS cred ffff88021f305d00 to unwrap rpc data
kernel: RPC:  9702 gss_unwrap_resp returning -10025
kernel: RPC:  9702 call_decode result -10025
kernel: RPC:       wake_up_first(ffff8802348e1a98 "NFSv4.0 transport Slot table")
kernel: RPC:  9702 sleep_on(queue "NFS client" time 5089081880)
kernel: RPC:  9702 added to queue ffff880233645dc0 "NFS client"
kernel: RPC:       freeing buffer of size 5316 at ffff8801b24b4000
kernel: RPC:  9702 __rpc_wake_up_task (now 5089081880)
kernel: RPC:  9702 removed from queue ffff880233645dc0 "NFS client"
kernel: RPC:       __rpc_wake_up_task done
kernel: RPC:  9702 release request ffff8802230c1a00
kernel: RPC:       wake_up_first(ffff8800b7b1b320 "xprt_backlog")
kernel: RPC:  9702 call_start nfs4 proc WRITE (async)
kernel: RPC:  9702 call_reserve (status 0)
kernel: RPC:  9702 reserved req ffff8802230c1a00 xid 23f704d8
kernel: RPC:       wake_up_first(ffff8800b7b1b190 "xprt_sending")
kernel: RPC:  9702 call_reserveresult (status 0)
kernel: RPC:  9702 call_refresh (status 0)
kernel: RPC:  9702 refreshing RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9702 call_refreshresult (status 0)
kernel: RPC:  9702 call_allocate (status 0)
kernel: RPC:  9702 allocated buffer of size 5316 at ffff8801b24b4000
kernel: RPC:  9702 call_bind (status 0)
kernel: RPC:  9702 call_connect xprt ffff8800b7b1b000 is connected
kernel: RPC:  9702 call_transmit (status 0)
kernel: RPC:  9702 xprt_prepare_transmit
kernel: RPC:  9702 rpc_xdr_encode (status 0)
kernel: RPC:  9702 marshaling RPCSEC_GSS cred ffff88021f305d00
kernel: RPC:  9702 gss_marshal
kernel: RPC:       gss_get_mic_v1
kernel: RPC:       krb5_make_rc4_seq_num:
kernel: krb5_rc4_setup_seq_key: entered
kernel: krb5_rc4_setup_seq_key: returning 0
kernel: RPC:       krb5_encrypt returns 0
kernel: RPC:  9702 using AUTH_GSS cred ffff88021f305d00 to wrap rpc data
kernel: RPC:  9702 gss_wrap_req
kernel: RPC:  9702 gss_wrap_req returning 0
kernel: RPC:  9702 xprt_transmit(32984)
kernel: RPC:       xs_tcp_send_request(32984) = 32984
kernel: RPC:  9702 xmit complete
kernel: RPC:  9702 sleep_on(queue "xprt_pending" time 5089081880)
kernel: RPC:  9702 added to queue ffff8800b7b1b258 "xprt_pending"
kernel: RPC:  9702 setting alarm for 60000 ms
kernel: RPC:       wake_up_first(ffff8800b7b1b190 "xprt_sending")
kernel: RPC:       xs_tcp_data_ready...
kernel: RPC:       xs_tcp_data_recv started
kernel: RPC:       reading TCP record fragment of length 92
kernel: RPC:       reading XID (4 bytes)
kernel: RPC:       reading reply for XID 22f704d8
kernel: RPC:       reading CALL/REPLY flag (4 bytes)
kernel: RPC:       read reply XID 22f704d8
kernel: RPC:       XID 22f704d8 read 84 bytes
kernel: RPC:       xprt = ffff8800b7b1b000, tcp_copied = 92, tcp_offset = 92, tcp_reclen = 92
kernel: RPC:  9701 xid 22f704d8 complete (92 bytes received)
...
---------------------------

I can complete provide logs for both NFS and RPC debug, but in the end they seem to be just repeating. Also I have a wireshark trace from this machine for a couple of seconds.

Some more info on the software versions in use:
---------------------------
# rpm -qa|grep -E '^(kernel-desktop|nfs|rpc)'
nfsidmap-0.25-6.1.2.x86_64
nfs4-acl-tools-0.3.3-18.1.2.x86_64
kernel-desktop-3.16.6-2.1.x86_64
rpcbind-0.2.1_rc4-8.4.1.x86_64
nfs-client-1.3.0-4.2.1.x86_64
---------------------------
Comment 1 Joschi Brauchle 2014-12-12 09:15:13 UTC
Note that all debug info is for the error case -10025. We currently have no logs for the -10023 case, but will provide them ASAP once we observe the problem again.
Comment 2 Jiri Slaby 2014-12-12 10:33:51 UTC
Neil, could you help with this one?
Comment 3 Neil Brown 2014-12-15 00:33:25 UTC
The
  nfs4_schedule_stateid_recovery: scheduling stateid recovery for server

message confirms that nfs4_schedule_state_manager is being called should ensure that the state management thread is running.
This is a process with a name IPADDRESS-manager - though only 16 character are used.  So e.g. "192.168.1.1-mana".
If this cannot be started and error message mentioning "kthread_run" will be printed.  That doesn't appear in the logs so we must assume it is running.

After printing the above message, the failed write request should be queued on the "cl_rpcwaitq" queue, also known as "NFS client".  We see this happening in the logs.

The request is then woken up either:
- immediately if NFS4CLNT_MANAGER_RUNNING is not set, or
- when the state manager thinks it has done everything that needs doing.


The state manager should at least try to re-open the file, but there is no evidence of that happening.

Is the state-manager thread running?  If you can find it in 'ps' output, is it always there with the same  PID, or does it keep changing PID?

When the enable or disable rpc debugging, it should print state status of any pending RPC requests.  Does it?  Can you provide that part of the logs?
Comment 4 Joschi Brauchle 2014-12-19 09:48:57 UTC
I'll try to get the requested information ASAP. Unfortunately there currently is no machine in our network exhibiting the problems and we'll have to wait for them to reappear. This is kinda unlikely during the christmas break as most machines are idle or off, thus it looks more like January for more info to come!
Comment 5 Joschi Brauchle 2015-01-06 14:33:12 UTC
Created attachment 618735 [details]
ps auxf output (stripped of user processes)

This is the 'ps auxf' output of a machine exhibiting the -10025 error message. I dont see any $IP-manager in the list.

The machine had about 35% IO-Wait load (4 cores) and thus had to be rebooted unfortunately...
Comment 6 Joschi Brauchle 2015-01-06 14:37:16 UTC
Created attachment 618736 [details]
RPC-debug output in journald

I was able to catch some of the RPC debug output of this machine from the terminal before reboot... it looks different that the previous logs I provided, does it?
Comment 7 Joschi Brauchle 2015-01-06 14:41:41 UTC
Created attachment 618737 [details]
NFS-debug output in journald
Comment 8 Joschi Brauchle 2015-01-06 14:46:56 UTC
In addition to the NFS/RPC debug logs, I found the following in the journal. It looks like the system went to standby and resumed! Then the problems started...

---------------------
Jan 06 12:31:52 tueilnt-st-stinner kernel: PM: Finishing wakeup.
Jan 06 12:31:52 tueilnt-st-stinner kernel: Restarting tasks ... done.
... 
...
Jan 06 12:31:51 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 1
Jan 06 12:31:51 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 1
Jan 06 12:31:52 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 1
Jan 06 12:31:52 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 2
Jan 06 12:31:52 tueilnt-st-stinner automount[15034]: failed to to open old config /etc/sysconfig/autofs
Jan 06 12:31:54 tueilnt-st-stinner kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: key description: cifs.spnego;3298478;3000000;39010000;ver=0x2;host=<NAS_HOSTNAME>;ip4
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: ver=2
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: host=<NAS_HOSTNAME>
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: ip=<NAS_IP>
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: sec=1
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: uid=3298478
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: creduid=3298478
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: user=<USERID>
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: pid=2139
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: find_krb5_cc: considering /tmp/krb5ccmachine_<REALM>
Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: find_krb5_cc: /tmp/krb5ccmachine_<REALM> is owned by 0, not 3298478
Jan 06 12:31:55 tueilnt-st-stinner kernel: NFS: nfs4_reclaim_open_state: unhandled error -10026
Jan 06 12:31:55 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff880231d5e020!
Jan 06 12:31:55 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff880232f91420!
Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: krb5_get_init_creds_keytab: 13
Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: handle_krb5_mech: getting service ticket for nas.ads.mwn.de
Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: handle_krb5_mech: obtained service ticket
Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: Exit status 0
...
Jan 06 12:32:42 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff88021be5e020!
Jan 06 12:32:42 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff88022a813220!
...
---------------------

The machine is mounting CIFS and NFS with sec=KRB5 using an automounter.
Comment 9 Neil Brown 2015-02-20 01:45:48 UTC
(sorry for the delay).

It looks like the client is holding a delegation from the server, the server is rejecting use of the delegation (maybe because the client was suspended for a while and didn't renew the delegation) and the client isn't responding to the error properly.

There are a few upstream patches that could be relevant (Deal with NFS state management errors) and were marked for inclusion in -stable kernels, but didn't get into 3.16 because it was closed for new patches by the time they came along.

In particular, 

commit 4dfd4f7af0afd201706ad186352ca423b0f17d4b
Author: Trond Myklebust <trond.myklebust@primarydata.com>
Date:   Fri Oct 17 15:10:25 2014 +0300

    NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired


looks like it might be relevant.

I have added 5 'stable' patches to our 13.2 kernel including this one.

The new kernel should appear in
  http://download.opensuse.org/repositories/Kernel:/openSUSE-13.2/standard/x86_64/

in a day or so.  The current kernel version is 3.16.7-47.1.gf9c15b3.
When that changes, the NFS patches will be included.

Please try installing that kernel and report if it makes a difference.
Comment 10 Joschi Brauchle 2015-03-23 09:32:49 UTC
(In reply to Neil F Brown from comment #9)

> Please try installing that kernel and report if it makes a difference.

Today we had some NAS storage problems and one of the patched machine (with above kernel) was also affected with 10023 errors.

Unfortunatelty I was unable to acquire logs before the machine got rebooted. I will try next time.

But it looks like the problem is not fixed with the new patches...
Comment 11 Neil Brown 2015-04-02 01:11:38 UTC
I'm sorry to here that ...
Can you please provide logs showing the problem and the exact kernel version that you were running?
Comment 12 Swamp Workflow Management 2015-04-13 12:10:41 UTC
openSUSE-SU-2015:0713-1: An update that solves 13 vulnerabilities and has 52 fixes is now available.

Category: security (important)
Bug References: 867199,893428,895797,900811,901925,903589,903640,904899,905681,907039,907818,907988,908582,908588,908589,908592,908593,908594,908596,908598,908603,908604,908605,908606,908608,908610,908612,909077,909078,909477,909634,910150,910322,910440,911311,911325,911326,911356,911438,911578,911835,912061,912202,912429,912705,913059,913466,913695,914175,915425,915454,915456,915577,915858,916608,917830,917839,918954,918970,919463,920581,920604,921313,922542,922944
CVE References: CVE-2014-8134,CVE-2014-8160,CVE-2014-8559,CVE-2014-9419,CVE-2014-9420,CVE-2014-9428,CVE-2014-9529,CVE-2014-9584,CVE-2014-9585,CVE-2015-0777,CVE-2015-1421,CVE-2015-1593,CVE-2015-2150
Sources used:
openSUSE 13.2 (src):    bbswitch-0.8-3.6.6, cloop-2.639-14.6.6, crash-7.0.8-6.6, hdjmod-1.28-18.7.6, ipset-6.23-6.6, kernel-docs-3.16.7-13.2, kernel-obs-build-3.16.7-13.7, kernel-obs-qa-3.16.7-13.1, kernel-obs-qa-xen-3.16.7-13.1, kernel-source-3.16.7-13.1, kernel-syms-3.16.7-13.1, pcfclock-0.44-260.6.2, vhba-kmp-20140629-2.6.2, virtualbox-4.3.20-10.2, xen-4.4.1_08-12.2, xtables-addons-2.6-6.2
Comment 13 Joschi Brauchle 2015-04-22 12:42:29 UTC
Created attachment 631961 [details]
RPC/NFS debug log

(In reply to Neil F Brown from comment #11)
> I'm sorry to here that ...
> Can you please provide logs showing the problem and the exact kernel version
> that you were running?

I have a machine showing the problem now:
Linux hostname 3.16.7-58.g23df5ff-desktop #1 SMP PREEMPT Sat Mar 21 15:44:56 UTC 2015 (23df5ff) x86_64 x86_64 x86_64 GNU/Linux

RPC/NFS full debug log, see attachment.
Comment 14 Joschi Brauchle 2015-04-22 12:48:06 UTC
Created attachment 631962 [details]
RPC/NFS debug log
Comment 15 Neil Brown 2015-04-23 01:42:35 UTC
Thanks.

What I see happening here is that a WRITE or LOCK request gets NFS4ERR_STALE_STATEID, and then the client sends a "RENEW" request.

If the stateid is really stale, that should fail, but it doesn't.
So the client assumes everything is OK and keeps sending requests .... which fail with NFS4ERR_STALE_STATEID.

STALE_STATEID means:

      A stateid generated by an earlier server instance was used.

which means that the server must have restarted since that stateid was issued.

So there seem to be two possible explanations:

1/ The server is lying.  The stateid isn't really 'stale'.  Maybe it is 'bad'.
2/ The client has previously handled a STALE_STATEID error and created a
   new "client" handle, but hasn't refreshed all of the stateids that it was
   holding.  So it is trying to use old stateid even though it has established
   a new client.

The client code looks OK, though it is sufficiently complex that I cannot be certain.  There are no patches in more recent versions.

My only suggestions are:

1/ contact Netapp if you have support from them and see if they know of a
  problem that fits this description, maybe fixed in an update???

2/ report the problem the Linux NFS developers at linux-nfs@vger.kernel.org
  feel free to quote anything I have said.
  They will probably want you to reproduce the problem on a more recent
  kernel.
  https://en.opensuse.org/openSUSE:Kernel_of_the_day
  provides instructions on doing that.

Sorry I cannot be more helpful.
Comment 16 Joschi Brauchle 2015-04-23 07:48:58 UTC
(In reply to Neil F Brown from comment #15)
> STALE_STATEID means:
> 
>       A stateid generated by an earlier server instance was used.
> 
> which means that the server must have restarted since that stateid was
> issued.

Hello Neil,
thanks for all your help so far.

What we can see is that these problems with our client machines start happening soon/right after the Netapp Server is unavailable for a short period of time. We have logs showing that correlation.

I am not sure if 'unavailable' means 'rebooted', but it looks very much like this unavailability is causing the client side issue.

I will talk to the server operators and see what really happened there.
Comment 17 Joschi Brauchle 2015-04-23 10:06:31 UTC
Created attachment 632104 [details]
Wireshark trace of NFS/RPC calls

For reference in email to NFS mailing list.
Comment 18 Joschi Brauchle 2015-06-17 14:28:10 UTC
I was able to track the start of the problem from a (virtual machine) NFS client. In order to get the problem started, I opened multiple browsers (firefox/google chrome), because we have the impression, these programs are always involved.

I then dropped the connection between the NFS client and server in the firewall and blocked any outgoing connections from the NFS Client for some time (1-2 minutes).

After I allowed connections to go through the firewall again, the problem starts to appear on the client.

I will upload a (rather large, sorry!) TCP dump with some initial observations next.
Comment 19 Joschi Brauchle 2015-06-17 14:34:45 UTC
Created attachment 638234 [details]
TCP Dump at start of BAD_STATEID error

Please find attached the tcp dump capturing the start of the problem.
Here are some observations in order to help read the file:

Packets 1-4350: Normal NFS traffic
-------
At Packet ~4350: Network Connection Drop
-------
Packets 4350-6966: Some NFS traffic, incl mostly GETATTR. Server replies with NFS4ERR_ADMIN_REVOKE often
-------
Packets 6966-14928: Looong sequence of only NFS4ERR_ADMIN_REVOKE from server!
-------
Packets 14928-15016: Very few WRITES from the client, then:
-------
Packets 15016-END: Mostly only NFS4ERR_BAD_STATE and NFS4ERR_ADMIN_REVOKE from now on from the server.
Comment 20 Neil Brown 2016-02-18 05:49:51 UTC
Hmmm... looks like I really dropped the ball here.  Sorry.

Here have been a number of NFS state management bugs fixed in the last several month, so maybe this has been fixed.
Can you try a recent kernel and see if the problem is still there (or maybe you have already)
Again: sorry for dropping this.
Comment 21 Neil Brown 2016-05-20 04:09:03 UTC
Sorry we didn't get anywhere with this.
I'm going to close the bug now on the assumption that it is fixed with later kernels.  If you are still experiencing problems, please re-open.