Bugzilla – Bug 909634
NFS: 100% iowait CPU load with error -10023 (STALE_STATE_ID) or -10025 (BAD_STATE_ID)
Last modified: 2016-05-20 04:09:03 UTC
On openSUSE 13.2 with kernel 3.16.6 we seeing the following problem on multiple machine connected to a 'NetApp FAS 3140 with ONTAP 8.1.4P1 7-mode' NFSv3/4 Server. We are mounting user homes and other shares using 'sec=krb' and NFSv4 and the user logged into the machine has a valid KRB5 ticket. Something without a PID (i.e. the kernel) uses 100% on one core, 'top' shows '71.7 id, 25.6 wa' on a 4 core machine. Cranking up debug level on RPC and NFS yields the following, constantly repeating messages (for the -10025 error case): --------------------------- ``rpcdebug -m nfs all`` --------------------------- ... kernel: NFS: nfs_pgio_result: 9702, (status -10025) kernel: nfs4_free_slot: slotid 0 highest_used_slotid 1 kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER> kernel: --> nfs4_alloc_slot used_slots=0002 highest_used=1 max_slots=1024 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=0 kernel: --> nfs_put_client({4}) kernel: NFS: nfs_pgio_result: 9701, (status -10025) kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0 kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER> kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=1024 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1 kernel: --> nfs_put_client({4}) kernel: NFS: nfs_pgio_result: 9702, (status -10025) kernel: nfs4_free_slot: slotid 0 highest_used_slotid 1 kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER> kernel: --> nfs4_alloc_slot used_slots=0002 highest_used=1 max_slots=1024 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=0 kernel: --> nfs_put_client({4}) kernel: NFS: nfs_pgio_result: 9701, (status -10025) kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0 kernel: nfs4_schedule_stateid_recovery: scheduling stateid recovery for server <OUR_NFS_SERVER> kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=1024 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1 kernel: --> nfs_put_client({4}) ... --------------------------- and ``rpcdebug -m rpc all`` --------------------------- ... kernel: RPC: 9702 gss_unwrap_resp returning -10025 kernel: RPC: 9702 call_decode result -10025 kernel: RPC: wake_up_first(ffff8802348e1a98 "NFSv4.0 transport Slot table") kernel: RPC: 9702 sleep_on(queue "NFS client" time 5089081878) kernel: RPC: 9702 added to queue ffff880233645dc0 "NFS client" kernel: RPC: 9702 __rpc_wake_up_task (now 5089081878) kernel: RPC: 9702 removed from queue ffff880233645dc0 "NFS client" kernel: RPC: __rpc_wake_up_task done kernel: RPC: freeing buffer of size 5316 at ffff8801b24b4000 kernel: RPC: 9702 release request ffff8802230c1a00 kernel: RPC: wake_up_first(ffff8800b7b1b320 "xprt_backlog") kernel: RPC: 9702 call_start nfs4 proc WRITE (async) kernel: RPC: 9702 call_reserve (status 0) kernel: RPC: 9702 reserved req ffff8802230c1a00 xid 21f704d8 kernel: RPC: wake_up_first(ffff8800b7b1b190 "xprt_sending") kernel: RPC: 9702 call_reserveresult (status 0) kernel: RPC: 9702 call_refresh (status 0) kernel: RPC: 9702 refreshing RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9702 call_refreshresult (status 0) kernel: RPC: 9702 call_allocate (status 0) kernel: RPC: 9702 allocated buffer of size 5316 at ffff8801b24b4000 kernel: RPC: 9702 call_bind (status 0) kernel: RPC: 9702 call_connect xprt ffff8800b7b1b000 is connected kernel: RPC: 9702 call_transmit (status 0) kernel: RPC: 9702 xprt_prepare_transmit kernel: RPC: 9702 rpc_xdr_encode (status 0) kernel: RPC: 9702 marshaling RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9702 gss_marshal kernel: RPC: gss_get_mic_v1 kernel: RPC: krb5_make_rc4_seq_num: kernel: krb5_rc4_setup_seq_key: entered kernel: krb5_rc4_setup_seq_key: returning 0 kernel: RPC: krb5_encrypt returns 0 kernel: RPC: 9702 using AUTH_GSS cred ffff88021f305d00 to wrap rpc data kernel: RPC: 9702 gss_wrap_req kernel: RPC: 9702 gss_wrap_req returning 0 kernel: RPC: 9702 xprt_transmit(32984) kernel: RPC: xs_tcp_send_request(32984) = 32984 kernel: RPC: 9702 xmit complete kernel: RPC: 9702 sleep_on(queue "xprt_pending" time 5089081878) kernel: RPC: 9702 added to queue ffff8800b7b1b258 "xprt_pending" kernel: RPC: 9702 setting alarm for 60000 ms kernel: RPC: wake_up_first(ffff8800b7b1b190 "xprt_sending") kernel: RPC: xs_tcp_data_ready... kernel: RPC: xs_tcp_data_recv started kernel: RPC: reading TCP record fragment of length 92 kernel: RPC: reading XID (4 bytes) kernel: RPC: reading reply for XID 20f704d8 kernel: RPC: reading CALL/REPLY flag (4 bytes) kernel: RPC: read reply XID 20f704d8 kernel: RPC: XID 20f704d8 read 84 bytes kernel: RPC: xprt = ffff8800b7b1b000, tcp_copied = 92, tcp_offset = 92, tcp_reclen = 92 kernel: RPC: 9701 xid 20f704d8 complete (92 bytes received) kernel: RPC: 9701 __rpc_wake_up_task (now 5089081879) kernel: RPC: 9701 disabling timer kernel: RPC: 9701 removed from queue ffff8800b7b1b258 "xprt_pending" kernel: RPC: __rpc_wake_up_task done kernel: RPC: xs_tcp_data_recv done kernel: RPC: 9701 __rpc_execute flags=0xa01 kernel: RPC: 9701 call_status (status 92) kernel: RPC: 9701 call_decode (status 92) kernel: RPC: 9701 validating RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9701 gss_validate kernel: RPC: krb5_read_token kernel: RPC: krb5_get_seq_num: kernel: RPC: krb5_get_rc4_seq_num: kernel: krb5_rc4_setup_seq_key: entered kernel: krb5_rc4_setup_seq_key: returning 0 kernel: RPC: gss_k5decrypt returns 0 kernel: RPC: 9701 gss_validate: gss_verify_mic succeeded. kernel: RPC: 9701 using AUTH_GSS cred ffff88021f305d00 to unwrap rpc data kernel: RPC: 9701 gss_unwrap_resp returning -10025 kernel: RPC: 9701 call_decode result -10025 kernel: RPC: wake_up_first(ffff8802348e1a98 "NFSv4.0 transport Slot table") kernel: RPC: 9701 sleep_on(queue "NFS client" time 5089081879) kernel: RPC: 9701 added to queue ffff880233645dc0 "NFS client" kernel: RPC: freeing buffer of size 5316 at ffff880233d04000 kernel: RPC: 9701 release request ffff8802230c1c00 kernel: RPC: 9701 __rpc_wake_up_task (now 5089081879) kernel: RPC: 9701 removed from queue ffff880233645dc0 "NFS client" kernel: RPC: __rpc_wake_up_task done kernel: RPC: wake_up_first(ffff8800b7b1b320 "xprt_backlog") kernel: RPC: 9701 call_start nfs4 proc WRITE (async) kernel: RPC: 9701 call_reserve (status 0) kernel: RPC: 9701 reserved req ffff8802230c1c00 xid 22f704d8 kernel: RPC: wake_up_first(ffff8800b7b1b190 "xprt_sending") kernel: RPC: 9701 call_reserveresult (status 0) kernel: RPC: 9701 call_refresh (status 0) kernel: RPC: 9701 refreshing RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9701 call_refreshresult (status 0) kernel: RPC: 9701 call_allocate (status 0) kernel: RPC: 9701 allocated buffer of size 5316 at ffff880233d04000 kernel: RPC: 9701 call_bind (status 0) kernel: RPC: 9701 call_connect xprt ffff8800b7b1b000 is connected kernel: RPC: 9701 call_transmit (status 0) kernel: RPC: 9701 xprt_prepare_transmit kernel: RPC: 9701 rpc_xdr_encode (status 0) kernel: RPC: 9701 marshaling RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9701 gss_marshal kernel: RPC: gss_get_mic_v1 kernel: RPC: krb5_make_rc4_seq_num: kernel: krb5_rc4_setup_seq_key: entered kernel: krb5_rc4_setup_seq_key: returning 0 kernel: RPC: krb5_encrypt returns 0 kernel: RPC: 9701 using AUTH_GSS cred ffff88021f305d00 to wrap rpc data kernel: RPC: 9701 gss_wrap_req kernel: RPC: 9701 gss_wrap_req returning 0 kernel: RPC: 9701 xprt_transmit(32984) kernel: RPC: xs_tcp_send_request(32984) = 32984 kernel: RPC: 9701 xmit complete kernel: RPC: 9701 sleep_on(queue "xprt_pending" time 5089081879) kernel: RPC: 9701 added to queue ffff8800b7b1b258 "xprt_pending" kernel: RPC: 9701 setting alarm for 60000 ms kernel: RPC: wake_up_first(ffff8800b7b1b190 "xprt_sending") kernel: RPC: xs_tcp_data_ready... kernel: RPC: xs_tcp_data_recv started kernel: RPC: reading TCP record fragment of length 92 kernel: RPC: reading XID (4 bytes) kernel: RPC: reading reply for XID 21f704d8 kernel: RPC: reading CALL/REPLY flag (4 bytes) kernel: RPC: read reply XID 21f704d8 kernel: RPC: XID 21f704d8 read 84 bytes kernel: RPC: xprt = ffff8800b7b1b000, tcp_copied = 92, tcp_offset = 92, tcp_reclen = 92 kernel: RPC: 9702 xid 21f704d8 complete (92 bytes received) kernel: RPC: 9702 __rpc_wake_up_task (now 5089081880) kernel: RPC: 9702 disabling timer kernel: RPC: 9702 removed from queue ffff8800b7b1b258 "xprt_pending" kernel: RPC: __rpc_wake_up_task done kernel: RPC: xs_tcp_data_recv done kernel: RPC: 9702 __rpc_execute flags=0xa01 kernel: RPC: 9702 call_status (status 92) kernel: RPC: 9702 call_decode (status 92) kernel: RPC: 9702 validating RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9702 gss_validate kernel: RPC: krb5_read_token kernel: RPC: krb5_get_seq_num: kernel: RPC: krb5_get_rc4_seq_num: kernel: krb5_rc4_setup_seq_key: entered kernel: krb5_rc4_setup_seq_key: returning 0 kernel: RPC: gss_k5decrypt returns 0 kernel: RPC: 9702 gss_validate: gss_verify_mic succeeded. kernel: RPC: 9702 using AUTH_GSS cred ffff88021f305d00 to unwrap rpc data kernel: RPC: 9702 gss_unwrap_resp returning -10025 kernel: RPC: 9702 call_decode result -10025 kernel: RPC: wake_up_first(ffff8802348e1a98 "NFSv4.0 transport Slot table") kernel: RPC: 9702 sleep_on(queue "NFS client" time 5089081880) kernel: RPC: 9702 added to queue ffff880233645dc0 "NFS client" kernel: RPC: freeing buffer of size 5316 at ffff8801b24b4000 kernel: RPC: 9702 __rpc_wake_up_task (now 5089081880) kernel: RPC: 9702 removed from queue ffff880233645dc0 "NFS client" kernel: RPC: __rpc_wake_up_task done kernel: RPC: 9702 release request ffff8802230c1a00 kernel: RPC: wake_up_first(ffff8800b7b1b320 "xprt_backlog") kernel: RPC: 9702 call_start nfs4 proc WRITE (async) kernel: RPC: 9702 call_reserve (status 0) kernel: RPC: 9702 reserved req ffff8802230c1a00 xid 23f704d8 kernel: RPC: wake_up_first(ffff8800b7b1b190 "xprt_sending") kernel: RPC: 9702 call_reserveresult (status 0) kernel: RPC: 9702 call_refresh (status 0) kernel: RPC: 9702 refreshing RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9702 call_refreshresult (status 0) kernel: RPC: 9702 call_allocate (status 0) kernel: RPC: 9702 allocated buffer of size 5316 at ffff8801b24b4000 kernel: RPC: 9702 call_bind (status 0) kernel: RPC: 9702 call_connect xprt ffff8800b7b1b000 is connected kernel: RPC: 9702 call_transmit (status 0) kernel: RPC: 9702 xprt_prepare_transmit kernel: RPC: 9702 rpc_xdr_encode (status 0) kernel: RPC: 9702 marshaling RPCSEC_GSS cred ffff88021f305d00 kernel: RPC: 9702 gss_marshal kernel: RPC: gss_get_mic_v1 kernel: RPC: krb5_make_rc4_seq_num: kernel: krb5_rc4_setup_seq_key: entered kernel: krb5_rc4_setup_seq_key: returning 0 kernel: RPC: krb5_encrypt returns 0 kernel: RPC: 9702 using AUTH_GSS cred ffff88021f305d00 to wrap rpc data kernel: RPC: 9702 gss_wrap_req kernel: RPC: 9702 gss_wrap_req returning 0 kernel: RPC: 9702 xprt_transmit(32984) kernel: RPC: xs_tcp_send_request(32984) = 32984 kernel: RPC: 9702 xmit complete kernel: RPC: 9702 sleep_on(queue "xprt_pending" time 5089081880) kernel: RPC: 9702 added to queue ffff8800b7b1b258 "xprt_pending" kernel: RPC: 9702 setting alarm for 60000 ms kernel: RPC: wake_up_first(ffff8800b7b1b190 "xprt_sending") kernel: RPC: xs_tcp_data_ready... kernel: RPC: xs_tcp_data_recv started kernel: RPC: reading TCP record fragment of length 92 kernel: RPC: reading XID (4 bytes) kernel: RPC: reading reply for XID 22f704d8 kernel: RPC: reading CALL/REPLY flag (4 bytes) kernel: RPC: read reply XID 22f704d8 kernel: RPC: XID 22f704d8 read 84 bytes kernel: RPC: xprt = ffff8800b7b1b000, tcp_copied = 92, tcp_offset = 92, tcp_reclen = 92 kernel: RPC: 9701 xid 22f704d8 complete (92 bytes received) ... --------------------------- I can complete provide logs for both NFS and RPC debug, but in the end they seem to be just repeating. Also I have a wireshark trace from this machine for a couple of seconds. Some more info on the software versions in use: --------------------------- # rpm -qa|grep -E '^(kernel-desktop|nfs|rpc)' nfsidmap-0.25-6.1.2.x86_64 nfs4-acl-tools-0.3.3-18.1.2.x86_64 kernel-desktop-3.16.6-2.1.x86_64 rpcbind-0.2.1_rc4-8.4.1.x86_64 nfs-client-1.3.0-4.2.1.x86_64 ---------------------------
Note that all debug info is for the error case -10025. We currently have no logs for the -10023 case, but will provide them ASAP once we observe the problem again.
Neil, could you help with this one?
The nfs4_schedule_stateid_recovery: scheduling stateid recovery for server message confirms that nfs4_schedule_state_manager is being called should ensure that the state management thread is running. This is a process with a name IPADDRESS-manager - though only 16 character are used. So e.g. "192.168.1.1-mana". If this cannot be started and error message mentioning "kthread_run" will be printed. That doesn't appear in the logs so we must assume it is running. After printing the above message, the failed write request should be queued on the "cl_rpcwaitq" queue, also known as "NFS client". We see this happening in the logs. The request is then woken up either: - immediately if NFS4CLNT_MANAGER_RUNNING is not set, or - when the state manager thinks it has done everything that needs doing. The state manager should at least try to re-open the file, but there is no evidence of that happening. Is the state-manager thread running? If you can find it in 'ps' output, is it always there with the same PID, or does it keep changing PID? When the enable or disable rpc debugging, it should print state status of any pending RPC requests. Does it? Can you provide that part of the logs?
I'll try to get the requested information ASAP. Unfortunately there currently is no machine in our network exhibiting the problems and we'll have to wait for them to reappear. This is kinda unlikely during the christmas break as most machines are idle or off, thus it looks more like January for more info to come!
Created attachment 618735 [details] ps auxf output (stripped of user processes) This is the 'ps auxf' output of a machine exhibiting the -10025 error message. I dont see any $IP-manager in the list. The machine had about 35% IO-Wait load (4 cores) and thus had to be rebooted unfortunately...
Created attachment 618736 [details] RPC-debug output in journald I was able to catch some of the RPC debug output of this machine from the terminal before reboot... it looks different that the previous logs I provided, does it?
Created attachment 618737 [details] NFS-debug output in journald
In addition to the NFS/RPC debug logs, I found the following in the journal. It looks like the system went to standby and resumed! Then the problems started... --------------------- Jan 06 12:31:52 tueilnt-st-stinner kernel: PM: Finishing wakeup. Jan 06 12:31:52 tueilnt-st-stinner kernel: Restarting tasks ... done. ... ... Jan 06 12:31:51 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 1 Jan 06 12:31:51 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 1 Jan 06 12:31:52 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 1 Jan 06 12:31:52 tueilnt-st-stinner sssd_be[1779]: GSSAPI client step 2 Jan 06 12:31:52 tueilnt-st-stinner automount[15034]: failed to to open old config /etc/sysconfig/autofs Jan 06 12:31:54 tueilnt-st-stinner kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed! Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: key description: cifs.spnego;3298478;3000000;39010000;ver=0x2;host=<NAS_HOSTNAME>;ip4 Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: ver=2 Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: host=<NAS_HOSTNAME> Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: ip=<NAS_IP> Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: sec=1 Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: uid=3298478 Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: creduid=3298478 Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: user=<USERID> Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: pid=2139 Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: find_krb5_cc: considering /tmp/krb5ccmachine_<REALM> Jan 06 12:31:54 tueilnt-st-stinner cifs.upcall[15057]: find_krb5_cc: /tmp/krb5ccmachine_<REALM> is owned by 0, not 3298478 Jan 06 12:31:55 tueilnt-st-stinner kernel: NFS: nfs4_reclaim_open_state: unhandled error -10026 Jan 06 12:31:55 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff880231d5e020! Jan 06 12:31:55 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff880232f91420! Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: krb5_get_init_creds_keytab: 13 Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: handle_krb5_mech: getting service ticket for nas.ads.mwn.de Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: handle_krb5_mech: obtained service ticket Jan 06 12:31:55 tueilnt-st-stinner cifs.upcall[15057]: Exit status 0 ... Jan 06 12:32:42 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff88021be5e020! Jan 06 12:32:42 tueilnt-st-stinner kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff88022a813220! ... --------------------- The machine is mounting CIFS and NFS with sec=KRB5 using an automounter.
(sorry for the delay). It looks like the client is holding a delegation from the server, the server is rejecting use of the delegation (maybe because the client was suspended for a while and didn't renew the delegation) and the client isn't responding to the error properly. There are a few upstream patches that could be relevant (Deal with NFS state management errors) and were marked for inclusion in -stable kernels, but didn't get into 3.16 because it was closed for new patches by the time they came along. In particular, commit 4dfd4f7af0afd201706ad186352ca423b0f17d4b Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Fri Oct 17 15:10:25 2014 +0300 NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired looks like it might be relevant. I have added 5 'stable' patches to our 13.2 kernel including this one. The new kernel should appear in http://download.opensuse.org/repositories/Kernel:/openSUSE-13.2/standard/x86_64/ in a day or so. The current kernel version is 3.16.7-47.1.gf9c15b3. When that changes, the NFS patches will be included. Please try installing that kernel and report if it makes a difference.
(In reply to Neil F Brown from comment #9) > Please try installing that kernel and report if it makes a difference. Today we had some NAS storage problems and one of the patched machine (with above kernel) was also affected with 10023 errors. Unfortunatelty I was unable to acquire logs before the machine got rebooted. I will try next time. But it looks like the problem is not fixed with the new patches...
I'm sorry to here that ... Can you please provide logs showing the problem and the exact kernel version that you were running?
openSUSE-SU-2015:0713-1: An update that solves 13 vulnerabilities and has 52 fixes is now available. Category: security (important) Bug References: 867199,893428,895797,900811,901925,903589,903640,904899,905681,907039,907818,907988,908582,908588,908589,908592,908593,908594,908596,908598,908603,908604,908605,908606,908608,908610,908612,909077,909078,909477,909634,910150,910322,910440,911311,911325,911326,911356,911438,911578,911835,912061,912202,912429,912705,913059,913466,913695,914175,915425,915454,915456,915577,915858,916608,917830,917839,918954,918970,919463,920581,920604,921313,922542,922944 CVE References: CVE-2014-8134,CVE-2014-8160,CVE-2014-8559,CVE-2014-9419,CVE-2014-9420,CVE-2014-9428,CVE-2014-9529,CVE-2014-9584,CVE-2014-9585,CVE-2015-0777,CVE-2015-1421,CVE-2015-1593,CVE-2015-2150 Sources used: openSUSE 13.2 (src): bbswitch-0.8-3.6.6, cloop-2.639-14.6.6, crash-7.0.8-6.6, hdjmod-1.28-18.7.6, ipset-6.23-6.6, kernel-docs-3.16.7-13.2, kernel-obs-build-3.16.7-13.7, kernel-obs-qa-3.16.7-13.1, kernel-obs-qa-xen-3.16.7-13.1, kernel-source-3.16.7-13.1, kernel-syms-3.16.7-13.1, pcfclock-0.44-260.6.2, vhba-kmp-20140629-2.6.2, virtualbox-4.3.20-10.2, xen-4.4.1_08-12.2, xtables-addons-2.6-6.2
Created attachment 631961 [details] RPC/NFS debug log (In reply to Neil F Brown from comment #11) > I'm sorry to here that ... > Can you please provide logs showing the problem and the exact kernel version > that you were running? I have a machine showing the problem now: Linux hostname 3.16.7-58.g23df5ff-desktop #1 SMP PREEMPT Sat Mar 21 15:44:56 UTC 2015 (23df5ff) x86_64 x86_64 x86_64 GNU/Linux RPC/NFS full debug log, see attachment.
Created attachment 631962 [details] RPC/NFS debug log
Thanks. What I see happening here is that a WRITE or LOCK request gets NFS4ERR_STALE_STATEID, and then the client sends a "RENEW" request. If the stateid is really stale, that should fail, but it doesn't. So the client assumes everything is OK and keeps sending requests .... which fail with NFS4ERR_STALE_STATEID. STALE_STATEID means: A stateid generated by an earlier server instance was used. which means that the server must have restarted since that stateid was issued. So there seem to be two possible explanations: 1/ The server is lying. The stateid isn't really 'stale'. Maybe it is 'bad'. 2/ The client has previously handled a STALE_STATEID error and created a new "client" handle, but hasn't refreshed all of the stateids that it was holding. So it is trying to use old stateid even though it has established a new client. The client code looks OK, though it is sufficiently complex that I cannot be certain. There are no patches in more recent versions. My only suggestions are: 1/ contact Netapp if you have support from them and see if they know of a problem that fits this description, maybe fixed in an update??? 2/ report the problem the Linux NFS developers at linux-nfs@vger.kernel.org feel free to quote anything I have said. They will probably want you to reproduce the problem on a more recent kernel. https://en.opensuse.org/openSUSE:Kernel_of_the_day provides instructions on doing that. Sorry I cannot be more helpful.
(In reply to Neil F Brown from comment #15) > STALE_STATEID means: > > A stateid generated by an earlier server instance was used. > > which means that the server must have restarted since that stateid was > issued. Hello Neil, thanks for all your help so far. What we can see is that these problems with our client machines start happening soon/right after the Netapp Server is unavailable for a short period of time. We have logs showing that correlation. I am not sure if 'unavailable' means 'rebooted', but it looks very much like this unavailability is causing the client side issue. I will talk to the server operators and see what really happened there.
Created attachment 632104 [details] Wireshark trace of NFS/RPC calls For reference in email to NFS mailing list.
I was able to track the start of the problem from a (virtual machine) NFS client. In order to get the problem started, I opened multiple browsers (firefox/google chrome), because we have the impression, these programs are always involved. I then dropped the connection between the NFS client and server in the firewall and blocked any outgoing connections from the NFS Client for some time (1-2 minutes). After I allowed connections to go through the firewall again, the problem starts to appear on the client. I will upload a (rather large, sorry!) TCP dump with some initial observations next.
Created attachment 638234 [details] TCP Dump at start of BAD_STATEID error Please find attached the tcp dump capturing the start of the problem. Here are some observations in order to help read the file: Packets 1-4350: Normal NFS traffic ------- At Packet ~4350: Network Connection Drop ------- Packets 4350-6966: Some NFS traffic, incl mostly GETATTR. Server replies with NFS4ERR_ADMIN_REVOKE often ------- Packets 6966-14928: Looong sequence of only NFS4ERR_ADMIN_REVOKE from server! ------- Packets 14928-15016: Very few WRITES from the client, then: ------- Packets 15016-END: Mostly only NFS4ERR_BAD_STATE and NFS4ERR_ADMIN_REVOKE from now on from the server.
Hmmm... looks like I really dropped the ball here. Sorry. Here have been a number of NFS state management bugs fixed in the last several month, so maybe this has been fixed. Can you try a recent kernel and see if the problem is still there (or maybe you have already) Again: sorry for dropping this.
Sorry we didn't get anywhere with this. I'm going to close the bug now on the assumption that it is fixed with later kernels. If you are still experiencing problems, please re-open.