Centos – causing `input/output` errors when reading from NFS v4 on CentOS

We're seeing apps like nginx and php-fpm error out occasionally (and temporarily) while opening good files from a connected NFS mount:

php-fpm error example:

2017/05/20 22:53:09 [error] 55#0: *6575 FastCGI sent in stderr: "PHP message: PHP Warning:  getimagesize(/www/newspaperfoundation.org/html/wp-content/blogs.dir/22/files/2017/05/19-highest-honors-1.jpg): failed to open stream: Input/output error in /www/newspaperfoundation.org/html/wp-content/plugins/mashsharer/includes/header-meta-tags.php on line 271" while reading response header from upstream, client:
192.168.255.34, server: www.dailyrepublic.com, request: "GET /solano-news/fairfield/highest-honors-commends-students-with-4-0-and-higher-grade-point-average/ HTTP/1.1", upstream: "fastcgi://172.17.0.3:9001", host: "www.dailyrepublic.com"

nginx error example:

2017/05/20 23:22:32 [crit] 56#0: *712 open() "/www/newspaperfoundation.org/html/wp-content/blogs.dir/24/files/2017/05/Tandem1W-550x550.jpg" failed (5: Input/output error), client: 192.168.255.34, server: www.davisenterprise.com, request: "GET /files/2017/05/Tandem1W-550x550.jpg HTTP/1.1", host: "www.davisenterprise.com", referrer: "http://www.davisenterprise.com/"

During a temporary error, I can ls and see the file exists with correct permissions. The image eventually becomes OK after a long while. Other files return OK without input/output errors.

There's not much logging I can find to document the issue. But enabling rpcdebug I see a lot of messages like these around the time of errors:

May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner           (null)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
May 20 16:10:07 tomentella kernel: nfsd_dispatch: vers 4 proc 1
May 20 16:10:07 tomentella kernel: nfsv4 compound op #1/5: 22 (OP_PUTFH)
May 20 16:10:07 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #1: 22: status 0
May 20 16:10:07 tomentella kernel: nfsv4 compound op #2/5: 18 (OP_OPEN)
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner           (null)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
May 20 16:10:08 tomentella kernel: nfsd_dispatch: vers 4 proc 1
May 20 16:10:08 tomentella kernel: nfsv4 compound op #1/4: 22 (OP_PUTFH)
May 20 16:10:08 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
May 20 16:10:08 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 4 #1: 22: status 0
May 20 16:10:08 tomentella kernel: nfsv4 compound op #2/4: 15 (OP_LOOKUP)

In particular, I feel like I only see this message for files that are erroring out:

May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner           (null)

Any ideas on what might be causing the input/output errors?

Client mounts using the following:

mount.nfs4 -v -o proto=tcp $NFSMASTERHOST:/srv/data /srv/data

Centos 7 with updated packages. The error is "new" with few server changes recently. I think perhaps my recent update to system packages may have been the trigger for this change.

Because the problem goes in and out for some images, I'm able to somewhat watch the logs and compare/contrast. Here's an example of it going from OK to bad when grepping on a particular image name:

May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open_confirm on file Ron-Thomas-web-150x150.jpg
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_close on file Ron-Thomas-web-150x150.jpg
May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)
May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner           (null)

Here's nfsstat

tomentella ★ ~ $ nfsstat
Server rpc stats:
calls      badcalls   badclnt    badauth    xdrcall
94437487   6          6          0          0       

Server nfs v4:
null         compound     
503       0% 94436978 99% 

Server nfs v4 operations:
op0-unused   op1-unused   op2-future   access       close        commit       
0         0% 0         0% 0         0% 11213689  3% 2631554   0% 3377      0% 
create       delegpurge   delegreturn  getattr      getfh        link         
579       0% 0         0% 0         0% 88581315 31% 32460559 11% 0         0% 
lock         lockt        locku        lookup       lookup_root  nverify      
365       0% 0         0% 365       0% 30058556 10% 0         0% 0         0% 
open         openattr     open_conf    open_dgrd    putfh        putpubfh     
2771686   0% 0         0% 74326     0% 0         0% 92969992 32% 0         0% 
putrootfh    read         readdir      readlink     remove       rename       
2435      0% 1999675   0% 1917567   0% 350       0% 12404     0% 5072      0% 
renew        restorefh    savefh       secinfo      setattr      setcltid     
1226801   0% 0         0% 5072      0% 0         0% 18315216  6% 121025    0% 
setcltidconf verify       write        rellockowner bc_ctl       bind_conn    
121105    0% 0         0% 115189    0% 365       0% 0         0% 0         0% 
exchange_id  create_ses   destroy_ses  free_stateid getdirdeleg  getdevinfo   
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
getdevlist   layoutcommit layoutget    layoutreturn secinfononam sequence     
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
set_ssv      test_stateid want_deleg   destroy_clid reclaim_comp 
0         0% 0         0% 0         0% 0         0% 0         0% 

Client rpc stats:
calls      retrans    authrefrsh
0          0          0

Best Answer

I found this searching for a solution to my own input/output error issues with a shared NFS mount. I was mounting a shared NFS drive on several machines, reading and writing with PHP. I was getting sporadic, but frequent, errors like this. I don't know if what I did fixed it, but on the off chance it helps someone else with the same problem ...

So, I was creating worker servers by cloning them. This resulted in them all having the same hostname. I didn't think anything of that, the hostname wasn't something that affected what I was doing, as far as I could tell. I change the hostnames to all be unique, and made sure the /etc/hosts file included the hostname pointing to 127.0.0.1, and the NFS errors haven't come back since.

Best Answer

Related Solutions

Linux – error reading input file: Stale NFS file handle

Linux – getting input/output error from NFS client on RHEL5

Related Topic