We're seeing apps like nginx and php-fpm error out occasionally (and temporarily) while opening good files from a connected NFS mount:
php-fpm error example:
2017/05/20 22:53:09 [error] 55#0: *6575 FastCGI sent in stderr: "PHP message: PHP Warning: getimagesize(/www/newspaperfoundation.org/html/wp-content/blogs.dir/22/files/2017/05/19-highest-honors-1.jpg): failed to open stream: Input/output error in /www/newspaperfoundation.org/html/wp-content/plugins/mashsharer/includes/header-meta-tags.php on line 271" while reading response header from upstream, client:
192.168.255.34, server: www.dailyrepublic.com, request: "GET /solano-news/fairfield/highest-honors-commends-students-with-4-0-and-higher-grade-point-average/ HTTP/1.1", upstream: "fastcgi://172.17.0.3:9001", host: "www.dailyrepublic.com"
nginx error example:
2017/05/20 23:22:32 [crit] 56#0: *712 open() "/www/newspaperfoundation.org/html/wp-content/blogs.dir/24/files/2017/05/Tandem1W-550x550.jpg" failed (5: Input/output error), client: 192.168.255.34, server: www.davisenterprise.com, request: "GET /files/2017/05/Tandem1W-550x550.jpg HTTP/1.1", host: "www.davisenterprise.com", referrer: "http://www.davisenterprise.com/"
During a temporary error, I can ls
and see the file exists with correct permissions. The image eventually becomes OK after a long while. Other files return OK without input/output errors.
There's not much logging I can find to document the issue. But enabling rpcdebug
I see a lot of messages like these around the time of errors:
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
May 20 16:10:07 tomentella kernel: nfsd_dispatch: vers 4 proc 1
May 20 16:10:07 tomentella kernel: nfsv4 compound op #1/5: 22 (OP_PUTFH)
May 20 16:10:07 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #1: 22: status 0
May 20 16:10:07 tomentella kernel: nfsv4 compound op #2/5: 18 (OP_OPEN)
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
May 20 16:10:08 tomentella kernel: nfsd_dispatch: vers 4 proc 1
May 20 16:10:08 tomentella kernel: nfsv4 compound op #1/4: 22 (OP_PUTFH)
May 20 16:10:08 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
May 20 16:10:08 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 4 #1: 22: status 0
May 20 16:10:08 tomentella kernel: nfsv4 compound op #2/4: 15 (OP_LOOKUP)
In particular, I feel like I only see this message for files that are erroring out:
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null)
Any ideas on what might be causing the input/output
errors?
Client mounts using the following:
mount.nfs4 -v -o proto=tcp $NFSMASTERHOST:/srv/data /srv/data
Centos 7 with updated packages. The error is "new" with few server changes recently. I think perhaps my recent update to system packages may have been the trigger for this change.
Because the problem goes in and out for some images, I'm able to somewhat watch the logs and compare/contrast. Here's an example of it going from OK to bad when grepping on a particular image name:
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open_confirm on file Ron-Thomas-web-150x150.jpg
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_close on file Ron-Thomas-web-150x150.jpg
May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
Here's nfsstat
tomentella ★ ~ $ nfsstat
Server rpc stats:
calls badcalls badclnt badauth xdrcall
94437487 6 6 0 0
Server nfs v4:
null compound
503 0% 94436978 99%
Server nfs v4 operations:
op0-unused op1-unused op2-future access close commit
0 0% 0 0% 0 0% 11213689 3% 2631554 0% 3377 0%
create delegpurge delegreturn getattr getfh link
579 0% 0 0% 0 0% 88581315 31% 32460559 11% 0 0%
lock lockt locku lookup lookup_root nverify
365 0% 0 0% 365 0% 30058556 10% 0 0% 0 0%
open openattr open_conf open_dgrd putfh putpubfh
2771686 0% 0 0% 74326 0% 0 0% 92969992 32% 0 0%
putrootfh read readdir readlink remove rename
2435 0% 1999675 0% 1917567 0% 350 0% 12404 0% 5072 0%
renew restorefh savefh secinfo setattr setcltid
1226801 0% 0 0% 5072 0% 0 0% 18315216 6% 121025 0%
setcltidconf verify write rellockowner bc_ctl bind_conn
121105 0% 0 0% 115189 0% 365 0% 0 0% 0 0%
exchange_id create_ses destroy_ses free_stateid getdirdeleg getdevinfo
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
getdevlist layoutcommit layoutget layoutreturn secinfononam sequence
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
set_ssv test_stateid want_deleg destroy_clid reclaim_comp
0 0% 0 0% 0 0% 0 0% 0 0%
Client rpc stats:
calls retrans authrefrsh
0 0 0
Best Answer
I found this searching for a solution to my own input/output error issues with a shared NFS mount. I was mounting a shared NFS drive on several machines, reading and writing with PHP. I was getting sporadic, but frequent, errors like this. I don't know if what I did fixed it, but on the off chance it helps someone else with the same problem ...
So, I was creating worker servers by cloning them. This resulted in them all having the same hostname. I didn't think anything of that, the hostname wasn't something that affected what I was doing, as far as I could tell. I change the hostnames to all be unique, and made sure the /etc/hosts file included the hostname pointing to 127.0.0.1, and the NFS errors haven't come back since.