Linux – non-cpu-intensive alternative to lsof

linuxlsof

We run an Apache Cassandra cluster where each host has a few hundred thousand files open at any given time.

We'd like to be able to get a count of open files at periodic intervals and feed this number into graphite, but when we run lsof under collectd, it ends up taking a few minutes to complete and chewing up an inordinate amount of CPU in the meantime.

I'm wondering if there's an alternate and more friendly means of getting the same data that lsof provides, or even a way of running lsof that won't eat into CPU as noticeably? (Although I assume this latter method would likely take much longer to complete than it currently does… not ideal).

Perhaps the kernel maintains some variable somewhere that contains the number of open files? Wishful thinking?

Update:

In reponse to one of the answers, we're already using the -b and -n flags. Here's the full command as I have it running under collectd:

sudo lsof -b -n -w | stdbuf -i0 -o0 -e0 wc -l

Best Answer

You probably don't need to resolve the network addresses for socket, so a least use the -n switch. Then you may also want so skip blocking operations with -b.

These 2 first switches should really make it faster.

And then -l to avoid resolving uids. And -L to avoid counting links. Etc. See the man lsof .

Alternatively, with Linux, you could make a script to simply count the links under /proc/<PID>/fd like this:

find /proc -mindepth 3 -maxdepth 3 -type l | awk -F/ '$4 == "fd" { s++ } END { print s }'

Related Solutions

Linux – Restrict lsof Output to Physical Files Only

Just looked through some man pages, it appears you use the command:

sudo lsof /

This will list all open files in the / directory, which is everything on a Linux filesystem. Just tested and it shows only REG and DIR.

More examples:

lsof -a -d 0-999 -c <command name> /
lsof -a -d 0-999 -p <pid> /

0-999 limits it to files with a file descriptor number.

Linux – lsof gives warning for some daemon

It's a virtual filesystem, used by gnome. It doesn't implement everything stat() wants to get so it returns an error. Here is a quote from Fedora mailing list about a similar problem:

The issue is that when the user is logged in, fuse creates a memory resident filesystem interpreted by the user process.

This user process doesn't implement anything but owner. All else is refused. Root can't access it because it isn't the owner, and can't override that because the fuse filesystem doesn't implement/support the override.

Best Answer

Related Solutions

Linux – Restrict lsof Output to Physical Files Only

Linux – lsof gives warning for some daemon

Related Topic