How do i speedily traverse a file system while extracting/extrapolating various data and provide user feedback

command linecross platformfile-systemshashingrecursion

I'm working on a system file scanner that reveals info about various files (e.g. size, last used, duplicates, etc). Currently I'm traversing the file system once just to get a good measure of the files I'll be processing, I then loop through doing the actual processing (size info, hash info, etc). Obviously that immediately creates an entire layer of "extra" processing, but it allows me to use the previously acquired info to provide the user with some "progress data".

I've been looking for a good mechanism to use in order to speed up the process while still showing progress data for end users. I thought about creating separate threads (one for appending files to a stack, and the other for reading from the stack as they became available), but that might get out of hand programmatically quickly.

In the interest of speeding up the initial scan, I currently perform a "find path" (or the equivalent depending on what OS is being used) and grab all of the output. This, however, prevents me from negating entire subfolders (if the user desires) as it simply recursively lists everything. Certain OSes do have command line options for negating directories, etc, but I need a cross-platform solution.

So, aaaaaaall that said, does anyone have any algorithmic suggestions for being speedy while providing quality progress? I'm not fundamentally tied to any specific language. I'm looking more for a higher level view of what needs to occur.

Best.

Best Answer

You've got an I/O-bound problem. That means you'll have to fit the computation to match the I/O. You're likely dealing with physical harddisks, which means that seeks are dominant. Therefore, "speedily" translates to "minimum number of seeks".

We can therefore list the following principles: 1. Scan entire directories (breadth first). Don't be tempted to enter subdirectories before you've scanned the parent directory; going back takes another seek. 2. Save all data you can possibly get from a directory. Having to go back takes another seek.

Now, some filesystems (eg NTFS) save small file content inside the directory entries. On such systems, you should hash those file contents immediately after you've scanned the directory. Otherwise, it really doesn't matter when you do. It may be wise to scan subdirectories first so you can report the running count of files found, and delay reading large files.

When you truely want to push high-performance, asynchronous I/O is the proper solution. This won't be needed on regular PCs, even an SSD isn't that fast, but high-end file servers can overload a single thread. In such systems, a solution such as Boost ASIO can scale. You'll just throw read requests to Boost ASIO, and it will return results some time later as they come in. Possibly on other threads, if needed. This gives the underlying O/S more flexibility to handle the read requests.

Related Topic