C++ – Find if certain files have been added or removed in large directory structure

cfilesqt

I have a larger directory structure (dir + sub directories) with files. It contains files of certain types. For one particular type (let's say with appendix .foo) I need to figure out if files have been added, changed or deleted.

The first approach was to iterate over all files and check the timestamp of the files. It works pretty ok for local files, but once they reside on a network file system it gets too slow.

And, in order to detect deleted files I have to create and index anyway. One idea is to create hash values for the relevant files and store them per directory. Unfortunately the normal case is that nothing changes most of the time, yet I still have to recalculate all hash values (slow) just to detect nothing changed.

Remark: I am coding in Qt 5.6/C++ and code has to work on Windows, Mac, and Linux, but the question`s focus is more the concept and not working code.

— edit —

Some clarifications (as asked below)

  • it is about files for flight simulation. When the user adds or changes files I have to rerun some tasks like parsing those files. The foo files are special files I care about, while the 10000s other flight simulator files can be ignored. My goal is to find out if I have to start the expensive parsing process, which is only required if any foofile changed
  • there is indeed something like a monitor watching changed files in Qt QFileSystemWatcher . But I cannot guarantee to monitor all the time, also it would not be needed. When I start my software I need to find out if any foofile changed, if so I start parsing, otherwise skip this step.
  • There are some 100 up to 10000 foo files within a directory structure of >50000 files
  • The files change because the user installs new features or maybe deletes some. This happens while my software is not running, I already said I cannot monitor all the time. So I need something which works after I have started my software
  • There is no client / server and it needs to work on Linux, Mac and Windows (but that was already mentioned in the original post)

Best Answer

If possible, have such monitoring happen close to disk. In particular, avoid monitoring remote file systems. Make the monitoring program run on the file server instead. The monitoring thing might not need to have a Qt interface (in other words, if you need a Qt interface, make it a separate executable communicating with the monitor). Perhaps the monitor might be queryable thru HTTP (by using some HTTP server library like libonion in it).

On Linux with local genuine file systems like Ext4, you could use inotify(7) facilities (which don't work on remote file systems, or on pseudo-file systems à la /proc/)

You might indeed create and maintain an index, using nftw(3).

It looks like plain file systems are not optimal for the job as a whole. Perhaps you should think the other way round, and use a genuine database with triggers (instead of file systems) to store the monitored data. Did you consider MariaDB, PostGreSQL, MongoDB?

See also locate(1) & updatedb(1)

Unfortunately the normal case is that nothing changes most of the time, yet I still have to recalculate all hash values (slow) just to detect nothing changed.

Again (with a Linux perspective, or any OS & filesystem with reliable modification time!) you don't need to do that. You query the file metadata -size and modification time- with stat(2) or with QFileInfo and its lastModified() & size() member functions, keep it with the hash value, and you recompute the hash values (so read again the file contents) only when the size or mtime of the file did change (from the previous hash value).

It looks like you are reinventing make or omake or some other builder... (some of them work on content checksums, not mtimes).

Related Topic