Python – Is creating and writing to one large file faster than creating and writing to many smaller files in Python on Linux

file handlinglinuxpythonspeed

If using Python on a Linux machine, which of the following would be faster? Why?

  1. Creating a file at the very beginning of the program, writing very large amounts of data (text), closing it, then splitting the large file up into many smaller files at the very end of the program.
  2. Throughout the program's span, many smaller files will be created, written to and closed.

Specifically, the program in question is one which needs to record the state of a very large array at each of many time-steps. The state of the array at each time-step needs to be recorded in independent files.

I've worked with C on Linux and know that opening/creating and closing files is quite time-expensive, and fewer open/create operations means faster programs. Is the same true if writing in Python? Would changing the language even matter if still using the same OS?

I'm also interested in RAM's role in this context. For example — correct me if I'm wrong — I'm assuming that parts of a file being written to will be placed in RAM. If the file gets too big, will it bloat RAM and cause problems in speed or other areas? If an answer could incorporate RAM that would be great.

Best Answer

To answer your question, you really should benchmark (i.e. measure the execution time of several variants of your program). I guess it might depend on how many small files you need (10 thousand files is not the same as 10 billion files), and what file system you are using. You could use tmpfs file systems. It also obviously depends on the hardware (SSD disks are faster).

I would also suggest to avoid putting a big lot of files in the same directory. So prefer dir01/file001.txt ... dir01/file999.txt dir02/file001.txt ... to file00001.txt ... file99999.txt ie have directories with e.g. at most a thousand files.

I would also advise to avoid having a big lot of tiny files (e.g. files with less than a hundred bytes of data each): they make a lot of filesystems unhappy (since a file needs at least its inode).

However, you should perhaps consider other alternatives, like using a database (which might be as simple as Sqlite ...) or using some indexed file (like gdbm ...)

Regarding RAM, the kernel tries quite hard to keep file data in RAM. See e.g. linuxatemyram.com; read about posix_fadvise(2), fsync(2), readahead(2), ...

BTW, Python code will ultimately call C code and use the same (kernel provided) syscalls(2). Most file system related processing happens inside the Linux kernel. So it won't be faster (unless Python adds it own user-space buffering to e.g. read(2) data in megabyte chunks, hence lowering the number of executed syscalls).

Notice that every Linux system is able to deal with a lot of disk data, either with a single huge file (much bigger than available RAM: you could have a 50Gbyte file on your laptop, and a terabyte file on your desktop!) or in many files.

Related Topic