Python – Is creating and writing to one large file faster than creating and writing to many smaller files in Python on Linux

file handlinglinuxpythonspeed

If using Python on a Linux machine, which of the following would be faster? Why?

Creating a file at the very beginning of the program, writing very large amounts of data (text), closing it, then splitting the large file up into many smaller files at the very end of the program.
Throughout the program's span, many smaller files will be created, written to and closed.

Specifically, the program in question is one which needs to record the state of a very large array at each of many time-steps. The state of the array at each time-step needs to be recorded in independent files.

I've worked with C on Linux and know that opening/creating and closing files is quite time-expensive, and fewer open/create operations means faster programs. Is the same true if writing in Python? Would changing the language even matter if still using the same OS?

I'm also interested in RAM's role in this context. For example — correct me if I'm wrong — I'm assuming that parts of a file being written to will be placed in RAM. If the file gets too big, will it bloat RAM and cause problems in speed or other areas? If an answer could incorporate RAM that would be great.

Best Answer

To answer your question, you really should benchmark (i.e. measure the execution time of several variants of your program). I guess it might depend on how many small files you need (10 thousand files is not the same as 10 billion files), and what file system you are using. You could use tmpfs file systems. It also obviously depends on the hardware (SSD disks are faster).

I would also suggest to avoid putting a big lot of files in the same directory. So prefer dir01/file001.txt ... dir01/file999.txt dir02/file001.txt ... to file00001.txt ... file99999.txt ie have directories with e.g. at most a thousand files.

I would also advise to avoid having a big lot of tiny files (e.g. files with less than a hundred bytes of data each): they make a lot of filesystems unhappy (since a file needs at least its inode).

However, you should perhaps consider other alternatives, like using a database (which might be as simple as Sqlite ...) or using some indexed file (like gdbm ...)

Regarding RAM, the kernel tries quite hard to keep file data in RAM. See e.g. linuxatemyram.com; read about posix_fadvise(2), fsync(2), readahead(2), ...

BTW, Python code will ultimately call C code and use the same (kernel provided) syscalls(2). Most file system related processing happens inside the Linux kernel. So it won't be faster (unless Python adds it own user-space buffering to e.g. read(2) data in megabyte chunks, hence lowering the number of executed syscalls).

Notice that every Linux system is able to deal with a lot of disk data, either with a single huge file (much bigger than available RAM: you could have a 50Gbyte file on your laptop, and a terabyte file on your desktop!) or in many files.

Related Solutions

Python Development – Developing Python on Windows and Deploying to Linux

Don't know about python, but I've moved Java applications from Windows to Linux and vice-versa. Java makes the "write once, run anywhere" claim which may not be 100% true, but with very little work I was able to make it true enough (basically everything works great on Linux, a few issues on Windows).

I'll use W and L for Windows and Linux:

W: files and folders are case insensitive. L: case sensitive. Test file name capitalization carefully on Linux because Windows hides these issues.

Windows has a more granular file permissions system, that lets you use intersections of various groups and permissions. Linux has a simpler system of one group and one user per file or folder. Plus an execute-bit. Well, there are some other little things like setting the execute bit on a folder makes the permissions cascade the way they do in Windows vs. being set to the user and group that created each file the way they do by default in Linux. These issues mostly come into play when zipping and unzipping files, for instance during an install.

W: drives are mounted in the root folder as letters. L: Drives are mounted anywhere as anything. A single file can appear multiple places in your file system (symlinks).

Folder separator: W: \ L: /

Path separator: W: ; L: :

End of line in a text file: W: \r\n L: \n

Default character set: W: ISO-8859-1 L: UTF-8

You need to know which Linux distribution you are targeting. Two areas of difference are how System V init scripts are handled and how super-user tasks are performed (sudo vs su). Also you mentioned an install script. Apt and Yum are popular, but you need to work with the tool that your distribution uses. Use yum on RedHat, apt on Debian, etc.

This is why you need a Linux machine for testing, whether virtual or physical. It must use the same exact distribution you are targeting. Have someone set up a dual-boot on an old server or something. I also strongly recommend cygwin for every developer. The file permissions aren't quite the same as Linux, and you can set it to case sensitive (though it's more useful case insensitive on Windows) but it makes a pretty reasonable test bed.

It doesn't hurt you to know both (Windows and Linux) and once you do, you can make an informed choice about what works best for you. I was a Windows-only developer for the first 10 years of my career. I've been almost purely Linux for the last 4-6 years, so some of my Windows information might be old. I still run Windows in a virtual machine to do testing on Internet Explorer.

One thing you will get used to quickly on Linux is that you can solve most problems by Googling the error message. 90% of command line tools tell you how they work if you type "man ." If you really need it, most source code is easily available, depending on the distribution. When I solve a problem in Linux, I feel like I learned something about how computers really work. In Windows, I feel like I just keep blindly trying stuff until something works. When I find the solution, I'm lucky to remember everything I tried, no less know what it all means.

So I'd encourage you to spend some of your own time learning Linux and this job might be a way to get paid for some portion of that learning. But don't mistake temptation for opportunity. If the time frame is truly short, or money is tight, you may have to say that you have to deploy to what you know (Windows) or not take the job.

Python – writing a controller file in Python

I have a similar system but have a different concept:

I use cron to auto-start processes
Check if related process is running
If it is not running, then start the process.

The module uses psutil package to get a list of running processes, search for the related process and returns whether is it in the list or not. That may sound practical or not according to your use case though:

import psutil

class ProcessControl(object):
    """
    This program checks whether the given python file is running or not. If check_params flag is set as True, then
    it will be checked if an instance of the python file is running with all of the given paramters. If check_params
    flag sets as False, then the check will be made with only using the python file name and instances of the file
    running with different parameters will also be count.
    """
    def __init__(self, filename, *args):
        self.filename = filename
        self.args = args
        self.arg_num = len(args)

    def process_count(self, check_params):
        process_count = 0
        # Examine the process list to check if given process is running or not...
        for _prs in psutil.process_iter():
            try:
                _cmdline = _prs.cmdline()
            except TypeError:
                _cmdline = _prs.cmdline
            if len(_cmdline) >= 2 and "python" in _cmdline[0] and self.filename in _cmdline[1] and _prs.is_running():
                # We found an instance of the process that is running. Since This control function is triggered when
                # we run the python code, There would be at least one process (this one) which is running. Counting 2
                # or more processes means there is another instance which is still running when we trigger the python
                # code file
                if check_params:
                    # We also will check the parameters for the complete similarity
                    if len(_cmdline) == 2 + self.arg_num and all(str(_arg) in _cmdline[2:] for _arg in self.args):
                        process_count += 1
                    else:
                        # This is no match...
                        pass
                else:
                    process_count += 1
            else:
                # This is no match....
                pass
        return process_count

    def is_running(self, check_params=True):
        return self.process_count(check_params) > 1

I have little python files which have similar code as below:

Ex:

myFile.py

class MyCodeClass:
    def run_code(self):
         ...


if __name__ == "__main__":
    my_code = MyCodeClass()
    try:
        if ProcessControl(__file__).is_running():
            print "Already Running"
        else:
            my_code.run_code()
    except Exception as e:
        print e

And finally I have lines that trigger this file in my cron:

* * * * * python myFile.py

Logic:

cron triggers this file every x minutes. ProcessControl checks whether file is running or not. When calling is_running method, you can pass a bool value so the controller will ignore the parameters passed while running the python file. Like if check_params is True following two commands will be accepted as different and both will be triggered:

python myOtherFile.py 127.0.0.1 220
python myOtherFile.py 127.0.0.1 250

and disabling check_params will evaluate the second call as the same program and will not trigger it

Best Answer

Related Solutions

Python Development – Developing Python on Windows and Deploying to Linux

Python – writing a controller file in Python

Related Topic