Don't know about python, but I've moved Java applications from Windows to Linux and vice-versa. Java makes the "write once, run anywhere" claim which may not be 100% true, but with very little work I was able to make it true enough (basically everything works great on Linux, a few issues on Windows).
I'll use W and L for Windows and Linux:
W: files and folders are case insensitive. L: case sensitive. Test file name capitalization carefully on Linux because Windows hides these issues.
Windows has a more granular file permissions system, that lets you use intersections of various groups and permissions. Linux has a simpler system of one group and one user per file or folder. Plus an execute-bit. Well, there are some other little things like setting the execute bit on a folder makes the permissions cascade the way they do in Windows vs. being set to the user and group that created each file the way they do by default in Linux. These issues mostly come into play when zipping and unzipping files, for instance during an install.
W: drives are mounted in the root folder as letters. L: Drives are mounted anywhere as anything. A single file can appear multiple places in your file system (symlinks).
Folder separator: W: \ L: /
Path separator: W: ; L: :
End of line in a text file: W: \r\n L: \n
Default character set: W: ISO-8859-1 L: UTF-8
You need to know which Linux distribution you are targeting. Two areas of difference are how System V init scripts are handled and how super-user tasks are performed (sudo vs su). Also you mentioned an install script. Apt and Yum are popular, but you need to work with the tool that your distribution uses. Use yum on RedHat, apt on Debian, etc.
This is why you need a Linux machine for testing, whether virtual or physical. It must use the same exact distribution you are targeting. Have someone set up a dual-boot on an old server or something. I also strongly recommend cygwin for every developer. The file permissions aren't quite the same as Linux, and you can set it to case sensitive (though it's more useful case insensitive on Windows) but it makes a pretty reasonable test bed.
It doesn't hurt you to know both (Windows and Linux) and once you do, you can make an informed choice about what works best for you. I was a Windows-only developer for the first 10 years of my career. I've been almost purely Linux for the last 4-6 years, so some of my Windows information might be old. I still run Windows in a virtual machine to do testing on Internet Explorer.
One thing you will get used to quickly on Linux is that you can solve most problems by Googling the error message. 90% of command line tools tell you how they work if you type "man ." If you really need it, most source code is easily available, depending on the distribution. When I solve a problem in Linux, I feel like I learned something about how computers really work. In Windows, I feel like I just keep blindly trying stuff until something works. When I find the solution, I'm lucky to remember everything I tried, no less know what it all means.
So I'd encourage you to spend some of your own time learning Linux and this job might be a way to get paid for some portion of that learning. But don't mistake temptation for opportunity. If the time frame is truly short, or money is tight, you may have to say that you have to deploy to what you know (Windows) or not take the job.
I have a similar system but have a different concept:
- I use
cron
to auto-start processes
- Check if related process is running
- If it is not running, then start the process.
The module uses psutil
package to get a list of running processes, search for the related process and returns whether is it in the list or not. That may sound practical or not according to your use case though:
import psutil
class ProcessControl(object):
"""
This program checks whether the given python file is running or not. If check_params flag is set as True, then
it will be checked if an instance of the python file is running with all of the given paramters. If check_params
flag sets as False, then the check will be made with only using the python file name and instances of the file
running with different parameters will also be count.
"""
def __init__(self, filename, *args):
self.filename = filename
self.args = args
self.arg_num = len(args)
def process_count(self, check_params):
process_count = 0
# Examine the process list to check if given process is running or not...
for _prs in psutil.process_iter():
try:
_cmdline = _prs.cmdline()
except TypeError:
_cmdline = _prs.cmdline
if len(_cmdline) >= 2 and "python" in _cmdline[0] and self.filename in _cmdline[1] and _prs.is_running():
# We found an instance of the process that is running. Since This control function is triggered when
# we run the python code, There would be at least one process (this one) which is running. Counting 2
# or more processes means there is another instance which is still running when we trigger the python
# code file
if check_params:
# We also will check the parameters for the complete similarity
if len(_cmdline) == 2 + self.arg_num and all(str(_arg) in _cmdline[2:] for _arg in self.args):
process_count += 1
else:
# This is no match...
pass
else:
process_count += 1
else:
# This is no match....
pass
return process_count
def is_running(self, check_params=True):
return self.process_count(check_params) > 1
I have little python files which have similar code as below:
Ex:
myFile.py
class MyCodeClass:
def run_code(self):
...
if __name__ == "__main__":
my_code = MyCodeClass()
try:
if ProcessControl(__file__).is_running():
print "Already Running"
else:
my_code.run_code()
except Exception as e:
print e
And finally I have lines that trigger this file in my cron
:
* * * * * python myFile.py
Logic:
cron
triggers this file every x
minutes. ProcessControl
checks whether file is running or not. When calling is_running
method, you can pass a bool value so the controller will ignore the parameters passed while running the python file. Like if check_params
is True
following two commands will be accepted as different and both will be triggered:
python myOtherFile.py 127.0.0.1 220
python myOtherFile.py 127.0.0.1 250
and disabling check_params
will evaluate the second call as the same program and will not trigger it
Best Answer
To answer your question, you really should benchmark (i.e. measure the execution time of several variants of your program). I guess it might depend on how many small files you need (10 thousand files is not the same as 10 billion files), and what file system you are using. You could use
tmpfs
file systems. It also obviously depends on the hardware (SSD disks are faster).I would also suggest to avoid putting a big lot of files in the same directory. So prefer
dir01/file001.txt
...dir01/file999.txt
dir02/file001.txt
... tofile00001.txt
...file99999.txt
ie have directories with e.g. at most a thousand files.I would also advise to avoid having a big lot of tiny files (e.g. files with less than a hundred bytes of data each): they make a lot of filesystems unhappy (since a file needs at least its inode).
However, you should perhaps consider other alternatives, like using a database (which might be as simple as Sqlite ...) or using some indexed file (like
gdbm
...)Regarding RAM, the kernel tries quite hard to keep file data in RAM. See e.g. linuxatemyram.com; read about posix_fadvise(2), fsync(2), readahead(2), ...
BTW, Python code will ultimately call C code and use the same (kernel provided) syscalls(2). Most file system related processing happens inside the Linux kernel. So it won't be faster (unless Python adds it own user-space buffering to e.g. read(2) data in megabyte chunks, hence lowering the number of executed syscalls).
Notice that every Linux system is able to deal with a lot of disk data, either with a single huge file (much bigger than available RAM: you could have a 50Gbyte file on your laptop, and a terabyte file on your desktop!) or in many files.