So I have some fairly gigantic .gz files – we're talking 10 to 20 gb each when decompressed.
I need to loop through each line of them, so I'm using the standard:
import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
#(yadda yadda)
f.close()
However, both the open()
and close()
commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed
to the terminal. Maybe it is loading the entire extracted file into memory?
I'm now using something like:
from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file
This works. But is there a cleaner way?
Best Answer
I'm 99% sure that your problem is not in the
gzip.open()
, but in thereadlines()
.As the documentation explains:
Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.
Most likely, it's actually the
malloc
calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.You almost never want to use
readlines
. Unless you're using a very old Python, just do this:A
file
is an iterable full of lines, just like thelist
returned byreadlines
—except that it's not actually alist
, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GBlist
. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.From a quick test, with a 3.5GB gzip file,
gzip.open()
is effectively instant,for line in f: pass
takes a few seconds,gzip.close()
is effectively instant. But if I dofor line in f.readlines(): pass
, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.