Reading csv in Julia is slow compared to Python

julia

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.

python 3.3 example

import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)

Output: 19.90

R 3.0.2 example

system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
   stringsAsFactors=F,na.strings=""))

Output:
User    System  Elapsed
181.13  1.07    182.32

Julia 0.2.0 (Julia Studio 0.4.4) example # 1

using DataFrames
timing = @time myData = readtable("C:/myFile.txt",separator='|',header=false)

Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)

Julia 0.2.0 (Julia Studio 0.4.4) example # 2

timing = @time myData = readdlm("C:/myFile.txt",'|',header=false)

Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)

Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?
a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?

Best Answer

The best answer is probably that I'm not as a good a programmer as Wes.

In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.

As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.

On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.

Related Solutions

R – Linking R and Julia

I too have been looking at Julia ever since Doug Bates sent me a heads-up in January. But like @gsk3, I measure this on an "Rcpp scale" as I would like to pass rich R objects to Julia. And that does not seem to be supported at all right now.

Julia has a nice and simple C interface. So that gets us something like .C(). But as recently discussed on r-devel, you really do not want .C(), in most cases you rather want .Call() in order to pass actual SEXP variables representing real R objects. So right now I see little scope for Julia from R because of this limitation.

Maybe an indirect interface using tcp/ip to Rserve could be a first start before Julia matures a little and we get a proper C++ interface. Or we use something based on Rcpp to get from from R to C++ before we enter an intermediate layer [which someone would have to write] from which we data feed to Julia, just like the actual R API only offers a C layer. I don't know.

And the end of the day, some patience may be needed. I started to look at R around 1996 or 1997 when Fritz Leisch made the first announcements on the comp.os.linux.announce newsgroup. And R had rather limited facilities then (but the full promise of the S language, of course, si we knew we had a winner). And a few years later I was ready to make it my primary modeling language. At that time CRAN had still way less than 100 packages...

Julia may well get there. But for now I suspect many of us will get work done in R, and have just a few curious glimpses at Julia.

Julia (Julia-lang) Performance Compared to Fortran and Python

I have followed the Julia project for a while now, and I have some comments to the code that might be relevant.

It seems like you run a substantial amount of the code in global scope. The global environment is currently very slow in Julia, because the types all variables have to be checked on every iteration. Loops should usually be written in a function.
You seem to use array slicing. Currently that makes a copy because Julia does not have fast Array views. You might try to switch them for subarray, but they are currently much slower than they should.

The loading time of PyPlot (and any other package) is a known issue, and is because parsing and compiling Julia code to machine code, is time consuming. There are ideas about having a cache for this process so that this process becomes instantaneous, but it is not finished yet. The Base library is currently cached in compiled state, so most of the infrastructure is on the master branch now.

ADDED: I tried to run the test in an isolated function and got these results. See this gist

Parsing:

elapsed time: 0.334042578 seconds (11797548 bytes allocated)

And tree consecutive runes of the main test loop.

elapsed time: 0.62999287 seconds (195210884 bytes allocated)
elapsed time: 0.39398753 seconds (184735016 bytes allocated)
elapsed time: 0.392036875 seconds (184735016 bytes allocated)

Notice how the timing improved after the first run, because the compiled code was used again.

Update 2 With some improved memory handling (ensure reuse of arrays, because assignment does not copy), I got the timing down to 0.2 seconds (on my machine). There is definitely more that could be done to avoid allocating new arrays, but then it starts to be a little tricky.

This line does not do what you think:

vx_old = vx

but this do what you want:

copy!(vx_old, vx)

and devectorize one loop.

x += 0.5*(vx + vx_old)*delta_t
y += 0.5*(vy + vy_old)*delta_t

to:

for i = 1:nvortex
    x[i] += 0.5*(vx[i] + vx_old[i])*delta_t
    y[i] += 0.5*(vy[i] + vy_old[i])*delta_t
end

Best Answer

Related Solutions

R – Linking R and Julia

Julia (Julia-lang) Performance Compared to Fortran and Python

Related Topic