Performance Optimization – Improving Grep Performance on Large Files

algorithmsbashperformanceperl

I have FILE_A which has over 300,000 lines and FILE_B which has over 30 million lines. I created a Bash script that greps each line in FILE_A over in FILE_B and writes the result of the grep to a new file.

This whole process is taking over 5 hours.

How can I improve the performance of my script?

I'm using grep -F -m 1 as the grep command. FILE_A looks like this:

123456789 
123455321

and FILE_B is like this:

123456789,123456789,730025400149993,
123455321,123455321,730025400126097,

So with Bash I have a while loop that picks the next line in FILE_A and greps it over in FILE_B. When the pattern is found in FILE_B, I write it to file result.txt.

while read -r line; do
   grep -F -m1 $line 30MFile
done < 300KFile

Best Answer

Try using grep --file==FILE_A. It almost certainly loads the patterns into memory, meaning it will only scan FILE_B once.

grep -F -m1 --file==300KFile 30MFile

Related Solutions

Speed of MySQL type index access vs. binary jump search on a huge file

Indexes tend to be much smaller than the table. If the whole index fits in memory, then there will be an average of 1 disk seek per random lookup. If not, there will generally be 2 disk seeks (once for the index, once into the table for your actual data). Keep in mind that a disk seek averages 1/200th of a second. If you're going to do a million lookups, this is going to be a problem.

In general the right approach if you have a dataset of that size is to use sorting and resorting heavily. That will let data stream to/from disk. Which is an operation that disk drives are very good at. If you can figure out an algorithm that avoids random lookups, you'll generally get order of magnitude improvements.

As for MySQL, be aware that relational databases are not magic pixie dust. They are an abstraction layer on top of things like btrees and sorting algorithms. This can save you a lot of time, but will never be faster than raw data manipulation if you know what you are doing. They may speed up development time, but in principle the same program against raw data will outperform. Frequently by very large margins.

With a smart database (eg PostgreSQL or Oracle) there is a chance that the database will be faster than you could be because the optimizer knows more about how to structure access to the data and will come up with a query plan that is better than you would have come up with. However MySQL's optimizer is not nearly as intelligent, and so it is unlikely that it will save you there. (It does give you SQL, relational constructs, transactions, and so on. Just not a good query planner for complex queries on a lot of data.)

Python – Shell commands in bash or python? How much encapsulation is too much

I'd give xonsh a go, it's a clever mix of shell and python.

xonsh is a Python-ish, BASHwards-compatible shell language and command prompt. The language is a superset of Python 3.4 with additional shell primitives that you are used to from BASH and IPython. xonsh is meant for the daily use of experts and novices alike.

Take advantage of Python(3)'s abstraction and package system, coupled with nice conditionals, but write what needs to be in shell as just shell.

e.g.,

#!/usr/bin/env xonsh

def exists(filename):
    return filename in $(ls)

if exists(".git"):
    git checkout master
    git pull
else:
    git clone $GITURL

Note that only a little bit of ugliness $() is required to in-line shell inside of python, and it Just Works (TM) if you are splitting things clearly by line (eg the if statement lines)

Lots more detail (including embedding python in shell lines with @() ) in the tutorial http://xonsh.org/tutorial.html

You can use it as your system shell. But just because you can doesn't mean you should :-)

Best Answer

Related Solutions

Speed of MySQL type index access vs. binary jump search on a huge file

Python – Shell commands in bash or python? How much encapsulation is too much

Related Topic