Most Efficient Way to Search Large Strings in Java

java

I have a question that I have been thinking about recently but I am still not sure of the answer(s) so I would like a broader opinion.

Imagine the scenario where text strings of up to lets say 500,000,000 characters are processed in Java. But each string of text is searched for the words "the", "and" and "yes".

What is the most efficient way of completing this task?

Best Answer

Efficient in terms of space or size? Will the string change, and if so will it be appended to or arbitrarily changed? How is the string created in the first place i.e where does it come from?

Not knowing the answers to these crucial questions I be tempted to think of (and throughly test) a number of approaches:

Think about indexing substrings at the time of string creation. You need to create this string originally, either by reading it from a file, growing it as part of an algorithm's output, or generating it in its entirety algorithmically. At time of creation, as the characters are 'going into' the string, think about searching in the input stream and indexing positions in the string that the substrings occur so at the end you then have a table mapping the substrings to list of locations that the substrings start at. This means that you are building an index at the time of string creation/reading (much like an insert into a DB). You need to consider whether the string changes though and although appending works OK with this approach, insertion is not so easy (although after a mid-string change, adjusting the index of substrings wouldn't require a new computation, just calculated offset additions or subtraction). If you need to search for arbitrary words then at the time of build, create an index that maps all words to locations in the string. It'll be a very big index, but you haven't indicated whether you are interested in speed or space.
Consider using Java language constructs that already exist. You may need to subdivide the string into arrays of substrings to ensure in-memory operations, and then use pattern matching or split() on the substrings and check the beginning and end of those strings to ensure that you haven't subdivided on a word you are looking for.
Consider using tools in the underlying OS - for example, if you're on *nix platform then execute a 'grep' on the string (saved as a file?) or similar tool. The work on searching efficiently has already been done by the OS command developer and has undertaken decades of continual improvement and the additional time of firing up a shell and grepping may well be outweighed by the efficiency of the tool.
This sounds like DNA sequence matching and there will be an existing body of knowledge covering the search of these sequences which may be useful to this problem for you. I'd definitely check that out.

You'd need to test these, and it will depend on whether your string is constantly changing or not. My feeling is that the building of indexes at the time of creating/updating the string is probably the right way to go.

Related Solutions

Java – How to Transition from JavaScript to Java

Using arrays for everything is a bad idea. Even in JavaScript. But in Java even more so, because arrays are far more narrow in Java.

The elements of your return array are in fact representing results. You should communicate it accordingly:

class Result {
     public int someInt;//no idea what this actually is, but a class will allow you giving it a proper name
     public String someString;//same as with the other field. Also, I suppose String is actually wrong here, since it rather looks like a '-'-separated list of integers, which should rather be represented as an array of integers
     Result(String someString, int someInt) {
           this.someString = someString;
           this.someInt = someInt;
     }
}

And then your return value should look something like:

ArrayList.ofList(new Result('1-3-5', 50), new Result('2-5-8', 90), new Result('2-4-7', 75))

Java File Handling – Simulating Inner Join on Very Large Files Without Running Out of Memory

you can use a random RandomAccessFile

private void join(RandomAccessFile left, RandomAccessFile right, BufferedWriter output) throws Throwable {

    long rightIndex = right.getFilePointer();

    Record leftRecord = read(left);
    Record rightRecord;

    while((leftRecord )!=null){
        right.seek(rightIndex);

        while((rightRecord = read(right))!=null && leftRecord.getKey() > rightRecord.getKey()){
            //skip right until it can match left
            // store pointer because we don't need to return to earlier
            rightIndex = right.getFilePointer();
        }

        while(rightRecord !=null && leftRecord.getKey()==rightRecord.getKey()){
            write(_output, leftRecord, rightRecord );
            rightRecord = read(right))
        }
        leftRecord = read(left);
        //you can test the need to skip this left record if you keep the previous key and the current rightRecord.getKey()
    }
}

this will be a O(n+m) time and O(1) space algorithm

Best Answer

Related Solutions

Java – How to Transition from JavaScript to Java

Java File Handling – Simulating Inner Join on Very Large Files Without Running Out of Memory

Related Topic