Most Efficient Way to Search Large Strings in Java

java

I have a question that I have been thinking about recently but I am still not sure of the answer(s) so I would like a broader opinion.

Imagine the scenario where text strings of up to lets say 500,000,000 characters are processed in Java. But each string of text is searched for the words "the", "and" and "yes".

What is the most efficient way of completing this task?

Best Answer

Efficient in terms of space or size? Will the string change, and if so will it be appended to or arbitrarily changed? How is the string created in the first place i.e where does it come from?

Not knowing the answers to these crucial questions I be tempted to think of (and throughly test) a number of approaches:

  1. Think about indexing substrings at the time of string creation. You need to create this string originally, either by reading it from a file, growing it as part of an algorithm's output, or generating it in its entirety algorithmically. At time of creation, as the characters are 'going into' the string, think about searching in the input stream and indexing positions in the string that the substrings occur so at the end you then have a table mapping the substrings to list of locations that the substrings start at. This means that you are building an index at the time of string creation/reading (much like an insert into a DB). You need to consider whether the string changes though and although appending works OK with this approach, insertion is not so easy (although after a mid-string change, adjusting the index of substrings wouldn't require a new computation, just calculated offset additions or subtraction). If you need to search for arbitrary words then at the time of build, create an index that maps all words to locations in the string. It'll be a very big index, but you haven't indicated whether you are interested in speed or space.

  2. Consider using Java language constructs that already exist. You may need to subdivide the string into arrays of substrings to ensure in-memory operations, and then use pattern matching or split() on the substrings and check the beginning and end of those strings to ensure that you haven't subdivided on a word you are looking for.

  3. Consider using tools in the underlying OS - for example, if you're on *nix platform then execute a 'grep' on the string (saved as a file?) or similar tool. The work on searching efficiently has already been done by the OS command developer and has undertaken decades of continual improvement and the additional time of firing up a shell and grepping may well be outweighed by the efficiency of the tool.

  4. This sounds like DNA sequence matching and there will be an existing body of knowledge covering the search of these sequences which may be useful to this problem for you. I'd definitely check that out.

You'd need to test these, and it will depend on whether your string is constantly changing or not. My feeling is that the building of indexes at the time of creating/updating the string is probably the right way to go.

Related Topic