Following the below question on Stack Overflow
https://stackoverflow.com/questions/5564610/fast-alernative-for-stringindexofstring-str
I got to wondering why it is that java (6 at least) not use a more efficient implementation?
Following is the code:
java.lang.String#indexOf(String str)
1762 static int indexOf(char[] source, int sourceOffset, int sourceCount,
1763 char[] target, int targetOffset, int targetCount,
1764 int fromIndex) {
1765 if (fromIndex >= sourceCount) {
1766 return (targetCount == 0 ? sourceCount : -1);
1767 }
1768 if (fromIndex < 0) {
1769 fromIndex = 0;
1770 }
1771 if (targetCount == 0) {
1772 return fromIndex;
1773 }
1774
1775 char first = target[targetOffset];
1776 int max = sourceOffset + (sourceCount - targetCount);
1777
1778 for (int i = sourceOffset + fromIndex; i <= max; i++) {
1779 /* Look for first character. */
1780 if (source[i] != first) {
1781 while (++i <= max && source[i] != first);
1782 }
1783
1784 /* Found first character, now look at the rest of v2 */
1785 if (i <= max) {
1786 int j = i + 1;
1787 int end = j + targetCount - 1;
1788 for (int k = targetOffset + 1; j < end && source[j] ==
1789 target[k]; j++, k++);
1790
1791 if (j == end) {
1792 /* Found whole string. */
1793 return i - sourceOffset;
1794 }
1795 }
1796 }
1797 return -1;
1798 }
Best Answer
"Efficiency" is all about tradeoffs, and the "best" algorithm will depend on many factors. In the case of
indexOf()
, one of those factors is the expected size of strings.The JDK's algorithm is based on simple indexed reference into existing character arrays. The Knuth-Morris-Pratt that you reference needs to create a new
int[]
that's the same size as the input string. For Boyer-Moore, you need several external tables, at least one of which is two-dimensional (I think; I've never implemented B-M).So the question becomes: are allocating the additional objects and building lookup tables offset by the increased performance of the algorithm? Remember, we're not talking about a change from O(N2) to O(N), but simply a reduction in the number of steps taken for each N.
And I would expect that the JDK designers said something like "for strings less than X characters, the simple approach is faster, we don't expect regular use of strings longer than that, and people who do use longer strings will know how to optimize their searches."