Java – Why does Java’s String class not implement a more efficient indexOf()

efficiencyimplementationsjavastrings

Following the below question on Stack Overflow

https://stackoverflow.com/questions/5564610/fast-alernative-for-stringindexofstring-str

I got to wondering why it is that java (6 at least) not use a more efficient implementation?

Following is the code:

java.lang.String#indexOf(String str)

1762    static int indexOf(char[] source, int sourceOffset, int sourceCount,
1763                       char[] target, int targetOffset, int targetCount,
1764                       int fromIndex) {
1765        if (fromIndex >= sourceCount) {
1766            return (targetCount == 0 ? sourceCount : -1);
1767        }
1768        if (fromIndex < 0) {
1769            fromIndex = 0;
1770        }
1771        if (targetCount == 0) {
1772            return fromIndex;
1773        }
1774
1775        char first  = target[targetOffset];
1776        int max = sourceOffset + (sourceCount - targetCount);
1777
1778        for (int i = sourceOffset + fromIndex; i <= max; i++) {
1779            /* Look for first character. */
1780            if (source[i] != first) {
1781                while (++i <= max && source[i] != first);
1782            }
1783
1784            /* Found first character, now look at the rest of v2 */
1785            if (i <= max) {
1786                int j = i + 1;
1787                int end = j + targetCount - 1;
1788                for (int k = targetOffset + 1; j < end && source[j] ==
1789                         target[k]; j++, k++);
1790
1791                if (j == end) {
1792                    /* Found whole string. */
1793                    return i - sourceOffset;
1794                }
1795            }
1796        }
1797        return -1;
1798    }

Best Answer

"Efficiency" is all about tradeoffs, and the "best" algorithm will depend on many factors. In the case of indexOf(), one of those factors is the expected size of strings.

The JDK's algorithm is based on simple indexed reference into existing character arrays. The Knuth-Morris-Pratt that you reference needs to create a new int[] that's the same size as the input string. For Boyer-Moore, you need several external tables, at least one of which is two-dimensional (I think; I've never implemented B-M).

So the question becomes: are allocating the additional objects and building lookup tables offset by the increased performance of the algorithm? Remember, we're not talking about a change from O(N2) to O(N), but simply a reduction in the number of steps taken for each N.

And I would expect that the JDK designers said something like "for strings less than X characters, the simple approach is faster, we don't expect regular use of strings longer than that, and people who do use longer strings will know how to optimize their searches."