What are you using this trie for? What is the total number of words that you plan to hold, and what is the sparseness of their constituent characters? And most important, is a trie even appropriate (versus a simple map of prefix to list of words)?
Your idea of an intermediate table and replacing pointers with indexes will work, provided that you have a relatively small set of short words and a sparse character set. Otherwise you risk running out of space in your intermediate table. And unless you're looking at an extremely small set of words, you won't really save that much space: 2 bytes for a short versus 4 bytes for a reference on a 32-bit machine. If you're running on a 64-bit JVM, the savings will be more.
Your idea about breaking the characters into 4-bit chunks probably won't save you much, unless all of your expected characters are in an extremely limited range (maybe OK for words limited to uppercase US-ASCII, not likely with a general Unicode corpus).
If you have a sparse character set, then a HashMap<Character,Map<...>>
might be your best implementation. Yes, each entry will be much larger, but if you don't have many entries you'll get an overall win. (as a side note: I always thought it was funny that the Wikipedia article on Tries showed -- maybe still does -- an example based on a hashed data structure, completely ignoring the space/time tradeoffs of that choice)
Finally, you might want to avoid a trie altogether. If you're looking at a corpus of normal words in a human language (10,000 words in active use, with words 4-8 characters long), you'll probably be MUCH better off with a HashMap<String,List<String>
, where the key is the entire prefix.
Best Answer
Chances are that this method only supports the Basic Multilingual Plane of Unicode. That Plane contains the lower 64k of codepoints and can be represented with a 16 bit data type.
There was a time when the BMP was all the Unicode standard defined and at that time many languages and/or runtimes added "Unicode support". They thought that 16 bit will always be enough and therefore "Unicode" equals "16 bit characters" in many places (even though this is wrong these days). To be fair: The Unicode consortium also thought that 16 bit ought to be enough for everybody.
Unicode 2.0 however introduced additional planes and it was clear that 16 bit are no longer enough for representing every possible Unicode codepoint.
The "solution" to this is usually to use UTF-16 instead of UCS-2. I'm not only faulting .NET for this: Java has fallen into the same trap, having a 16-bit
char
data type and now having to supportString
instances that need 2 "characters" to represent a single codepoint.