Unicode – Efficient Trie Implementation for Unicode Strings

trieunicode

I have been looking for an efficient String trie implementation. Mostly I have found code like this:

Referential implementation in Java (per wikipedia)

I dislike these implementations for mostly two reasons:

  1. They support only 256 ASCII characters. I need to cover things like cyrillic.
  2. They are extremely memory inefficient.

Each node contains an array of 256 references, which is 4096 bytes on a
64 bit machine in Java. Each of these nodes can have up to 256
subnodes with 4096 bytes of references each. So a full Trie for
every ASCII 2 character string would require a bit over 1MB. Three character strings? 256MB just for arrays in nodes. And so on.

Of course I don't intend to have all of 16 million three character strings in my Trie, so a lot of space is just wasted. Most of these arrays are just null references as their capacity far exceeds the actual number of inserted keys. And if I add unicode, the arrays get even larger (char has 64k values instead of 256 in Java).

Is there any hope of making an efficient trie for strings?
I have considered a couple of improvements over these types of implementations:

  • Instead of using array of references, I could use an array of primitive integer type, which indexes into an array of references to nodes whose size is close to the number of actual nodes.
  • I could break strings into 4 bit parts which would allow for node arrays of size 16 at the cost of a deeper tree.

Best Answer

What are you using this trie for? What is the total number of words that you plan to hold, and what is the sparseness of their constituent characters? And most important, is a trie even appropriate (versus a simple map of prefix to list of words)?

Your idea of an intermediate table and replacing pointers with indexes will work, provided that you have a relatively small set of short words and a sparse character set. Otherwise you risk running out of space in your intermediate table. And unless you're looking at an extremely small set of words, you won't really save that much space: 2 bytes for a short versus 4 bytes for a reference on a 32-bit machine. If you're running on a 64-bit JVM, the savings will be more.

Your idea about breaking the characters into 4-bit chunks probably won't save you much, unless all of your expected characters are in an extremely limited range (maybe OK for words limited to uppercase US-ASCII, not likely with a general Unicode corpus).

If you have a sparse character set, then a HashMap<Character,Map<...>> might be your best implementation. Yes, each entry will be much larger, but if you don't have many entries you'll get an overall win. (as a side note: I always thought it was funny that the Wikipedia article on Tries showed -- maybe still does -- an example based on a hashed data structure, completely ignoring the space/time tradeoffs of that choice)

Finally, you might want to avoid a trie altogether. If you're looking at a corpus of normal words in a human language (10,000 words in active use, with words 4-8 characters long), you'll probably be MUCH better off with a HashMap<String,List<String>, where the key is the entire prefix.

Related Topic