Preliminary remark: I'm not a lawyer any longer, and never specialized myself in laws related to copyrights and intellectual property. If you want an unquestionable answer, you should consult a lawyer.
1. Data and data files are not the same
As it states, the exhibit 1 covers data files:
BY DOWNLOADING, INSTALLING, COPYING OR OTHERWISE USING UNICODE INC.'S DATA FILES [...]
Data files and data itself are not the same. When Microsoft implements uppercase and lowercase methods in .NET Framework, the unicode standard is used, but this doesn't mean that .NET Framework contains, somewhere, the files downloaded from http://www.unicode.org/
Simple illustration of a difference between the data and the support:
Imagine that I create a database with a list of countries, cities and the corresponding post codes. I expose this data through a web service and on my website.
The data itself is in public domain: you can't reasonably copyright the list of countries and ask every person who use such list to pay you or to distribute a copy of your copyright.
On the other hand, nothing forbids for me to enforce a restrictive license on the usage of the web service or the website (especially since I invested a lot of effort while creating this set of data). If I find that an application is scraping my website to download the data, this would be a copyright infringement, and I would be able to sue the person who created the scraper.
2. Data is too vague
If http://www.unicode.org/ stated that the license covers the data itself, it would be very difficult for this organization to enforce such copyright.
Imagine the following method:
public char ToUpper(char c)
{
string upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
if (upper.Contains(c))
{
return c;
}
string lower = "abcdefghijklmnopqrstuvwxyz";
if (lower.Contains(c))
{
return upper[lower.IndexOf(c)];
}
throw new OutOfRangeException();
}
Is this a violation of the copyright? Did I actually used the data from http://www.unicode.org/ and I should include the copy of the license in my answer below? Or maybe I just typed those letters myself?
In other words, if data itself was licensed, how far the license could go?
3. Copyright and data
Here are some interesting quotes:
http://www.lib.umich.edu/copyright/facts-and-data: University of Michigan
Copyright law does not apply to facts, data, or ideas. [...]
However, copyright may protect a collection of data as contained in a database or compilation, but only if it meets certain requirements. Simply working really hard to gather the data [...] is not enough. [...]
In order for a database to qualify for copyright protection, the author has to make choices about the selection, coordination, or arrangement of the facts or data, and those choices must be at least a little bit creative. [...]
It is important to remember that even if a database or compilation is arranged with sufficient originality to qualify for copyright protection, the facts and data within that database are still in the public domain.
http://www.ands.org.au/guides/copyright-and-data-awareness.html: Australian National Data Service
A table or compilation, consisting of words, figures or symbols (or a combination of these) is protected if it is
a literary work and
has the required degree of originality.
[...] Copyright applies not to the facts/information itself, but to the particular way the facts/information are presented in the dataset or database.
Those two examples, one concerning USA, the other one - Australia, clearly shows that the data itself, i.e. the unicode symbols with their respective numbers and the attributes such as "is this a digit?" or "is this a capital letter from Cyrillic alphabet?" is not covered by the copyright.
Data files, on the other hand, may be covered by the copyright, depending on their originality. For example, the PDFs you find on http://www.unicode.org/ would be very probably covered by a copyright. If, on the other hand, it is purely question of a CSV associating lowercase characters to uppercase or vice versa, the author of such data would hardly be able to enforce the copyright on it.
Clearly, the ToUpper
method I put above is not a violation of http://www.unicode.org/ copyright. Nor the code used by .NET Framework or Firefox, unless those systems contain somewhere inside the data files which are clearly, undoubtedly copied from http://www.unicode.org/ with, optionally, some minor changes.
Best Answer
What are you using this trie for? What is the total number of words that you plan to hold, and what is the sparseness of their constituent characters? And most important, is a trie even appropriate (versus a simple map of prefix to list of words)?
Your idea of an intermediate table and replacing pointers with indexes will work, provided that you have a relatively small set of short words and a sparse character set. Otherwise you risk running out of space in your intermediate table. And unless you're looking at an extremely small set of words, you won't really save that much space: 2 bytes for a short versus 4 bytes for a reference on a 32-bit machine. If you're running on a 64-bit JVM, the savings will be more.
Your idea about breaking the characters into 4-bit chunks probably won't save you much, unless all of your expected characters are in an extremely limited range (maybe OK for words limited to uppercase US-ASCII, not likely with a general Unicode corpus).
If you have a sparse character set, then a
HashMap<Character,Map<...>>
might be your best implementation. Yes, each entry will be much larger, but if you don't have many entries you'll get an overall win. (as a side note: I always thought it was funny that the Wikipedia article on Tries showed -- maybe still does -- an example based on a hashed data structure, completely ignoring the space/time tradeoffs of that choice)Finally, you might want to avoid a trie altogether. If you're looking at a corpus of normal words in a human language (10,000 words in active use, with words 4-8 characters long), you'll probably be MUCH better off with a
HashMap<String,List<String>
, where the key is the entire prefix.