In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n)
look-ups possible (again, much like B-Trees).
So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick.
Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap<ByteSequence,SomeOutput>
, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences [i.e., the indexed terms] to make the memory use of this mapping grow sub-linear). And he points to the paper that describes the particular FST algorithm Lucene uses, too.
For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are "done right.")
That's the hard way, and those java.util.Date
setter methods have been deprecated since Java 1.1 (1997). Moreover, the whole java.util.Date
class was de-facto deprecated (discommended) since introduction of java.time
API in Java 8 (2014).
Simply format the date using DateTimeFormatter
with a pattern matching the input string (the tutorial is available here).
In your specific case of "January 2, 2010" as the input string:
- "January" is the full text month, so use the
MMMM
pattern for it
- "2" is the short day-of-month, so use the
d
pattern for it.
- "2010" is the 4-digit year, so use the
yyyy
pattern for it.
String string = "January 2, 2010";
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("MMMM d, yyyy", Locale.ENGLISH);
LocalDate date = LocalDate.parse(string, formatter);
System.out.println(date); // 2010-01-02
Note: if your format pattern happens to contain the time part as well, then use LocalDateTime#parse(text, formatter)
instead of LocalDate#parse(text, formatter)
. And, if your format pattern happens to contain the time zone as well, then use ZonedDateTime#parse(text, formatter)
instead.
Here's an extract of relevance from the javadoc, listing all available format patterns:
Symbol |
Meaning |
Presentation |
Examples |
G |
era |
text |
AD; Anno Domini; A |
u |
year |
year |
2004; 04 |
y |
year-of-era |
year |
2004; 04 |
D |
day-of-year |
number |
189 |
M /L |
month-of-year |
number/text |
7; 07; Jul; July; J |
d |
day-of-month |
number |
10 |
Q /q |
quarter-of-year |
number/text |
3; 03; Q3; 3rd quarter |
Y |
week-based-year |
year |
1996; 96 |
w |
week-of-week-based-year |
number |
27 |
W |
week-of-month |
number |
4 |
E |
day-of-week |
text |
Tue; Tuesday; T |
e /c |
localized day-of-week |
number/text |
2; 02; Tue; Tuesday; T |
F |
week-of-month |
number |
3 |
a |
am-pm-of-day |
text |
PM |
h |
clock-hour-of-am-pm (1-12) |
number |
12 |
K |
hour-of-am-pm (0-11) |
number |
0 |
k |
clock-hour-of-am-pm (1-24) |
number |
0 |
H |
hour-of-day (0-23) |
number |
0 |
m |
minute-of-hour |
number |
30 |
s |
second-of-minute |
number |
55 |
S |
fraction-of-second |
fraction |
978 |
A |
milli-of-day |
number |
1234 |
n |
nano-of-second |
number |
987654321 |
N |
nano-of-day |
number |
1234000000 |
V |
time-zone ID |
zone-id |
America/Los_Angeles; Z; -08:30 |
z |
time-zone name |
zone-name |
Pacific Standard Time; PST |
O |
localized zone-offset |
offset-O |
GMT+8; GMT+08:00; UTC-08:00; |
X |
zone-offset 'Z' for zero |
offset-X |
Z; -08; -0830; -08:30; -083015; -08:30:15; |
x |
zone-offset |
offset-x |
+0000; -08; -0830; -08:30; -083015; -08:30:15; |
Z |
zone-offset |
offset-Z |
+0000; -0800; -08:00; |
Do note that it has several predefined formatters for the more popular patterns. So instead of e.g. DateTimeFormatter.ofPattern("EEE, d MMM yyyy HH:mm:ss Z", Locale.ENGLISH);
, you could use DateTimeFormatter.RFC_1123_DATE_TIME
. This is possible because they are, on the contrary to SimpleDateFormat
, thread safe. You could thus also define your own, if necessary.
For a particular input string format, you don't need to use an explicit DateTimeFormatter
: a standard ISO 8601 date, like 2016-09-26T17:44:57Z, can be parsed directly with LocalDateTime#parse(text)
as it already uses the ISO_LOCAL_DATE_TIME
formatter. Similarly, LocalDate#parse(text)
parses an ISO date without the time component (see ISO_LOCAL_DATE
), and ZonedDateTime#parse(text)
parses an ISO date with an offset and time zone added (see ISO_ZONED_DATE_TIME
).
Pre-Java 8
In case you're not on Java 8 yet, or are forced to use java.util.Date
, then format the date using SimpleDateFormat
using a format pattern matching the input string.
String string = "January 2, 2010";
DateFormat format = new SimpleDateFormat("MMMM d, yyyy", Locale.ENGLISH);
Date date = format.parse(string);
System.out.println(date); // Sat Jan 02 00:00:00 GMT 2010
Note the importance of the explicit Locale
argument. If you omit it, then it will use the default locale which is not necessarily English as used in the month name of the input string. If the locale doesn't match with the input string, then you would confusingly get a java.text.ParseException
even though when the format pattern seems valid.
Here's an extract of relevance from the javadoc, listing all available format patterns:
Letter |
Date or Time Component |
Presentation |
Examples |
G |
Era designator |
Text |
AD |
y |
Year |
Year |
1996; 96 |
Y |
Week year |
Year |
2009; 09 |
M /L |
Month in year |
Month |
July; Jul; 07 |
w |
Week in year |
Number |
27 |
W |
Week in month |
Number |
2 |
D |
Day in year |
Number |
189 |
d |
Day in month |
Number |
10 |
F |
Day of week in month |
Number |
2 |
E |
Day in week |
Text |
Tuesday; Tue |
u |
Day number of week |
Number |
1 |
a |
Am/pm marker |
Text |
PM |
H |
Hour in day (0-23) |
Number |
0 |
k |
Hour in day (1-24) |
Number |
24 |
K |
Hour in am/pm (0-11) |
Number |
0 |
h |
Hour in am/pm (1-12) |
Number |
12 |
m |
Minute in hour |
Number |
30 |
s |
Second in minute |
Number |
55 |
S |
Millisecond |
Number |
978 |
z |
Time zone |
General time zone |
Pacific Standard Time; PST; GMT-08:00 |
Z |
Time zone |
RFC 822 time zone |
-0800 |
X |
Time zone |
ISO 8601 time zone |
-08; -0800; -08:00 |
Note that the patterns are case sensitive and that text based patterns of four characters or more represent the full form; otherwise a short or abbreviated form is used if available. So e.g. MMMMM
or more is unnecessary.
Here are some examples of valid SimpleDateFormat
patterns to parse a given string to date:
Input string |
Pattern |
2001.07.04 AD at 12:08:56 PDT |
yyyy.MM.dd G 'at' HH:mm:ss z |
Wed, Jul 4, '01 |
EEE, MMM d, ''yy |
12:08 PM |
h:mm a |
12 o'clock PM, Pacific Daylight Time |
hh 'o''clock' a, zzzz |
0:08 PM, PDT |
K:mm a, z |
02001.July.04 AD 12:08 PM |
yyyyy.MMMM.dd GGG hh:mm aaa |
Wed, 4 Jul 2001 12:08:56 -0700 |
EEE, d MMM yyyy HH:mm:ss Z |
010704120856-0700 |
yyMMddHHmmssZ |
2001-07-04T12:08:56.235-0700 |
yyyy-MM-dd'T'HH:mm:ss.SSSZ |
2001-07-04T12:08:56.235-07:00 |
yyyy-MM-dd'T'HH:mm:ss.SSSXXX |
2001-W27-3 |
YYYY-'W'ww-u |
An important note is that SimpleDateFormat
is not thread safe. In other words, you should never declare and assign it as a static or instance variable and then reuse it from different methods/threads. You should always create it brand new within the method local scope.
Best Answer
What your code does is sorts by score, then by date, since your scores coming back are not likely the same, they will almost always be by score anyways.
This is what I would do:
This will just sort by the field you specified. You can also do: