Java – Sorting lucene documents by date

javalucene

How I can achieve scoring and sorting in lucene as per the start date.

Event which has latest start date should be shown first in search results. I am using lucene Version.LUCENE_44

I have retreived data from DB and stored in Lucene Document as,

public static Document createDoc(Event e) {
    Document d = new Document();
    //event id
    d.add(new StoredField("id", e.getId()));
    //event name
    d.add(new StoredField("eventname", e.getEName());
    TextField field = new TextField("enameSrch", e.getEName(), Store.NO);
    field.setBoost(10.0f);
    d.add(field);
    //event owner
    d.add(new StoredField("eventowner", e.getEOwner());
    //event start date
    d.add(new LongField("edateSort", Long.MAX_VALUE-e.getEStartTime(), Store.YES)); 
    //event tags    
    if (e.eventTags()!=null) {
        field = new TextField("eTagSrch", e.getTags(), Store.NO);
        field.setBoost(5.0f);
        d.add(field);
        d.add(new StoredField("eTags", e.getTags()));
    }

And while searching I am doing as,

public List search(String srchTxt){
        PhraseQuery enameQuery = new PhraseQuery();
        Term term = new Term("enameSrch", srchTxt.toLowerCase());
        enameQuery .add(term);

        PhraseQuery etagQuery = new PhraseQuery();
        term = new Term("eTagSrch", srchTxt.toLowerCase());
        etagQuery.add(term);

        BooleanQuery b= new BooleanQuery();
        b.add(enameQuery , Occur.SHOULD);
        b.add(etagQuery , Occur.SHOULD);

        SortField startField = new SortField("edateSort", Type.LONG);
        SortField scoreField = SortField.FIELD_SCORE;
        Sort sort = new Sort(scoreField, startField);

         TopFieldDocs tfd = searcher.search(b, 10, sort);
         ScoreDoc[] myscore= tfd.scoreDocs;

To rephrase: I want to sort Documents by date, which is stored as a Long field in my Document (see code above)

Best Answer

What your code does is sorts by score, then by date, since your scores coming back are not likely the same, they will almost always be by score anyways.

This is what I would do:

Sort sorter = new Sort(); // new sort object

String field = "fieldName"; // enter the field to sort by
Type type = Type.Long; // since your field is long type
boolean descending = false; // ascending by default

SortField sortField = new SortField(field, type, descending);

sorter.setSort(sortField); // now set the sort field

This will just sort by the field you specified. You can also do:

sorter.setSort(sortField, SortField.FIELD_SCORE); // this will sort by field, then by score

Related Solutions

How does lucene index documents

In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n) look-ups possible (again, much like B-Trees).

So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick.

Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap<ByteSequence,SomeOutput>, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences [i.e., the indexed terms] to make the memory use of this mapping grow sub-linear). And he points to the paper that describes the particular FST algorithm Lucene uses, too.

For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are "done right.")

Java string to date conversion

That's the hard way, and those java.util.Date setter methods have been deprecated since Java 1.1 (1997). Moreover, the whole java.util.Date class was de-facto deprecated (discommended) since introduction of java.time API in Java 8 (2014).

Simply format the date using DateTimeFormatter with a pattern matching the input string (the tutorial is available here).

In your specific case of "January 2, 2010" as the input string:

"January" is the full text month, so use the MMMM pattern for it
"2" is the short day-of-month, so use the d pattern for it.
"2010" is the 4-digit year, so use the yyyy pattern for it.

String string = "January 2, 2010";
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("MMMM d, yyyy", Locale.ENGLISH);
LocalDate date = LocalDate.parse(string, formatter);
System.out.println(date); // 2010-01-02

Note: if your format pattern happens to contain the time part as well, then use LocalDateTime#parse(text, formatter) instead of LocalDate#parse(text, formatter). And, if your format pattern happens to contain the time zone as well, then use ZonedDateTime#parse(text, formatter) instead.

Here's an extract of relevance from the javadoc, listing all available format patterns:

Symbol	Meaning	Presentation	Examples
`G`	era	text	AD; Anno Domini; A
`u`	year	year	2004; 04
`y`	year-of-era	year	2004; 04
`D`	day-of-year	number	189
`M`/`L`	month-of-year	number/text	7; 07; Jul; July; J
`d`	day-of-month	number	10
`Q`/`q`	quarter-of-year	number/text	3; 03; Q3; 3rd quarter
`Y`	week-based-year	year	1996; 96
`w`	week-of-week-based-year	number	27
`W`	week-of-month	number	4
`E`	day-of-week	text	Tue; Tuesday; T
`e`/`c`	localized day-of-week	number/text	2; 02; Tue; Tuesday; T
`F`	week-of-month	number	3
`a`	am-pm-of-day	text	PM
`h`	clock-hour-of-am-pm (1-12)	number	12
`K`	hour-of-am-pm (0-11)	number	0
`k`	clock-hour-of-am-pm (1-24)	number	0
`H`	hour-of-day (0-23)	number	0
`m`	minute-of-hour	number	30
`s`	second-of-minute	number	55
`S`	fraction-of-second	fraction	978
`A`	milli-of-day	number	1234
`n`	nano-of-second	number	987654321
`N`	nano-of-day	number	1234000000
`V`	time-zone ID	zone-id	America/Los_Angeles; Z; -08:30
`z`	time-zone name	zone-name	Pacific Standard Time; PST
`O`	localized zone-offset	offset-O	GMT+8; GMT+08:00; UTC-08:00;
`X`	zone-offset 'Z' for zero	offset-X	Z; -08; -0830; -08:30; -083015; -08:30:15;
`x`	zone-offset	offset-x	+0000; -08; -0830; -08:30; -083015; -08:30:15;
`Z`	zone-offset	offset-Z	+0000; -0800; -08:00;

Do note that it has several predefined formatters for the more popular patterns. So instead of e.g. DateTimeFormatter.ofPattern("EEE, d MMM yyyy HH:mm:ss Z", Locale.ENGLISH);, you could use DateTimeFormatter.RFC_1123_DATE_TIME. This is possible because they are, on the contrary to SimpleDateFormat, thread safe. You could thus also define your own, if necessary.

For a particular input string format, you don't need to use an explicit DateTimeFormatter: a standard ISO 8601 date, like 2016-09-26T17:44:57Z, can be parsed directly with LocalDateTime#parse(text) as it already uses the ISO_LOCAL_DATE_TIME formatter. Similarly, LocalDate#parse(text) parses an ISO date without the time component (see ISO_LOCAL_DATE), and ZonedDateTime#parse(text) parses an ISO date with an offset and time zone added (see ISO_ZONED_DATE_TIME).

Pre-Java 8

In case you're not on Java 8 yet, or are forced to use java.util.Date, then format the date using SimpleDateFormat using a format pattern matching the input string.

String string = "January 2, 2010";
DateFormat format = new SimpleDateFormat("MMMM d, yyyy", Locale.ENGLISH);
Date date = format.parse(string);
System.out.println(date); // Sat Jan 02 00:00:00 GMT 2010

Note the importance of the explicit Locale argument. If you omit it, then it will use the default locale which is not necessarily English as used in the month name of the input string. If the locale doesn't match with the input string, then you would confusingly get a java.text.ParseException even though when the format pattern seems valid.

Here's an extract of relevance from the javadoc, listing all available format patterns:

Letter	Date or Time Component	Presentation	Examples
`G`	Era designator	Text	AD
`y`	Year	Year	1996; 96
`Y`	Week year	Year	2009; 09
`M`/`L`	Month in year	Month	July; Jul; 07
`w`	Week in year	Number	27
`W`	Week in month	Number	2
`D`	Day in year	Number	189
`d`	Day in month	Number	10
`F`	Day of week in month	Number	2
`E`	Day in week	Text	Tuesday; Tue
`u`	Day number of week	Number	1
`a`	Am/pm marker	Text	PM
`H`	Hour in day (0-23)	Number	0
`k`	Hour in day (1-24)	Number	24
`K`	Hour in am/pm (0-11)	Number	0
`h`	Hour in am/pm (1-12)	Number	12
`m`	Minute in hour	Number	30
`s`	Second in minute	Number	55
`S`	Millisecond	Number	978
`z`	Time zone	General time zone	Pacific Standard Time; PST; GMT-08:00
`Z`	Time zone	RFC 822 time zone	-0800
`X`	Time zone	ISO 8601 time zone	-08; -0800; -08:00

Note that the patterns are case sensitive and that text based patterns of four characters or more represent the full form; otherwise a short or abbreviated form is used if available. So e.g. MMMMM or more is unnecessary.

Here are some examples of valid SimpleDateFormat patterns to parse a given string to date:

Input string	Pattern
2001.07.04 AD at 12:08:56 PDT	`yyyy.MM.dd G 'at' HH:mm:ss z`
Wed, Jul 4, '01	`EEE, MMM d, ''yy`
12:08 PM	`h:mm a`
12 o'clock PM, Pacific Daylight Time	`hh 'o''clock' a, zzzz`
0:08 PM, PDT	`K:mm a, z`
02001.July.04 AD 12:08 PM	`yyyyy.MMMM.dd GGG hh:mm aaa`
Wed, 4 Jul 2001 12:08:56 -0700	`EEE, d MMM yyyy HH:mm:ss Z`
010704120856-0700	`yyMMddHHmmssZ`
2001-07-04T12:08:56.235-0700	`yyyy-MM-dd'T'HH:mm:ss.SSSZ`
2001-07-04T12:08:56.235-07:00	`yyyy-MM-dd'T'HH:mm:ss.SSSXXX`
2001-W27-3	`YYYY-'W'ww-u`

An important note is that SimpleDateFormat is not thread safe. In other words, you should never declare and assign it as a static or instance variable and then reuse it from different methods/threads. You should always create it brand new within the method local scope.

Best Answer

Related Solutions

How does lucene index documents

Java string to date conversion

Pre-Java 8

Related Topic