Php – Problem with Lucene- search not indexing numeric values

lucenePHPzend-framework

I am using Lucene in PHP (using the Zend Framework implementation). I am having a problem that I cannot search on a field which contains a number.

Here is the data in the index:

      ts      |    contents
--------------+-----------------
  1236917100  | dog cat gerbil
  1236630752  |  cow pig goat
  1235680249  | lion tiger bear
  nonnumeric  | bass goby trout

My problem: A query for "ts:1236630752" returns no hits. However, a query for "ts:nonnumeric" returns a hit.

I am storing "ts" as a keyword field, which according to documentation "is not tokenized, but is indexed and stored. Useful for non-text fields, e.g. date or url." I have tried treating it as a "text" field, but the behavior is the same except that a query for "ts:*" returns nothing when ts is text.

I'm using Zend Framework 1.7 (just downloaded the latest 3 days ago) and PHP 5.2.9. Here is my code:

<?php

//=========================================================
// Initializes Zend Framework (Zend_Loader).
//=========================================================
set_include_path(realpath('../library') . PATH_SEPARATOR . get_include_path());
require_once('Zend/Loader.php');
Zend_Loader::registerAutoload();

//=========================================================
// Delete existing index and create a new one
//=========================================================
define('SEARCH_INDEX', 'test_search_index');
if(file_exists(SEARCH_INDEX))
  foreach(scandir(SEARCH_INDEX) as $file)
    if(!is_dir($file))
      unlink(SEARCH_INDEX . "/$file");

$index = Zend_Search_Lucene::create(SEARCH_INDEX);

//=========================================================
// Create this data in index:
//         ts      |    contents
//   --------------+-----------------
//     1236917100  | dog cat gerbil
//     1236630752  |  cow pig goat
//     1235680249  | lion tiger bear
//     nonnumeric  | bass goby trout
//=========================================================

function add_to_index($index, $ts, $contents) {
  $doc = new Zend_Search_Lucene_Document();
  $doc->addField(Zend_Search_Lucene_Field::Keyword('ts', $ts));
  $doc->addField(Zend_Search_Lucene_Field::Text('contents', $contents));
  $index->addDocument($doc);
}

add_to_index($index, '1236917100', 'dog cat gerbil');
add_to_index($index, '1236630752', 'cow pig goat');
add_to_index($index, '1235680249', 'lion tiger bear');
add_to_index($index, 'nonnumeric', 'bass goby trout');

//=========================================================
// Run some test queries and output results
//=========================================================

echo '<html><body><pre>';

function run_query($index, $query) {
  echo "Running query:  $query\n";
  $hits = $index->find($query);
  echo 'Got ' . count($hits) . " hits.\n";
  foreach($hits as $hit)
    echo "  ts='$hit->ts', contents='$hit->contents'\n";
  echo "\n";
}

run_query($index, 'pig');           //1 hit
run_query($index, 'ts:1236630752'); //0 hits
run_query($index, '1236630752');    //0 hits
run_query($index, 'ts:pig');        //0 hits
run_query($index, 'contents:pig');  //1 hits
run_query($index, 'ts:[1236630700 TO 1236630800]'); //0 hits (range query)
run_query($index, 'ts:*');          //4 hits if ts is keyword, 1 hit otherwise
run_query($index, 'nonnumeric');    //1 hits
run_query($index, 'ts:nonnumeric'); //1 hits
run_query($index, 'trout');         //1 hits

Output

Running query:  pig
Got 1 hits.
  ts='1236630752', contents='cow pig goat'

Running query:  ts:1236630752
Got 0 hits.

Running query:  1236630752
Got 0 hits.

Running query:  ts:pig
Got 0 hits.

Running query:  contents:pig
Got 1 hits.
  ts='1236630752', contents='cow pig goat'

Running query:  ts:[1236630700 TO 1236630800]
Got 0 hits.

Running query:  ts:*
Got 4 hits.
  ts='1236917100', contents='dog cat gerbil'
  ts='1236630752', contents='cow pig goat'
  ts='1235680249', contents='lion tiger bear'
  ts='nonnumeric', contents='bass goby trout'

Running query:  nonnumeric
Got 1 hits.
  ts='nonnumeric', contents='bass goby trout'

Running query:  ts:nonnumeric
Got 1 hits.
  ts='nonnumeric', contents='bass goby trout'

Running query:  trout
Got 1 hits.
  ts='nonnumeric', contents='bass goby trout'

Best Answer

The find() method tokenizes the query, and with the default Analzer your numbers will be pretty much ignored. If you want to search for a number you have to construct the query or use an alternate analyzer that includes numeric values..

http://framework.zend.com/manual/en/zend.search.lucene.searching.html

It is important to note that the query parser uses the standard analyzer to tokenize separate parts of query string. Thus all transformations which are applied to indexed text are also applied to query strings.

The standard analyzer may transform the query string to lower case for case-insensitivity, remove stop-words, and stem among other transformations.

The API method doesn't transform or filter input terms in any way. It's therefore more suitable for computer generated or untokenized fields.

Related Topic