Php – Lucene foreign chars problem

lucenePHPzend-framework

I'm having some serious issues using Zend_Lucene and foreign characters like åäö. These issues appear both when the index is created and when it's queried. I've tried both iso-8859-1 and utf-8.

ISO-8859-1

The query that doesn't work looks like "+_area:skåne". With Zend_Lucene I'm getting no matches, but if I run this query in Luke I get many matching docuements.

The index contains 20 fields. The "_area" field is added with the following syntax:

$doc->addField(Zend_Search_Lucene_Field::keyword('_area', strtolower($item['area']), 'iso-8859-1')); 

I am using the Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive analyzer.

While running indexing, the error message below appeared sometimes (the documents indexed were randomly selected from DB with iso-8859-1 encoding)

Notice: iconv(): Detected an illegal character in input string in TextNum.php.

This was "solved" by checking if $this->_input is empty, as it seemed that this caused the notices. Note: The weird query results were a pre-existing condition.

When I search keyword fields using foreign characters I receive the error above, but when I search text fields it behaves differently. Then it generates about a hundred of the error below.

Notice: Undefined offset: 1996 in \Zend\Search\Lucene\Search\Query\MultiTerm.php on line 472

But it produces what looks like a correct result set! On a side note, this second query doesn't generate any results in Luke.

UTF-8

I've also tried UTF-8 because, to my knowledge, Zend_Lucene uses it internally. Since the data set is ISO-8859-1, I convert it using utf8_encode. But the indexing produces the following errors.

Notice: Undefined offset: 266979 in
\Zend\Search\Lucene\Index\SegmentInfo.php
on line 632

Notice: Trying to get property of
non-object in
\Zend\Search\Lucene\Index\SegmentMerger.php
on line 196

Notice: Trying to get property of
non-object in
\Zend\Search\Lucene\Index\SegmentMerger.php
on line 200

Notice: Undefined index: in
\Zend\Search\Lucene\Index\SegmentWriter.php
on line 231

Notice: Trying to get property of
non-object in
\Zend\Search\Lucene\Index\SegmentWriter.php
on line 231

Notice: Undefined offset: 250595 in
\Zend\Search\Lucene\Index\SegmentInfo.php
on line 2020

Notice: Trying to get property of
non-object in
\Zend\Search\Lucene\Index\SegmentInfo.php
on line 2020

Notice: Undefined index: in
\Zend\Search\Lucene\Index\SegmentWriter.php
on line 465


So. Can someone please shed some light? 🙂 I believe (after days of googling) that I'm not the only one experiencing this.

Best Answer

I suggest you try using a UTF-8 compatible text analyzer. It looks like the analyzer you are using destroys the non-ASCII characters. You should make sure that the text is input properly, and that it reaches Lucene in the proper format.

Related Topic