R – How to index and find numbers with Lucene.NET

full-text-searchlucenelucene.netnet

I've implemented full text search for a web site using Lucene.NET (Version 2.0). Indexing and searching works well, but I have one problem. If I look for numbers (phone numbers, product numbers etc.) as search terms, I don't get any resulting documents.

I'm using the Lucene.Net.Analysis.SimpleAnalyzer Class. I guess I have to change Analyzer and/or Tokenizer.

Any advice?

Thank you!

Best Answer

When you build up a Lucene Document, you get to select different indexing options for each field. For fields you don't want tokenized, you need to select the Field.Index.UN_TOKENIZED option. This will keep your phone numbers and product numbers in tact.

I would also advise using the StandardAnalyzer, as its doesn't strip numbers out like SimpleAnalyzer.

It is also important you use the same analyzer for both indexing and searching, to get consistent results.

Related Solutions

C# – How to create an Excel (.XLS and .XLSX) file in C# without installing Microsoft Office

You can use a library called ExcelLibrary. It's a free, open source library posted on Google Code:

ExcelLibrary

This looks to be a port of the PHP ExcelWriter that you mentioned above. It will not write to the new .xlsx format yet, but they are working on adding that functionality in.

It's very simple, small and easy to use. Plus it has a DataSetHelper that lets you use DataSets and DataTables to easily work with Excel data.

ExcelLibrary seems to still only work for the older Excel format (.xls files), but may be adding support in the future for newer 2007/2010 formats.

You can also use EPPlus, which works only for Excel 2007/2010 format files (.xlsx files). There's also NPOI which works with both.

There are a few known bugs with each library as noted in the comments. In all, EPPlus seems to be the best choice as time goes on. It seems to be more actively updated and documented as well.

Also, as noted by @АртёмЦарионов below, EPPlus has support for Pivot Tables and ExcelLibrary may have some support (Pivot table issue in ExcelLibrary)

Here are a couple links for quick reference:
ExcelLibrary - GNU Lesser GPL
EPPlus - GNU (LGPL) - No longer maintained
EPPlus 5 - Polyform Noncommercial - Starting May 2020
NPOI - Apache License

Here some example code for ExcelLibrary:

Here is an example taking data from a database and creating a workbook from it. Note that the ExcelLibrary code is the single line at the bottom:

//Create the data set and table
DataSet ds = new DataSet("New_DataSet");
DataTable dt = new DataTable("New_DataTable");

//Set the locale for each
ds.Locale = System.Threading.Thread.CurrentThread.CurrentCulture;
dt.Locale = System.Threading.Thread.CurrentThread.CurrentCulture;

//Open a DB connection (in this example with OleDB)
OleDbConnection con = new OleDbConnection(dbConnectionString);
con.Open();

//Create a query and fill the data table with the data from the DB
string sql = "SELECT Whatever FROM MyDBTable;";
OleDbCommand cmd = new OleDbCommand(sql, con);
OleDbDataAdapter adptr = new OleDbDataAdapter();

adptr.SelectCommand = cmd;
adptr.Fill(dt);
con.Close();

//Add the table to the data set
ds.Tables.Add(dt);

//Here's the easy part. Create the Excel worksheet from the data set
ExcelLibrary.DataSetHelper.CreateWorkbook("MyExcelFile.xls", ds);

Creating the Excel file is as easy as that. You can also manually create Excel files, but the above functionality is what really impressed me.

Mysql – Comparison of full text search engine – Lucene, Sphinx, Postgresql, MySQL

Good to see someone's chimed in about Lucene - because I've no idea about that.

Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.

Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.

I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.

I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.

Best Answer

Related Solutions

C# – How to create an Excel (.XLS and .XLSX) file in C# without installing Microsoft Office

Mysql – Comparison of full text search engine – Lucene, Sphinx, Postgresql, MySQL

Related Topic