Sql – How to optimize Core Data query for full text search

cocoacocoa-touchcore-dataiphonesql

Can I optimize a Core Data query when searching for matching words in a text? (This question also pertains to the wisdom of custom SQL versus Core Data on an iPhone.)

I'm working on a new (iPhone) app that is a handheld reference tool for a scientific database. The main interface is a standard searchable table view and I want as-you-type response as the user types new words. Words matches must be prefixes of words in the text. The text is composed of 100,000s of words.

In my prototype I coded SQL directly. I created a separate "words" table containing every word in the text fields of the main entity. I indexed words and performed searches along the lines of

SELECT id, * FROM textTable 
  JOIN (SELECT DISTINCT textTableId FROM words 
         WHERE word BETWEEN 'foo' AND 'fooz' ) 
    ON id=textTableId
 LIMIT 50

This runs very fast. Using an IN would probably work just as well, i.e.

SELECT * FROM textTable
 WHERE id IN (SELECT textTableId FROM words 
               WHERE word BETWEEN 'foo' AND 'fooz' ) 
 LIMIT 50

The LIMIT is crucial and allows me to display results quickly. I notify the user that there are too many to display if the limit is reached. This is kludgy.

I've spent the last several days pondering the advantages of moving to Core Data, but I worry about the lack of control in the schema, indexing, and querying for an important query.

Theoretically an NSPredicate of textField MATCHES '.*\bfoo.*' would just work, but I'm sure it will be slow. This sort of text search seems so common that I wonder what is the usual attack? Would you create a words entity as I did above and use a predicate of "word BEGINSWITH 'foo'"? Will that work as fast as my prototype? Will Core Data automatically create the right indexes? I can't find any explicit means of advising the persistent store about indexes.

I see some nice advantages of Core Data in my iPhone app. The faulting and other memory considerations allow for efficient database retrievals for tableview queries without setting arbitrary limits. The object graph management allows me to easily traverse entities without writing lots of SQL. Migration features will be nice in the future. On the other hand, in a limited resource environment (iPhone) I worry that an automatically generated database will be bloated with metadata, unnecessary inverse relationships, inefficient attribute datatypes, etc.

Should I dive in or proceed with caution?

Best Answer

I made a workaround solution. I think it's similar to this post. I added the amalgamation source code to my Core Data project, then created a full-text search class that was not a managed object subclass. In the FTS class I #import "sqlite3.h" (the source file) instead of the sqlite framework. The FTS class saves to a different .sqlite file than the Core Data persistent store.

When I import my data, the Core Data object stores the rowid of the related FTS object as an integer attribute. I have a static dataset, so I don't worry about referential integrity, but the code to maintain integrity should be trivial.

To perform FTS, I MATCH query the FTS class, returning a set of rowids. In my managed object class, I query for the corresponding objects with [NSPredicate predicateWithFormat:@"rowid IN %@", rowids]. I avoid traversing any many-to-many relationships this way.

The performance improvement is dramatic. My dataset is 142287 rows, comprising 194MB (Core Data) and 92MB (FTS with stopwords removed). Depending on the search term frequency, my searches went from several seconds to 0.1 seconds for infrequent terms (<100 hits) and 0.2 seconds for frequent terms (>2000 hits).

I'm sure there are myriad problems with my approach (code bloat, possible namespace collisions, loss of some Core Data features), but it seems to be working.

Related Solutions

Ios – How to develop for iPhone using a Windows development machine

It's certainly possible to develop on a Windows machine, in fact, my first application was exclusively developed on the old Dell Precision I had at the time :)

There are three routes;

Install OSx86 (aka iATKOS / Kalyway) on a second partition/disk and dual boot.
Run Mac OS X Server under VMWare (Mac OS X 10.7 (Lion) onwards, read the update below).
Use Delphi XE4 and the macincloud service. This is a commercial toolset, but the component and lib support is growing.

The first route requires modifying (or using a pre-modified) image of Leopard that can be installed on a regular PC. This is not as hard as you would think, although your success/effort ratio will depend upon how closely the hardware in your PC matches that in Mac hardware - e.g. if you're running a Core 2 Duo on an Intel Motherboard, with an NVidia graphics card you are laughing. If you're running an AMD machine or something without SSE3 it gets a little more involved.

If you purchase (or already own) a version of Leopard then this is a gray area since the Leopard EULA states you may only run it on an "Apple Labeled" machine. As many point out if you stick an Apple sticker on your PC you're probably covered.

The second option is more costly. The EULA for the workstation version of Leopard prevents it from being run under emulation and as a result, there's no support in VMWare for this. Leopard server, however, CAN be run under emulation and can be used for desktop purposes. Leopard server and VMWare are expensive, however.

If you're interested in option 1) I would suggest starting at Insanelymac and reading the OSx86 sections.

I do think you should consider whether the time you will invest is going to be worth the money you will save though. It was for me because I enjoy tinkering with this type of stuff and I started during the early iPhone betas, months before their App Store became available.

Alternatively, you could pick up a low-spec Mac Mini from eBay. You don't need much horsepower to run the SDK and you can always sell it on later if you decide to stop development or buy a better Mac.

Update: You cannot create a Mac OS X Client virtual machine for OS X 10.6 and earlier. Apple does not allow these Client OSes to be virtualized. With Mac OS X 10.7 (Lion) onwards, Apple has changed its licensing agreement in regards to virtualization. Source: VMWare KnowledgeBase

Sql – How to concatenate text from multiple rows into a single text string in SQL Server

If you are on SQL Server 2017 or Azure, see Mathieu Renda answer.

I had a similar issue when I was trying to join two tables with one-to-many relationships. In SQL 2005 I found that XML PATH method can handle the concatenation of the rows very easily.

If there is a table called STUDENTS

SubjectID       StudentName
----------      -------------
1               Mary
1               John
1               Sam
2               Alaina
2               Edward

Result I expected was:

SubjectID       StudentName
----------      -------------
1               Mary, John, Sam
2               Alaina, Edward

I used the following T-SQL:

SELECT Main.SubjectID,
       LEFT(Main.Students,Len(Main.Students)-1) As "Students"
FROM
    (
        SELECT DISTINCT ST2.SubjectID, 
            (
                SELECT ST1.StudentName + ',' AS [text()]
                FROM dbo.Students ST1
                WHERE ST1.SubjectID = ST2.SubjectID
                ORDER BY ST1.SubjectID
                FOR XML PATH ('')
            ) [Students]
        FROM dbo.Students ST2
    ) [Main]

You can do the same thing in a more compact way if you can concat the commas at the beginning and use substring to skip the first one so you don't need to do a sub-query:

SELECT DISTINCT ST2.SubjectID, 
    SUBSTRING(
        (
            SELECT ','+ST1.StudentName  AS [text()]
            FROM dbo.Students ST1
            WHERE ST1.SubjectID = ST2.SubjectID
            ORDER BY ST1.SubjectID
            FOR XML PATH ('')
        ), 2, 1000) [Students]
FROM dbo.Students ST2

Best Answer

Related Solutions

Ios – How to develop for iPhone using a Windows development machine

Sql – How to concatenate text from multiple rows into a single text string in SQL Server

Related Topic