I know several basic string-matching algorithms such as KMP or Boyer-Moore, but all of those analyze the pattern before searching.However, if one has a single character, there is not much to analyze.
So is there any better algorithm than the naive search of comparing every character of the text ?
Ny good search algorithm for a single character
algorithmsstring-matching
Related Solutions
It depends on the kind of search you want to perform. Each of the algorithms performs particularly well for certain types of a search, but you have not stated the context of your searches.
Here are some typical thoughts on search types:
Boyer-Moore: works by pre-analyzing the pattern and comparing from right-to-left. If a mismatch occurs, the initial analysis is used to determine how far the pattern can be shifted w.r.t. the text being searched. This works particularly well for long search patterns. In particular, it can be sub-linear, as you do not need to read every single character of your text.
Knuth-Morris-Pratt: also pre-analyzes the pattern, but tries to re-use whatever was already matched in the initial part of the pattern to avoid having to rematch that. This can work quite well, if your alphabet is small (f.ex. DNA bases), as you get a higher chance that your search patterns contain reuseable subpatterns.
Aho-Corasick: Needs a lot of preprocessing, but does so for a number of patterns. If you know you will be looking for the same search patterns over and over again, then this is much better than the other, because you need to analyse patterns only once, not once per search.
Hence, as usual in CS, there is no definite answer to the overall best. It is rather a matter of choosing the right tool for the job at hand.
Another note on your worst-case reasoning: Do consider the kinds of searches required to create that worst-case and think thoroughly about whether these are really relevant in your case. For example, the O(mn)
worst-case complexity of the Boyer-Moore algorithm stems from a search pattern and a text that each use only one character (like finding aaa
in aaaaaaaaaaaaaaaaaaaaa
) - do you really need to be fast for searches like that?
You may want to have a look at Lucene. It's written in Java but you can use it from PHP through either its Zend port or Solr, although I think Solr might not be a solution since you probably want something embedded within your CMS.
The idea is that you can build an index from all the searchable data and then search that index. It has the advantage that it can be much faster than database queries. You can also implement a scoring system, so that some records show up higher in the results list if they are more relevant to the user. One possible downside is that you have to update the index each time the data is updated.
If the Lucene approach doesn't fit your scenario, you could go with a mix of PHP and database code. A view or a stored procedure could act as a surrogate index, aggregating the searchable data. You'd need to split strings by whitespaces and this could make it rather slow, but will get you there. The PHP code will be responsible for building the WHERE clause (you can have a look at how Solr implements query strategies for suggestions as source of inspiration). Assuming you have Id
, Name
and Score
as index fields and two contact records with Id=1, FirstName="Ric", LastName="Flair"
and Id=2, FirstName="Ric", LastName="Flare"
the index records would be looking something like this:
+--------------+-------+
| Id | Name | Score |
+----+---------+-------+
| 1 | Ric | 1 |
+----+---------+-------+
| 1 | Flair | 1 |
+----+---------+-------+
| 2 | Ric | 1 |
+----+---------+-------+
| 2 | Flare | 1 |
+----+---------+-------+
An example could look like this (SQL Server):
DECLARE @index TABLE (
Id INT NOT NULL,
Name NVARCHAR(50) NOT NULL,
Score INT NOT NULL
)
INSERT INTO @index (Id, Name, Score) VALUES (1, 'Ric', 1)
INSERT INTO @index (Id, Name, Score) VALUES (1, 'Flair', 1)
INSERT INTO @index (Id, Name, Score) VALUES (2, 'Ric', 1)
INSERT INTO @index (Id, Name, Score) VALUES (2, 'Flare', 1)
SELECT
Id,
SUM(Score) AS Score
FROM
@index
WHERE
Name = 'Ric'
OR Name = 'Flair'
GROUP BY
Id
ORDER BY
Score DESC
Ric Flair
would have a higher score since it matches on both values, thus popping first in the search results.
The index could also contain title and summary fields to be used as values to be displayed in the search results page. Or you could join the results with a view that preselects those values.
You can toy around with the conditions in the WHERE
clause or with the Score
field (you could give different scores to different type of records or properties) to get a more refined search experience.
Best Answer
It being understood that the worst case is
O(N)
, there are some very nice micro-optimizations.The naive method performs a character comparison and an end-of-text comparison for each character.
Using a sentinel (i.e. a copy of the target character at the end of the text) reduces the number of comparisons to 1 per character.
At the bit twiddling level there is:
to know if any byte in a word (
x
) has a specific value (n
).The subexpression
v - 0x01010101UL
, evaluates to a high bit set in any byte whenever the corresponding byte inv
is zero or greater than0x80
.The sub-expression
~v & 0x80808080UL
evaluates to high bits set in bytes where the byte ofv
doesn't have its high bit set (so the byte was less than0x80
).By ANDing these two sub-expressions (
haszero
) the result is the high bits set where the bytes inv
were zero, since the high bits set due to a value greater than0x80
in the first sub-expression are masked off by the second (April 27, 1987 by Alan Mycroft).Now we can XOR the value to test (
x
) with a word that has been filled with the byte value in which we're interested (n
). Because XORing a value with itself results in a zero byte and nonzero otherwise, we can pass the result tohaszero
.This is often used in a typical
strchr
implementation.(Stephen M Bennet suggested this on December 13, 2009. Further details in the well known Bit Twiddling Hacks).
PS
The hack passes the brute force test (just be patient):
Thank you for the remark.
The answer was meant to be anything but an essay on multi-byte / variable-width encodings :-) (in all fairness that's not my area of expertise and I'm not sure it's what the OP was looking for).
Anyway it seems to me that the above ideas/tricks could somewhat be adapted to MBE (especially self-synchronizing encodings):
strchr
/strstr
(e.g. GNUlib coreutils mbschr)