Ny good search algorithm for a single character

algorithmsstring-matching

I know several basic string-matching algorithms such as KMP or Boyer-Moore, but all of those analyze the pattern before searching.However, if one has a single character, there is not much to analyze.
So is there any better algorithm than the naive search of comparing every character of the text ?

Best Answer

It being understood that the worst case is O(N), there are some very nice micro-optimizations.

The naive method performs a character comparison and an end-of-text comparison for each character.

Using a sentinel (i.e. a copy of the target character at the end of the text) reduces the number of comparisons to 1 per character.

At the bit twiddling level there is:

#define haszero(v)      ( ((v) - 0x01010101UL) & ~(v) & 0x80808080UL )
#define hasvalue(x, n)  ( haszero((x) ^ (~0UL / 255 * (n))) )

to know if any byte in a word (x) has a specific value (n).

The subexpression v - 0x01010101UL, evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80.

The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80).

By ANDing these two sub-expressions (haszero) the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second (April 27, 1987 by Alan Mycroft).

Now we can XOR the value to test (x) with a word that has been filled with the byte value in which we're interested (n). Because XORing a value with itself results in a zero byte and nonzero otherwise, we can pass the result to haszero.

This is often used in a typical strchr implementation.

(Stephen M Bennet suggested this on December 13, 2009. Further details in the well known Bit Twiddling Hacks).

this code is broken for any combination of 1111's next to a 0

The hack passes the brute force test (just be patient):

#include <iostream>
#include <limits>

bool haszero(std::uint32_t v)
{
  return (v - std::uint32_t(0x01010101)) & ~v & std::uint32_t(0x80808080);
}

bool hasvalue(std::uint32_t x, unsigned char n)
{
  return haszero(x ^ (~std::uint32_t(0) / 255 * n));
}

bool hasvalue_slow(std::uint32_t x, unsigned char n)
{
  for (unsigned i(0); i < 32; i += 8)
    if (((x >> i) & 0xFF) == n)
      return true;

  return false;
}

int main()
{
  const std::uint64_t stop(std::numeric_limits<std::uint32_t>::max());

  for (unsigned c(0); c < 256; ++c)
  {
    std::cout << "Testing " << c << std::endl;

    for (std::uint64_t w(0); w != stop; ++w)
    {
      if (w && w % 100000000 == 0)
        std::cout << w * 100 / stop << "%\r" << std::flush;

      const bool h(hasvalue(w, c));
      const bool hs(hasvalue_slow(w, c));

      if (h != hs)
        std::cerr << "hasvalue(" << w << ',' << c << ") is " << h << '\n';
    }
  }

  return 0;
}

Lots of upvotes for an answer which makes the assumption one chararacter=one byte, which is nowadays not the standard any more

Thank you for the remark.

The answer was meant to be anything but an essay on multi-byte / variable-width encodings :-) (in all fairness that's not my area of expertise and I'm not sure it's what the OP was looking for).

Anyway it seems to me that the above ideas/tricks could somewhat be adapted to MBE (especially self-synchronizing encodings):

as noted in Johan's comment the hack can 'easily' be extended to work for double bytes or anything (of course you cannot stretch it too much);
a typical function that locates a character in a multibyte character string:
- contains calls to strchr / strstr (e.g. GNUlib coreutils mbschr)
- expects them to be well tuned.
the sentinel technique can be used with a little foresight.

Related Solutions

C++ Algorithms – Fastest String Search Algorithm

It depends on the kind of search you want to perform. Each of the algorithms performs particularly well for certain types of a search, but you have not stated the context of your searches.

Here are some typical thoughts on search types:

Boyer-Moore: works by pre-analyzing the pattern and comparing from right-to-left. If a mismatch occurs, the initial analysis is used to determine how far the pattern can be shifted w.r.t. the text being searched. This works particularly well for long search patterns. In particular, it can be sub-linear, as you do not need to read every single character of your text.
Knuth-Morris-Pratt: also pre-analyzes the pattern, but tries to re-use whatever was already matched in the initial part of the pattern to avoid having to rematch that. This can work quite well, if your alphabet is small (f.ex. DNA bases), as you get a higher chance that your search patterns contain reuseable subpatterns.
Aho-Corasick: Needs a lot of preprocessing, but does so for a number of patterns. If you know you will be looking for the same search patterns over and over again, then this is much better than the other, because you need to analyse patterns only once, not once per search.

Hence, as usual in CS, there is no definite answer to the overall best. It is rather a matter of choosing the right tool for the job at hand.

Another note on your worst-case reasoning: Do consider the kinds of searches required to create that worst-case and think thoroughly about whether these are really relevant in your case. For example, the O(mn) worst-case complexity of the Boyer-Moore algorithm stems from a search pattern and a text that each use only one character (like finding aaa in aaaaaaaaaaaaaaaaaaaaa) - do you really need to be fast for searches like that?

Algorithms – How to Create a Search Algorithm for a CRM

You may want to have a look at Lucene. It's written in Java but you can use it from PHP through either its Zend port or Solr, although I think Solr might not be a solution since you probably want something embedded within your CMS.

The idea is that you can build an index from all the searchable data and then search that index. It has the advantage that it can be much faster than database queries. You can also implement a scoring system, so that some records show up higher in the results list if they are more relevant to the user. One possible downside is that you have to update the index each time the data is updated.

If the Lucene approach doesn't fit your scenario, you could go with a mix of PHP and database code. A view or a stored procedure could act as a surrogate index, aggregating the searchable data. You'd need to split strings by whitespaces and this could make it rather slow, but will get you there. The PHP code will be responsible for building the WHERE clause (you can have a look at how Solr implements query strategies for suggestions as source of inspiration). Assuming you have Id, Name and Score as index fields and two contact records with Id=1, FirstName="Ric", LastName="Flair" and Id=2, FirstName="Ric", LastName="Flare" the index records would be looking something like this:

+--------------+-------+
| Id | Name    | Score |
+----+---------+-------+
| 1  | Ric     |   1   |
+----+---------+-------+
| 1  | Flair   |   1   |
+----+---------+-------+
| 2  | Ric     |   1   |
+----+---------+-------+
| 2  | Flare   |   1   |
+----+---------+-------+

An example could look like this (SQL Server):

DECLARE @index TABLE (
    Id INT NOT NULL,
    Name NVARCHAR(50) NOT NULL,
    Score INT NOT NULL
)

INSERT INTO @index (Id, Name, Score) VALUES (1, 'Ric', 1)
INSERT INTO @index (Id, Name, Score) VALUES (1, 'Flair', 1)
INSERT INTO @index (Id, Name, Score) VALUES (2, 'Ric', 1)
INSERT INTO @index (Id, Name, Score) VALUES (2, 'Flare', 1)

SELECT
    Id,
    SUM(Score) AS Score
FROM
    @index
WHERE
    Name = 'Ric'
    OR Name = 'Flair'
GROUP BY
    Id
ORDER BY
    Score DESC

Ric Flair would have a higher score since it matches on both values, thus popping first in the search results.

The index could also contain title and summary fields to be used as values to be displayed in the search results page. Or you could join the results with a view that preselects those values.

You can toy around with the conditions in the WHERE clause or with the Score field (you could give different scores to different type of records or properties) to get a more refined search experience.

Best Answer

Related Solutions

C++ Algorithms – Fastest String Search Algorithm

Algorithms – How to Create a Search Algorithm for a CRM

Related Topic