Ny good search algorithm for a single character

algorithmsstring-matching

I know several basic string-matching algorithms such as KMP or Boyer-Moore, but all of those analyze the pattern before searching.However, if one has a single character, there is not much to analyze.
So is there any better algorithm than the naive search of comparing every character of the text ?

Best Answer

It being understood that the worst case is O(N), there are some very nice micro-optimizations.

The naive method performs a character comparison and an end-of-text comparison for each character.

Using a sentinel (i.e. a copy of the target character at the end of the text) reduces the number of comparisons to 1 per character.

At the bit twiddling level there is:

#define haszero(v)      ( ((v) - 0x01010101UL) & ~(v) & 0x80808080UL )
#define hasvalue(x, n)  ( haszero((x) ^ (~0UL / 255 * (n))) )

to know if any byte in a word (x) has a specific value (n).

The subexpression v - 0x01010101UL, evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80.

The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80).

By ANDing these two sub-expressions (haszero) the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second (April 27, 1987 by Alan Mycroft).

Now we can XOR the value to test (x) with a word that has been filled with the byte value in which we're interested (n). Because XORing a value with itself results in a zero byte and nonzero otherwise, we can pass the result to haszero.

This is often used in a typical strchr implementation.

(Stephen M Bennet suggested this on December 13, 2009. Further details in the well known Bit Twiddling Hacks).


PS

this code is broken for any combination of 1111's next to a 0

The hack passes the brute force test (just be patient):

#include <iostream>
#include <limits>

bool haszero(std::uint32_t v)
{
  return (v - std::uint32_t(0x01010101)) & ~v & std::uint32_t(0x80808080);
}

bool hasvalue(std::uint32_t x, unsigned char n)
{
  return haszero(x ^ (~std::uint32_t(0) / 255 * n));
}

bool hasvalue_slow(std::uint32_t x, unsigned char n)
{
  for (unsigned i(0); i < 32; i += 8)
    if (((x >> i) & 0xFF) == n)
      return true;

  return false;
}

int main()
{
  const std::uint64_t stop(std::numeric_limits<std::uint32_t>::max());

  for (unsigned c(0); c < 256; ++c)
  {
    std::cout << "Testing " << c << std::endl;

    for (std::uint64_t w(0); w != stop; ++w)
    {
      if (w && w % 100000000 == 0)
        std::cout << w * 100 / stop << "%\r" << std::flush;

      const bool h(hasvalue(w, c));
      const bool hs(hasvalue_slow(w, c));

      if (h != hs)
        std::cerr << "hasvalue(" << w << ',' << c << ") is " << h << '\n';
    }
  }

  return 0;
}

Lots of upvotes for an answer which makes the assumption one chararacter=one byte, which is nowadays not the standard any more

Thank you for the remark.

The answer was meant to be anything but an essay on multi-byte / variable-width encodings :-) (in all fairness that's not my area of expertise and I'm not sure it's what the OP was looking for).

Anyway it seems to me that the above ideas/tricks could somewhat be adapted to MBE (especially self-synchronizing encodings):

  • as noted in Johan's comment the hack can 'easily' be extended to work for double bytes or anything (of course you cannot stretch it too much);
  • a typical function that locates a character in a multibyte character string:
  • the sentinel technique can be used with a little foresight.
Related Topic