Php – How to generate excerpt with most searched words in PHP

PHP

Here is an excerpt function:

    function excerpt($text, $phrase, $radius = 100, $ending = "...") {
270             if (empty($text) or empty($phrase)) {
271                 return $this->truncate($text, $radius * 2, $ending);
272             }
273     
274             $phraseLen = strlen($phrase);
275             if ($radius < $phraseLen) {
276                 $radius = $phraseLen;
277             }
278     
279             $pos = strpos(strtolower($text), strtolower($phrase));
280     
281             $startPos = 0;
282             if ($pos > $radius) {
283                 $startPos = $pos - $radius;
284             }
285     
286             $textLen = strlen($text);
287     
288             $endPos = $pos + $phraseLen + $radius;
289             if ($endPos >= $textLen) {
290                 $endPos = $textLen;
291             }
292     
293             $excerpt = substr($text, $startPos, $endPos - $startPos);
294             if ($startPos != 0) {
295                 $excerpt = substr_replace($excerpt, $ending, 0, $phraseLen);
296             }
297     
298             if ($endPos != $textLen) {
299                 $excerpt = substr_replace($excerpt, $ending, -$phraseLen);
300             }
301     
302             return $excerpt;
303         }

Its drawback is that it doesn't try to match as many searched words as possible,which only matches once by default.

How to implement the desired one?

Best Answer

The code listed here thus far has not worked for me so I spent some time thinking of an algorithm to implement. What I have now works decently, and it does not appear to be a performance problem - feel free to test. Results are not as snazzy Google's snippets as there is no detection for where sentences start and end. I could add this but it'd be that much more complicated and I'd have to throw in the towel on doing this in a single function. Already its getting crowded and could be better coded if, for example, the object manipulations were abstracted to methods.

Anyhow, this is what I have and it should be a good start. The most dense excerpt is determined and the resulting string will approximately be the span you have specified. I urge some testing of this code as I have not done a thorough job of it. Surely there are problematic cases to be found.

I also encourage anyone to improve on this algorithm, or simply the code to execute it.

Enjoy.

// string excerpt(string $text, string $phrase, int $span = 100, string $delimiter = '...')
// parameters:
//  $text - text to be searched
//  $phrase - search string
//  $span - approximate length of the excerpt
//  $delimiter - string to use as a suffix and/or prefix if the excerpt is from the middle of a text

function excerpt($text, $phrase, $span = 100, $delimiter = '...') {

  $phrases = preg_split('/\s+/', $phrase);

  $regexp = '/\b(?:';
  foreach ($phrases as $phrase) {
    $regexp .= preg_quote($phrase, '/') . '|';
  }

  $regexp = substr($regexp, 0, -1) . ')\b/i';
  $matches = array();
  preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
  $matches = $matches[0];

  $nodes = array();
  foreach ($matches as $match) {
    $node = new stdClass;
    $node->phraseLength = strlen($match[0]);
    $node->position = $match[1];
    $nodes[] = $node;
  }

  if (count($nodes) > 0) {
    $clust = new stdClass;
    $clust->nodes[] = array_shift($nodes);
    $clust->length = $clust->nodes[0]->phraseLength;
    $clust->i = 0;
    $clusters = new stdClass;
    $clusters->data = array($clust);
    $clusters->i = 0;
    foreach ($nodes as $node) {
      $lastClust = $clusters->data[$clusters->i];
      $lastNode = $lastClust->nodes[$lastClust->i];
      $addedLength = $node->position - $lastNode->position - $lastNode->phraseLength + $node->phraseLength;
      if ($lastClust->length + $addedLength <= $span) {
        $lastClust->nodes[] = $node;
        $lastClust->length += $addedLength;
        $lastClust->i += 1;
      } else {
        if ($addedLength > $span) {
          $newClust = new stdClass;
          $newClust->nodes = array($node);
          $newClust->i = 0;
          $newClust->length = $node->phraseLength;
          $clusters->data[] = $newClust;
          $clusters->i += 1;
        } else {
          $newClust = clone $lastClust;
          while ($newClust->length + $addedLength > $span) {
            $shiftedNode = array_shift($newClust->nodes);
            if ($shiftedNode === null) {
              break;
            }
            $newClust->i -= 1;
            $removedLength = $shiftedNode->phraseLength;
            if (isset($newClust->nodes[0])) {
              $removedLength += $newClust->nodes[0]->position - $shiftedNode->position;
            }
            $newClust->length -= $removedLength;
          }
          if ($newClust->i < 0) {
            $newClust->i = 0;
          }
          $newClust->nodes[] = $node;
          $newClust->length += $addedLength;
          $clusters->data[] = $newClust;
          $clusters->i += 1;
        }
      }
    }
    $bestClust = $clusters->data[0];
    $bestClustSize = count($bestClust->nodes);
    foreach ($clusters->data as $clust) {
      $newClustSize = count($clust->nodes);
      if ($newClustSize > $bestClustSize) {
        $bestClust = $clust;
        $bestClustSize = $newClustSize;
      }
    }
    $clustLeft = $bestClust->nodes[0]->position;
    $clustLen = $bestClust->length;
    $padding = round(($span - $clustLen)/2);
    $clustLeft -= $padding;
    if ($clustLeft < 0) {
      $clustLen += $clustLeft*-1 + $padding;
      $clustLeft = 0;
    } else {
      $clustLen += $padding*2;
    }
  } else {
    $clustLeft = 0;
    $clustLen = $span;
  }

  $textLen = strlen($text);
  $prefix = '';
  $suffix = '';

  if (!ctype_space($text[$clustLeft]) && isset($text[$clustLeft-1]) && !ctype_space($text[$clustLeft-1])) {
    while (!ctype_space($text[$clustLeft])) {
      $clustLeft += 1;
    }
    $prefix = $delimiter;
  }

  $lastChar = $clustLeft + $clustLen;
  if (!ctype_space($text[$lastChar]) && isset($text[$lastChar+1]) && !ctype_space($text[$lastChar+1])) {
    while (!ctype_space($text[$lastChar])) {
      $lastChar -= 1;
    }
    $suffix = $delimiter;
    $clustLen = $lastChar - $clustLeft;
  }

  if ($clustLeft > 0) {
    $prefix = $delimiter;
  }

  if ($clustLeft + $clustLen < $textLen) {
    $suffix = $delimiter;
  }

  return $prefix . trim(substr($text, $clustLeft, $clustLen+1)) . $suffix;
}