如何在PHP中生成大多数搜索单词的摘录?

时间:2009-09-17 03:28:17

标签: php

这是摘录功能:

    function excerpt($text, $phrase, $radius = 100, $ending = "...") {
270             if (empty($text) or empty($phrase)) {
271                 return $this->truncate($text, $radius * 2, $ending);
272             }
273     
274             $phraseLen = strlen($phrase);
275             if ($radius < $phraseLen) {
276                 $radius = $phraseLen;
277             }
278     
279             $pos = strpos(strtolower($text), strtolower($phrase));
280     
281             $startPos = 0;
282             if ($pos > $radius) {
283                 $startPos = $pos - $radius;
284             }
285     
286             $textLen = strlen($text);
287     
288             $endPos = $pos + $phraseLen + $radius;
289             if ($endPos >= $textLen) {
290                 $endPos = $textLen;
291             }
292     
293             $excerpt = substr($text, $startPos, $endPos - $startPos);
294             if ($startPos != 0) {
295                 $excerpt = substr_replace($excerpt, $ending, 0, $phraseLen);
296             }
297     
298             if ($endPos != $textLen) {
299                 $excerpt = substr_replace($excerpt, $ending, -$phraseLen);
300             }
301     
302             return $excerpt;
303         }

它的缺点是它不会尝试匹配尽可能多的搜索词,默认情况下只匹配一次。

如何实现所需的?

4 个答案:

答案 0 :(得分:5)

到目前为止,此处列出的代码对我没有用,所以我花了一些时间考虑实现算法。我现在所做的工作得体,而且似乎不是性能问题 - 随意测试。结果并不像谷歌那样时髦,因为没有检测到句子的开始和结束位置。我可以添加这个,但它会更复杂,我不得不在一个功能中做这件事。如果对象操作被抽象为方法,那么它已经变得拥挤并且可以更好地编码。

无论如何,这就是我所拥有的,它应该是一个良好的开端。确定最密集的摘录,结果字符串大约是您指定的范围。我敦促对这段代码进行一些测试,因为我还没有彻底完成它。肯定有问题的案例可以找到。

我也鼓励任何人改进这个算法,或者只是改进代码来执行它。

享受。

// string excerpt(string $text, string $phrase, int $span = 100, string $delimiter = '...')
// parameters:
//  $text - text to be searched
//  $phrase - search string
//  $span - approximate length of the excerpt
//  $delimiter - string to use as a suffix and/or prefix if the excerpt is from the middle of a text

function excerpt($text, $phrase, $span = 100, $delimiter = '...') {

  $phrases = preg_split('/\s+/', $phrase);

  $regexp = '/\b(?:';
  foreach ($phrases as $phrase) {
    $regexp .= preg_quote($phrase, '/') . '|';
  }

  $regexp = substr($regexp, 0, -1) . ')\b/i';
  $matches = array();
  preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
  $matches = $matches[0];

  $nodes = array();
  foreach ($matches as $match) {
    $node = new stdClass;
    $node->phraseLength = strlen($match[0]);
    $node->position = $match[1];
    $nodes[] = $node;
  }

  if (count($nodes) > 0) {
    $clust = new stdClass;
    $clust->nodes[] = array_shift($nodes);
    $clust->length = $clust->nodes[0]->phraseLength;
    $clust->i = 0;
    $clusters = new stdClass;
    $clusters->data = array($clust);
    $clusters->i = 0;
    foreach ($nodes as $node) {
      $lastClust = $clusters->data[$clusters->i];
      $lastNode = $lastClust->nodes[$lastClust->i];
      $addedLength = $node->position - $lastNode->position - $lastNode->phraseLength + $node->phraseLength;
      if ($lastClust->length + $addedLength <= $span) {
        $lastClust->nodes[] = $node;
        $lastClust->length += $addedLength;
        $lastClust->i += 1;
      } else {
        if ($addedLength > $span) {
          $newClust = new stdClass;
          $newClust->nodes = array($node);
          $newClust->i = 0;
          $newClust->length = $node->phraseLength;
          $clusters->data[] = $newClust;
          $clusters->i += 1;
        } else {
          $newClust = clone $lastClust;
          while ($newClust->length + $addedLength > $span) {
            $shiftedNode = array_shift($newClust->nodes);
            if ($shiftedNode === null) {
              break;
            }
            $newClust->i -= 1;
            $removedLength = $shiftedNode->phraseLength;
            if (isset($newClust->nodes[0])) {
              $removedLength += $newClust->nodes[0]->position - $shiftedNode->position;
            }
            $newClust->length -= $removedLength;
          }
          if ($newClust->i < 0) {
            $newClust->i = 0;
          }
          $newClust->nodes[] = $node;
          $newClust->length += $addedLength;
          $clusters->data[] = $newClust;
          $clusters->i += 1;
        }
      }
    }
    $bestClust = $clusters->data[0];
    $bestClustSize = count($bestClust->nodes);
    foreach ($clusters->data as $clust) {
      $newClustSize = count($clust->nodes);
      if ($newClustSize > $bestClustSize) {
        $bestClust = $clust;
        $bestClustSize = $newClustSize;
      }
    }
    $clustLeft = $bestClust->nodes[0]->position;
    $clustLen = $bestClust->length;
    $padding = round(($span - $clustLen)/2);
    $clustLeft -= $padding;
    if ($clustLeft < 0) {
      $clustLen += $clustLeft*-1 + $padding;
      $clustLeft = 0;
    } else {
      $clustLen += $padding*2;
    }
  } else {
    $clustLeft = 0;
    $clustLen = $span;
  }

  $textLen = strlen($text);
  $prefix = '';
  $suffix = '';

  if (!ctype_space($text[$clustLeft]) && isset($text[$clustLeft-1]) && !ctype_space($text[$clustLeft-1])) {
    while (!ctype_space($text[$clustLeft])) {
      $clustLeft += 1;
    }
    $prefix = $delimiter;
  }

  $lastChar = $clustLeft + $clustLen;
  if (!ctype_space($text[$lastChar]) && isset($text[$lastChar+1]) && !ctype_space($text[$lastChar+1])) {
    while (!ctype_space($text[$lastChar])) {
      $lastChar -= 1;
    }
    $suffix = $delimiter;
    $clustLen = $lastChar - $clustLeft;
  }

  if ($clustLeft > 0) {
    $prefix = $delimiter;
  }

  if ($clustLeft + $clustLen < $textLen) {
    $suffix = $delimiter;
  }

  return $prefix . trim(substr($text, $clustLeft, $clustLen+1)) . $suffix;
}

答案 1 :(得分:5)

我想出了以下内容来生成摘录。你可以在这里看到代码https://github.com/boyter/php-excerpt它的工作原理是找到匹配单词的所有位置,然后根据最接近的单词进行摘录。从理论上讲,这听起来并不是很好,但在实践中它的效果非常好。

它实际上非常接近Sphider(它在第529行到第566行的searchfuncs.php中的记录)如何产生它的片段。我认为以下内容更容易阅读,并且没有Sphider中存在的错误。它也不使用正则表达式,这使得它比我使用的其他方法快一点。

我在这里发表了博客http://www.boyter.org/2013/04/building-a-search-result-extract-generator-in-php/

<?php

// find the locations of each of the words
// Nothing exciting here. The array_unique is required 
// unless you decide to make the words unique before passing in
function _extractLocations($words, $fulltext) {
    $locations = array();
    foreach($words as $word) {
        $wordlen = strlen($word);
        $loc = stripos($fulltext, $word);
        while($loc !== FALSE) {
            $locations[] = $loc;
            $loc = stripos($fulltext, $word, $loc + $wordlen);
        }
    }
    $locations = array_unique($locations);
    sort($locations);

    return $locations;
}

// Work out which is the most relevant portion to display
// This is done by looping over each match and finding the smallest distance between two found 
// strings. The idea being that the closer the terms are the better match the snippet would be. 
// When checking for matches we only change the location if there is a better match. 
// The only exception is where we have only two matches in which case we just take the 
// first as will be equally distant.
function _determineSnipLocation($locations, $prevcount) {
    // If we only have 1 match we dont actually do the for loop so set to the first
    $startpos = $locations[0];  
    $loccount = count($locations);
    $smallestdiff = PHP_INT_MAX;    

    // If we only have 2 skip as its probably equally relevant
    if(count($locations) > 2) {
        // skip the first as we check 1 behind
        for($i=1; $i < $loccount; $i++) { 
            if($i == $loccount-1) { // at the end
                $diff = $locations[$i] - $locations[$i-1];
            }
            else {
                $diff = $locations[$i+1] - $locations[$i];
            }

            if($smallestdiff > $diff) {
                $smallestdiff = $diff;
                $startpos = $locations[$i];
            }
        }
    }

    $startpos = $startpos > $prevcount ? $startpos - $prevcount : 0;
    return $startpos;
}

// 1/6 ratio on prevcount tends to work pretty well and puts the terms
// in the middle of the extract
function extractRelevant($words, $fulltext, $rellength=300, $prevcount=50, $indicator='...') {

    $textlength = strlen($fulltext);
    if($textlength <= $rellength) {
        return $fulltext;
    }

    $locations = _extractLocations($words, $fulltext);
    $startpos  = _determineSnipLocation($locations,$prevcount);

    // if we are going to snip too much...
    if($textlength-$startpos < $rellength) {
        $startpos = $startpos - ($textlength-$startpos)/2;
    }

    $reltext = substr($fulltext, $startpos, $rellength);

    // check to ensure we dont snip the last word if thats the match
    if( $startpos + $rellength < $textlength) {
        $reltext = substr($reltext, 0, strrpos($reltext, " ")).$indicator; // remove last word
    }

    // If we trimmed from the front add ...
    if($startpos != 0) {
        $reltext = $indicator.substr($reltext, strpos($reltext, " ") + 1); // remove first word
    }

    return $reltext;
}
?>

答案 2 :(得分:0)

function excerpt($text, $phrase, $radius = 100, $ending = "...") { 


     $phraseLen = strlen($phrase); 
   if ($radius < $phraseLen) { 
         $radius = $phraseLen; 
     } 

     $phrases = explode (' ',$phrase);

     foreach ($phrases as $phrase) {
             $pos = strpos(strtolower($text), strtolower($phrase)); 
             if ($pos > -1) break;
     }

     $startPos = 0; 
     if ($pos > $radius) { 
         $startPos = $pos - $radius; 
     } 

     $textLen = strlen($text); 

     $endPos = $pos + $phraseLen + $radius; 
     if ($endPos >= $textLen) { 
         $endPos = $textLen; 
     } 

     $excerpt = substr($text, $startPos, $endPos - $startPos); 
     if ($startPos != 0) { 
         $excerpt = substr_replace($excerpt, $ending, 0, $phraseLen); 
     } 

     if ($endPos != $textLen) { 
         $excerpt = substr_replace($excerpt, $ending, -$phraseLen); 
     } 

     return $excerpt; }

答案 3 :(得分:0)

我无法联系erisco,因此我发布了多项修复功能(最重要的是多字节支持)。

&#13;
&#13;
/**
 * @param string $text text to be searched
 * @param string $phrase search string
 * @param int $span approximate length of the excerpt
 * @param string $delimiter string to use as a suffix and/or prefix if the excerpt is from the middle of a text
 *
 * @return string
 */
public static function excerpt($text, $phrase, $span = 100, $delimiter = '...')
{
	$phrases = preg_split('/\s+/u', $phrase);
	$regexp = '/\b(?:';
	foreach($phrases as $phrase)
	{
		$regexp.= preg_quote($phrase, '/') . '|';
	}

	$regexp = mb_substr($regexp, 0, -1) .')\b/ui';
	$matches = [];
	preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
	$matches = $matches[0];
	$nodes = [];
	foreach($matches as $match)
	{
		$node = new stdClass;
		$node->phraseLength = mb_strlen($match[0]);
		$node->position = mb_strlen(substr($text, 0, $match[1])); // calculate UTF-8 position (@see https://bugs.php.net/bug.php?id=67487)
		$nodes[] = $node;
	}

	if(count($nodes) > 0)
	{
		$clust = new stdClass;
		$clust->nodes[] = array_shift($nodes);
		$clust->length = $clust->nodes[0]->phraseLength;
		$clust->i = 0;
		$clusters = new stdClass;
		$clusters->data =
		[
			$clust
		];
		$clusters->i = 0;
		foreach($nodes as $node)
		{
			$lastClust = $clusters->data[$clusters->i];
			$lastNode = $lastClust->nodes[$lastClust->i];
			$addedLength = $node->position - $lastNode->position - $lastNode->phraseLength + $node->phraseLength;
			if($lastClust->length + $addedLength <= $span)
			{
				$lastClust->nodes[] = $node;
				$lastClust->length+= $addedLength;
				$lastClust->i++;
			}
			else
			{
				if($addedLength > $span)
				{
					$newClust = new stdClass;
					$newClust->nodes =
					[
						$node
					];
					$newClust->i = 0;
					$newClust->length = $node->phraseLength;
					$clusters->data[] = $newClust;
					$clusters->i++;
				}
				else
				{
					$newClust = clone $lastClust;
					while($newClust->length + $addedLength > $span)
					{
						$shiftedNode = array_shift($newClust->nodes);
						if($shiftedNode === null)
						{
							break;
						}

						$newClust->i--;
						$removedLength = $shiftedNode->phraseLength;
						if(isset($newClust->nodes[0]))
						{
							$removedLength+= $newClust->nodes[0]->position - $shiftedNode->position;
						}

						$newClust->length-= $removedLength;
					}

					if($newClust->i < 0)
					{
						$newClust->i = 0;
					}

					$newClust->nodes[] = $node;
					$newClust->length+= $addedLength;
					$clusters->data[] = $newClust;
					$clusters->i++;
				}
			}
		}

		$bestClust = $clusters->data[0];
		$bestClustSize = count($bestClust->nodes);
		foreach($clusters->data as $clust)
		{
			$newClustSize = count($clust->nodes);
			if($newClustSize > $bestClustSize)
			{
				$bestClust = $clust;
				$bestClustSize = $newClustSize;
			}
		}

		$clustLeft = $bestClust->nodes[0]->position;
		$clustLen = $bestClust->length;
		$padding = intval(round(($span - $clustLen) / 2));
		$clustLeft-= $padding;
		if($clustLeft < 0)
		{
			$clustLen+= $clustLeft * -1 + $padding;
			$clustLeft = 0;
		}
		else
		{
			$clustLen+= $padding * 2;
		}
	}
	else
	{
		$clustLeft = 0;
		$clustLen = $span;
	}

	$textLen = mb_strlen($text);
	$prefix = '';
	$suffix = '';
	if($clustLeft > 0 && !ctype_space(mb_substr($text, $clustLeft, 1))
		&& !ctype_space(mb_substr($text, $clustLeft - 1, 1)))
	{
		$clustLeft++;
		while(!ctype_space(mb_substr($text, $clustLeft, 1)))
		{
			$clustLeft++;
		}

		$prefix = $delimiter;
	}

	$lastChar = $clustLeft + $clustLen;
	if($lastChar < $textLen && !ctype_space(mb_substr($text, $lastChar, 1))
		&& !ctype_space(mb_substr($text, $lastChar + 1, 1)))
	{
		$lastChar--;
		while(!ctype_space(mb_substr($text, $lastChar, 1)))
		{
			$lastChar--;
		}

		$suffix = $delimiter;
		$clustLen = $lastChar - $clustLeft;
	}

	if($clustLeft > 0)
	{
		$prefix = $delimiter;
	}
	if($clustLeft + $clustLen < $textLen)
	{
		$suffix = $delimiter;
	}

	return $prefix . trim(mb_substr($text, $clustLeft, $clustLen + 1)) . $suffix;
}
&#13;
&#13;
&#13;