这是摘录功能:
function excerpt($text, $phrase, $radius = 100, $ending = "...") {
270 if (empty($text) or empty($phrase)) {
271 return $this->truncate($text, $radius * 2, $ending);
272 }
273
274 $phraseLen = strlen($phrase);
275 if ($radius < $phraseLen) {
276 $radius = $phraseLen;
277 }
278
279 $pos = strpos(strtolower($text), strtolower($phrase));
280
281 $startPos = 0;
282 if ($pos > $radius) {
283 $startPos = $pos - $radius;
284 }
285
286 $textLen = strlen($text);
287
288 $endPos = $pos + $phraseLen + $radius;
289 if ($endPos >= $textLen) {
290 $endPos = $textLen;
291 }
292
293 $excerpt = substr($text, $startPos, $endPos - $startPos);
294 if ($startPos != 0) {
295 $excerpt = substr_replace($excerpt, $ending, 0, $phraseLen);
296 }
297
298 if ($endPos != $textLen) {
299 $excerpt = substr_replace($excerpt, $ending, -$phraseLen);
300 }
301
302 return $excerpt;
303 }
它的缺点是它不会尝试匹配尽可能多的搜索词,默认情况下只匹配一次。
如何实现所需的?
答案 0 :(得分:5)
到目前为止,此处列出的代码对我没有用,所以我花了一些时间考虑实现算法。我现在所做的工作得体,而且似乎不是性能问题 - 随意测试。结果并不像谷歌那样时髦,因为没有检测到句子的开始和结束位置。我可以添加这个,但它会更复杂,我不得不在一个功能中做这件事。如果对象操作被抽象为方法,那么它已经变得拥挤并且可以更好地编码。
无论如何,这就是我所拥有的,它应该是一个良好的开端。确定最密集的摘录,结果字符串大约是您指定的范围。我敦促对这段代码进行一些测试,因为我还没有彻底完成它。肯定有问题的案例可以找到。
我也鼓励任何人改进这个算法,或者只是改进代码来执行它。
享受。
// string excerpt(string $text, string $phrase, int $span = 100, string $delimiter = '...')
// parameters:
// $text - text to be searched
// $phrase - search string
// $span - approximate length of the excerpt
// $delimiter - string to use as a suffix and/or prefix if the excerpt is from the middle of a text
function excerpt($text, $phrase, $span = 100, $delimiter = '...') {
$phrases = preg_split('/\s+/', $phrase);
$regexp = '/\b(?:';
foreach ($phrases as $phrase) {
$regexp .= preg_quote($phrase, '/') . '|';
}
$regexp = substr($regexp, 0, -1) . ')\b/i';
$matches = array();
preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
$matches = $matches[0];
$nodes = array();
foreach ($matches as $match) {
$node = new stdClass;
$node->phraseLength = strlen($match[0]);
$node->position = $match[1];
$nodes[] = $node;
}
if (count($nodes) > 0) {
$clust = new stdClass;
$clust->nodes[] = array_shift($nodes);
$clust->length = $clust->nodes[0]->phraseLength;
$clust->i = 0;
$clusters = new stdClass;
$clusters->data = array($clust);
$clusters->i = 0;
foreach ($nodes as $node) {
$lastClust = $clusters->data[$clusters->i];
$lastNode = $lastClust->nodes[$lastClust->i];
$addedLength = $node->position - $lastNode->position - $lastNode->phraseLength + $node->phraseLength;
if ($lastClust->length + $addedLength <= $span) {
$lastClust->nodes[] = $node;
$lastClust->length += $addedLength;
$lastClust->i += 1;
} else {
if ($addedLength > $span) {
$newClust = new stdClass;
$newClust->nodes = array($node);
$newClust->i = 0;
$newClust->length = $node->phraseLength;
$clusters->data[] = $newClust;
$clusters->i += 1;
} else {
$newClust = clone $lastClust;
while ($newClust->length + $addedLength > $span) {
$shiftedNode = array_shift($newClust->nodes);
if ($shiftedNode === null) {
break;
}
$newClust->i -= 1;
$removedLength = $shiftedNode->phraseLength;
if (isset($newClust->nodes[0])) {
$removedLength += $newClust->nodes[0]->position - $shiftedNode->position;
}
$newClust->length -= $removedLength;
}
if ($newClust->i < 0) {
$newClust->i = 0;
}
$newClust->nodes[] = $node;
$newClust->length += $addedLength;
$clusters->data[] = $newClust;
$clusters->i += 1;
}
}
}
$bestClust = $clusters->data[0];
$bestClustSize = count($bestClust->nodes);
foreach ($clusters->data as $clust) {
$newClustSize = count($clust->nodes);
if ($newClustSize > $bestClustSize) {
$bestClust = $clust;
$bestClustSize = $newClustSize;
}
}
$clustLeft = $bestClust->nodes[0]->position;
$clustLen = $bestClust->length;
$padding = round(($span - $clustLen)/2);
$clustLeft -= $padding;
if ($clustLeft < 0) {
$clustLen += $clustLeft*-1 + $padding;
$clustLeft = 0;
} else {
$clustLen += $padding*2;
}
} else {
$clustLeft = 0;
$clustLen = $span;
}
$textLen = strlen($text);
$prefix = '';
$suffix = '';
if (!ctype_space($text[$clustLeft]) && isset($text[$clustLeft-1]) && !ctype_space($text[$clustLeft-1])) {
while (!ctype_space($text[$clustLeft])) {
$clustLeft += 1;
}
$prefix = $delimiter;
}
$lastChar = $clustLeft + $clustLen;
if (!ctype_space($text[$lastChar]) && isset($text[$lastChar+1]) && !ctype_space($text[$lastChar+1])) {
while (!ctype_space($text[$lastChar])) {
$lastChar -= 1;
}
$suffix = $delimiter;
$clustLen = $lastChar - $clustLeft;
}
if ($clustLeft > 0) {
$prefix = $delimiter;
}
if ($clustLeft + $clustLen < $textLen) {
$suffix = $delimiter;
}
return $prefix . trim(substr($text, $clustLeft, $clustLen+1)) . $suffix;
}
答案 1 :(得分:5)
我想出了以下内容来生成摘录。你可以在这里看到代码https://github.com/boyter/php-excerpt它的工作原理是找到匹配单词的所有位置,然后根据最接近的单词进行摘录。从理论上讲,这听起来并不是很好,但在实践中它的效果非常好。
它实际上非常接近Sphider(它在第529行到第566行的searchfuncs.php中的记录)如何产生它的片段。我认为以下内容更容易阅读,并且没有Sphider中存在的错误。它也不使用正则表达式,这使得它比我使用的其他方法快一点。
我在这里发表了博客http://www.boyter.org/2013/04/building-a-search-result-extract-generator-in-php/
<?php
// find the locations of each of the words
// Nothing exciting here. The array_unique is required
// unless you decide to make the words unique before passing in
function _extractLocations($words, $fulltext) {
$locations = array();
foreach($words as $word) {
$wordlen = strlen($word);
$loc = stripos($fulltext, $word);
while($loc !== FALSE) {
$locations[] = $loc;
$loc = stripos($fulltext, $word, $loc + $wordlen);
}
}
$locations = array_unique($locations);
sort($locations);
return $locations;
}
// Work out which is the most relevant portion to display
// This is done by looping over each match and finding the smallest distance between two found
// strings. The idea being that the closer the terms are the better match the snippet would be.
// When checking for matches we only change the location if there is a better match.
// The only exception is where we have only two matches in which case we just take the
// first as will be equally distant.
function _determineSnipLocation($locations, $prevcount) {
// If we only have 1 match we dont actually do the for loop so set to the first
$startpos = $locations[0];
$loccount = count($locations);
$smallestdiff = PHP_INT_MAX;
// If we only have 2 skip as its probably equally relevant
if(count($locations) > 2) {
// skip the first as we check 1 behind
for($i=1; $i < $loccount; $i++) {
if($i == $loccount-1) { // at the end
$diff = $locations[$i] - $locations[$i-1];
}
else {
$diff = $locations[$i+1] - $locations[$i];
}
if($smallestdiff > $diff) {
$smallestdiff = $diff;
$startpos = $locations[$i];
}
}
}
$startpos = $startpos > $prevcount ? $startpos - $prevcount : 0;
return $startpos;
}
// 1/6 ratio on prevcount tends to work pretty well and puts the terms
// in the middle of the extract
function extractRelevant($words, $fulltext, $rellength=300, $prevcount=50, $indicator='...') {
$textlength = strlen($fulltext);
if($textlength <= $rellength) {
return $fulltext;
}
$locations = _extractLocations($words, $fulltext);
$startpos = _determineSnipLocation($locations,$prevcount);
// if we are going to snip too much...
if($textlength-$startpos < $rellength) {
$startpos = $startpos - ($textlength-$startpos)/2;
}
$reltext = substr($fulltext, $startpos, $rellength);
// check to ensure we dont snip the last word if thats the match
if( $startpos + $rellength < $textlength) {
$reltext = substr($reltext, 0, strrpos($reltext, " ")).$indicator; // remove last word
}
// If we trimmed from the front add ...
if($startpos != 0) {
$reltext = $indicator.substr($reltext, strpos($reltext, " ") + 1); // remove first word
}
return $reltext;
}
?>
答案 2 :(得分:0)
function excerpt($text, $phrase, $radius = 100, $ending = "...") {
$phraseLen = strlen($phrase);
if ($radius < $phraseLen) {
$radius = $phraseLen;
}
$phrases = explode (' ',$phrase);
foreach ($phrases as $phrase) {
$pos = strpos(strtolower($text), strtolower($phrase));
if ($pos > -1) break;
}
$startPos = 0;
if ($pos > $radius) {
$startPos = $pos - $radius;
}
$textLen = strlen($text);
$endPos = $pos + $phraseLen + $radius;
if ($endPos >= $textLen) {
$endPos = $textLen;
}
$excerpt = substr($text, $startPos, $endPos - $startPos);
if ($startPos != 0) {
$excerpt = substr_replace($excerpt, $ending, 0, $phraseLen);
}
if ($endPos != $textLen) {
$excerpt = substr_replace($excerpt, $ending, -$phraseLen);
}
return $excerpt; }
答案 3 :(得分:0)
我无法联系erisco,因此我发布了多项修复功能(最重要的是多字节支持)。
/**
* @param string $text text to be searched
* @param string $phrase search string
* @param int $span approximate length of the excerpt
* @param string $delimiter string to use as a suffix and/or prefix if the excerpt is from the middle of a text
*
* @return string
*/
public static function excerpt($text, $phrase, $span = 100, $delimiter = '...')
{
$phrases = preg_split('/\s+/u', $phrase);
$regexp = '/\b(?:';
foreach($phrases as $phrase)
{
$regexp.= preg_quote($phrase, '/') . '|';
}
$regexp = mb_substr($regexp, 0, -1) .')\b/ui';
$matches = [];
preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
$matches = $matches[0];
$nodes = [];
foreach($matches as $match)
{
$node = new stdClass;
$node->phraseLength = mb_strlen($match[0]);
$node->position = mb_strlen(substr($text, 0, $match[1])); // calculate UTF-8 position (@see https://bugs.php.net/bug.php?id=67487)
$nodes[] = $node;
}
if(count($nodes) > 0)
{
$clust = new stdClass;
$clust->nodes[] = array_shift($nodes);
$clust->length = $clust->nodes[0]->phraseLength;
$clust->i = 0;
$clusters = new stdClass;
$clusters->data =
[
$clust
];
$clusters->i = 0;
foreach($nodes as $node)
{
$lastClust = $clusters->data[$clusters->i];
$lastNode = $lastClust->nodes[$lastClust->i];
$addedLength = $node->position - $lastNode->position - $lastNode->phraseLength + $node->phraseLength;
if($lastClust->length + $addedLength <= $span)
{
$lastClust->nodes[] = $node;
$lastClust->length+= $addedLength;
$lastClust->i++;
}
else
{
if($addedLength > $span)
{
$newClust = new stdClass;
$newClust->nodes =
[
$node
];
$newClust->i = 0;
$newClust->length = $node->phraseLength;
$clusters->data[] = $newClust;
$clusters->i++;
}
else
{
$newClust = clone $lastClust;
while($newClust->length + $addedLength > $span)
{
$shiftedNode = array_shift($newClust->nodes);
if($shiftedNode === null)
{
break;
}
$newClust->i--;
$removedLength = $shiftedNode->phraseLength;
if(isset($newClust->nodes[0]))
{
$removedLength+= $newClust->nodes[0]->position - $shiftedNode->position;
}
$newClust->length-= $removedLength;
}
if($newClust->i < 0)
{
$newClust->i = 0;
}
$newClust->nodes[] = $node;
$newClust->length+= $addedLength;
$clusters->data[] = $newClust;
$clusters->i++;
}
}
}
$bestClust = $clusters->data[0];
$bestClustSize = count($bestClust->nodes);
foreach($clusters->data as $clust)
{
$newClustSize = count($clust->nodes);
if($newClustSize > $bestClustSize)
{
$bestClust = $clust;
$bestClustSize = $newClustSize;
}
}
$clustLeft = $bestClust->nodes[0]->position;
$clustLen = $bestClust->length;
$padding = intval(round(($span - $clustLen) / 2));
$clustLeft-= $padding;
if($clustLeft < 0)
{
$clustLen+= $clustLeft * -1 + $padding;
$clustLeft = 0;
}
else
{
$clustLen+= $padding * 2;
}
}
else
{
$clustLeft = 0;
$clustLen = $span;
}
$textLen = mb_strlen($text);
$prefix = '';
$suffix = '';
if($clustLeft > 0 && !ctype_space(mb_substr($text, $clustLeft, 1))
&& !ctype_space(mb_substr($text, $clustLeft - 1, 1)))
{
$clustLeft++;
while(!ctype_space(mb_substr($text, $clustLeft, 1)))
{
$clustLeft++;
}
$prefix = $delimiter;
}
$lastChar = $clustLeft + $clustLen;
if($lastChar < $textLen && !ctype_space(mb_substr($text, $lastChar, 1))
&& !ctype_space(mb_substr($text, $lastChar + 1, 1)))
{
$lastChar--;
while(!ctype_space(mb_substr($text, $lastChar, 1)))
{
$lastChar--;
}
$suffix = $delimiter;
$clustLen = $lastChar - $clustLeft;
}
if($clustLeft > 0)
{
$prefix = $delimiter;
}
if($clustLeft + $clustLen < $textLen)
{
$suffix = $delimiter;
}
return $prefix . trim(mb_substr($text, $clustLeft, $clustLen + 1)) . $suffix;
}
&#13;