突出显示4个连续的匹配词

时间:2019-01-30 07:34:20

标签: php

我有两个字符串,一个是模态答案,另一个是学生给出的答案。我想用学生给出的答案中的模态答案突出显示四个连续匹配的单词。

我写了下面的函数来匹配和突出显示答案字符串中的单词。

function getCopiedText($modelAnswer, $answer) {
    $modelAnsArr = explode(' ', $modelAnswer);
    $answerArr = explode(' ', $answer);
    $common = array_intersect($answerArr, $modelAnsArr);
    if (isset($common) && !empty($common)) {
        $common[max(array_keys($common)) + 2] = '';
        $count = 0;
        $word = '';
        for ($i = 0; $i <= max(array_keys($common)); $i++) {
            if (isset($common[$i])) {
                $count++;
                $word .= $common[$i] . ' ';
            } else {
                if ($count >= 4) {
                    $answer = preg_replace("@($word)@i", '<span style="color:blue">$1</span>', $answer);
                }
                $count = 0;
                $word = '';
            }
        }
    }
    return $answer;
}

示例字符串

$modelAnswer = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry`s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

$answer ='Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry`s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

函数调用

echo getCopiedText($modelAnswer, $answer);

问题: 当$answer字符串超过300个字符时,该函数将不会返回突出显示的字符串。如果假设$answer字符串少于300个字符,则它将返回突出显示的字符串。例如假设$answer字符串为Lorem Ipsum is simply dummy text of the printing and typesetting industry.,则返回突出显示的字符串。但不适用于字符数超过300的字符。

我不确定,但是preg_replace函数似乎有问题。长度(preg_replace的第一个参数)的长度可能超出限制。

2 个答案:

答案 0 :(得分:2)

我要添加一个单独的答案,因为OP随后评论说,他们确实希望匹配4个或更多单词的短语。而我最初的答案是基于OP最初希望匹配4个词组的评论。

我重构了原始答案,使用CachingIterator遍历每个单词,而不是仅搜索4个单词的集合。不仅可以指定每个短语中最少的单词数(默认值为4),还可以处理缩短的重复短语并在遇到部分匹配项时倒带。

示例:

Model: "one two three four one two three four five six seven"
Answer:
    "two three four five two three four five six seven"
Shortened Duplicate:: 
    "[two three four five] [[two three four five] six seven]"

Answer: 
    "one one two three four"
Partial Match Rewind:
    "one [one two three four]"

来源https://3v4l.org/AKRTQ


示例:https://3v4l.org/5P2L6

  

此解决方案不区分大小写,并考虑特殊的@ (, )和不可打印的   字符\n\r\t

     

我建议同时删除所有非字母数字字符   答案和模型,以对它们进行消毒以进行比较并   检测算法更可预测。

     

preg_replace(['/[^[:alnum:][:space:]]/u', '/[[:space:]]{2,}/u'], ['', ' '], $answer); https://3v4l.org/Pn6CT

     

或者,您可以使用explode https://3v4l.org/cChjo而不是使用str_word_count($answer, 1, '1234567890')来完成相同的结果,同时保留带连字符和撇号的单词。

function getCopiedText($model, $answer, $min = 4)
{
    //ensure there are not double spaces
    $model = str_replace('  ', ' ', $model);
    $answer = str_replace('  ', ' ', $answer);
    $test = new CachingIterator(new ArrayIterator(explode(' ', $answer)));
    $words = $matches = [];
    $p = $match = null;
    //test each word
    foreach($test as $i => $word) {
        $words[] = $word;
        $count = count($words);
        if ($count === 2) {
            //save pointer at second word
            $p = $i;
        }
        //check if the phrase of words exists in the model
        if (false !== stripos($model, $phrase = implode(' ', $words))) {
            //only match phrases with the minimum or more words
            if ($count >= $min) {
                //reset back to here for more matches
                $match = $phrase;
                if (!$test->hasNext()) {
                    //add the the last word to the phrase
                    $matches[$match] = true;
                    $p = null;
                }
            }
        } else {
            //the phrase of words was no longer found
            if (null !== $match && !isset($matches[$match])) {
                //add the matched phrase to the list of matches
                $matches[$match] = true;
                $p = null;
                $iterator = $test->getInnerIterator();
                if ($iterator->valid()) {
                    //rewind pointer back to the current word since the current word may be part of the next phrase
                    $iterator->seek($i);
                }
            } elseif (null !== $p) {
                //match not found, determine if we need to rewind the pointer
                $iterator = $test->getInnerIterator();
                if ($iterator->valid()) {
                    //rewind pointer back to second word since a partial phrase less than 4 words was matched
                    $iterator->seek($p);
                }
                $p = null;
            }
            //reset testing
            $words = [];
            $match = null;
        }
    }

    //highlight the matched phrases in the answer
    if (!empty($matches)) {
        $phrases = array_keys($matches);
        //sort phrases by the length
        array_multisort(array_map('strlen', $phrases), $phrases);

        //filter the matches as regular expression patterns
        //order by longest phrase first to ensure double highlighting of smaller phrases
        $phrases  = array_map(function($phrase) {
            return '/(' . preg_quote($phrase, '/') . ')/iu';
        }, array_reverse($phrases));

        $answer = preg_replace($phrases, '<span style="color:blue">$0</span>', $answer);
    }

    return $answer;
}
$modelAnswer = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry`s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

$answer ='NOT IN is simply dummy text NOT in when an unknown printer took a galley -this- is simply dummy text of the printing and typesetting industry';

echo getCopiedText($modelAnswer, $answer);

结果:

NOT IN <span style="color:blue">is simply dummy text</span> NOT in <span style="color:blue">when an unknown printer took a galley</span> -this- <span style="color:blue"><span style="color:blue">is simply dummy text</span> of the printing and typesetting industry</span>

答案 1 :(得分:1)

尽管我不能完全确定您想要的最终结果。看来您正在尝试突出显示给定答案中在模型中连续匹配的4个连续单词的任何集合。至于确定潜在的抄袭发生。

根据您有关检索匹配的4个单词集的评论,我想提出很多优化建议。

示例:https://3v4l.org/uvPug

function getCopiedText($model, $answer) 
{
    $test = explode(' ', $answer);
    while ($test) {
        if (count($test) < 4) {
            break;
        }
        //retrieve 4 consecutive words from the answer and remove them
        $words = array_splice($test, 0, 4);
        $phrase = implode(' ', $words);
        //ensure the phrase is found in the model
        if (false !== stripos($model, $phrase)) {
            $answer = str_ireplace($phrase, '<span style="color:blue">' . $phrase . '</span>', $answer);
        }
    }

    return $answer;
}

$modelAnswer = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry`s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

$answer ='NOT IN is simply dummy text NOT IN when an unknown printer took a galley -this- is simply dummy text';

echo getCopiedText($modelAnswer, $answer);

结果:

NOT IN <span style="color:blue">is simply dummy text</span> NOT IN <span style="color:blue">when an unknown printer</span> took a galley -this- <span style="color:blue">is simply dummy text</span>

提示您的原始方法。

每当将变量传递给PHP中的regex函数时,您都需要确保已使用preg_quote对其进行了适当的转义。这样可以确保将变量中的特殊字符(例如@\n\\)视为模式的一部分。