Question

我正在尝试在字符串中查找（并替换）重复的字符串。

我的字符串可能如下所示：

Lorem ipsum dolor坐 amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat。

这应该成为：

Lorem ipsum dolor坐 amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat。

请注意 amit sit 是如何删除的，因为它没有重复。

或者字符串可以是这样的：

Lorem ipsum dolor sit amet（）sit amet（）sit amet（）sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat。

应该成为：

Lorem ipsum dolor坐 amet（）坐 nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat。

所以它不仅仅是a-z，还可以有其他（ascii）字符。如果有人可以帮助我，我会很高兴。

下一步是匹配（和替换）这样的事情：

2个问题3个问题4个问题5个问题

将成为：

2个问题

最终输出中的数字可以是任意数字2,3,4，这没关系。最后一个例子中只会有不同的数字，但单词会相同。

Answer 1

如果有帮助，\1，\2等用于引用以前的分组。因此，例如，以下内容将挑选出重复的单词，并使它们只重复一次：

$string =~ s/(\w+) ( \1)+/$1/g

可以类似地重复短语。

Answer 2

有趣的问题。这可以通过单个preg_replace()语句来解决，但必须限制重复短语的长度以避免过多的回溯。这是一个带注释的正则表达式的解决方案，适用于测试数据并修复了最大长度为50个字符的双倍，三倍（或重复n次）短语：

第1部分的解决方案：

$result = preg_replace('/
    # Match a doubled "phrase" having length up to 50 chars.
    (            # $1: Phrase having whitespace boundaries.
      (?<=\s|^)  # Assert phrase preceded by ws or BOL.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $1: Phrase
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \1         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1', $text);

请注意，使用此解决方案，“短语”可以由单个单词组成，并且存在双重单词是有效语法且不应修复的合法情况。如果上述解决方案不是所需的行为，则可以轻松修改正则表达式，将“短语”定义为两个或多个“单词”。

修改：修改上面的正则表达式来处理任意数量的短语重复。还为下面问题的第二部分添加了解决方案。

这是一个类似的解决方案，其中短语以数字开头，重复短语也必须以数字开头（但重复短语的第一个数字不需要与原始数字匹配）：< / p>

第2部分的解决方案：

$result = preg_replace('/
    # Match doubled "phrases" with wildcard digits first word.
    (            # $1: 1st word of phrase (digits).
    \b           # Anchor 1st phrase word to word boundary.
    \d+          # Phrase 1st word is string of digits.
    \s+          # 1st and 2nd words separated by whitespace.
    )            # End $1:  1st word of phrase (digits).
    (            # $2: Part of phrase after 1st digits word.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $2: Part of phrase after 1st digits word.
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \d+        # Match duplicate of phrase.
      \s+        # Doubled phrase separated by whitespace.
      \2         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1$2', $text);

Answer 3

((?:\b|^)[\x20-\x7E]+)(\1)+将匹配从单词边界开始的任何重复的可打印ASCII字符串。这意味着它将匹配hello hello但不匹配hello中的双l。

如果要调整匹配的字符，可以更改\x##-\x##\x##-\x##形式的范围（其中##是 hex 值）并省略{{1你只想添加一个字符。

我能看到的唯一问题是，这种有点简单的方法会挑出一个合法重复的词而不是重复的短语。如果你想强迫它只选择由多个单词组成的重复短语，你可以使用-\x##之类的东西（注意额外的((?:\b|^)[\x20-\x7E]+\s)(\1)+）。

\s正在接近解决你的第二个问题，但我可能已经把自己想到了那个问题。

编辑：只是为了澄清，你在Perl中使用((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+或者使用PHP等价物来使用它。

Answer 4

好老暴力......

这太丑了我倾向于将其发布为eval(base64_decode(...))，但现在是：

function fixi($str) {
    $a = explode(" ", $str);
    return implode(' ', fix($a));
}

function fix($a) {
    $l = count($a);
    $len = 0;
    for($i=1; $i <= $l/2; $i++) {
        for($j=0; $j <= $l - 2*$i; $j++) {
            $n = 1;
            $found = false;
            while(1) {
                $a1 = array_slice($a, $j, $i);
                $a2 = array_slice($a, $j+$n*$i, $i);
                if ($a1 != $a2)
                    break;
                $found = true;
                $n++;
            }
            if ($found && $n*$i > $len) {
                $len = $n*$i;
                $f_j = $j;
                $f_i = $i;
            }
        }
    }
    if ($len) {
        return array_merge(
            fix(array_slice($a, 0, $f_j)),
            array_slice($a, $f_j, $f_i),
            fix(array_slice($a, $f_j+$len, $l))
        );
    }
    return $a;
}

标点符号是这个词的一部分，所以不要指望奇迹。

Answer 5

2个问题3个问题4个问题5个问题

成为

2个问题

可以使用以下方法解决：

$string =~ s/(\d+ (.*))( \d+ (\2))+/$1/g;

它匹配一个数字后跟任何东西（贪婪），然后是一系列以空格开头，后跟一个数字后跟一个匹配前一个任何东西的东西。对于所有这一切，它将其替换为第一个数字对。

Answer 6

第一个任务解决方案代码：

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        // Here I iterate through the number of words that will be repeated and check all the possible positions reps
        for ($i = round($words_count / 2); $i >= 1; $i--) {
            for ($j = 0; $j < ($words_count - $i); $j++) {
                $need_remove_item = !$need_remove[$j];
                for ($k = $j; $k < ($j + $i); $k++) {
                    if ($words[$k] != $words[$k + $i]) {
                        $need_remove_item = false;
                        break;
                    }
                }
                if ($need_remove_item) {
                    for ($k = $j; $k < ($j + $i); $k++) {
                        $need_remove[$k] = true;
                    }
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = 'Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.' . '<br>' . '<br>';



    $string = 'Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.';

?>

第二个任务解决方案代码：

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        for ($j = 0; $j < ($words_count - 1); $j++) {
            $need_remove_item = !$need_remove[$j];
            for ($k = $j + 1; $k < ($words_count - 1); $k += 2) {
                if ($words[$k] != $words[$k + 2]) {
                    $need_remove_item = false;
                    break;
                }
            }
            if ($need_remove_item) {
                for ($k = $j + 2; $k < $words_count; $k++) {
                    $need_remove[$k] = true;
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = '2 questions 3 questions 4 questions 5 questions';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo '2 questions';

?>

Answer 7

非常感谢你回答这个问题。这对我帮助很大！。我尝试了Ridgerunners和dtanders正则表达式，虽然他们在一些测试字符串上工作（经过一些修改）但我遇到了其他字符串的麻烦。

所以我去了蛮力攻击:)这是受Nox的启发。这样我可以将两个问题结合起来并且仍然具有良好的性能（甚至比regexp更好，因为在PHP中这很慢）。

对于任何有兴趣的人都是概念代码：

function split_repeating_num($string) {
$words = explode(' ', $string);
$all_words = $words;
$num_words = count($words);
$max_length = 100; //max length of substring to check
$max_words = 4; //maximum number of words in substring 
$found = array();
$current_pos = 0;
$unset = array();
foreach ($words as $key=>$word) {
    //see if this word exist in the next part of the string
    $len = strlen($word);
    if ($len === 0) continue;
    $current_pos += $len + 1; //+1 for the space
    $substr = substr($string, $current_pos, $max_length);
    if (($pos = strpos(substr($string, $current_pos, $max_length), $word)) !== false) {
        //found it
        //set pointer words and all_words to same value
        while (key($all_words) < $key ) next($all_words);
        while (key($all_words) > $key ) prev($all_words);
        $next_word = next($all_words);

        while (is_numeric($next_word) || $next_word === '') {
            $next_word = next($all_words);
        }
        // see if it follows the word directly
        if ($word === $next_word) {
            $unset [$key] = 1;
        } elseif ($key + 3 < $num_words) {
            for($i = $max_words; $i > 0; $i --) {
                $x = 0;
                $string_a = '';
                $string_b = '';
                while ($x < $i ) {
                    while (is_numeric($next_word) || $next_word === '' ) {
                        $next_word = each($all_words);
                    }
                    $x ++;
                    $string_a .= $next_word;
                    $string_b .= $words [key($all_words) + $i];
                }

                if ($string_a === $string_b) {
                    //we have a match
                    for($x = $key; $x < $i + $key; $x ++)
                        $unset [$x] = 1;
                }
            }
        }
    }

}
foreach ($unset as $k=>$v) {
    unset($words [$k]);
}
return implode(' ', $words);

}

仍然存在一些小问题，我确实需要测试，但它似乎可以完成它的工作。

替换字符串中的重复字符串

7 个答案: