Question

我想要一个Perl正则表达式，它将匹配字符串中的重复单词。

给出以下输入：

$str = "Thus joyful Troy Troy maintained the the watch of night..."

我想要以下输出：

Thus joyful [Troy Troy] maintained [the the] watch of night...

Answer 1

这类似于Learning Perl练习中的一个。诀窍是捕获所有重复的单词，因此你需要复制的“一个或多个”量词：

 $str = 'This is Goethe the the the their sentence';

 $str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;

我要使用的功能在perlre中描述，当它们应用于模式时，或perlop当它们影响替换运算符的工作时。

如果您希望/x标志添加无关紧要的空格和注释：

 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \2
          \b
         )+
      )
     /[\1]/xg;

我不喜欢那个\2，因为我讨厌计算相对位置。我可以使用Perl 5.10中的相对反向引用。 \g{-1}指的是紧接在前的捕获组：

 use 5.010;
 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \g{-1}
          \b
         )+
      )
     /[\1]/xg;

计数也不是那么好，所以我可以使用带标签的匹配：

 use 5.010;
 $str =~ s/
      \b
      (
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[\1]/xg;

我可以标记第一次捕获（$1）并稍后在%+中访问其值：

 use 5.010;
 $str =~ s/
      \b
      (?<dups>
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[$+{dups}]/xg;

我不应该真的需要第一次捕捉，因为它真的只是指那些匹配的东西。可悲的是，看起来${^MATCH}没有足够早地让我在替换方面使用它。我认为这是一个错误。这应该有效，但不是：

 $str =~ s/
      \b
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
     /[${^MATCH}]/pgx;   # DOESN'T WORK

我在Blead上检查这个，但是在我的小机器上编译需要一些时间。

Answer 2

这有效：

$str =~ s/\b((\w+)\s+\2)\b/[\1]/g;

Answer 3

您可以尝试：

$str = "Thus joyful Troy Troy maintained the the watch of night...";
$str =~s{\b(\w+)\s+\1\b}{[$1 $1]}g;
print "$str"; # prints Thus joyful [Troy Troy] maintained [the the] watch of night...

使用正则表达式：\b(\w+)\s+\1\b

说明：

\b：word bondary
\w+：一个字
()：要记住上面的单词
\s+：空白
\1：记住的字

它有效地找到两个由空格分隔的完整字词，并在其周围放置[ ]。

修改

如果您想保留可以使用的单词之间的空格量：

$str =~s{\b(\w+)(\s+)\1\b}{[$1$2$1]}g;

Answer 4

尝试以下方法：

$str =~ s/\b(\S+)\b(\s+\1\b)+/[\1]/g;

如何使用Perl正则表达式突出显示连续的重复单词？

4 个答案: