Question

我可以使用以下方法轻松捕捉重复的单词： "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b"但是这个正则表达式似乎没有扩展到多个单词（为什么它应该处于当前状态）。如何使用正则表达式找到重复的短语？

在这里，我提取重复的术语（无论情况如何），但同样的正则表达式并没有提到重复的短语：

library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)

我希望有一个可以返回的正则表达式：

"is a big is a Big"

有：

x <- "this is a big is a Big deal"

为了覆盖角落的情况，这里需要更大的测试和输出...

    "this is a big is a Big deal",
    "I want want to see",
    "I want, want to see",
    "I want...want to see see how",
    "this is a big is a Big deal for those of, those of you who are.",
    "I like it. It is cool",
)


[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big" "those of, those of"

[[6]]
[1] NA

我现在的正则表达式只能让我：

rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)

## [[1]]
## [1] NA
## 
## [[2]]
## [1] "want want"
## 
## [[3]]
## [1] "want, want"
## 
## [[4]]
## [1] "want...want" "see see"    
## 
## [[5]]
## [1] NA

Answer 1

我认为这样做符合您的要求（请注意，我们只允许使用一个空格...或,作为分隔符，但您应该能够轻松调整它）：

pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)

产地：

[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big"  "those of, those of"

Answer 2

试试这个：

> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big"  "those of, those of"

这是一个可视化（除了可视化中的错误 - \S部分应该在组内。

(?i)\b(\S.*\S)[ ,.]*\b(\1)

Regular expression visualization

Debuggex Demo

您可能希望将[ ,.]替换为[ [:punct:]]。我没有这样做，因为debuggex不支持POSIX字符组。

正则表达式捕获重复的短语

2 个答案: