正则表达式捕获重复的短语

时间:2015-02-28 20:19:49

标签: regex r

我可以使用以下方法轻松捕捉重复的单词: "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b"但是这个正则表达式似乎没有扩展到多个单词(为什么它应该处于当前状态)。如何使用正则表达式找到重复的短语?

在这里,我提取重复的术语(无论情况如何),但同样的正则表达式并没有提到重复的短语:

library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)

我希望有一个可以返回的正则表达式:

"is a big is a Big"

有:

x <- "this is a big is a Big deal"

为了覆盖角落的情况,这里需要更大的测试和输出...

    "this is a big is a Big deal",
    "I want want to see",
    "I want, want to see",
    "I want...want to see see how",
    "this is a big is a Big deal for those of, those of you who are.",
    "I like it. It is cool",
)


[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big" "those of, those of"

[[6]]
[1] NA

我现在的正则表达式只能让我:

rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)

## [[1]]
## [1] NA
## 
## [[2]]
## [1] "want want"
## 
## [[3]]
## [1] "want, want"
## 
## [[4]]
## [1] "want...want" "see see"    
## 
## [[5]]
## [1] NA

2 个答案:

答案 0 :(得分:3)

我认为这样做符合您的要求(请注意,我们只允许使用一个空格...,作为分隔符,但您应该能够轻松调整它):

pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)

产地:

[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big"  "those of, those of"

答案 1 :(得分:1)

试试这个:

> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big"  "those of, those of"

这是一个可视化(除了可视化中的错误 - \S部分应该在组内。

(?i)\b(\S.*\S)[ ,.]*\b(\1)

Regular expression visualization

Debuggex Demo

您可能希望将[ ,.]替换为[ [:punct:]]。我没有这样做,因为debuggex不支持POSIX字符组。