我可以使用以下方法轻松捕捉重复的单词:
"(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b"
但是这个正则表达式似乎没有扩展到多个单词(为什么它应该处于当前状态)。如何使用正则表达式找到重复的短语?
在这里,我提取重复的术语(无论情况如何),但同样的正则表达式并没有提到重复的短语:
library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
我希望有一个可以返回的正则表达式:
"is a big is a Big"
有:
x <- "this is a big is a Big deal"
为了覆盖角落的情况,这里需要更大的测试和输出...
"this is a big is a Big deal",
"I want want to see",
"I want, want to see",
"I want...want to see see how",
"this is a big is a Big deal for those of, those of you who are.",
"I like it. It is cool",
)
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
[[6]]
[1] NA
我现在的正则表达式只能让我:
rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "want want"
##
## [[3]]
## [1] "want, want"
##
## [[4]]
## [1] "want...want" "see see"
##
## [[5]]
## [1] NA
答案 0 :(得分:3)
我认为这样做符合您的要求(请注意,我们只允许使用一个空格...
或,
作为分隔符,但您应该能够轻松调整它):
pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)
产地:
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
答案 1 :(得分:1)
试试这个:
> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
这是一个可视化(除了可视化中的错误 - \S
部分应该在组内。
(?i)\b(\S.*\S)[ ,.]*\b(\1)
您可能希望将[ ,.]
替换为[ [:punct:]]
。我没有这样做,因为debuggex不支持POSIX字符组。