我的目标是仅在字符串集的括号中删除重复的单词。
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
我想得到的就像这样
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
为了得到结果,我使用了像这样的代码
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
然而,它导致了不良的输出。
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
为什么我的代码会删除位于字符串后半部分的括号? 我应该怎样做我想要的结果?
答案 0 :(得分:5)
我们可以使用gsubfn
。这里的想法是通过匹配开括号来选择括号内的字符(\\(
必须转义括号,因为它是元字符)后跟一个或多个不是右括号的字符({{1 }},将其作为括号内的一组捕获。在替换中,我们将字符组([^)]+
)与x
,strsplit
unlist
输出分开,获取list
元素和{{1}它在一起
unique
答案 1 :(得分:2)
得到上面的答案。这更简单,但您也可以尝试:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
这个想法是将括号内的单词作为单位提取出来。然后删除重复项并用更新后的旧单元替换。
答案 2 :(得分:1)
更长但更精心的尝试
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"