s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index."
> unique(strsplit(s, ",")[[1]])
[1] "height (female)" " weight" " BRCA1" " height (female)" " body mass index" " height (e.g. by kilos)" " body mass index."
我有一个具有以下结构的字符串:<string>, <string>, <string>, ..., <string>.
除了最后一个之外,每个<string>
都用逗号分隔,后跟一个句点。我想使用正则表达式删除重复的字符串。字符串可以采用以下三种格式之一:
(...)
的字词,例如height (female)
或height (e.g. by kilos)
weight
或BRCA1
body mass index
我想要的输出是:
"height (female), weight, BRCA1, body mass index, height (e.g. by kilos)."
简单地在逗号上执行strsplit
不会考虑在height (female)
第二次出现之前有空格的特殊情况,或者在最后body mass index
之后出现空格的特殊情况一段时间。
答案 0 :(得分:3)
@ thelatemail的评论你指出了正确的方向。使用unlist(strsplit(x = <input string>, split = <regex pattern>))
拉出逗号和空格。 unique
删除重复项,paste(<character vector>, collapse = ", ")
将所有内容重新组合在一起。不要忘记unlist
,或者唯一会查找列表中的不同元素而不是字符向量。
# input
s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index."
# code
paste(unique(unlist(strsplit(s, ",\\s+|\\.$"))), collapse = ", ")
# [1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"
答案 1 :(得分:2)
只要您不必在输入中转义任何逗号,并且格式已知(例如,字符串以应该被剥离的句点结束),这可以通过几个简单的步骤完成:
library(stringr)
s_unique = s %>%
str_remove("\\.$") %>%
str_split(",", simplify = TRUE) %>%
str_trim() %>% # Trim whitespace
unique()
paste0(s_unique, collapse = ", ")
输出:
[1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"