正则表达式删除重复的字符串

时间:2018-06-13 00:16:12

标签: r regex

s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index." 

> unique(strsplit(s, ",")[[1]])
[1] "height (female)"         " weight"                 " BRCA1"                  " height (female)"        " body mass index"        " height (e.g. by kilos)"      " body mass index." 

我有一个具有以下结构的字符串:<string>, <string>, <string>, ..., <string>.

除了最后一个之外,每个<string>都用逗号分隔,后跟一个句点。我想使用正则表达式删除重复的字符串。字符串可以采用以下三种格式之一:

  1. 后跟(...)的字词,例如height (female)height (e.g. by kilos)
  2. 一个字:例如weightBRCA1
  3. 由空格分隔的多个单词,例如body mass index
  4. 我想要的输出是:

    "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)."
    

    简单地在逗号上执行strsplit不会考虑在height (female)第二次出现之前有空格的特殊情况,或者在最后body mass index之后出现空格的特殊情况一段时间。

2 个答案:

答案 0 :(得分:3)

@ thelatemail的评论你指出了正确的方向。使用unlist(strsplit(x = <input string>, split = <regex pattern>))拉出逗号和空格。 unique删除重复项,paste(<character vector>, collapse = ", ")将所有内容重新组合在一起。不要忘记unlist,或者唯一会查找列表中的不同元素而不是字符向量。

# input
s <- "height (female), weight, BRCA1, height (female), BRCA1, weight, body mass index, body mass index, weight, weight, height (e.g. by kilos), body mass index." 

# code
paste(unique(unlist(strsplit(s, ",\\s+|\\.$"))), collapse = ", ")
# [1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"

答案 1 :(得分:2)

只要您不必在输入中转义任何逗号,并且格式已知(例如,字符串以应该被剥离的句点结束),这可以通过几个简单的步骤完成:

library(stringr)
s_unique = s %>%
    str_remove("\\.$") %>%
    str_split(",", simplify = TRUE) %>%
    str_trim() %>%  # Trim whitespace
    unique()

paste0(s_unique, collapse = ", ")

输出:

[1] "height (female), weight, BRCA1, body mass index, height (e.g. by kilos)"