正确消除R中重叠字符串的重复?

时间:2012-10-24 17:15:48

标签: regex r edit

我想问一个this issue的后续问题,因为出现了另外一个问题:我发现了属于多个类别的科目(文化研究,例如),艺术与人文科学和社会科学),即必须考虑重叠。

我有很长的类别列表,例如这个机器可读的例子:

AB <- c("Science","Arts & Humanities","Arts & Humanities; Social Sciences","Science","Arts & Humanities; Arts & Humanities; Social Sciences","Science","Science; Social Sciences","Social Sciences; Science")  

所以它看起来像这样:

> AB  
[1] "Science"                                               "Arts & Humanities"  
[3] "Arts & Humanities; Social Sciences"                    "Science"  
[5] "Arts & Humanities; Arts & Humanities; Social Sciences" "Science"  
[7] "Science; Social Sciences"                              "Social Sciences; Science"  

我想编辑这些术语并消除重复项以获得此结果:

[1] "Science"                                    "Arts & Humanities"  
[3] "Arts & Humanities; Social Sciences"         "Science"  
[5] "Arts & Humanities; Social Sciences"         "Science"  
[7] "Science; Social Sciences"                   "Science; Social Sciences"  

所以我正在寻找另一个循环来消除#5中的重复。我尝试使用 strsplit() unique(),但这不起作用:

> unique(strsplit(AB, "; *"))  
[[1]]  
[1] "Science"  

[[2]]  
[1] "Arts & Humanities"  

[[3]]  
[1] "Arts & Humanities" "Social Sciences"  

[[4]]  
[1] "Arts & Humanities" "Arts & Humanities" "Social Sciences"  

[[5]]  
[1] "Social Sciences" "Science"  

所以我想再次问你,请问:我怎样才能达到上面提到的正确输出? 非常感谢您提前考虑!

1 个答案:

答案 0 :(得分:2)

我认为这与尾随或领先的空白区域有关。如果您将此应用于AB,它将为您解决此问题:

fun <- function(text.var){
    x <- unlist(strsplit(text.var, ";"))
    Trim <- function(x) gsub("^\\s+|\\s+$", "", x)
    paste(sort(unique(Trim(x))), collapse="; ")
}

sapply(AB, fun, USE.NAMES = FALSE)

产量:

> sapply(AB, fun, USE.NAMES = FALSE)
[1] "Science"                            "Arts & Humanities"                 
[3] "Arts & Humanities; Social Sciences" "Science"                           
[5] "Arts & Humanities; Social Sciences" "Science"                           
[7] "Science; Social Sciences"           "Science; Social Sciences"