我想问一个this issue的后续问题,因为出现了另外一个问题:我发现了属于多个类别的科目(文化研究,例如),艺术与人文科学和社会科学),即必须考虑重叠。
我有很长的类别列表,例如这个机器可读的例子:
AB <- c("Science","Arts & Humanities","Arts & Humanities; Social Sciences","Science","Arts & Humanities; Arts & Humanities; Social Sciences","Science","Science; Social Sciences","Social Sciences; Science")
所以它看起来像这样:
> AB
[1] "Science" "Arts & Humanities"
[3] "Arts & Humanities; Social Sciences" "Science"
[5] "Arts & Humanities; Arts & Humanities; Social Sciences" "Science"
[7] "Science; Social Sciences" "Social Sciences; Science"
我想编辑这些术语并消除重复项以获得此结果:
[1] "Science" "Arts & Humanities"
[3] "Arts & Humanities; Social Sciences" "Science"
[5] "Arts & Humanities; Social Sciences" "Science"
[7] "Science; Social Sciences" "Science; Social Sciences"
所以我正在寻找另一个循环来消除#5中的重复。我尝试使用 strsplit()和 unique(),但这不起作用:
> unique(strsplit(AB, "; *"))
[[1]]
[1] "Science"
[[2]]
[1] "Arts & Humanities"
[[3]]
[1] "Arts & Humanities" "Social Sciences"
[[4]]
[1] "Arts & Humanities" "Arts & Humanities" "Social Sciences"
[[5]]
[1] "Social Sciences" "Science"
所以我想再次问你,请问:我怎样才能达到上面提到的正确输出? 非常感谢您提前考虑!
答案 0 :(得分:2)
我认为这与尾随或领先的空白区域有关。如果您将此应用于AB,它将为您解决此问题:
fun <- function(text.var){
x <- unlist(strsplit(text.var, ";"))
Trim <- function(x) gsub("^\\s+|\\s+$", "", x)
paste(sort(unique(Trim(x))), collapse="; ")
}
sapply(AB, fun, USE.NAMES = FALSE)
产量:
> sapply(AB, fun, USE.NAMES = FALSE)
[1] "Science" "Arts & Humanities"
[3] "Arts & Humanities; Social Sciences" "Science"
[5] "Arts & Humanities; Social Sciences" "Science"
[7] "Science; Social Sciences" "Science; Social Sciences"