删除R中具有相同字符的重复文本

时间:2017-10-26 02:40:39

标签: r

我的示例数据框:

query <- c("women dress","dress women","dresses women","black women jean","women jeans black")
SearchVolume <- c(1000,1000,400,900,900)
PredictiveImpression <- c(900,900,200,700,700)
Lem <- c("women,dress","dress,women","dress 
women","black,women,jean","women,jean,black")

data <- data.frame(query,SearchVolume,PredictiveImpression,Lem)

我需要用(1)相同的字符删除查询 - 即使在不同的顺序和单数/复数状态; (2)相同的搜索量和预测印象。最终,“女装”,“女装”和“黑色女装牛仔裤”应该留下来。

我在r中使用了词形还原来提取根词,但是无法弄清楚如何使用相同的字符但不同的顺序对查询进行重复数据删除。这就是我现在所取得的成就。

enter image description here

我的预期结果: enter image description here

1 个答案:

答案 0 :(得分:2)

我们可以将“Lem”拆分为listvectorsort,应用duplicated和子集

data[!duplicated(lapply(strsplit(as.character(data$Lem), ','), sort)),]
#           query SearchVolume PredictiveImpression              Lem
#1      women dress         1000                  900      women,dress
#3    dresses women          400                  200      dress women
#4 black women jean          900                  700 black,women,jean