我的示例数据框:
query <- c("women dress","dress women","dresses women","black women jean","women jeans black")
SearchVolume <- c(1000,1000,400,900,900)
PredictiveImpression <- c(900,900,200,700,700)
Lem <- c("women,dress","dress,women","dress
women","black,women,jean","women,jean,black")
data <- data.frame(query,SearchVolume,PredictiveImpression,Lem)
我需要用(1)相同的字符删除查询 - 即使在不同的顺序和单数/复数状态; (2)相同的搜索量和预测印象。最终,“女装”,“女装”和“黑色女装牛仔裤”应该留下来。
我在r中使用了词形还原来提取根词,但是无法弄清楚如何使用相同的字符但不同的顺序对查询进行重复数据删除。这就是我现在所取得的成就。
答案 0 :(得分:2)
我们可以将“Lem”拆分为list
个vector
,sort
,应用duplicated
和子集
data[!duplicated(lapply(strsplit(as.character(data$Lem), ','), sort)),]
# query SearchVolume PredictiveImpression Lem
#1 women dress 1000 900 women,dress
#3 dresses women 400 200 dress women
#4 black women jean 900 700 black,women,jean