我有一个包含2和3个单词的数据框。 我想过滤掉一些具有相同模式的特定字符串。
df <- data.frame(word = c("thin film", "film resistor", "thin film resistor",
"protection material", "protection material removed",
"protection layer", "interconnect metal"))
>df
words
1 thin film
2 film resistor
3 thin film resistor
4 protection material
5 protection material removed
6 protection layer
7 interconnect metal
我想过滤掉重复字符串模式的字符串。
所以这就是我想要的。
words
1 thin film resistor
2 protection material removed
3 protection layer
4 interconnect metal
答案 0 :(得分:4)
假设字符类的words
列:
必须有最佳方法:
data.frame(words=names(which(colSums(sapply(df[,1],grepl,df[,1]))==1)))
words
1 thin film resistor
2 protection material removed
3 protection layer
4 interconnect metal
希望这有帮助
你也可以这样做:
df$word[colSums(sapply(df[,1],grepl,df[,1]))==1]
[1] "thin film resistor" "protection material removed" "protection layer"
[4] "interconnect metal
或
df$word[colSums(outer(df$word, df$word, stringr::str_detect)) == 1]
答案 1 :(得分:0)
在创建data.frame
时,请设置stringsAsFactors=FALSE
试试这个:
lst = strsplit(df$word,split = " ")
output = sapply(1:length(lst),
function(t,dict){
superstring=c()
temp = sapply(dict[-t],function(u,v){
matches = match(x=v,table=u);
if(length(which(!is.na(matches)))==length(v)){
return(str_c(u,collapse = " "))
}else{
return(NULL)
}},dict[[t]],simplify = T)
if(length(which(sapply(temp,is.null,simplify = T)))==(length(dict)-1)){
superstring[t]=str_c(dict[[t]],collapse = " ")
}else{
superstring[t]=temp[[which.max(sapply(temp,nchar,simplify = T))]]
}
},lst)
unique(output)
#[1] "thin film resistor" "protection material removed" "protection layer" "interconnect metal"
不是最优化但应该做的。