Question

我有一个包含2和3个单词的数据框。我想过滤掉一些具有相同模式的特定字符串。

df <- data.frame(word = c("thin film", "film resistor", "thin film resistor", 
                          "protection material", "protection material removed",
                          "protection layer", "interconnect metal"))
>df                          
  words
1 thin film
2 film resistor
3 thin film resistor
4 protection material
5 protection material removed
6 protection layer
7 interconnect metal

我想过滤掉重复字符串模式的字符串。

所以这就是我想要的。

  words
1 thin film resistor
2 protection material removed
3 protection layer
4 interconnect metal

Answer 1

假设字符类的words列：

必须有最佳方法：

  data.frame(words=names(which(colSums(sapply(df[,1],grepl,df[,1]))==1)))                       
              words
 1          thin film resistor
 2 protection material removed
 3            protection layer
 4          interconnect metal

希望这有帮助

你也可以这样做：

 df$word[colSums(sapply(df[,1],grepl,df[,1]))==1]
 [1] "thin film resistor"          "protection material removed" "protection layer"           
 [4] "interconnect metal

或

 df$word[colSums(outer(df$word, df$word, stringr::str_detect)) == 1]

Answer 2

在创建data.frame时，请设置stringsAsFactors=FALSE

试试这个：

lst = strsplit(df$word,split = " ")

output = sapply(1:length(lst),
   function(t,dict){
       superstring=c()
       temp = sapply(dict[-t],function(u,v){
                  matches = match(x=v,table=u); 
                  if(length(which(!is.na(matches)))==length(v)){
                      return(str_c(u,collapse = " "))
                  }else{
                      return(NULL)
                  }},dict[[t]],simplify = T)
        if(length(which(sapply(temp,is.null,simplify = T)))==(length(dict)-1)){
            superstring[t]=str_c(dict[[t]],collapse = " ")
        }else{
            superstring[t]=temp[[which.max(sapply(temp,nchar,simplify = T))]]
        }          
       },lst)

unique(output)

#[1] "thin film resistor"          "protection material removed" "protection layer"            "interconnect metal"

不是最优化但应该做的。

按特定模式过滤掉字符串

2 个答案: