按特定模式过滤掉字符串

时间:2017-09-08 04:14:34

标签: r regex filter

我有一个包含2和3个单词的数据框。 我想过滤掉一些具有相同模式的特定字符串。

df <- data.frame(word = c("thin film", "film resistor", "thin film resistor", 
                          "protection material", "protection material removed",
                          "protection layer", "interconnect metal"))
>df                          
  words
1 thin film
2 film resistor
3 thin film resistor
4 protection material
5 protection material removed
6 protection layer
7 interconnect metal

我想过滤掉重复字符串模式的字符串。

所以这就是我想要的。

  words
1 thin film resistor
2 protection material removed
3 protection layer
4 interconnect metal

2 个答案:

答案 0 :(得分:4)

假设字符类的words列:

必须有最佳方法:

  data.frame(words=names(which(colSums(sapply(df[,1],grepl,df[,1]))==1)))                       
              words
 1          thin film resistor
 2 protection material removed
 3            protection layer
 4          interconnect metal

希望这有帮助

你也可以这样做:

 df$word[colSums(sapply(df[,1],grepl,df[,1]))==1]
 [1] "thin film resistor"          "protection material removed" "protection layer"           
 [4] "interconnect metal 

 df$word[colSums(outer(df$word, df$word, stringr::str_detect)) == 1]

答案 1 :(得分:0)

在创建data.frame时,请设置stringsAsFactors=FALSE

试试这个:

lst = strsplit(df$word,split = " ")

output = sapply(1:length(lst),
   function(t,dict){
       superstring=c()
       temp = sapply(dict[-t],function(u,v){
                  matches = match(x=v,table=u); 
                  if(length(which(!is.na(matches)))==length(v)){
                      return(str_c(u,collapse = " "))
                  }else{
                      return(NULL)
                  }},dict[[t]],simplify = T)
        if(length(which(sapply(temp,is.null,simplify = T)))==(length(dict)-1)){
            superstring[t]=str_c(dict[[t]],collapse = " ")
        }else{
            superstring[t]=temp[[which.max(sapply(temp,nchar,simplify = T))]]
        }          
       },lst)

unique(output)

#[1] "thin film resistor"          "protection material removed" "protection layer"            "interconnect metal" 

不是最优化但应该做的。