只保留在向量R中找到的数据框中的单词

时间:2015-03-06 01:51:45

标签: r

我需要从数据框中删除所有非英语单词,如下所示:

ID     text
1      they all went to the store bonkobuns and bought chicken
2      if we believe no exomunch standards are in order then we're ok
3      living among the calipodians seems reasonable  
4      given the state of all relimited editions we should be fine

我希望以这样的数据框结束:

 ID     text
 1      they all went to the store and bought chicken
 2      if we believe no standards are in order then we're ok
 3      living among the seems reasonable  
 4      given the state of all editions we should be fine

我有一个包含所有英文单词的向量:word_vec

我可以使用tm包

从数据框中删除向量中的所有单词
for(k in 1:nrow(frame){
    for(i in 1:length(word_vec)){
        frame[k,] <- removeWords(frame[i,],word_vec[i])
    }
}

但我想做相反的事情。我想保持&#39;只有在矢量中找到的单词。

3 个答案:

答案 0 :(得分:3)

这是一个简单的方法:

txt <- "Hi this is an example"
words <- c("this", "is", "an", "example")
paste(intersect(strsplit(txt, "\\s")[[1]], words), collapse=" ")
[1] "this is an example"

当然魔鬼在细节中,所以你可能需要稍微调整一下,以考虑撇号和其他标点符号。

答案 1 :(得分:2)

您可以尝试gsub

 word_vec <- paste(c('bonkobuns ', 'exomunch ', 'calipodians ', 
          'relimited '), collapse="|")
 gsub(word_vec, '', df1$text)
 #[1] "they all went to the store and bought chicken"        
 #[2] "if we believe no standards are in order then we're ok"
 #[3] "living among the seems reasonable"                    
 #[4] "given the state of all editions we should be fine" 

假设您已经有一个与上述向量中的word_vec相反的word_vec,例如

  word_vec <- c("among", "editions", "bought", "seems", "fine", 
  "state", "in", 
  "then", "reasonable", "ok", "standards", "store", "order", "should", 
  "and", "be", "to", "they", "are", "no", "living", "all", "if", 
  "we're", "went", "of", "given", "the", "chicken", "believe", 
  "we")


  word_vec2 <-  paste(gsub('^ +| +$', '', gsub(paste(word_vec, 
        collapse="|"), '', df1$text)), collapse= ' |')
  gsub(word_vec2, '', df1$text)
  #[1] "they all went to the store and bought chicken"        
  #[2] "if we believe no standards are in order then we're ok"
  #[3] "living among the seems reasonable"                    
  #[4] "given the state of all  editions we should be fine"  

答案 2 :(得分:0)

我能想到的只有以下程序:

  1. 对于向量中的每一行,按空格strsplit()
  2. 拆分为向量
  3. 对于新漫画中的每个元素,请使用regexpr()
  4. 检查您的任何word_vec
  5. 如果特定位置的值返回为-1(regexpr examples),请删除该位置。
  6. 加入字符串并存储在新的载体中
  7. 如果沿着这条路走下去,也许值得思考一下()的功能:

        which(c('a','b','c','d','e') == 'd')
    [1] 4