我需要从数据框中删除所有非英语单词,如下所示:
ID text
1 they all went to the store bonkobuns and bought chicken
2 if we believe no exomunch standards are in order then we're ok
3 living among the calipodians seems reasonable
4 given the state of all relimited editions we should be fine
我希望以这样的数据框结束:
ID text
1 they all went to the store and bought chicken
2 if we believe no standards are in order then we're ok
3 living among the seems reasonable
4 given the state of all editions we should be fine
我有一个包含所有英文单词的向量:word_vec
我可以使用tm包
从数据框中删除向量中的所有单词for(k in 1:nrow(frame){
for(i in 1:length(word_vec)){
frame[k,] <- removeWords(frame[i,],word_vec[i])
}
}
但我想做相反的事情。我想保持&#39;只有在矢量中找到的单词。
答案 0 :(得分:3)
这是一个简单的方法:
txt <- "Hi this is an example"
words <- c("this", "is", "an", "example")
paste(intersect(strsplit(txt, "\\s")[[1]], words), collapse=" ")
[1] "this is an example"
当然魔鬼在细节中,所以你可能需要稍微调整一下,以考虑撇号和其他标点符号。
答案 1 :(得分:2)
您可以尝试gsub
word_vec <- paste(c('bonkobuns ', 'exomunch ', 'calipodians ',
'relimited '), collapse="|")
gsub(word_vec, '', df1$text)
#[1] "they all went to the store and bought chicken"
#[2] "if we believe no standards are in order then we're ok"
#[3] "living among the seems reasonable"
#[4] "given the state of all editions we should be fine"
假设您已经有一个与上述向量中的word_vec相反的word_vec,例如
word_vec <- c("among", "editions", "bought", "seems", "fine",
"state", "in",
"then", "reasonable", "ok", "standards", "store", "order", "should",
"and", "be", "to", "they", "are", "no", "living", "all", "if",
"we're", "went", "of", "given", "the", "chicken", "believe",
"we")
word_vec2 <- paste(gsub('^ +| +$', '', gsub(paste(word_vec,
collapse="|"), '', df1$text)), collapse= ' |')
gsub(word_vec2, '', df1$text)
#[1] "they all went to the store and bought chicken"
#[2] "if we believe no standards are in order then we're ok"
#[3] "living among the seems reasonable"
#[4] "given the state of all editions we should be fine"
答案 2 :(得分:0)
我能想到的只有以下程序:
strsplit()
regexpr()
如果沿着这条路走下去,也许值得思考一下()的功能:
which(c('a','b','c','d','e') == 'd')
[1] 4