从R中的列表中删除大量字符

时间:2014-06-19 06:10:23

标签: r text-mining gsub tm

我有一个带句子的字符列表。我有大约10000多行。我想从中删除1000多个单词。所以我有一个字符向量,其中包含要删除的单词。我使用的方法如下:

c<-gsub(pattern = wordsToBeDeleted,replacement = "",x = mainList)

这只使用第一个单词。我怎么能这样做?

2 个答案:

答案 0 :(得分:1)

gsub目前只采用一种模式,但您可以将其与Reduce

结合使用
#sample data
sentences<-c(
    "Morbi in tempus metus, quis commodo eros",
    "Cum sociis natoque penatibus et magnis dis parturient montes",
    "Nulla diam quam, imperdiet vitae blandit eu",
    "Nullam nec pellentesque sapien, ac mollis mauris")

words<-c("quis","eros","diam","nec")

新的我们遍历所有单词,将它们从句子中删除

Reduce(function(a,b) gsub(b,"", a,fixed=T), words, sentences)

给了我们

[1] "Morbi in tempus metus,  commodo "                            
[2] "Cum sociis natoque penatibus et magnis dis parturient montes"
[3] "Nulla  quam, imperdiet vitae blandit eu"                     
[4] "Nullam  pellentesque sapien, ac mollis mauris" 

答案 1 :(得分:0)

尝试这个食谱怎么样:

sentences = tolower(c("I don't like you.", "But I do like this."))
dropWords = tolower(c("I", "like"))

splitSentences = strsplit(sentences, " ")
purged = lapply(X=splitSentences, FUN=setdiff, y=dropWords)

purged
[[1]]
[1] "don't" "you." 

[[2]]
[1] "but"   "do"    "this."

我还建议在那里使用tolower,因为它会处理案例差异。