使用tm进行文本挖掘,删除不在列表中的单词

时间:2015-01-21 14:12:57

标签: r tm

使用removeWords()使用包tm进行文本挖掘。我有几千个相关单词的列表。我可以使用removeWords()来反转逻辑并从语料库中删除不在列表中的单词吗?

使用Perl,我可以这样做:

$diminishedText = (fullText =! s/$wordlist//g); #not tested

在R中,这将删除单词列表中的单词:

text <- tm_map(text, removeWords, wordList)

做这样的事情的正确语法是什么?

text <- tm_map(text, removeWords, not in wordList)

3 个答案:

答案 0 :(得分:1)

这感觉非常笨拙,但可能会奏效。最后有一种不同的可能性。

library(tm)
library(qdap); library(gtools)
library(stringr)

docs <- c("cat", "dog", "mouse", "oil", "crude", "tanker") # starting documents

EDIT 我遇到了这种方法: tdm.keep&lt; - Text.tdm [rownames(Text.tdm)%in%keepWords,]

keepWords <- c("oil", "crude", "tanker") # choose the words to keep from the starting documents
keeppattern <- paste0(keepWords, collapse = "|") # create a regex pattern of the keepWords
Text <- unlist(str_extract_all(string = docs, pattern = keeppattern)) # remove only the keepWords, as a vector
Text.tdm <- TermDocumentMatrix(Text) # create the tdm based on keepWords only

这是另一种可能性,但我没有通过。 R remove stopwords from a character vector using %in%

修改 另一种方法:

tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]

'%nin%' <- Negate('%in%') # assign to an operator the opposite of %in%
Text <- tm_map(crude, removeWords(crude %nin% keepWords)) 
# Error because removeWords can't take a logical argument

答案 1 :(得分:1)

文本分析包 quanteda 具有正面(保持)和负面(删除)功能选择功能。以下是我们希望保留一系列经济词汇的例子,来自美国总统就职语料库:

require(quanteda)
dfm(inaugTexts[50:57], keptFeatures = c("tax*", "econom*", "mone*"), verbose = FALSE)
# Document-feature matrix of: 8 documents, 5 features.
# 8 x 5 sparse Matrix of class "dfmSparse"
#               features
# docs           economic taxes tax economy money
#   1985-Reagan         4     2   4       5     1
#   1989-Bush           0     0   0       0     1
#   1993-Clinton        0     0   0       3     0
#   1997-Clinton        0     0   0       2     0
#   2001-Bush           0     1   0       2     0
#   2005-Bush           1     0   0       0     0
#   2009-Obama          0     0   0       3     0
#   2013-Obama          2     0   1       1     0

这里的比赛是使用默认的&#34; glob&#34;格式,但也可以使用固定和正则表达式匹配功能选择。请参阅?dfm?selectFeatures

答案 2 :(得分:0)

也许你可以蛮力。

下载一些字典并从中删除wordList中的字词。

尝试在tm_map()中传递该词典。