使用removeWords()使用包tm进行文本挖掘。我有几千个相关单词的列表。我可以使用removeWords()来反转逻辑并从语料库中删除不在列表中的单词吗?
使用Perl,我可以这样做:
$diminishedText = (fullText =! s/$wordlist//g); #not tested
在R中,这将删除单词列表中的单词:
text <- tm_map(text, removeWords, wordList)
做这样的事情的正确语法是什么?
text <- tm_map(text, removeWords, not in wordList)
答案 0 :(得分:1)
这感觉非常笨拙,但可能会奏效。最后有一种不同的可能性。
library(tm)
library(qdap); library(gtools)
library(stringr)
docs <- c("cat", "dog", "mouse", "oil", "crude", "tanker") # starting documents
EDIT 我遇到了这种方法: tdm.keep&lt; - Text.tdm [rownames(Text.tdm)%in%keepWords,]
keepWords <- c("oil", "crude", "tanker") # choose the words to keep from the starting documents
keeppattern <- paste0(keepWords, collapse = "|") # create a regex pattern of the keepWords
Text <- unlist(str_extract_all(string = docs, pattern = keeppattern)) # remove only the keepWords, as a vector
Text.tdm <- TermDocumentMatrix(Text) # create the tdm based on keepWords only
这是另一种可能性,但我没有通过。 R remove stopwords from a character vector using %in%
修改强> 另一种方法:
tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]
'%nin%' <- Negate('%in%') # assign to an operator the opposite of %in%
Text <- tm_map(crude, removeWords(crude %nin% keepWords))
# Error because removeWords can't take a logical argument
答案 1 :(得分:1)
文本分析包 quanteda 具有正面(保持)和负面(删除)功能选择功能。以下是我们希望保留一系列经济词汇的例子,来自美国总统就职语料库:
require(quanteda)
dfm(inaugTexts[50:57], keptFeatures = c("tax*", "econom*", "mone*"), verbose = FALSE)
# Document-feature matrix of: 8 documents, 5 features.
# 8 x 5 sparse Matrix of class "dfmSparse"
# features
# docs economic taxes tax economy money
# 1985-Reagan 4 2 4 5 1
# 1989-Bush 0 0 0 0 1
# 1993-Clinton 0 0 0 3 0
# 1997-Clinton 0 0 0 2 0
# 2001-Bush 0 1 0 2 0
# 2005-Bush 1 0 0 0 0
# 2009-Obama 0 0 0 3 0
# 2013-Obama 2 0 1 1 0
这里的比赛是使用默认的&#34; glob&#34;格式,但也可以使用固定和正则表达式匹配功能选择。请参阅?dfm
和?selectFeatures
。
答案 2 :(得分:0)
也许你可以蛮力。
下载一些字典并从中删除wordList
中的字词。
尝试在tm_map()
中传递该词典。