我正在使用' tm'在R中打包以使用词干术语创建术语文档矩阵。这个过程正在完成,但最终的矩阵包括似乎没有被删除的术语,我试图理解为什么会这样,以及如何解决它。
以下是该流程的脚本,该脚本使用了几个在线新闻故事作为沙箱:
library(boilerpipeR)
library(RCurl)
library(tm)
# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))
# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
control = list(removePunctuation = TRUE,
stopwords = TRUE,
stripWhitespace = TRUE,
stemDocument = TRUE))
# Now inspect the result
findFreqTerms(news, 4)
这是最后一行产生的输出:
[1] "acadine" "adobe" "android" "browser" "challenge" "companies" "company" "devices" "firefox" "flash"
[11] "funding" "gong" "hackers" "international" "ios" "like" "million" "mobile" "mozilla" "mozillas"
[21] "new" "online" "operating" "said" "security" "smartphones" "software" "startup" "system" "systems"
[31] "tsinghua" "unigroup" "used" "users" "videos" "web" "will"
例如,在第1行中,我们看到&#34;公司&#34;和&#34;公司&#34;,我们看到&#34;设备&#34;。我认为干预可以减少公司&#34;和&#34;公司&#34;同一个词干(&#34; compani&#34;?),我认为它会削减&#34; s&#34;关于复数,例如&#34;设备&#34;。我错了吗?如果没有,为什么这段代码不会产生所需的结果呢?
答案 0 :(得分:2)
使用stemming = TRUE
或stemming = stemDocument
代替stemDocument = TRUE
。 (?termFreq
表示stemDocument
不是有效的控制参数。)