Question

我正在使用＆＃39; tm＆＃39;在R中打包以使用词干术语创建术语文档矩阵。这个过程正在完成，但最终的矩阵包括似乎没有被删除的术语，我试图理解为什么会这样，以及如何解决它。

以下是该流程的脚本，该脚本使用了几个在线新闻故事作为沙箱：

library(boilerpipeR)
library(RCurl)
library(tm)

# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))

# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
  control = list(removePunctuation = TRUE,
                 stopwords = TRUE,
                 stripWhitespace = TRUE,
                 stemDocument = TRUE))

# Now inspect the result
findFreqTerms(news, 4)

这是最后一行产生的输出：

[1] "acadine"       "adobe"         "android"       "browser"       "challenge"     "companies"     "company"       "devices"       "firefox"       "flash"        
[11] "funding"       "gong"          "hackers"       "international" "ios"           "like"          "million"       "mobile"        "mozilla"       "mozillas"     
[21] "new"           "online"        "operating"     "said"          "security"      "smartphones"   "software"      "startup"       "system"        "systems"      
[31] "tsinghua"      "unigroup"      "used"          "users"         "videos"        "web"           "will"

例如，在第1行中，我们看到＆＃34;公司＆＃34;和＆＃34;公司＆＃34;，我们看到＆＃34;设备＆＃34;。我认为干预可以减少公司＆＃34;和＆＃34;公司＆＃34;同一个词干（＆＃34; compani＆＃34;？），我认为它会削减＆＃34; s＆＃34;关于复数，例如＆＃34;设备＆＃34;。我错了吗？如果没有，为什么这段代码不会产生所需的结果呢？

Answer 1

使用stemming = TRUE或stemming = stemDocument代替stemDocument = TRUE。（?termFreq表示stemDocument不是有效的控制参数。）

为什么不会干掉文件？

1 个答案: