为什么不会干掉文件?

时间:2015-07-15 18:50:31

标签: r nlp text-mining tm

我正在使用' tm'在R中打包以使用词干术语创建术语文档矩阵。这个过程正在完成,但最终的矩阵包括似乎没有被删除的术语,我试图理解为什么会这样,以及如何解决它。

以下是该流程的脚本,该脚本使用了几个在线新闻故事作为沙箱:

library(boilerpipeR)
library(RCurl)
library(tm)

# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))

# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
  control = list(removePunctuation = TRUE,
                 stopwords = TRUE,
                 stripWhitespace = TRUE,
                 stemDocument = TRUE))

# Now inspect the result
findFreqTerms(news, 4)

这是最后一行产生的输出:

[1] "acadine"       "adobe"         "android"       "browser"       "challenge"     "companies"     "company"       "devices"       "firefox"       "flash"        
[11] "funding"       "gong"          "hackers"       "international" "ios"           "like"          "million"       "mobile"        "mozilla"       "mozillas"     
[21] "new"           "online"        "operating"     "said"          "security"      "smartphones"   "software"      "startup"       "system"        "systems"      
[31] "tsinghua"      "unigroup"      "used"          "users"         "videos"        "web"           "will"  

例如,在第1行中,我们看到&#34;公司&#34;和&#34;公司&#34;,我们看到&#34;设备&#34;。我认为干预可以减少公司&#34;和&#34;公司&#34;同一个词干(&#34; compani&#34;?),我认为它会削减&#34; s&#34;关于复数,例如&#34;设备&#34;。我错了吗?如果没有,为什么这段代码不会产生所需的结果呢?

1 个答案:

答案 0 :(得分:2)

使用stemming = TRUEstemming = stemDocument代替stemDocument = TRUE。 (?termFreq表示stemDocument不是有效的控制参数。)