tm(文本挖掘)文档术语矩阵创建中的致命错误

时间:2015-04-21 16:00:16

标签: r text-mining tm

当我尝试创建文档术语矩阵时,

tm会抛出错误

library(tm)
data(crude)

#control parameters
dtm.control <- list(
    tolower           = TRUE, 
    removePunctuation = TRUE,
    removeNumbers     = TRUE,
    stopWords         = stopwords("english"),
    stemming          = TRUE, # false for sentiment
    wordLengths       = c(3, "inf"))

dtm <- DocumentTermMatrix(corp, control = dtm.control)

错误:

  

simple_triplet_matrix中的错误(i = i,j = j,v = as.numeric(v),nrow = length(allTerms),:     &#39; i,j,v&#39;不同的长度   另外:警告信息:   1:在mclapply(unname(content(x)),termFreq,control):     所有计划的核心在用户代码中遇到错误   2:在simple_triplet_matrix中(i = i,j = j,v = as.numeric(v),nrow = length(allTerms),:     强制引入的NA

我做错了什么? 也:

我正在使用这些教程:

是否有更好/更近期的演练?

1 个答案:

答案 0 :(得分:0)

您可能会考虑对代码进行一些更改,尤其是removeStopWords和创建语料库。以下对我有用:

library(tm)
data("crude")

#control parameters
dtm.control <- list(
  tolower           = TRUE, 
  removePunctuation = TRUE,
  removeNumbers     = TRUE,
  removestopWords   = TRUE,
  stemming          = TRUE, # false for sentiment
  wordLengths       = c(3, "inf"))

corp <- Corpus(VectorSource(crude))

dtm <- DocumentTermMatrix(corp, control = dtm.control)

> inspect(dtm)
<<DocumentTermMatrix (documents: 20, terms: 848)>>
Non-/sparse entries: 1877/15083
Sparsity           : 89%
Maximal term length: 16
Weighting          : term frequency (tf)