没有空文档时,DocumentTermMatrix / LDA会产生非零输入错误

时间:2018-10-26 15:45:40

标签: r text tm lda topic-modeling

我正在R中尝试我的第一个LDA模型并抛出错误

Error in LDA(Corpus_clean_dtm, k, method = "Gibbs", control = list(nstart = nstart,  :    Each row of the input matrix needs to contain at least one non-zero entry

这是我的模型代码,其中包含一些标准的预处理步骤

 library(tm)
 library(topicmodels)
 library(textstem)


df_withduplicateID <- data.frame(
  doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2", 
    "13559/1", "19094/1", "19053/1", "20215/1", "20215/1"), 
  text = c("He do subjects prepared bachelor juvenile ye oh.", 
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "Fond his say old meet cold find come whom. ",
    "Wonder matter now can estate esteem assure fat roused.",
    ".Am performed on existence as discourse is.", 
    "Moment led family sooner cannot her window pulled any.",
    "Why resolution one motionless you him thoroughly.", 
    "Why resolution one motionless you him thoroughly.")     
)


clean_corpus <- function(corpus){
                  corpus <- tm_map(corpus, stripWhitespace)
                  corpus <- tm_map(corpus, removePunctuation)
                  corpus <- tm_map(corpus, tolower)
                  corpus <- tm_map(corpus, lemmatize_strings)
                  return(corpus)
                }

df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- Corpus(DataframeSource(df))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)


burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(203,500,623,1001,765)
nstart <- 5
best <- TRUE
k <- 5

LDAresult_1683 <- LDA(Corpus_clean_dtm, k, method = "Gibbs", 
  control = list(nstart = nstart, seed = seed, best = best, 
  burnin = burnin, iter = iter, thin = thin))

经过一些搜索,看来我的DocumentTermMatrix可能包含空文档(先前在 herehere中提到过,这导致了此错误消息。

然后我继续删除空文档,重新运行LDA模型,一切顺利。没有引发任何错误。

rowTotals <- apply(Corpus_clean_dtm , 1, sum)
Corpus_clean_dtm.new <- Corpus_clean_dtm[rowTotals >0, ]
Corpus_clean_dtm.empty <- Corpus_clean_dtm[rowTotals <= 0, ]
Corpus_clean_dtm.empty$dimnames$Docs

我继续从Corpus_clean_dtm.empty中手动查找行号ID(拉出所有空文档条目),并匹配“ Corpus_clean”中的相同ID(和行号),并意识到这些文档并不是真正的“空” ',并且每个“空”文档至少包含20个字符。

我在这里想念东西吗?

1 个答案:

答案 0 :(得分:0)

经过更多的探讨并从讨论here得到启发后,如果我错了,请纠正我,但我认为我提出的问题是由tm软件包中的实际错误引起的。在将数据框转换为VCorpus()而不是使用Corpus()之后,并在所有清理步骤中添加包装器content_transformer()将使我对所有文档进行定形,并将DocumentTermMatrix()应用于干净的语料库,而无需任何错误。 如果我不将包装器content_transformer()应用于清理过程,则清理后,我的VCorpus()对象将作为列表而不是语料库结构返回。 LDA()也不会引发任何错误。

我正在使用tm版本0.7-3供将来参考。

library(tm)
 library(topicmodels)
 library(textstem)


df_withduplicateID <- data.frame(
  doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2", 
    "13559/1", "19094/1", "19053/1", "20215/1", "20215/1"), 
  text = c("He do subjects prepared bachelor juvenile ye oh.", 
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "Fond his say old meet cold find come whom. ",
    "Wonder matter now can estate esteem assure fat roused.",
    ".Am performed on existence as discourse is.", 
    "Moment led family sooner cannot her window pulled any.",
    "Why resolution one motionless you him thoroughly.", 
    "Why resolution one motionless you him thoroughly.")     
)


clean_corpus <- function(corpus){
                  corpus <- tm_map(corpus, content_transformer(stripWhitespace))
                  corpus <- tm_map(corpus, content_transformer(removePunctuation))
                  corpus <- tm_map(corpus, content_transformer(tolower))
                  corpus <- tm_map(corpus, content_transformer(lemmatize_strings))
                  return(corpus)
                }

df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- VCorpus(DataframeSource(df), readerControl = list(reader = reader(DataframeSource(df)), language = "en"))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)