Question

我正在使用R中的topicmodels包进行主题建模。我正在创建一个Corpus对象，进行一些基本的预处理，然后创建一个DocumentTermMatrix：

corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) 
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))

然后执行LDA：

LDA(dtm, 30)

对LDA（）的最后调用返回错误

  "Each row of the input matrix needs to contain at least one non-zero entry".

我认为这意味着在预处理之后至少有一个文档中没有任何术语。有没有一种简单的方法可以从DocumentTermMatrix中删除不包含任何术语的文档？

我查看了topicmodels包的文档，找到了函数removeSparseTerms，它删除了任何文档中没有出现的术语，但没有类似的删除文档。

Answer 1

"Each row of the input matrix needs to contain at least one non-zero entry"

错误意味着稀疏矩阵包含没有条目（单词）的行。一个想法是按行计算单词总和

rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new   <- dtm[rowTotals> 0, ]           #remove all docs without words

Answer 2

agstudy的答案效果很好，但在慢速计算机上使用它证明是有问题的。

tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed

（这是用4000x15000 dtm完成的）

瓶颈似乎是将sum()应用于稀疏矩阵。

由tm包创建的文档 - 术语 - 矩阵包含名称i和j，它们是条目在稀疏矩阵中的索引。如果dtm$i不包含特定的行索引p，则行p为空。

tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed

ui包含所有非零索引，并且由于dtm$i已经订购，dtm.new的顺序与dtm的顺序相同。对于较小的文档术语矩阵，性能增益可能无关紧要，但对于较大的矩阵可能会变得很重要。

Answer 3

这只是为了详细说明agstudy给出的答案。

我们可以在执行第二个仅包含非空文档的dtm之前，识别我们语料库中长度为零的文档并直接从语料库中删除文档，而不是从dtm矩阵中删除空行。

这对于在dtm和语料库之间保持1：1的对应关系很有用。

empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]] corpus <- corpus[-as.numeric(empty.rows)]

Answer 4

只需从DTM中删除稀疏术语，一切都会正常运行。

dtm <- DocumentTermMatrix(crude, sparse=TRUE)

Answer 5

Dario Lacan答案的小补遗：

empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]

将收集记录id，而不是订单号。试试这个：

library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"

如果您使用连续编号构建自己的语料库，则在清理数据后，某些文档可能会被删除，编号也会被破坏。因此，最好直接使用id：

corpus <- tm_filter(
  corpus,
  FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
  # !( meta(doc)$id %in% emptyRows )
)

Answer 6

我在数据帧lt$title中有一列包含字符串。我在此列中没有“空”行，但仍然收到错误：

LDA中的错误（dtm，k = 20，control = list（seed = 813））：输入矩阵必须包含至少一个非零条目

上述某些解决方案对我不起作用，因为我需要将预测主题向量加入到我的原始数据框中。因此，从文档术语矩阵中删除非零条目是没有选择的。

问题是，lt$title中的某些（很短的）字符串包含特殊字符，Corpus()和/或DocumentTermMatrix()无法处理。

我的解决方案是删除“短”字符串（最多一个或两个单词），这些字符串反正不会携带太多信息。

# Clean up text data
lt$test=nchar(lt$title)
lt = lt[!lt$test<10,]
lt$test<-NULL

# Topic modeling
corpus <- Corpus(VectorSource(lt$title))
dtm = DocumentTermMatrix(corpus)
tm = LDA(dtm, k = 20, control = list(seed = 813))

# Add "topics" to original DF
lt$topic = topics(tm)

从R topicmodels中删除DocumentTermMatrix中的空文档？

6 个答案: