Question

我有一个大型csv文件（3.8 Gb），其中包含列（术语），行（文档）格式的数据。我想将此转换为来自tm包的dtm。

我在这里跳过read.csv步骤，但你明白了。

dtm <- structure(list(the = c(2L, 1L), apple = c(0L, 2L), dumb = c(1L, 0L)), .Names = c("the", "apple", "dumb"), class = "data.frame", row.names = c(NA, -2L))

然后我不知道如何将其转换为正式的tm包dtm：

c <- Corpus(DataframeSource(dtm))

显然这是错的。

感谢任何指示。

Answer 1

这样做：

tmDTM <- tm::as.DocumentTermMatrix(slam::as.simple_triplet_matrix(dtm),
                                   weighting = tm::weightTf)

quanteda包也有一些不错的实现。

将大型CSV DTM转换为tm软件包DTM

1 个答案: