Question

我正在使用R的quanteda软件包以及R和软件包的最新版本。我有一个数百万的文件集。

假设我有一个从quanteda生成的DFM，每个文档都有一个日期的docvar。在给定的一天中生成了数千个文档，但是我希望白天获得应用于文档的DFM（这样我每天都有每个术语的总字数）。我知道quanteda是使用data.table构建的，所以它应该可以这样做，但是我在“Quanteda入门”或者StackOverflow中找不到这样做，它提供了一种干净的方法。

有什么建议吗？

Answer 1

你想要＆＃39;组＆＃39; dfm的参数：

> # Add some random dates to an existing corpus
> docvars(data_corpus_inaugural)$date <- rep(as.Date(runif(19, 1, 18000), origin='1970-01-01'), 3)

> dfm_inaugural <- dfm(data_corpus_inaugural, groups='date')
> head(dfm_inaugural)
Document-feature matrix of: 19 documents, 9,215 features (80.8% sparse).
(showing first 6 documents and first 6 features)
            features
docs         fellow citizens  i appear before you
  1970-12-27      4        7 39      2     10  17
  1972-04-25      8       13 29      1      8   8
  1973-08-22      1        3 48      1      6   1
  1973-10-11      2        4 25      0      3   5
  1974-01-05      3        9 57      0      7   2
  1975-04-12      7       21 63      4      6  16

Quanteda - 在文档变量上将功能应用于DFM

1 个答案: