考虑这个简单的例子
tibble(text = c('a grande latte with soy milk',
'black coffee no room',
'latte is a latte',
'coke, diet coke'),
myday = c(ymd('2018-01-01','2018-01-01','2018-01-03','2018-01-03'))) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 4 documents, 14 features (71.4% sparse).
4 x 14 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room is coke , diet
text1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1 0 0 0 0
text3 1 0 2 0 0 0 0 0 0 0 1 0 0 0
text4 0 0 0 0 0 0 0 0 0 0 0 2 1 1
我对获取coffee
一词按天汇总的比例感兴趣。
也就是说,对于第2018-01-01
天,我们可以看到有10个单词(a
grande
latte
with
soy
{{1 }} milk
black
coffee
no
)和room
仅被提及一次。因此比例为1/10。其他几天也是如此。
如何在coffee
中做到这一点?当然,这样做的目的是避免将稀疏矩阵具体化为密集矩阵。
谢谢!
答案 0 :(得分:1)
这很容易,并且是核心 quanteda 设计决策的一部分,该决策将您的docvar从语料库对象传递到dfm等“下游”对象。您可以通过dfm_group()
docvar使用myday
然后加权来解决此问题。
首先,要使示例完全可重复,并为dfm对象分配一个名称:
library("quanteda")
## Package version: 1.4.3
library("tibble")
library("lubridate")
dfmat <- tibble(
text = c(
"a grande latte with soy milk",
"black coffee no room",
"latte is a latte",
"coke, diet coke"
),
myday = c(ymd("2018-01-01", "2018-01-01", "2018-01-03", "2018-01-03"))
) %>%
corpus() %>%
tokens() %>%
dfm()
现在只需两个操作即可获得所需的结果。
dfmat2 <- dfm_group(dfmat, groups = "myday") %>%
dfm_weight(scheme = "prop")
dfmat2
## Document-feature matrix of: 2 documents, 14 features (42.9% sparse).
## 2 x 14 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room is
## 2018-01-01 0.100 0.1 0.10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0
## 2018-01-03 0.125 0 0.25 0 0 0 0 0 0 0 0.125
## features
## docs coke , diet
## 2018-01-01 0 0 0
## 2018-01-03 0.25 0.125 0.125
dfmat2[, "coffee"]
## Document-feature matrix of: 2 documents, 1 feature (50.0% sparse).
## 2 x 1 sparse Matrix of class "dfm"
## features
## docs coffee
## 2018-01-01 0.1
## 2018-01-03 0