考虑此示例
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
# A tibble: 2 x 2
text repetition
<chr> <dbl>
1 a grande latte with soy milk 100
2 black coffee no room 2
数据意味着句子a grande latte with soy milk
在我的数据集中出现了100次。当然,存储该冗余是浪费内存,这就是为什么我有repetition
变量。
不过,我仍然想让Quanted的dtm
来反映这一点,因为dfm的稀疏性给了我保留这些信息的空间。也就是说,我如何在dfm中仍然有100行用于第一个文本?仅使用以下代码不会考虑repetition
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2)) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 1 1 1 1 1 1 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1
答案 0 :(得分:2)
假设您的data.frame
被称为df1,则可以使用cbind
向dfm添加一列。但这可能不会给您所需的结果。下面的其他两个选项可能更好。
绑定
df1 <- tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
my_dfm <- df1 %>%
corpus() %>%
tokens() %>%
dfm() %>%
cbind(repetition = df1$repetition) # add column to dfm with name repetition
Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
2 x 11 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room repetition
text1 1 1 1 1 1 1 0 0 0 0 100
text2 0 0 0 0 0 0 1 1 1 1 2
docvars
您还可以通过docvars
函数添加数据,然后将数据添加到dfm中,但更多地隐藏在dfm类插槽中(可通过@到达)。
docvars(my_dfm, "repetition") <- df1$repetition
docvars(my_dfm)
repetition
text1 100
text2 2
乘法
使用乘法:
my_dfm * df1$repetition
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 100 100 100 100 100 100 0 0 0 0
text2 0 0 0 0 0 0 2 2 2 2
答案 1 :(得分:1)
您可以使用索引来获取所需的重复,同时保持仅包含单个文本的效率。
library("tibble")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
tib <- tibble(
text = c(
"a grande latte with soy milk",
"black coffee no room"
),
repetition = c(100, 2)
)
dfmat <- corpus(tib) %>%
dfm()
定义一个函数来重复您的“重复”变量:
repindex <- function(x) rep(seq_along(x), times = x)
然后重复两个文档dfm的索引:
dfmat2 <- dfmat[repindex(tib$repetition), ]
dfmat2
## Document-feature matrix of: 102 documents, 10 features (40.4% sparse).
head(dfmat2, 2)
## Document-feature matrix of: 2 documents, 10 features (40.0% sparse).
## 2 x 10 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room
## text1 1 1 1 1 1 1 0 0 0 0
## text1 1 1 1 1 1 1 0 0 0 0
tail(dfmat2, 4)
## Document-feature matrix of: 4 documents, 10 features (50.0% sparse).
## 4 x 10 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room
## text1 1 1 1 1 1 1 0 0 0 0
## text1 1 1 1 1 1 1 0 0 0 0
## text2 0 0 0 0 0 0 1 1 1 1
## text2 0 0 0 0 0 0 1 1 1 1