如何在汇总数据上使用Quanteda?

时间:2019-02-15 15:34:26

标签: r quanteda

考虑此示例

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) 
# A tibble: 2 x 2
  text                         repetition
  <chr>                             <dbl>
1 a grande latte with soy milk        100
2 black coffee no room                  2

数据意味着句子a grande latte with soy milk在我的数据集中出现了100次。当然,存储该冗余是浪费内存,这就是为什么我有repetition变量。

不过,我仍然想让Quanted的dtm来反映这一点,因为dfm的稀疏性给了我保留这些信息的空间。也就是说,我如何在dfm中仍然有100行用于第一个文本?仅使用以下代码不会考虑repetition

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) %>% 
  corpus() %>% 
  tokens() %>% 
  dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
       features
docs    a grande latte with soy milk black coffee no room
  text1 1      1     1    1   1    1     0      0  0    0
  text2 0      0     0    0   0    0     1      1  1    1

2 个答案:

答案 0 :(得分:2)

假设您的data.frame被称为df1,则可以使用cbind向dfm添加一列。但这可能不会给您所需的结果。下面的其他两个选项可能更好。

绑定

df1 <- tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2))

my_dfm <- df1 %>%  
  corpus() %>% 
  tokens() %>% 
  dfm() %>% 
  cbind(repetition = df1$repetition) # add column to dfm with name repetition

Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
2 x 11 sparse Matrix of class "dfm"
       features
docs    a grande latte with soy milk black coffee no room repetition
  text1 1      1     1    1   1    1     0      0  0    0        100
  text2 0      0     0    0   0    0     1      1  1    1          2

docvars

您还可以通过docvars函数添加数据,然后将数据添加到dfm中,但更多地隐藏在dfm类插槽中(可​​通过@到达)。

docvars(my_dfm, "repetition") <- df1$repetition
docvars(my_dfm)

      repetition
text1        100
text2          2

乘法

使用乘法:

my_dfm * df1$repetition

Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
       features
docs      a grande latte with soy milk black coffee no room
  text1 100    100   100  100 100  100     0      0  0    0
  text2   0      0     0    0   0    0     2      2  2    2

答案 1 :(得分:1)

您可以使用索引来获取所需的重复,同时保持仅包含单个文本的效率。

library("tibble")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

tib <- tibble(
  text = c(
    "a grande latte with soy milk",
    "black coffee no room"
  ),
  repetition = c(100, 2)
)
dfmat <- corpus(tib) %>%
  dfm()

定义一个函数来重复您的“重复”变量:

repindex <- function(x) rep(seq_along(x), times = x)

然后重复两个文档dfm的索引:

dfmat2 <- dfmat[repindex(tib$repetition), ]
dfmat2
## Document-feature matrix of: 102 documents, 10 features (40.4% sparse).

head(dfmat2, 2)
## Document-feature matrix of: 2 documents, 10 features (40.0% sparse).
## 2 x 10 sparse Matrix of class "dfm"
##        features
## docs    a grande latte with soy milk black coffee no room
##   text1 1      1     1    1   1    1     0      0  0    0
##   text1 1      1     1    1   1    1     0      0  0    0
tail(dfmat2, 4)
## Document-feature matrix of: 4 documents, 10 features (50.0% sparse).
## 4 x 10 sparse Matrix of class "dfm"
##        features
## docs    a grande latte with soy milk black coffee no room
##   text1 1      1     1    1   1    1     0      0  0    0
##   text1 1      1     1    1   1    1     0      0  0    0
##   text2 0      0     0    0   0    0     1      1  1    1
##   text2 0      0     0    0   0    0     1      1  1    1