我可以使用data.table而不使用merge()函数吗?

时间:2019-05-29 19:47:31

标签: r merge data.table

我处理非常庞大的数据集,并且试图加快我的R代码的速度。 这是数据示例:

dt <- data.table(id = c(100,101,102,103, 104), sex = c("m","f","m","m","f"), 
value = c(32,14,32,03,03))

data look like this :
    id sex value
1: 100   m    32
2: 101   f    14
3: 102   m    32
4: 103   m     3
5: 104   f     3 

我想要的最终输出:

   value f.value m.value f   m
1:     3    1       1    1   1
2:    14    1       NA   1   NA
3:    32    NA      2    NA  2

我当前使用的代码:

dt_u <- unique(dt, by = c("id", "sex", "value"))
dt_u <- dt_u[, .(n = .N), keyby = .(value, sex)]
dt_u <- dcast(dt_u, value ~ sex, value.var = "n")
dt_t <- dt[, .(n = .N), keyby = .(value, sex)]
dt_t <- dcast(dt_t, value ~ sex, value.var = "n")
dt <- merge(dt_t, dt_u, by = "value", all = TRUE)

代码运行良好,问题在于dt_u和dt_t的合并在10GB +数据上可能会花费很多时间。所以我的问题是:是否有可能获得相同的最终输出而不必“拆分”数据然后合并呢?

如果可能的话,我也希望答案在data.table中, 谢谢。

EDIT:示例和说明。 ID代表一个人,这个人可以多次前往同一位置(值)。对于此示例,您可以说每个值代表一个不同的城市。

IE:

dt <- data.table(value = c(21,21,21,21,21,40,1,22,1,1,22, 22, 49, 
49,21,21,1,1,1), id = 
c(1000716624,1000722724,1000716624,1000746824,1001012024,
1002067324,1002743624,1002743645, 1002743636, 
1002743423,1000716624,1000722724, 1000722724,1001012024, 
1000716624,1000716624,1002743624,1002743624,1002743624), sex = c("f", "m", 
"m", "m", "f", "f", "m", "f", "f", "m", "f", "m", "m", "f","f","f", "m", 
"m", "m"))

输出:

 value   places_women   places_men  number_women   number_men
1:     1            1          5            1          2
2:    21            4          3            2          3
3:    22            2          1            2          1
4:    40            1         NA            1         NA
5:    49            1          1            1          1

2 个答案:

答案 0 :(得分:3)

这适用于第二个示例(基于对所需输出进行反向工程):

> dcast(dt, value ~ sex, value.var=list("value", "id"), fun=list(length, uniqueN), fill=NA)
   value value.1_length_f value.1_length_m id_uniqueN_f id_uniqueN_m
1:     1                1                5            1            2
2:    21                4                3            2            3
3:    22                2                1            2            1
4:    40                1               NA            1           NA
5:    49                1                1            1            1

如果这不能解决全部问题,则更明确地说明应该在每一列中进行哪些计算(示例中可能使用更自然的列名)。

答案 1 :(得分:1)

library(data.table)

dt <- data.table(id = c(100,101,102,103, 104), sex = c("m","f","m","m","f"), 
                 value = c(32,14,32,03,03))

dcast(unique(unique(dt, 
                    by = c("id", "sex", "value"))[ , 

            count := .N, by = list(value,sex)][,
        id:=NULL]), 
value ~ sex, value.var = "count")

#>    value  f  m
#> 1:     3  1  1
#> 2:    14  1 NA
#> 3:    32 NA  2

reprex package(v0.3.0)于2019-05-29创建