我处理非常庞大的数据集,并且试图加快我的R代码的速度。 这是数据示例:
dt <- data.table(id = c(100,101,102,103, 104), sex = c("m","f","m","m","f"),
value = c(32,14,32,03,03))
data look like this :
id sex value
1: 100 m 32
2: 101 f 14
3: 102 m 32
4: 103 m 3
5: 104 f 3
我想要的最终输出:
value f.value m.value f m
1: 3 1 1 1 1
2: 14 1 NA 1 NA
3: 32 NA 2 NA 2
我当前使用的代码:
dt_u <- unique(dt, by = c("id", "sex", "value"))
dt_u <- dt_u[, .(n = .N), keyby = .(value, sex)]
dt_u <- dcast(dt_u, value ~ sex, value.var = "n")
dt_t <- dt[, .(n = .N), keyby = .(value, sex)]
dt_t <- dcast(dt_t, value ~ sex, value.var = "n")
dt <- merge(dt_t, dt_u, by = "value", all = TRUE)
代码运行良好,问题在于dt_u和dt_t的合并在10GB +数据上可能会花费很多时间。所以我的问题是:是否有可能获得相同的最终输出而不必“拆分”数据然后合并呢?
如果可能的话,我也希望答案在data.table中, 谢谢。
EDIT:示例和说明。 ID代表一个人,这个人可以多次前往同一位置(值)。对于此示例,您可以说每个值代表一个不同的城市。
IE:
dt <- data.table(value = c(21,21,21,21,21,40,1,22,1,1,22, 22, 49,
49,21,21,1,1,1), id =
c(1000716624,1000722724,1000716624,1000746824,1001012024,
1002067324,1002743624,1002743645, 1002743636,
1002743423,1000716624,1000722724, 1000722724,1001012024,
1000716624,1000716624,1002743624,1002743624,1002743624), sex = c("f", "m",
"m", "m", "f", "f", "m", "f", "f", "m", "f", "m", "m", "f","f","f", "m",
"m", "m"))
输出:
value places_women places_men number_women number_men
1: 1 1 5 1 2
2: 21 4 3 2 3
3: 22 2 1 2 1
4: 40 1 NA 1 NA
5: 49 1 1 1 1
答案 0 :(得分:3)
这适用于第二个示例(基于对所需输出进行反向工程):
> dcast(dt, value ~ sex, value.var=list("value", "id"), fun=list(length, uniqueN), fill=NA)
value value.1_length_f value.1_length_m id_uniqueN_f id_uniqueN_m
1: 1 1 5 1 2
2: 21 4 3 2 3
3: 22 2 1 2 1
4: 40 1 NA 1 NA
5: 49 1 1 1 1
如果这不能解决全部问题,则更明确地说明应该在每一列中进行哪些计算(示例中可能使用更自然的列名)。
答案 1 :(得分:1)
library(data.table)
dt <- data.table(id = c(100,101,102,103, 104), sex = c("m","f","m","m","f"),
value = c(32,14,32,03,03))
dcast(unique(unique(dt,
by = c("id", "sex", "value"))[ ,
count := .N, by = list(value,sex)][,
id:=NULL]),
value ~ sex, value.var = "count")
#> value f m
#> 1: 3 1 1
#> 2: 14 1 NA
#> 3: 32 NA 2
由reprex package(v0.3.0)于2019-05-29创建