data.table按组聚合并保留其他列的相应值

时间:2016-11-22 18:52:29

标签: r data.table aggregate

我想通过分组变量 AND 通过多个函数聚合R中data.table的值。保留相应行中其他列(不包括在聚合中)的信息( =与聚合相同的行)。一个例子:

注意:代码使用此which_quantile()-function(在其代码中使用sort(x)而不是order(x))。它找到一个接近定义的分位数的数据集的实际值。

# sample data
dt <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
.Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A", 
"B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L, 
104L, 37L, 24L, 19L, 118L, 88L, 42L), Number=c(1L,2L,3L,4L,5L,6L,7L,8L), Number2=c(9,10,11,12,13,14,15,16)),
.Names = c("State", "Company", "Employees", "Number", "Number2"), class = "data.frame", row.names = c(NA, 8L))

require(data.table)
setDT(dt)

# aggregation
agg <- dt[ , .(max = max(Employees),
               min = min(Employees),
               quantile70 = which.quantile(Employees, 0.7)), by=State]
agg_m <- dt[agg, on="State"]

聚合DT会产生以下输出:

     State  max   min   quantile70
1:    AK    104   24    82
2:    RI    118   19    88

将聚合与原始DT合并到:

     State Company Employees Number Number2    max      min quantile70
1:    AK       A        82      1       9      104       24     82
2:    AK       B       104      2      10      104       24     82
3:    AK       C        37      3      11      104       24     82
4:    AK       D        24      4      12      104       24     82
5:    RI       E        19      5      13      118       19     88
6:    RI       F       118      6      14      118       19     88
7:    RI       G        88      7      15      118       19     88
8:    RI       H        42      8      16      118       19     88

问题:如何汇总data.table,同时在Company,Number和Number2列中保留相应的值?最大状态AK中的数字列为104,第二列中的对应值为10.最小值为24,对应值为12,依此类推。聚合data.table时如何保留这些信息?

所需的输出:

    State Company Employees Number Number2 aggregation
1:    AK       A        82      1       9      quantile70
2:    AK       B       104      2      10      max
3:    AK       D        24      4      12      min
4:    RI       E        19      5      13      min
5:    RI       F       118      6      14      max
6:    RI       H        88      8      16      quantile70

问题类似于this one。样本数据也从那里获取并进行了调整。

以下汇总无法解决我的问题:

dt[ ,.SD[ which.max(Employees) ], by=State]
dt[dt[ ,.I[ which.max(Employees) ], by=State ]$V1]
# only which.max() OR which.min() are possible

dt[ , max_Empl := max(Employees), by=State ]
# only ONE aggregation function at a time is possible

1 个答案:

答案 0 :(得分:0)

关于按群组进行子集化的@eddi's canonical answer ...

aggi <- dt[ , .(max = .I[which.max(Employees)],
               min = .I[which.min(Employees)],
               quantile70 = .I[which.quantile(Employees, 0.7)]), by=State]

从这里,你可以做到

maggi <- melt(aggi, id="State")

dt[maggi$value][, v := maggi$variable][]

   State Company Employees Number Number2          v
1:    AK       B       104      2      10        max
2:    RI       F       118      6      14        max
3:    AK       D        24      4      12        min
4:    RI       E        19      5      13        min
5:    AK       A        82      1       9 quantile70
6:    RI       G        88      7      15 quantile70