与R中的data.table聚合

时间:2013-03-05 19:06:47

标签: r data.table

练习包括通过因子的组合和R中的data.table来聚合值的数值向量。以下面的数据表为例:

require (data.table)
require (plyr)
dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3),
                                       fac = letters[1:3]),
                          value = rnorm (27)))

请注意,'month'和'fac'的每个独特组合都会出现三次。因此,当我尝试通过这两个因素对值进行平均时,我应该期望一个包含9个唯一行的数据框:

(agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value)))
  month fac          V1
1   Jan   a -0.36030953
2   Jan   b -0.58444588
3   Jan   c -0.15472876
4   Feb   a -0.05674483
5   Feb   b  0.26415972
6   Feb   c -1.62346772
7   Mar   a  0.24560510
8   Mar   b  0.82548140
9   Mar   c  0.18721114

但是,在与data.table聚合时,我会不断得到两个因素的每个冗余组合所提供的结果:

(agg2 <- dtb[, value := mean (value), by = list (month, fac)])
    month fac       value
 1:   Jan   a -0.36030953
 2:   Jan   a -0.36030953
 3:   Jan   a -0.36030953
 4:   Feb   a -0.05674483
 5:   Feb   a -0.05674483
 6:   Feb   a -0.05674483
 7:   Mar   a  0.24560510
 8:   Mar   a  0.24560510
 9:   Mar   a  0.24560510
10:   Jan   b -0.58444588
11:   Jan   b -0.58444588
12:   Jan   b -0.58444588
13:   Feb   b  0.26415972
14:   Feb   b  0.26415972
15:   Feb   b  0.26415972
16:   Mar   b  0.82548140
17:   Mar   b  0.82548140
18:   Mar   b  0.82548140
19:   Jan   c -0.15472876
20:   Jan   c -0.15472876
21:   Jan   c -0.15472876
22:   Feb   c -1.62346772
23:   Feb   c -1.62346772
24:   Feb   c -1.62346772
25:   Mar   c  0.18721114
26:   Mar   c  0.18721114
27:   Mar   c  0.18721114
    month fac       value

是否有一种优雅的方法可以将这些结果折叠为每个独特的因子组合与数据表的一行?

2 个答案:

答案 0 :(得分:9)

问题(和推理)与聚合值分配而不仅仅是计算的事实有关。

如果您查看的data.table包含的列数多于用于计算的列数,则更容易观察到这一点。

# Therefore, let's add a new column
dtb[, newCol := LETTERS[seq(length(value))]

请注意,如果我们只想输出计算出的值,那么RHS上的表达式就好了。

# This gives the expected results
dtb[, mean (value), by = list (month, fac)]

# This on the other hand assigns the respective values to *each* row
dtb[, value := mean (value), by = list (month, fac)]

换句话说,数据被子集化为仅返回唯一值 但是,如果要将此值保存回 SAME 数据表(使用:=运算符时会发生这种情况) 然后将为i中标识的所有行(defualt的所有行)分配一个值。 (当你用附加列查看输出时,这是有意义的)

然后将此data.table复制到agg仍会发送所有行。

因此,如果您要复制到新表格,只显示原始表格中唯一的行,您可以

a.  wrap the original table inside `unique()` before assigning it
b.  assign the table, above, that is returned when you 
    are not assigning the RHS output (which is what @Arun suggested)

a.的一个例子是:

 agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)])

以下示例可能有助于说明。

(您需要复制+粘贴此内容,因为省略了输出)

  # SAMPLE DATA, as above
  library(data.table)
  dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27))

  #  METHOD 1  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.


  dtb[, value := mean (value), by = list (month, fac)]
  dtb

  # this is what you would like to assign
  unique(dtb)


  #  METHOD 2  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.

  # this is what you would like to assign
  # next two lines are the same, only differnce is column name
  dtb[, mean (value), by = list (month, fac)]
  dtb[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity

  # dtb is unchanged. 
  dtb



  # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN
  dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)]


  dtb1 <- copy(dtb.bak)  # restore, from sample data.
  dtb2 <- copy(dtb.bak)  # restore, from sample data.


  # Method 1
  dtb1[, value := mean (value), by = list (month, fac)]
  dtb1
  unique(dtb1)

  #  METHOD 2  # 
  dtb2[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity
  dtb2

  # METHOD 2, WITH ADDED COLUMNS IN list() in `j`
  dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)]  # quote marks added for clarity
  # notice this has more columns thatn 
  unique(dtb1)

答案 1 :(得分:5)

你应该这样做:

agg2 <- dtb[, list(value = mean(value)), by = list (month, fac)]

:=会回收RHS的值,以符合LHS中的元素数量。请?':='阅读更多相关信息。