如何在多列上使用apply函数来处理多个数据子集?

时间:2015-11-20 16:44:05

标签: r subset apply

我有一个数据框(下面的示例),包含943列和500行。

df <-data.frame(Rep=c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), Depth=c("D", "D", "D", "M", "M", "M", "D", "D", "D", "M", "M", "D", "D"), T0= c(-165,-163,-160,-161,-270,165,-163,-160,-161,-270,-181,-231, -230), T0.01= c(458,459,457,342,158,458,459,457,342,158,324,333,320), T0.02=c(-151,-153,-131,-125,-130,-151,-153,-131,-125,-130,-120, -130,-120)) 

我需要在我的数据集中获取第7:943列的列中位数(所有带有数字数据的列......它们也都以标题“T”开头,如T0,T0.01等)。但是,我只需要列的中间行为特定的行子集。该子集将基于“Rep”和“Depth”。例如,我需要一个列中位数的矢量用于“深度D处的Rep 1”,然后是“深度为M的Rep 1”的列中值矢量。我总共有24个Reps和3个深度,并且需要所有组合的中位数向量,总共产生3x24 = 72个向量。这会产生一个像这样结构的表(转换版本也可以):

 df <-data.frame(Rep=c(1, 1, 1, 2, 2, 2), Depth=c("D", "M", "S", "D", "M", "S"), T0= c(-163,-160,-161,-270,165, 165), T0.01= c(458,459,457,342,158,458), T0.02=c(-151,-153,-131,-125,-130,-151))

   Rep Depth   T0 T0.01 T0.02
   1     D -163   458  -151
   1     M -160   459  -153
   1     S -161   457  -131
   2     D -270   342  -125
   2     M  165   158  -130
   2     S  165   458  -151

此外,我需要计算这些相同数据子集的第7列:943(“T”列)中所有单元格的方差。这将为每个子集产生一个数字(而不是矢量)。

我已经尝试了所有这些的子集,tapply,grepl函数,但似乎无法让他们做我想要的。感谢。

1 个答案:

答案 0 :(得分:0)

使用您提供的数据:

library(dplyr) 

df %>% 
  group_by(Rep, Depth) %>%
  summarise_each(funs(median, var))

    Rep  Depth T0_median T0.01_median T0.02_median      T0_var T0.01_var T0.02_var
  (dbl) (fctr)     (dbl)        (dbl)        (dbl)       (dbl)     (dbl)     (dbl)
1     1      D    -163.0        458.0       -151.0    6.333333     1.000  148.0000
2     1      M    -215.5        250.0       -127.5 5940.500000 16928.000   12.5000
3     2      D    -161.0        457.0       -131.0    2.333333  4486.333  217.3333
4     2      M     165.0        458.0       -151.0          NA        NA        NA
5     3      D    -230.5        326.5       -125.0    0.500000    84.500   50.0000
6     3      M    -225.5        241.0       -125.0 3960.500000 13778.000   50.0000

或者,如果您想使分组更具描述性:

df %>% 
  mutate(group=paste("Rep",Rep,"at Depth", Depth)) %>%
  group_by(group) %>%
  summarise_each(funs(median, var), matches("^T"))

             group T0_median T0.01_median T0.02_median      T0_var T0.01_var T0.02_var
             (chr)     (dbl)        (dbl)        (dbl)       (dbl)     (dbl)     (dbl)
1 Rep 1 at Depth D    -163.0        458.0       -151.0    6.333333     1.000  148.0000
2 Rep 1 at Depth M    -215.5        250.0       -127.5 5940.500000 16928.000   12.5000
3 Rep 2 at Depth D    -161.0        457.0       -131.0    2.333333  4486.333  217.3333
4 Rep 2 at Depth M     165.0        458.0       -151.0          NA        NA        NA
5 Rep 3 at Depth D    -230.5        326.5       -125.0    0.500000    84.500   50.0000
6 Rep 3 at Depth M    -225.5        241.0       -125.0 3960.500000 13778.000   50.0000

UPDATE:因此,对于所有数据列上的组差异,这是你的意思(do语句可能比它需要的更复杂):

df %>% 
  mutate(group=paste("Rep",Rep,"at Depth", Depth)) %>%
  select(-Rep, -Depth) %>%
  group_by(group) %>%
  do(data.frame(variance=var(unlist(.[,sapply(., is.numeric)]))))

             group variance
             (chr)    (dbl)
1 Rep 1 at Depth D 93682.36
2 Rep 1 at Depth M 53501.60
3 Rep 2 at Depth D 81997.03
4 Rep 2 at Depth M 92764.33
5 Rep 3 at Depth D 70057.87
6 Rep 3 at Depth M 51781.50