R - 如何计算数据帧列表的组均值,使用不同的子集条件计算每个均值?

时间:2015-03-22 22:12:53

标签: r list aggregate lapply cbind

我有一个包含三个数据帧的列表,并希望生成另一个包含三个数据帧的列表,这些数据帧的行由分组变量(g1)的每个值和g1变量的六个变量的均值组成。扭曲是我只想在相应虚拟变量的值等于1时计算三个连续变量的均值。

可重复的例子:

    a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))

u[[1]]

> u
[[1]]
   g1 dummy1 dummy2 dummy3  contin1 contin2 contin3
1  fj      1      0      0       199      18      61
2  fj      1      0      0        91     158      28
3  fj      1      1      0       147      67     190
4   a      1      0      1       181     105      22
5  fj      0      1      0        14      16     156
6   a      0      0      0       178      14      98
7   g      0      0      1       116      97      30
8   g      1      1      1        48      31     144
9   g      0      0      0        60      21     112
10  g      0      1      0        95     145     199

我想仅在dummy1 = 1时计算contin1的平均值,仅在dummy2 = 1时计算contin2的平均值,仅在dummy3 = 1时计算contin3的平均值

我希望第一个列表的输出:

> rates
[[1]]
  x[, 1]   V1  V2  V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1      a 0.50 0.0 0.5      a 181         a  NA         a  22
2     fj 0.75 0.5 0.0     fj 145.67     fj  41.5      fj  NA
3      g 0.25 0.5 0.5      g  48         g  88         g  87

我尝试过:

rates <- lapply(u, function(x) {
    cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
    aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
    aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
    aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
    })
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 3, 2

我知道这个错误来自cbind,因为每当你尝试cbind具有不同行数的对象时cbind就会失败。 (列x [,6]有三行,而x [,7]和x [,8]有两行。)我想我希望聚合有一些方法可以为每个分组变量保留一行,这意味着我将拥有相同数量的行,而cbind将起作用。根据R文档,这可能是不可能的?:“任何by变量中缺少值的行都将从结果中省略。”

我已经咖啡馆阅读了汇总的文档。以下两篇文章解决了类似的问题,但没有使用不同的数据子集来计算均值。

R: Calculate means for subset of a groupMeans from a list of data frames in R

任何建议都会非常感激。

2 个答案:

答案 0 :(得分:1)

如果安装了dplyr,以下代码似乎可以解决您的问题。

library(dplyr)

set.seed(1234)

a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))


rates <- lapply(u, function(x)
  x %>% 
    mutate( contin1_ = ifelse(dummy1==1, contin1, NA) ) %>%
    mutate( contin2_ = ifelse(dummy2==1, contin2, NA) ) %>%
    mutate( contin3_ = ifelse(dummy3==1, contin3, NA) ) %>%
    group_by(g1) %>%
    summarize( 
              V1 = mean(dummy1, na.rm=TRUE),
              V2 = mean(dummy2, na.rm=TRUE),
              V3 = mean(dummy3, na.rm=TRUE),
              mean1 = mean(contin1_, na.rm=TRUE),
              mean2 = mean(contin2_, na.rm=TRUE),
              mean3 = mean(contin3_, na.rm=TRUE)
               )
)

print(rates[[1]])

这给了我这个:

Source: local data frame [3 x 7]

  g1   V1  V2  V3     mean1 mean2 mean3
1  a 0.50 0.0 0.5 128.00000   NaN    17
2 fj 0.75 0.5 0.0  94.66667    64   NaN
3  g 0.25 0.5 0.5  54.00000    57   146

我得到的数字似乎大致正确,NA在所有正确的位置。不幸的是,你的例子不能完全重现,因为你没有指定用于生成随机变量的种子,因此,我的runif给了我不同于你的值。

答案 1 :(得分:1)

另一个选择是将格式从'wide'更改为'long',并在获得'mean'值后重新转换回'wide'。对于多值列,现在可以使用来自melt的开发版dcast的{​​{1}},data.tablev1.9.5。它可以从here安装。 (使用@ akhmed的帖子中的相同数据集。)

我们可以通过在melt中指定列的索引('dummy'和'contin')作为列表来measure.vars列表中的数据集('u')。通过指定'g1'和'变量'(从'融化'创建),dcastlongwide获取'虚拟'和'连续'列的平均值value.vars为'dummyMean'和'continMean'。

 res <-  lapply(u, function(x) {
   x1 <- melt(setDT(x), measure.vars=list(2:4,5:7),
                        value.name=c('dummy', 'contin'))
   x2 <- x1[, list(dummyMean = mean(dummy, na.rm=TRUE),
             continMean = mean(contin[dummy==1], na.rm=TRUE)), 
                           by=list(g1, variable)]

  dcast(x2, g1~variable, value.var=c('dummyMean', 'continMean'))})

 res[[1]]
 #   g1 1_dummyMean 2_dummyMean 3_dummyMean 1_continMean 2_continMean
 #1:  a        0.50         0.0         0.5    128.00000          NaN
 #2: fj        0.75         0.5         0.0     94.66667           64
 #3:  g        0.25         0.5         0.5     54.00000           57
 #    3_continMean
 #1:           17
 #2:          NaN
 #3:          146

使用base R的{​​{1}}选项。创建函数'fdummy','fcontin'以对'dummy'和'contin'列进行子集化。循环通过'u'(Map)。使用lapply(...)获取'dummy'和'contin'的相应列,按'g1'列分组,获取'dummy'的Map和'{1}}'contin'列'dummy == 1'使用meanmean结果。

tapply