使用列表中数据框的加权平均值创建新数据框

时间:2014-09-30 02:05:49

标签: r

我有很多数据框存储在列表中,我想从这些数据框中创建加权平均值,并将结果存储在新的数据框中。例如,使用列表:

dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"), 
                      df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")), 
                 .Names = c("df1", "df2"))

在此示例中,我希望使用列ABWeight作为加权平均值。我还希望移动Site等相关数据,并希望将TRUEFALSE的数量相加。我想要的结果看起来像:

result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"), 
    A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L, 
    1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))


    Site    A.Weight    B.Weight    Sum.Weight
1   X       4.5         6           2
2   Y       8.0         4           1

以上只是一个非常简单的示例,但我的真实数据在列表中包含许多数据框,而且还有更多列,而不仅仅是我要计算加权平均值的AB。我还有几个类似于Site的列,它们在每个数据帧中都是常量,我想移动到结果。

我能够使用类似

的方式手动计算加权平均值
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)

但我不确定如何用更短,更少的手册来做到这一点#34;办法。有没有人有任何建议?我最近学会了如何lapply跨列表中的数据框架,但到目前为止,我的尝试并没有那么大。

2 个答案:

答案 0 :(得分:2)

诀窍是创建一个适用于单个data.frame的函数,然后使用lapply遍历列表。由于lapply会返回一个列表,因此我们会将do.call rbind一起用于生成的对象:

foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
  means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
  sumWeight <- sum(data[, weightCol])
  others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
  out <- data.frame(others, means, sumWeight)
  return(out)
}

行动中:

do.call(rbind, lapply(dfs, foo))
---
    Site   A B sumWeight
df1    X 4.5 6         2
df2    Y 8.0 4         1

既然你说这是一个最小的例子,这里有一种方法可以将它扩展到其他列。我们将使用grepl()并使用正则表达式来标识正确的列。或者,您可以将它们全部写在矢量中。像这样:

do.call(rbind, lapply(dfs, foo, 
                      meanCols = grepl("A|B", names(dfs[[1]])),
                      otherCols = grepl("Site", names(dfs[[1]]))
                      ))

答案 1 :(得分:2)

使用dplyr

 library(dplyr)
 library('devtools')
 install_github('hadley/tidyr')
 library(tidyr)

 unnest(dfs) %>%
           group_by(Site) %>% 
           filter(Weight) %>% 
           mutate(Sum=n()) %>%
           select(-Weight) %>% 
           summarise_each(funs(mean=mean(., na.rm=TRUE)))

给出结果

 #  Site   A B Sum
 #1    X 4.5 6   2
 #2    Y 8.0 4   1

或使用data.table

 library(data.table)
 DT <- rbindlist(dfs)
 DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE), 
                Sum=.N), by = Site, .SDcols = c("A", "B")]
 #   Site   A B Sum
 #1:    X 4.5 6   2
 #2:    Y 8.0 4   1

更新

回应@ jazzuro的评论,使用dplyr 0.3,我正在

   unnest(dfs) %>% 
             group_by(Site) %>% 
             summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
                    Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
             select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight) 

  #    Site A_weighted.mean B_weighted.mean Sum.Weight
  #1    X             4.5               6          2
  #2    Y             8.0               4          1