我有很多数据框存储在列表中,我想从这些数据框中创建加权平均值,并将结果存储在新的数据框中。例如,使用列表:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
在此示例中,我希望使用列A
,B
和Weight
作为加权平均值。我还希望移动Site
等相关数据,并希望将TRUE
和FALSE
的数量相加。我想要的结果看起来像:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
以上只是一个非常简单的示例,但我的真实数据在列表中包含许多数据框,而且还有更多列,而不仅仅是我要计算加权平均值的A
和B
。我还有几个类似于Site
的列,它们在每个数据帧中都是常量,我想移动到结果。
我能够使用类似
的方式手动计算加权平均值weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
但我不确定如何用更短,更少的手册来做到这一点#34;办法。有没有人有任何建议?我最近学会了如何lapply
跨列表中的数据框架,但到目前为止,我的尝试并没有那么大。
答案 0 :(得分:2)
诀窍是创建一个适用于单个data.frame的函数,然后使用lapply
遍历列表。由于lapply
会返回一个列表,因此我们会将do.call
rbind
一起用于生成的对象:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
行动中:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
既然你说这是一个最小的例子,这里有一种方法可以将它扩展到其他列。我们将使用grepl()
并使用正则表达式来标识正确的列。或者,您可以将它们全部写在矢量中。像这样:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
答案 1 :(得分:2)
使用dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
给出结果
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
或使用data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
回应@ jazzuro的评论,使用dplyr 0.3
,我正在
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1