我有以下数据框:
df = data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
sub=rep(c(1:4),4),
acc1=runif(16,0,3),
acc2=runif(16,0,3),
acc3=runif(16,0,3),
acc4=runif(16,0,3))
我想要的是获得每个ID的平均行,也就是说我想通过平均每个子的值来获得每个级别A,B,C和D的平均acc1,acc2,acc3和acc4。 (每个id有4个级别),最终会给出类似的结果(当然,NAs被我想要的方式取代):
dfavg = data.frame(id=c("A","B","C","D"),meanacc1=NA,meanacc2=NA,meanacc3=NA,meanacc4=NA)
提前致谢!
答案 0 :(得分:3)
尝试:
您可以使用任何专用软件包dplyr
或data.table
或使用base R
。因为您有很多以acc
开头的列来获取平均值,所以我选择dplyr
。在这里,我们的想法是先group
变量id
,然后使用summarise_each
获取mean
每列的id
,其中acc
为 library(dplyr)
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), starts_with("acc")) %>%
rename(meanacc1=acc1, meanacc2=acc2, meanacc3=acc3, meanacc4=acc4) #this works but it requires more typing.
}}
rename
我会paste
使用# colnames(df1)[-1] <- paste0("mean", colnames(df1)[-1])
# id meanacc1 meanacc2 meanacc3 meanacc4
#1 A 1.7061929 2.401601 2.057538 1.643627
#2 B 1.7172095 1.405389 2.132378 1.769410
#3 C 1.4424233 1.737187 1.998414 1.137112
#4 D 0.5468509 1.281781 1.790294 1.429353
给出结果
data.table
或使用 library(data.table)
nm1 <- paste0("acc", 1:4) #names of columns to do the `means`
dt1 <- setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=id, .SDcols=nm1]
.SD
此处Subset of Data.table
表示.SDcols
,mean
是我们应用 setnames(dt1, 2:5, paste0("mean", nm1)) #change the names of the concerned columns in the result
dt1
操作的列。
{{1}}
答案 1 :(得分:2)
(这必须至少被问过20次。)`aggregate函数将相同的函数(作为第三个参数给出)应用于第二个参数定义的组中第一个参数的所有列:
aggregate(df[-(1:2)], df[1],mean)
如果要将字母“mean”附加到列名称:
names(df2) <- paste0("mean", names(df2)
如果您想自动进行列选择,那么grep或grepl将起作用:
aggregate(df[ grepl("acc", names(df) )], df[1], mean)
答案 2 :(得分:1)
以下是其他一些基本R选项:
split
+ vapply
(因为我们知道vapply
会尽可能简化为矩阵)
t(vapply(split(df[-c(1, 2)], df[, 1]), colMeans, numeric(4L)))
by
(使用do.call(rbind, ...)
获取最终结构)
do.call(rbind, by(data = df[-c(1, 2)], INDICES = df[[1]], FUN = colMeans))
两者都会给你这样的结果:
# acc1 acc2 acc3 acc4
# A 1.337496 2.091926 1.978835 1.799669
# B 1.287303 1.447884 1.297933 1.312325
# C 1.870008 1.145385 1.768011 1.252027
# D 1.682446 1.413716 1.582506 1.274925
此处使用的样本数据为(set.seed
,为了重现性):
set.seed(1)
df = data.frame(id = rep(LETTERS[1:4], 4),
sub = rep(c(1:4), 4),
acc1 = runif(16, 0, 3),
acc2 = runif(16, 0, 3),
acc3 = runif(16, 0, 3),
acc4 = runif(16, 0, 3))
最多可扩展到1M行,这些行表现得非常好(尽管显然没有“dplyr”或“data.table”那么快)。
答案 3 :(得分:0)
您可以使用以下方法在基本包本身中执行此操作:
a <- list();
for (i in 1:nlevels(df$id))
{
a[[i]] = colMeans(subset(df, id==levels(df$id)[i])[,c(3,4,5,6)]) ##select columns of df of which you want to compute the means. In your example, 3, 4, 5 and 6 are the columns
}
meanDF <- cbind(data.frame(levels(df$id)), data.frame(matrix(unlist(a), nrow=4, ncol=4, byrow=T)))
colnames(meanDF) = c("id", "meanacc1", "meanacc2", "meanacc3", "meanacc4")
meanDF
id meanacc1 meanacc2 meanacc3 meanacc4
A 1.464635 1.645898 1.7461862 1.026917
B 1.807555 1.097313 1.7135346 1.517892
C 1.350708 1.922609 0.8068907 1.607274
D 1.458911 0.726527 2.4643733 2.141865