考虑以下具有4列的数据框:
df = data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10), D = rnorm(10))
A,B,C,D列属于不同的组,这些组在单独的数据框中定义:
groups = data.frame(Class = c("A","B","C","D"), Group = c("G1", "G2", "G2", "G1"))
#> groups
# Class Group
#1 A G1
#2 B G2
#3 C G2
#4 D G1
我想对属于同一组的列的元素进行平均,并得到类似于:
的内容#> res
# G1 G2
#1 -0.30023039 -0.71075139
#2 0.53053443 -0.12397126
#3 0.21968567 -0.46916160
#4 -1.13775100 -0.61266026
#5 1.30388130 -0.28021734
#6 0.29275876 -0.03994522
#7 -0.09649998 0.59396983
#8 0.71334020 -0.29818438
#9 -0.29830924 -0.47094084
#10 -0.36102888 -0.40181739
其中G1的每个细胞是A和D的相对细胞的平均值,G2的每个细胞是B和C的相对细胞的平均值等。
我能够以一种相当粗暴的方式实现这一结果:
l = levels(groups$Group)
res = data.frame(matrix(nc = length(levels), nr = nrow(df)))
for(i in 1:length(l)) {
df.sub = df[which(groups$Group == l[i])]
res[,i] = apply(df.sub, 1, mean)
}
names(res) <- l
有更好的方法吗?实际上,我有20多个专栏和10多个小组。
谢谢!
答案 0 :(得分:3)
library(data.table)
groups <- data.table(groups, key="Group")
DT <- data.table(df)
groups[, rowMeans(DT[, Class, with=FALSE]), by=Group][, setnames(as.data.table(matrix(V1, ncol=length(unique(Group)))), unique(Group))]
G1 G2
1: -0.13052091 -0.3667552
2: 1.17178729 -0.5496347
3: 0.23115841 0.8317714
4: 0.45209516 -1.2180895
5: -0.01861638 -0.4174929
6: -0.43156831 0.9008427
7: -0.64026238 0.1854066
8: 0.56225108 -0.3563087
9: -2.00405840 -0.4680040
10: 0.57608055 -0.6177605
# Also, make sure you have characters, not factors,
groups[, Class := as.character(Class)]
groups[, Group := as.character(Group)]
简单的基础:
tapply(groups$Class, groups$Group, function(X) rowMeans(df[, X]))
使用sapply
:
sapply(unique(groups$Group), function(X)
rowMeans(df[, groups[groups$Group==X, "Class"]]) )
答案 1 :(得分:0)
我个人会选择里卡多的解决方案,但另一个选择是首先merge
你的两个数据集,然后使用你喜欢的聚合方法。
library(reshape2)
## Retain the "rownames" so we can aggregate by row
temp <- merge(cbind(id = rownames(df), melt(df)), groups,
by.x = "variable", by.y = "Class")
head(temp)
# variable id value Group
# 1 A 1 -0.6264538 G1
# 2 A 2 0.1836433 G1
# 3 A 3 -0.8356286 G1
# 4 A 4 1.5952808 G1
# 5 A 5 0.3295078 G1
# 6 A 6 -0.8204684 G1
## This is the perfect form for `dcast` to do its work
dcast(temp, id ~ Group, value.var="value", mean)
# id G1 G2
# 1 1 0.36611287 1.21537927
# 2 10 0.22889368 0.50592144
# 3 2 0.04042780 0.58598977
# 4 3 -0.22397850 -0.27333780
# 5 4 0.77073788 -2.10202579
# 6 5 -0.52377589 0.87237833
# 7 6 -0.61773147 -0.05053117
# 8 7 0.04656955 -0.08599288
# 9 8 0.33950565 -0.26345809
# 10 9 0.83790336 0.17153557
(以上数据使用样本“df”上的set.seed(1)
。