我有一个像这样的大数据框:
ID c_Al c_D c_Hy occ
A 0 0 0 2306
B 0 0 0 3031
C 0 0 1 2581
D 0 0 1 1917
E 0 0 1 2708
F 0 1 0 2751
G 0 1 0 1522
H 0 1 0 657
I 0 1 1 469
J 0 1 1 2629
L 1 0 0 793
L 1 0 0 793
M 1 0 0 564
N 1 0 1 2617
O 1 0 1 1167
P 1 0 1 389
Q 1 0 1 294
R 1 1 0 1686
S 1 1 0 992
我如何在每一栏中获得手段?
0 1
c_Al 1506.2 1641.2
c_D 748.6 1467.5
c_Hy 1506.2 1641.2
我尝试了aggregate(occ~c_Al, mean, data=table2)
,但必须多次完成; ddply
具有相同的结果,或for(i in 1:dim(table2)[1]){ aggregate(occ~[,i], mean, data=table2)}
,但它无法正常工作。
答案 0 :(得分:10)
我只会使用来自" reshape2"的melt
和dcast
:
library(reshape2)
dfL <- melt(table2, id.vars = c("ID", "occ"))
dcast(dfL, variable ~ value, value.var = "occ", fun.aggregate = mean)
# variable 0 1
# 1 c_Al 2057.100 1032.778
# 2 c_D 1596.667 1529.429
# 3 c_Hy 1509.500 1641.222
当然,基地R也可以处理这个问题。
在这里,我使用了tapply
和vapply
:
vapply(table2[2:4], function(x) tapply(table2$occ, x, mean), numeric(2L))
# c_Al c_D c_Hy
# 0 2057.100 1596.667 1509.500
# 1 1032.778 1529.429 1641.222
t(vapply(table2[2:4], function(x) tapply(table2$occ, x, mean), numeric(2L)))
# 0 1
# c_Al 2057.100 1032.778
# c_D 1596.667 1529.429
# c_Hy 1509.500 1641.222
答案 1 :(得分:4)
使用dplyr
。如果dat
是数据集
library(dplyr)
library(tidyr)
dat%>%
gather(Var,Value, c_Al:c_Hy)%>%
group_by(Value,Var)%>%
summarize(occ=mean(occ))%>%
spread(Value, occ)
Source: local data frame [3 x 3]
# Var 0 1
# 1 c_Al 2057.100 1032.778
# 2 c_D 1596.667 1529.429
# 3 c_Hy 1509.500 1641.222
答案 2 :(得分:3)
我是通过dplyr
和tidyr
尝试过的。与@ akrun的方法类似,但将数据保持在更宽的范围内。格式(无特殊原因)
library(tidyr)
library(dplyr)
new_df <- df %>%
gather(category,value,c_Al:c_Hy) %>%
mutate(ids = 1:n()) %>%
#unique %>%
spread(value,occ,fill = NA)
mean_na <- function(x) mean(x,na.rm = TRUE)
new_df %>%
group_by(category) %>%
select(-ID,-ids) %>%
summarise_each(funs(mean_na))
category 0 1
1 c_Al 2057.100 1032.778
2 c_D 1596.667 1529.429
3 c_Hy 1509.500 1641.222
答案 3 :(得分:1)
替代平原R:
sapply(0:1,
function(i) sapply(colnames(df[2:4]),
function(column) mean(df[df[,column]==i, "occ"])))
编辑:或者,根据结果中的colnames请求(由具有命名元素的向量替换为0:1):
sapply(c("0"=0, "1"=1),
function(i) sapply(colnames(df[2:4]),
function(column) mean(df[df[,column]==i, "occ"])))
答案 4 :(得分:1)
以下是仅使用colSums
并通过考虑问题的矩阵结构进行子集化的解决方案:
cbind(`0`=colSums((x[,2:4]-1)*x[,5]*-1)/colSums(x[,2:4]==0),
`1`=colSums(x[,2:4]*x[,5])/colSums(x[,2:4]==1))
0 1
c_Al 2057.100 1032.778
c_D 1596.667 1529.429
c_Hy 1509.500 1641.222