我怎样才能在每一栏中获得手段?

时间:2014-07-16 18:01:28

标签: r dataframe aggregate

我有一个像这样的大数据框:

ID  c_Al   c_D    c_Hy      occ
A     0     0      0        2306
B     0     0      0        3031
C     0     0      1        2581
D     0     0      1        1917
E     0     0      1        2708
F     0     1      0        2751
G     0     1      0        1522
H     0     1      0        657
I     0     1      1        469
J     0     1      1        2629
L     1     0      0        793
L     1     0      0        793
M     1     0      0        564
N     1     0      1        2617
O     1     0      1        1167
P     1     0      1        389
Q     1     0      1        294
R     1     1      0        1686
S     1     1      0        992

我如何在每一栏中获得手段?

               0        1
    c_Al    1506.2  1641.2
    c_D     748.6   1467.5
    c_Hy    1506.2  1641.2

我尝试了aggregate(occ~c_Al, mean, data=table2),但必须多次完成; ddply具有相同的结果,或for(i in 1:dim(table2)[1]){ aggregate(occ~[,i], mean, data=table2)},但它无法正常工作。

5 个答案:

答案 0 :(得分:10)

我只会使用来自" reshape2"的meltdcast

library(reshape2)
dfL <- melt(table2, id.vars = c("ID", "occ"))
dcast(dfL, variable ~ value, value.var = "occ", fun.aggregate = mean)
#   variable        0        1
# 1     c_Al 2057.100 1032.778
# 2      c_D 1596.667 1529.429
# 3     c_Hy 1509.500 1641.222

当然,基地R也可以处理这个问题。

在这里,我使用了tapplyvapply

vapply(table2[2:4], function(x) tapply(table2$occ, x, mean), numeric(2L))
#       c_Al      c_D     c_Hy
# 0 2057.100 1596.667 1509.500
# 1 1032.778 1529.429 1641.222
t(vapply(table2[2:4], function(x) tapply(table2$occ, x, mean), numeric(2L)))
#             0        1
# c_Al 2057.100 1032.778
# c_D  1596.667 1529.429
# c_Hy 1509.500 1641.222

答案 1 :(得分:4)

使用dplyr。如果dat是数据集

library(dplyr)
library(tidyr) 

dat%>% 
gather(Var,Value, c_Al:c_Hy)%>%
group_by(Value,Var)%>% 
summarize(occ=mean(occ))%>% 
spread(Value, occ)
 Source: local data frame [3 x 3]

#   Var        0        1
# 1 c_Al 2057.100 1032.778
# 2  c_D 1596.667 1529.429
# 3 c_Hy 1509.500 1641.222

答案 2 :(得分:3)

我是通过dplyrtidyr尝试过的。与@ akrun的方法类似,但将数据保持在更宽的范围内。格式(无特殊原因)

library(tidyr)
library(dplyr)

new_df <- df %>% 
  gather(category,value,c_Al:c_Hy) %>%
  mutate(ids = 1:n()) %>%
  #unique %>%
  spread(value,occ,fill = NA)

mean_na <- function(x) mean(x,na.rm = TRUE)

new_df %>% 
  group_by(category) %>%
  select(-ID,-ids) %>%
  summarise_each(funs(mean_na))

  category        0        1
1     c_Al 2057.100 1032.778
2      c_D 1596.667 1529.429
3     c_Hy 1509.500 1641.222

答案 3 :(得分:1)

替代平原R:

sapply(0:1, 
       function(i) sapply(colnames(df[2:4]), 
                          function(column) mean(df[df[,column]==i, "occ"])))

编辑:或者,根据结果中的colnames请求(由具有命名元素的向量替换为0:1):

sapply(c("0"=0, "1"=1), 
       function(i) sapply(colnames(df[2:4]), 
                          function(column) mean(df[df[,column]==i, "occ"])))

答案 4 :(得分:1)

以下是仅使用colSums并通过考虑问题的矩阵结构进行子集化的解决方案:

cbind(`0`=colSums((x[,2:4]-1)*x[,5]*-1)/colSums(x[,2:4]==0),
      `1`=colSums(x[,2:4]*x[,5])/colSums(x[,2:4]==1))
            0        1
c_Al 2057.100 1032.778
c_D  1596.667 1529.429
c_Hy 1509.500 1641.222