我有一个包含40,000多列的大型数据框,我遇到了类似的问题 Sum by distinct column value in R
shop <- data.frame(
'shop_id' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'),
'Assets' = c(2, 15, 7, 5, 8, 3),
'Liabilities' = c(5, 3, 8, 9, 12, 8),
'sale' = c(12, 5, 9, 15, 10, 18),
'profit' = c(3, 1, 3, 6, 5, 9))
我有一个专栏shop_id,重复多次。我有与shop_id相关的其他值,例如资产,负债,利润,损失等。我现在想要对具有相同shop_id的所有变量求平均值,即我想要唯一的shop_id并且想要平均所有具有相同shop_id的列同一个shop_id。因为,每个列(变量)分别处理数千个变量(列)非常繁琐。
我的回答应该是
shop_id Assets Liabilities sale profit
Shop A 8.0 5.333333 8.666667 2.333333
Shop B 5.0 9.000000 15.000000 6.000000
Shop C 5.5 10.000000 14.000000 7.000000
我目前正在使用嵌套for循环,如下所示: 像R一样多才多艺,我相信应该有更快的方法来做到这一点
idx <- split(1:nrow(shop), shop$shop_id)
newdata <- data.frame()
for( i in 1:length(idx)){
newdata[i,1]<-c(names(idx)[i] )
for (j in 2:ncol(shop)){
newdata[i,j]<-mean(shop[unlist(idx[i]),j])
}
}
答案 0 :(得分:3)
尝试data.table
library(data.table)
setDT(shop)[, lapply(.SD, mean), shop_id]
# shop_id Assets Liabilities sale profit
#1: Shop A 8.0 5.333333 8.666667 2.333333
#2: Shop B 5.0 9.000000 15.000000 6.000000
#3: Shop C 5.5 10.000000 14.000000 7.000000
或者
library(dplyr)
shop %>%
group_by(shop_id)%>%
summarise_each(funs(mean))
# shop_id Assets Liabilities sale profit
#1 Shop A 8.0 5.333333 8.666667 2.333333
#2 Shop B 5.0 9.000000 15.000000 6.000000
#3 Shop C 5.5 10.000000 14.000000 7.000000
或者
aggregate(.~shop_id, shop, FUN=mean)
# shop_id Assets Liabilities sale profit
#1 Shop A 8.0 5.333333 8.666667 2.333333
#2 Shop B 5.0 9.000000 15.000000 6.000000
#3 Shop C 5.5 10.000000 14.000000 7.000000
对于40,000列,我会使用data.table
或dplyr
。
答案 1 :(得分:2)
尝试dplyr
:
library("dplyr")
shop %>% group_by(shop_id) %>% summarise_each(funs(mean))
# shop_id Assets Liabilities sale profit
# 1 Shop A 8.0 5.333333 8.666667 2.333333
# 2 Shop B 5.0 9.000000 15.000000 6.000000
# 3 Shop C 5.5 10.000000 14.000000 7.000000
答案 2 :(得分:2)
rowsum
可能会有所帮助,而且:
rowsum(shop[-1], shop[[1]]) / table(shop[[1]])
# Assets Liabilities sale profit
#Shop A 8.0 5.333333 8.666667 2.333333
#Shop B 5.0 9.000000 15.000000 6.000000
#Shop C 5.5 10.000000 14.000000 7.000000
答案 3 :(得分:1)
使用ddply
包中的plyr
功能:
> require("plyr")
> ddply(shop, ~shop_id, summarise, Assets=mean(Assets),
Liabilities=mean(Liabilities), sale=mean(sale), profit=mean(profit))
shop_id Assets Liabilities sale profit
1 Shop A 8.0 5.333333 8.666667 2.333333
2 Shop B 5.0 9.000000 15.000000 6.000000
3 Shop C 5.5 10.000000 14.000000 7.000000