基于group by的数据帧中多列的操作

时间:2015-03-06 23:27:12

标签: r

我有以下数据框架,并希望从州到2003年获得中位数价格。

state freq   2003   2004   2005   2006   2007   2008   2009   2010   2011
MS    2     83000  88300  87000  94400  94400  94400  94400  94400  94400
MS    2     97000  98000 110200 115700 115700 115700 115700 115700 115700
LA    2     154300 164600 181300 149200 149200 149200 149200 149200 149200
LA    2     126800 139200 157100 144500 144500 144500 144500 144500 144500

我还在学习,所以任何帮助都会受到赞赏。我以为我可以在数据框上使用sqldf。

2 个答案:

答案 0 :(得分:2)

如果我正确理解了您的目标,那么您正在寻找aggregate()函数,该函数通过分组变量将函数应用于data.frame的所有列。

aggregate(yourDf[ ,-(1:2)], by = list(yourDf$state), FUN = median) 

答案 1 :(得分:2)

大数据集的其他选项是

library(dplyr)
df1 %>%
    group_by(state) %>% 
    summarise_each(funs(median), -2)
    #there are many options to select the variables
    #e.g. starts_with, end_with, contains, matches, num_range, one_of..
    #summarise_each(funs(median), matches('^\\d+'))

#   state   2003   2004   2005   2006   2007   2008   2009   2010   2011
# 1    MS  90000  93150  98600 105050 105050 105050 105050 105050 105050
# 2    LA 140550 151900 169200 146850 146850 146850 146850 146850 146850

或者

library(data.table)  
setDT(df1)[, lapply(.SD, median), by = state, .SDcols=2:ncol(df1)]
#  state freq  2003   2004   2005   2006   2007   2008   2009   2010   2011
#1:  MS    2  90000  93150  98600 105050 105050 105050 105050 105050 105050
#2:  LA    2 140550 151900 169200 146850 146850 146850 146850 146850 146850

基准

set.seed(42)
m1 <- matrix(rnorm(9*1e6), ncol=9, dimnames=list(NULL, 2003:2011))
set.seed(29)
d1 <- data.frame(state=sample(state.abb, 1e6, replace=TRUE), m1, 
    stringsAsFactors=FALSE, check.names=FALSE)

 agg <- function() { aggregate(d1[,-1], by=list(d1$state), FUN=median)}
 dply <- function() {d1 %>% group_by(state) %>% summarise_each(funs(median))}
 dtable <- function() {DT <- as.data.table(d1)
      DT[, lapply(.SD, median), by = state] }

 library(microbenchmark)
 microbenchmark(agg(), dply(), dtable(), times=10L, unit='relative')  
 #Unit: relative
 #    expr        min         lq       mean    median        uq        max neval
 #   agg() 20.8518599 23.0428495 23.3284269 24.702038 21.304252 25.9574602    10
 #  dply()  1.0000000  1.0000000  1.0000000  1.000000  1.000000  1.0000000    10
 #dtable()  0.9273991  0.9062682  0.9769268  1.014912  1.012644  0.9540644    10
 # cld
 #  b
 # a 
 # a