我有以下数据框架,并希望从州到2003年获得中位数价格。
state freq 2003 2004 2005 2006 2007 2008 2009 2010 2011
MS 2 83000 88300 87000 94400 94400 94400 94400 94400 94400
MS 2 97000 98000 110200 115700 115700 115700 115700 115700 115700
LA 2 154300 164600 181300 149200 149200 149200 149200 149200 149200
LA 2 126800 139200 157100 144500 144500 144500 144500 144500 144500
我还在学习,所以任何帮助都会受到赞赏。我以为我可以在数据框上使用sqldf。
答案 0 :(得分:2)
如果我正确理解了您的目标,那么您正在寻找aggregate()
函数,该函数通过分组变量将函数应用于data.frame的所有列。
aggregate(yourDf[ ,-(1:2)], by = list(yourDf$state), FUN = median)
答案 1 :(得分:2)
大数据集的其他选项是
library(dplyr)
df1 %>%
group_by(state) %>%
summarise_each(funs(median), -2)
#there are many options to select the variables
#e.g. starts_with, end_with, contains, matches, num_range, one_of..
#summarise_each(funs(median), matches('^\\d+'))
# state 2003 2004 2005 2006 2007 2008 2009 2010 2011
# 1 MS 90000 93150 98600 105050 105050 105050 105050 105050 105050
# 2 LA 140550 151900 169200 146850 146850 146850 146850 146850 146850
或者
library(data.table)
setDT(df1)[, lapply(.SD, median), by = state, .SDcols=2:ncol(df1)]
# state freq 2003 2004 2005 2006 2007 2008 2009 2010 2011
#1: MS 2 90000 93150 98600 105050 105050 105050 105050 105050 105050
#2: LA 2 140550 151900 169200 146850 146850 146850 146850 146850 146850
set.seed(42)
m1 <- matrix(rnorm(9*1e6), ncol=9, dimnames=list(NULL, 2003:2011))
set.seed(29)
d1 <- data.frame(state=sample(state.abb, 1e6, replace=TRUE), m1,
stringsAsFactors=FALSE, check.names=FALSE)
agg <- function() { aggregate(d1[,-1], by=list(d1$state), FUN=median)}
dply <- function() {d1 %>% group_by(state) %>% summarise_each(funs(median))}
dtable <- function() {DT <- as.data.table(d1)
DT[, lapply(.SD, median), by = state] }
library(microbenchmark)
microbenchmark(agg(), dply(), dtable(), times=10L, unit='relative')
#Unit: relative
# expr min lq mean median uq max neval
# agg() 20.8518599 23.0428495 23.3284269 24.702038 21.304252 25.9574602 10
# dply() 1.0000000 1.0000000 1.0000000 1.000000 1.000000 1.0000000 10
#dtable() 0.9273991 0.9062682 0.9769268 1.014912 1.012644 0.9540644 10
# cld
# b
# a
# a