我有一个包含3列的数据框:custId,saleDate,DelivDateTime。
> head(events22)
custId saleDate DelivDate
1 280356593 2012-11-14 14:04:59 11/14/12 17:29
2 280367076 2012-11-14 17:04:44 11/14/12 20:48
3 280380097 2012-11-14 17:38:34 11/14/12 20:45
4 280380095 2012-11-14 20:45:44 11/14/12 23:59
5 280380095 2012-11-14 20:31:39 11/14/12 23:49
6 280380095 2012-11-14 19:58:32 11/15/12 00:10
这是dput:
> dput(events22)
structure(list(custId = c(280356593L, 280367076L, 280380097L,
280380095L, 280380095L, 280380095L, 280364279L, 280364279L, 280398506L,
280336395L, 280364376L, 280368458L, 280368458L, 280368456L, 280368456L,
280364225L, 280391721L, 280353458L, 280387607L, 280387607L),
saleDate = structure(c(1352901899.215, 1352912684.484, 1352914714.971,
1352925944.429, 1352925099.247, 1352923112.636, 1352922476.55,
1352920666.968, 1352915226.534, 1352911135.077, 1352921349.592,
1352911494.975, 1352910529.86, 1352924755.295, 1352907511.476,
1352920108.577, 1352906160.883, 1352905925.134, 1352916810.309,
1352916025.673), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
DelivDate = c("11/14/12 17:29", "11/14/12 20:48", "11/14/12 20:45",
"11/14/12 23:59", "11/14/12 23:49", "11/15/12 00:10", "11/14/12 23:35",
"11/14/12 22:59", "11/14/12 20:53", "11/14/12 19:52", "11/14/12 23:01",
"11/14/12 19:47", "11/14/12 19:42", "11/14/12 23:31", "11/14/12 23:33",
"11/14/12 22:45", "11/14/12 18:11", "11/14/12 18:12", "11/14/12 19:17",
"11/14/12 19:19")), .Names = c("custId", "saleDate", "DelivDate"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"
), class = "data.frame")
我正在为每个DelivDate
找到最新saleDate
的{{1}}。
我可以像这样使用plyr :: ddply来做到这一点:
custId
我的问题是,是否有更快的方法来执行此操作,因为ddply方法有点耗时(完整数据集约为400k行)。我已经看过使用dd1 <-ddply(events22, .(custId),.inform = T, function(x){
x[x$saleDate == max(x$saleDate),"DelivDate"]
})
,但不知道如何获得除我正在排序的值之外的值。
有什么建议吗?
编辑:
以下是10k行@ 10次迭代的基准测试结果:
aggregate()
EDIT2: 虽然最快的AGG2()没有给出正确的答案。
test replications elapsed relative user.self
2 AGG2() 10 5.96 1.000 5.93
1 AGG1() 10 20.87 3.502 20.75
5 DATATABLE() 10 61.32 1 60.31
3 DDPLY() 10 80.04 13.430 79.63
4 DOCALL() 10 90.43 15.173 88.39
答案 0 :(得分:10)
我也会在此推荐data.table
,但由于您要求aggregate
解决方案,因此这里有一个结合aggregate
和merge
来获取所有列的解决方案:
merge(events22, aggregate(saleDate ~ custId, events22, max))
如果您只想要“custId”和“DelivDate”列,则只需aggregate
:
aggregate(list(DelivDate = events22$saleDate),
list(custId = events22$custId),
function(x) events22[["DelivDate"]][which.max(x)])
最后,这是使用sqldf
的选项:
library(sqldf)
sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events22 group by custId")
我不是基准测试或data.table
专家,但令我感到惊讶的是data.table
在这里并不快。 我怀疑在较大的数据集上的结果会有很大不同,例如,你的400k第一行。无论如何,这里有一些基准代码modeled after @mnel's answer here,因此您可以对实际数据集进行一些测试以供将来参考。
library(rbenchmark)
首先,根据您想要的基准设置您的功能。
DDPLY <- function() {
x <- ddply(events22, .(custId), .inform = T,
function(x) {
x[x$saleDate == max(x$saleDate),"DelivDate"]})
}
DATATABLE <- function() { x <- dt[, .SD[which.max(saleDate), ], by = custId] }
AGG1 <- function() {
x <- merge(events22, aggregate(saleDate ~ custId, events22, max)) }
AGG2 <- function() {
x <- aggregate(list(DelivDate = events22$saleDate),
list(custId = events22$custId),
function(x) events22[["DelivDate"]][which.max(x)]) }
SQLDF <- function() {
x <- sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events22 group by custId") }
DOCALL <- function() {
do.call(rbind,
lapply(split(events22, events22$custId), function(x){
x[which.max(x$saleDate), ]
})
)
}
其次,进行基准测试。
benchmark(DDPLY(), DATATABLE(), AGG1(), AGG2(), SQLDF(), DOCALL(),
order = "elapsed")[1:5]
# test replications elapsed relative user.self
# 4 AGG2() 100 0.285 1.000 0.284
# 3 AGG1() 100 0.891 3.126 0.896
# 6 DOCALL() 100 1.202 4.218 1.204
# 2 DATATABLE() 100 1.251 4.389 1.248
# 1 DDPLY() 100 1.254 4.400 1.252
# 5 SQLDF() 100 2.109 7.400 2.108
答案 1 :(得分:7)
ddply
和aggregate
之间的最快速度,我认为会aggregate
,特别是对于您拥有的大量数据。但是,最快的是data.table
。
require(data.table)
dt <- data.table(events22)
dt[, .SD[which.max(saleDate),], by=custId]
来自?data.table
:.SD
是data.table
,其中包含x的子集
每组的数据,不包括组列。
答案 2 :(得分:3)
这应该非常快,但data.table
可能更快:
do.call(rbind,
lapply(split(events22, events22$custId), function(x){
x[which.max(x$saleDate), ]
})
)
答案 3 :(得分:2)
这里有一个更快data.table
的功能:
DATATABLE <- function() {
dt <- data.table(events, key=c('custId', 'saleDate'))
dt[, maxrow := 1:.N==.N, by = custId]
return(dt[maxrow==TRUE, list(custId, DelivDate)])
}
请注意,此功能会创建data.table
并对数据进行排序,这是您只需执行一次的步骤。如果您删除此步骤(可能您有一个多步骤数据处理管道,并创建data.table
一次,作为第一步),该功能的速度是原来的两倍。
我还修改了所有以前的函数以返回结果,以便于比较:
DDPLY <- function() {
return(ddply(events, .(custId), .inform = T,
function(x) {
x[x$saleDate == max(x$saleDate),"DelivDate"]}))
}
AGG1 <- function() {
return(merge(events, aggregate(saleDate ~ custId, events, max)))}
SQLDF <- function() {
return(sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events group by custId"))}
DOCALL <- function() {
return(do.call(rbind,
lapply(split(events, events$custId), function(x){
x[which.max(x$saleDate), ]
})
))
}
这里是10k行的结果,重复10次:
library(rbenchmark)
library(plyr)
library(data.table)
library(sqldf)
events <- do.call(rbind, lapply(1:500, function(x) events22))
events$custId <- sample(1:nrow(events), nrow(events))
benchmark(a <- DDPLY(), b <- DATATABLE(), c <- AGG1(), d <- SQLDF(),
e <- DOCALL(), order = "elapsed", replications=10)[1:5]
test replications elapsed relative user.self
2 b <- DATATABLE() 10 0.13 1.000 0.13
4 d <- SQLDF() 10 0.42 3.231 0.41
3 c <- AGG1() 10 12.11 93.154 12.03
1 a <- DDPLY() 10 32.17 247.462 32.01
5 e <- DOCALL() 10 56.05 431.154 55.85
由于所有函数都返回结果,我们可以验证它们都返回相同的答案:
c <- c[order(c$custId),]
dim(a); dim(b); dim(c); dim(d); dim(e)
all(a$V1==b$DelivDate)
all(a$V1==c$DelivDate)
all(a$V1==d$DelivDate)
all(a$V1==e$DelivDate)
/编辑:在较小的20行数据集上,data.table
仍然是最快的,但是更薄的边距:
test replications elapsed relative user.self
2 b <- DATATABLE() 100 0.22 1.000 0.22
3 c <- AGG1() 100 0.42 1.909 0.42
5 e <- DOCALL() 100 0.48 2.182 0.49
1 a <- DDPLY() 100 0.55 2.500 0.55
4 d <- SQLDF() 100 1.00 4.545 0.98
/ Edit2:如果我们从函数中删除data.table
创建,我们会得到以下结果:
dt <- data.table(events, key=c('custId', 'saleDate'))
DATATABLE2 <- function() {
dt[, maxrow := 1:.N==.N, by = custId]
return(dt[maxrow==TRUE, list(custId, DelivDate)])
}
benchmark(a <- DDPLY(), b <- DATATABLE2(), c <- AGG1(), d <- SQLDF(),
e <- DOCALL(), order = "elapsed", replications=10)[1:5]
test replications elapsed relative user.self
2 b <- DATATABLE() 10 0.09 1.000 0.08
4 d <- SQLDF() 10 0.41 4.556 0.39
3 c <- AGG1() 10 11.73 130.333 11.67
1 a <- DDPLY() 10 31.59 351.000 31.50
5 e <- DOCALL() 10 55.05 611.667 54.91