我有以下代码。它在这个玩具数据集上运行得非常好。但是,当我将完全相同的代码应用于具有250万个唯一accountIds的大型750万个数据集时,它会不断地重复崩溃R会话。
关于我可能做错的任何想法?我应该做些不同的事情来使dplyr规模更好吗?
library(dplyr)
fakedata <- data.frame(accountId = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5),
amount = c(22.5, 22.5, 22.5, 59, 82, 102, 44,
44, 64, 64, 202.5, 202.5),
date = c('2014-01-03', '2014-02-03', '2014-03-04',
'2015-04-01', '2015-05-01', '2014-02-08',
'2012-10-06', '2012-11-06', '2012-12-06',
'2013-01-06', '2014-06-02', '2014-09-03'))
fakedata
repeats <- fakedata %>%
group_by(accountId) %>%
summarise(repeated = n(), repeatb = (repeated > 1),
diffamt = (n_distinct(amount) > 1),
initamt = amount[which.min(date)],
lastamt = amount[which.max(date)],
higheramt = (lastamt > initamt))
repeats