我有一个数据集,该数据集由date_time,科目(均为字符)和金额(数字)组成,如下所示:
sample data:
date_time <- as.character(c('2018-01-22 18:18:00','2018-01-22 18:18:05','2018-01-22 18:18:19','2018-01-22 18:18:00','2018-01-22 18:30:12','2018-01-22 18:18:11'))
account <- as.character(c('a0001','a0001','a0001','b0001','b0001','c0001'))
amount <- c(1000,200,300,10000,400,10000)
df.sample <- data.frame(date_time, account, amount)
如果要进行交易,以使每个帐户的总计数> = 2并且总金额> = 12000,并且每个帐户在1分钟之内,我想返回TRUE,否则返回FALSE。
我使用dplyr编写了一个函数,如下所示:
simulation <- function(df, v.acct, v.date.time) {
# v.acct <- '5408044133161021'
# v.date.time <- as.POSIXct('2018-01-22 18:18:11')
#time.interval <- 120
#subset
df2 <- df %>%
mutate(date.time=as.POSIXct(date_time),
diff.time=difftime(v.date.time, date.time, units=c('mins'))) %>%
filter(account %in% v.acct, diff.time <= time.interval, diff.time > 0)
df.summary <- df2 %>%
group_by(account) %>%
summarise(agg.cnt=n(),
agg.amt=sum(amount))
nrow <- df.summary %>% filter(agg.cnt>=agg.count, agg.amt>=agg.amount) %>% nrow()
result <- ifelse(nrow==0, FALSE, TRUE)
return(result)
}
然后将返回包含TRUE或FALSE的向量:
time.interval <- 10
agg.count <- 10
agg.amount <- 20000
v.result <- apply(df[,c(1,2)],1,function(x) simulation(x[2],x[1]))
问题: 上面的代码能够返回结果,而如果数据集超过9万次观察,则计算时间将非常长。有其他替代方法吗?谢谢
答案 0 :(得分:1)
假设OP不介意data.table
解决方案,则可以使用非等价自联接来查找每个事务1分钟内的实例(by=.EACHI
告诉data.table
对i=df
中的每一行数据执行联接。请参见?data.table
以了解i
和.EACHI
的含义。
然后检查计数是否大于或等于agg.count
,以及总数是否大于或等于agg.amount
数据:
date_time <- as.character(c('2018-01-22 18:18:00','2018-01-22 18:18:05','2018-01-22 18:18:19','2018-01-22 18:18:00','2018-01-22 18:30:12','2018-01-22 18:18:11'))
account <- c('a0001','a0001','a0001','b0001','b0001','c0001')
amount <- c(1000,200,300,10000,400,10000)
df <- data.frame(date_time, account, amount)
time.interval <- 60
agg.count <- 10
agg.amount <- 20000
代码:
library(data.table)
setDT(df)
df[, date_time := as.POSIXct(date_time, format="%Y-%m-%d %H:%M:%S")]
df[, oneMinLater := date_time + time.interval]
df[, hit :=
df[df,
.N >= agg.count & sum(amount, na.rm=TRUE) >= agg.amount,
by=.EACHI,
on=.(account, date_time > date_time, date_time <= oneMinLater)]$V1
]
输出:
date_time account amount oneMinLater hit
1: 2018-01-22 18:18:00 a0001 1000 2018-01-22 18:19:00 FALSE
2: 2018-01-22 18:18:05 a0001 200 2018-01-22 18:19:05 FALSE
3: 2018-01-22 18:18:19 a0001 300 2018-01-22 18:19:19 FALSE
4: 2018-01-22 18:18:00 b0001 10000 2018-01-22 18:19:00 FALSE
5: 2018-01-22 18:30:12 b0001 400 2018-01-22 18:31:12 FALSE
6: 2018-01-22 18:18:11 c0001 10000 2018-01-22 18:19:11 FALSE
答案 1 :(得分:0)
这可能是一种解决方案:
library(data.table)
library(lubridate)
library(zoo)
setDT(df)
df[, date.time := as.POSIXct(date_time, format="%Y-%m-%d %H:%M:%S")]
df[, time.diff := difftime(date.time,min(date.time), units='mins')+0.0001, by=account]
df[, interval := ceiling(time.diff / dminutes(time.interval)), by=account]
df[, agg.cnt:=seq_len(.N), by=.(account, interval)]
df[, agg.amt2:=cumsum(amount), by=.(account, interval)]