R dplyr总计数或时间间隔内的数量(以分钟为单位)

时间:2018-06-25 08:51:55

标签: r dplyr

我有一个数据集,该数据集由date_time,科目(均为字符)和金额(数字)组成,如下所示:

sample data:
    date_time <- as.character(c('2018-01-22 18:18:00','2018-01-22 18:18:05','2018-01-22 18:18:19','2018-01-22 18:18:00','2018-01-22 18:30:12','2018-01-22 18:18:11'))
    account <- as.character(c('a0001','a0001','a0001','b0001','b0001','c0001'))
    amount <- c(1000,200,300,10000,400,10000)
    df.sample <- data.frame(date_time, account, amount)

如果要进行交易,以使每个帐户的总计数> = 2并且总金额> = 12000,并且每个帐户在1分钟之内,我想返回TRUE,否则返回FALSE。

我使用dplyr编写了一个函数,如下所示:

simulation <- function(df, v.acct, v.date.time) {

  # v.acct <- '5408044133161021'
  # v.date.time <- as.POSIXct('2018-01-22 18:18:11')
  #time.interval <- 120

  #subset
  df2 <- df %>% 
          mutate(date.time=as.POSIXct(date_time),
                 diff.time=difftime(v.date.time, date.time, units=c('mins'))) %>%
          filter(account %in% v.acct,  diff.time <= time.interval, diff.time > 0) 

  df.summary <- df2 %>% 
                  group_by(account) %>%
                  summarise(agg.cnt=n(),
                            agg.amt=sum(amount))

  nrow <- df.summary %>% filter(agg.cnt>=agg.count, agg.amt>=agg.amount) %>% nrow()

  result <- ifelse(nrow==0, FALSE, TRUE)

  return(result)

}

然后将返回包含TRUE或FALSE的向量:

time.interval <- 10
agg.count <- 10
agg.amount <- 20000
v.result <- apply(df[,c(1,2)],1,function(x) simulation(x[2],x[1]))

问题: 上面的代码能够返回结果,而如果数据集超过9万次观察,则计算时间将非常长。有其他替代方法吗?谢谢

2 个答案:

答案 0 :(得分:1)

假设OP不介意data.table解决方案,则可以使用非等价自联接来查找每个事务1分钟内的实例(by=.EACHI告诉data.tablei=df中的每一行数据执行联接。请参见?data.table以了解i.EACHI的含义。

然后检查计数是否大于或等于agg.count,以及总数是否大于或等于agg.amount

数据:

date_time <- as.character(c('2018-01-22 18:18:00','2018-01-22 18:18:05','2018-01-22 18:18:19','2018-01-22 18:18:00','2018-01-22 18:30:12','2018-01-22 18:18:11'))
account <- c('a0001','a0001','a0001','b0001','b0001','c0001')
amount <- c(1000,200,300,10000,400,10000)
df <- data.frame(date_time, account, amount)

time.interval <- 60
agg.count <- 10
agg.amount <- 20000

代码:

library(data.table)
setDT(df)
df[, date_time := as.POSIXct(date_time, format="%Y-%m-%d %H:%M:%S")]
df[, oneMinLater := date_time + time.interval]
df[, hit :=
    df[df, 
    .N >= agg.count & sum(amount, na.rm=TRUE) >= agg.amount,
    by=.EACHI, 
    on=.(account, date_time > date_time, date_time <= oneMinLater)]$V1
]

输出:

             date_time account amount         oneMinLater   hit
1: 2018-01-22 18:18:00   a0001   1000 2018-01-22 18:19:00 FALSE
2: 2018-01-22 18:18:05   a0001    200 2018-01-22 18:19:05 FALSE
3: 2018-01-22 18:18:19   a0001    300 2018-01-22 18:19:19 FALSE
4: 2018-01-22 18:18:00   b0001  10000 2018-01-22 18:19:00 FALSE
5: 2018-01-22 18:30:12   b0001    400 2018-01-22 18:31:12 FALSE
6: 2018-01-22 18:18:11   c0001  10000 2018-01-22 18:19:11 FALSE

答案 1 :(得分:0)

这可能是一种解决方案:

   library(data.table)
    library(lubridate)
    library(zoo)
    setDT(df)
    df[, date.time := as.POSIXct(date_time, format="%Y-%m-%d %H:%M:%S")]
    df[, time.diff := difftime(date.time,min(date.time), units='mins')+0.0001, by=account]
    df[, interval := ceiling(time.diff / dminutes(time.interval)), by=account]
    df[, agg.cnt:=seq_len(.N), by=.(account, interval)]
    df[, agg.amt2:=cumsum(amount), by=.(account, interval)]