所以我有10亿行的时间和销售数据,如下所示:
datetime price
"2016-05-01 18:00:02" 2060.75
"2016-05-01 18:00:22" 2060.50
"2016-05-01 18:00:35" 2060.50
"2016-05-01 18:01:59" 2060.75
"2016-05-01 18:03:21" 2061.00
"2016-05-01 18:03:21" 2061.25
"2016-05-01 18:03:42" 2061.00
"2016-05-01 18:04:22" 2061.00
"2016-05-01 18:04:25" 2061.25
"2016-05-01 18:04:44" 2061.50
"2016-05-01 18:06:41" 2061.50
我有一个函数,每分钟间隔将给出最近的价格:
datetime price
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:03:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:06:00" 2061.50
"2016-05-01 18:07:00" 2061.50
我的功能将时间向上舍入到最近的分钟:
datetime price
"2016-05-01 18:01:00" 2060.75
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:04:00" 2061.25
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.00
"2016-05-01 18:05:00" 2061.25
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:07:00" 2061.50
然后从底部开始向上移动,重复删除行:
datetime price
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:07:00" 2061.50
然后添加缺少的分钟:
datetime price
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:03:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:06:00" 2061.50
"2016-05-01 18:07:00" 2061.50
我尝试了许多不同的功能,但这是我能找到的最快的方法,功能仍然很慢,我认为必须有一种更有效的方法来做到这一点我无法想到。有人可以帮忙吗?
答案 0 :(得分:3)
您可以使用library(data.table)
的滚动连接分两步完成此操作
创建所有感兴趣的“分钟”的data.table
dt_minutes <- data.table(datetime = seq(as.POSIXct("2016-05-01 18:00:00"),
length.out = 10,
by = "mins"))
dt_minutes
# datetime
# 1: 2016-05-01 18:00:00
# 2: 2016-05-01 18:01:00
# 3: 2016-05-01 18:02:00
# 4: 2016-05-01 18:03:00
# 5: 2016-05-01 18:04:00
# 6: 2016-05-01 18:05:00
# 7: 2016-05-01 18:06:00
# 8: 2016-05-01 18:07:00
# 9: 2016-05-01 18:08:00
# 10: 2016-05-01 18:09:00
使用滚动连接获取每分钟的最新价格
## you'll need to set your data to a data.table
# library(data.table)
# setDT(dt)
dt[dt_minutes, roll = TRUE, on = "datetime"]
# datetime price
# 1: 2016-05-01 18:00:00 NA
# 2: 2016-05-01 18:01:00 2060.50
# 3: 2016-05-01 18:02:00 2060.75
# 4: 2016-05-01 18:03:00 2060.75
# 5: 2016-05-01 18:04:00 2061.00
# 6: 2016-05-01 18:05:00 2061.50
# 7: 2016-05-01 18:06:00 2061.50
# 8: 2016-05-01 18:07:00 2061.50
# 9: 2016-05-01 18:08:00 2061.50
# 10: 2016-05-01 18:09:00 2061.50
数据
library(data.table)
dt <- fread('datetime price
"2016-05-01 18:00:02" 2060.75
"2016-05-01 18:00:22" 2060.50
"2016-05-01 18:00:35" 2060.50
"2016-05-01 18:01:59" 2060.75
"2016-05-01 18:03:21" 2061.00
"2016-05-01 18:03:21" 2061.25
"2016-05-01 18:03:42" 2061.00
"2016-05-01 18:04:22" 2061.00
"2016-05-01 18:04:25" 2061.25
"2016-05-01 18:04:44" 2061.50
"2016-05-01 18:06:41" 2061.50', header = T)
Here's a good blog post关于滚动连接以帮助您入门。