尝试提高使用时间序列数据的R函数的效率

时间:2016-06-30 02:16:43

标签: r data.table

所以我有10亿行的时间和销售数据,如下所示:

datetime               price 
"2016-05-01 18:00:02"  2060.75
"2016-05-01 18:00:22"  2060.50
"2016-05-01 18:00:35"  2060.50
"2016-05-01 18:01:59"  2060.75
"2016-05-01 18:03:21"  2061.00
"2016-05-01 18:03:21"  2061.25
"2016-05-01 18:03:42"  2061.00
"2016-05-01 18:04:22"  2061.00
"2016-05-01 18:04:25"  2061.25
"2016-05-01 18:04:44"  2061.50
"2016-05-01 18:06:41"  2061.50

我有一个函数,每分钟间隔将给出最近的价格:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:03:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:06:00"  2061.50
"2016-05-01 18:07:00"  2061.50

我的功能将时间向上舍入到最近的分钟:

datetime               price 
"2016-05-01 18:01:00"  2060.75
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:04:00"  2061.25
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.00
"2016-05-01 18:05:00"  2061.25
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:07:00"  2061.50

然后从底部开始向上移动,重复删除行:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:07:00"  2061.50

然后添加缺少的分钟:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:03:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:06:00"  2061.50
"2016-05-01 18:07:00"  2061.50

我尝试了许多不同的功能,但这是我能找到的最快的方法,功能仍然很慢,我认为必须有一种更有效的方法来做到这一点我无法想到。有人可以帮忙吗?

1 个答案:

答案 0 :(得分:3)

您可以使用library(data.table)的滚动连接分两步完成此操作

创建所有感兴趣的“分钟”的data.table

dt_minutes <- data.table(datetime = seq(as.POSIXct("2016-05-01 18:00:00"), 
                                    length.out = 10, 
                                    by = "mins"))
dt_minutes

#                datetime
# 1: 2016-05-01 18:00:00
# 2: 2016-05-01 18:01:00
# 3: 2016-05-01 18:02:00
# 4: 2016-05-01 18:03:00
# 5: 2016-05-01 18:04:00
# 6: 2016-05-01 18:05:00
# 7: 2016-05-01 18:06:00
# 8: 2016-05-01 18:07:00
# 9: 2016-05-01 18:08:00
# 10: 2016-05-01 18:09:00

使用滚动连接获取每分钟的最新价格

## you'll need to set your data to a data.table
# library(data.table)
# setDT(dt)
dt[dt_minutes, roll = TRUE, on = "datetime"]

#               datetime   price
# 1: 2016-05-01 18:00:00      NA
# 2: 2016-05-01 18:01:00 2060.50
# 3: 2016-05-01 18:02:00 2060.75
# 4: 2016-05-01 18:03:00 2060.75
# 5: 2016-05-01 18:04:00 2061.00
# 6: 2016-05-01 18:05:00 2061.50
# 7: 2016-05-01 18:06:00 2061.50
# 8: 2016-05-01 18:07:00 2061.50
# 9: 2016-05-01 18:08:00 2061.50
# 10: 2016-05-01 18:09:00 2061.50

数据

library(data.table)

dt <- fread('datetime               price 
"2016-05-01 18:00:02"  2060.75
"2016-05-01 18:00:22"  2060.50
"2016-05-01 18:00:35"  2060.50
"2016-05-01 18:01:59"  2060.75
"2016-05-01 18:03:21"  2061.00
"2016-05-01 18:03:21"  2061.25
"2016-05-01 18:03:42"  2061.00
"2016-05-01 18:04:22"  2061.00
"2016-05-01 18:04:25"  2061.25
"2016-05-01 18:04:44"  2061.50
"2016-05-01 18:06:41"  2061.50', header = T)

Here's a good blog post关于滚动连接以帮助您入门。