我极力尝试重现经典的Pandas
滚动联接示例,其中quotes
数据与trade
数据合并。
在此处查看https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html
以下是data.table
格式的数据:
trades <- data.table(time = c('2016-05-25 13:30:00.023',
'2016-05-25 13:30:00.038',
'2016-05-25 13:30:00.048',
'2016-05-25 13:30:00.048',
'2016-05-25 13:30:00.048'),
ticker = c('MSFT','MSFT','GOOG','GOOG','AAPL'),
price = c(51.95,51.95,720.77,720.92,98.0),
quantity = c(75,155,100,100,100))
> trades
time ticker price quantity
1: 2016-05-25 13:30:00.023 MSFT 51.95 75
2: 2016-05-25 13:30:00.038 MSFT 51.95 155
3: 2016-05-25 13:30:00.048 GOOG 720.77 100
4: 2016-05-25 13:30:00.048 GOOG 720.92 100
5: 2016-05-25 13:30:00.048 AAPL 98.00 100
和引号
quotes <- data.table(time = c('2016-05-25 13:30:00.023',
'2016-05-25 13:30:00.023',
'2016-05-25 13:30:00.030',
'2016-05-25 13:30:00.041',
'2016-05-25 13:30:00.048',
'2016-05-25 13:30:00.049',
'2016-05-25 13:30:00.072',
'2016-05-25 13:30:00.075'),
ticker = c('GOOG','MSFT','MSFT','MSFT','GOOG','AAPL','GOOG','MSFT'),
bid = c(720.50, 51.95, 51.97, 51.99, 720.5,97.99,720.5,52.01),
ask = c(270.93,51.96,51.98,52.00,720.93,98.01,720.88,52.03))
> quotes
time ticker bid ask
1: 2016-05-25 13:30:00.023 GOOG 720.50 270.93
2: 2016-05-25 13:30:00.023 MSFT 51.95 51.96
3: 2016-05-25 13:30:00.030 MSFT 51.97 51.98
4: 2016-05-25 13:30:00.041 MSFT 51.99 52.00
5: 2016-05-25 13:30:00.048 GOOG 720.50 720.93
6: 2016-05-25 13:30:00.049 AAPL 97.99 98.01
7: 2016-05-25 13:30:00.072 GOOG 720.50 720.88
8: 2016-05-25 13:30:00.075 MSFT 52.01 52.03
我想做的是按照以下方式合并交易数据和报价数据
输出(与Pandas教程中的输出相同)应该
time ticker price quantity bid ask
1: 2016-05-25 13:30:00.023 MSFT 51.95 75 NA NA
2: 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
3: 2016-05-25 13:30:00.048 GOOG 720.77 100 NA NA
4: 2016-05-25 13:30:00.048 GOOG 720.92 100 NA NA
5: 2016-05-25 13:30:00.048 AAPL 98.00 100 NA NA
实际上,您可以看到唯一可能的报价匹配是针对2016-05-25 13:30:00.038
的第二笔交易,因为封闭的(先前的)报价发生在2016-05-25 13:30:00.030
处,所以这是在10毫秒之内(并非完全相同)匹配)。
尽管进行了试验,但我无法在data.table
中重现此内容。有任何想法吗?
谢谢!
答案 0 :(得分:2)
您还可以将this idiom与滚动连接结合使用, 这与@sindri_baldur提出的建议相似但不完全相同:
library(lubridate)
library(data.table)
quotes[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
trades[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
match_inexact <- function(q_time, t_time, bid, ask) {
exact <- q_time == t_time # exact matches get NA
bid[exact] <- NA_real_
ask[exact] <- NA_real_
list(bid, ask)
}
trades[, c("bid", "ask") := quotes[.SD,
match_inexact(x.time, i.time, x.bid, x.ask),
on = .(ticker, time),
roll = lubridate::dmilliseconds(10L)]]
要注意的重要事项:
time
是为连接指定的最后一列,因为这是data.table
将尝试滚动值的列。
答案 1 :(得分:1)
这是一件完成工作的东西(快速又肮脏):
# Format as POSIXct*
quotes[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
trades[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
# Match the nearest time (in the right direction) for each ticker and add as column
trades[quotes, on = .(time > time, ticker), qtime := i.time]
# Remove if not within time limit (10 millsecs)
trades[(time - qtime) > 0.01, qtime := NA_real_]
# Now perform an equi-join after removing timestamp that was too distant
trades[, c("bid", "ask") := quotes[trades, on = .(time = qtime), .(bid, ask)]]
trades[, !"qtime"] # drop this temporary column
# time ticker price quantity bid ask
# 1: 2016-05-25 13:30:00 MSFT 51.95 75 NA NA
# 2: 2016-05-25 13:30:00 MSFT 51.95 155 51.97 51.98
# 3: 2016-05-25 13:30:00 GOOG 720.77 100 NA NA
# 4: 2016-05-25 13:30:00 GOOG 720.92 100 NA NA
# 5: 2016-05-25 13:30:00 AAPL 98.00 100 NA NA
*构建了POSIXct向量 在双矢量的顶部,其中的值表示自1970-01-01起的秒数
从亚历克西斯(Alexis)的帖子中学习到的是使用roll参数的更简洁的版本。
trades[, c("qtime", "bid", "ask") := quotes[.SD, roll = 0.01, on = .(ticker, time), .(x.time, bid, ask)]]
trades[time == qtime, c("bid", "ask") := NA_real_][, qtime := NULL]
答案 2 :(得分:1)
另一种可能的非等额联接方法,即使用该10ms窗口内的最新报价:
options(digits.secs=3) #see https://stackoverflow.com/a/43475068/1989480
library(data.table)
quotes[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
trades[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")][,
c("start", "end") := .(time-0.01, time)]
trades[, c("bid", "ask") :=
quotes[trades, on=.(ticker, time>=start, time<end), mult="last", .(bid, ask)]
][, c("start", "end") := NULL]
输出:
time ticker price quantity bid ask
1: 2016-05-25 13:30:00.023 MSFT 51.95 75 NA NA
2: 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
3: 2016-05-25 13:30:00.048 GOOG 720.77 100 NA NA
4: 2016-05-25 13:30:00.048 GOOG 720.92 100 NA NA
5: 2016-05-25 13:30:00.048 AAPL 98.00 100 NA NA