pandas.merge_asof合并两个数据帧,执行左连接,除了它匹配最近的键而不是相等的键。
示例(从documentation窃取):
>>> quotes
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
>>> trades
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
>>> pd.merge_asof(trades, quotes,
... on='time',
... by='ticker')
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
在上面的示例中,pd.merge_asof将每一行交易与报价相同且报价时间最近的那一行报价进行匹配。
我发现此操作在我的工作流程中是必不可少的,我一直在绞尽脑汁想着如何在R中完成此操作。当然,我可以在python中执行该操作,然后再在R中读取数据帧,但是我的部分动机是学习R。
答案 0 :(得分:1)
您可以使用data.table
包进行滚动连接:
trades[quotes, on=.(ticker, time), roll=-Inf, c("bid","ask") := .(bid, ask)]
输出:
time ticker price quantity bid ask
1: 2016-05-25 13:30:00 MSFT 51.95 75 51.95 51.96
2: 2016-05-25 13:30:00 MSFT 51.95 155 51.97 51.98
3: 2016-05-25 13:30:00 GOOG 720.77 100 720.50 720.93
4: 2016-05-25 13:30:00 GOOG 720.92 100 720.50 720.93
5: 2016-05-25 13:30:00 AAPL 98.00 100 NA NA
数据:
library(data.table)
quotes <- fread("time ticker bid ask
2016-05-25_13:30:00.023 GOOG 720.50 720.93
2016-05-25_13:30:00.023 MSFT 51.95 51.96
2016-05-25_13:30:00.030 MSFT 51.97 51.98
2016-05-25_13:30:00.041 MSFT 51.99 52.00
2016-05-25_13:30:00.048 GOOG 720.50 720.93
2016-05-25_13:30:00.049 AAPL 97.99 98.01
2016-05-25_13:30:00.072 GOOG 720.50 720.88
2016-05-25_13:30:00.075 MSFT 52.01 52.03")
trades <- fread("time ticker price quantity
2016-05-25_13:30:00.023 MSFT 51.95 75
2016-05-25_13:30:00.038 MSFT 51.95 155
2016-05-25_13:30:00.048 GOOG 720.77 100
2016-05-25_13:30:00.048 GOOG 720.92 100
2016-05-25_13:30:00.048 AAPL 98.00 100")
quotes[, time := as.POSIXct(time, format="%Y-%m-%d_%H:%M:%OS")]
trades[, time := as.POSIXct(time, format="%Y-%m-%d_%H:%M:%OS")]
答案 1 :(得分:1)
可以使用SQL完成复杂的联接(在最后的注释中可重复显示测试输入)。这种方法的优点之一是,很明显,SQL语句使用了哪些条件。
假设您想加入股票行情,时差小于.002
library(sqldf)
sqldf("select t.*, q.bid, q.ask
from trades t
left join quotes q on t.ticker = q.ticker and abs(q.time - t.time) < .002")
给予:
time ticker price quantity bid ask
1 2016-05-25 13:30:00 MSFT 51.95 75 51.95 51.96
2 2016-05-25 13:30:00 MSFT 51.95 155 NA NA
3 2016-05-25 13:30:00 GOOG 720.77 100 720.50 720.93
4 2016-05-25 13:30:00 GOOG 720.92 100 720.50 720.93
5 2016-05-25 13:30:00 AAPL 98.00 100 97.99 98.01
或加入报价和最小时间差:
sqldf("select t.*, q.bid, q.ask, min(abs(q.time - t.time))
from trades t
left join quotes q on t.ticker = q.ticker
group by t.rowid")[1:6]
给予:
time ticker price quantity bid ask
1 2016-05-25 13:30:00 MSFT 51.95 75 51.95 51.96
2 2016-05-25 13:30:00 MSFT 51.95 155 51.99 52.00
3 2016-05-25 13:30:00 GOOG 720.77 100 720.50 720.93
4 2016-05-25 13:30:00 GOOG 720.92 100 720.50 720.93
5 2016-05-25 13:30:00 AAPL 98.00 100 97.99 98.01
或在0.002的时间差内加入最小差
sqldf("select t.*, q.bid, q.ask, min(abs(q.time - t.time))
from trades t
left join quotes q on t.ticker = q.ticker and abs(q.time - t.time) < 0.002
group by t.rowid")[1:6]
给予:
time ticker price quantity bid ask
1 2016-05-25 13:30:00 MSFT 51.95 75 51.95 51.96
2 2016-05-25 13:30:00 MSFT 51.95 155 NA NA
3 2016-05-25 13:30:00 GOOG 720.77 100 720.50 720.93
4 2016-05-25 13:30:00 GOOG 720.92 100 720.50 720.93
5 2016-05-25 13:30:00 AAPL 98.00 100 97.99 98.01
Lines1 <- "
time ticker bid ask
0 2016-05-25T13:30:00.023 GOOG 720.50 720.93
1 2016-05-25T13:30:00.023 MSFT 51.95 51.96
2 2016-05-25T13:30:00.030 MSFT 51.97 51.98
3 2016-05-25T13:30:00.041 MSFT 51.99 52.00
4 2016-05-25T13:30:00.048 GOOG 720.50 720.93
5 2016-05-25T13:30:00.049 AAPL 97.99 98.01
6 2016-05-25T13:30:00.072 GOOG 720.50 720.88
7 2016-05-25T13:30:00.075 MSFT 52.01 52.03"
quotes <- read.table(text = Lines1, as.is = TRUE)
quotes <- transform(quotes, time = as.POSIXct(sub("T", " ", time)))
Lines2 <- "
time ticker price quantity
0 2016-05-25T13:30:00.023 MSFT 51.95 75
1 2016-05-25T13:30:00.038 MSFT 51.95 155
2 2016-05-25T13:30:00.048 GOOG 720.77 100
3 2016-05-25T13:30:00.048 GOOG 720.92 100
4 2016-05-25T13:30:00.048 AAPL 98.00 100"
trades <- read.table(text = Lines2, as.is = TRUE)
trades <- transform(trades, time = as.POSIXct(sub("T", " ", time)))
答案 2 :(得分:0)
您可以使用合并功能在R中合并两个数据帧
merge(trades,quotes,by="ticker",all=TRUE)
答案 3 :(得分:0)
Fuzzyjoin软件包完全具有此功能(根据条件进行连接)。例如:How can I match fuzzy match strings from two datasets?
答案 4 :(得分:0)
您还可以使用 data.table 包来执行非等连接:
quotes[trades, on=.(ticker, time<=time), .(time=i.time, ticker, price, quantity, bid, ask), mult='last']
这提供了更多控制权,并且更容易根据其他匹配条件进行调整。结果是一样的。
time ticker price quantity bid ask
1: 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
2: 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
3: 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
4: 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
5: 2016-05-25 13:30:00.048 AAPL 98.00 100 NA NA