我想找到每秒最大买卖价差。假设我有这个引号文件:
In [1]: !head quotes.txt
exchtime|bid|ask
1389178814.587758|520.0000|541.0000
1389178830.462050|540.4300|540.8700
1389178830.462050|540.4300|540.8700
1389178830.468602|540.4300|540.8600
1389178830.468602|540.4300|540.8600
1389178847.67500|540.4300|540.8500
1389178847.67500|540.4300|540.8500
1389178847.73541|540.4300|540.8400
1389178847.73541|540.4300|540.8400
时间戳只是自UTC时代以来的秒数。通过第一列的一些技巧,我可以读取这样的文件:
import pandas as pd
import numpy as np
from datetime import datetime
def convert(x): return np.datetime64(datetime.fromtimestamp(float(x)).isoformat())
df = pd.read_csv('quotes.txt', sep='|', parse_dates=True, converters={0:convert})
这就产生了我想要的东西:
In [10]: df.head()
Out[10]:
exchtime bid ask
0 2014-01-08 11:00:14.587758 520.00 541.00
1 2014-01-08 11:00:30.462050 540.43 540.87
2 2014-01-08 11:00:30.462050 540.43 540.87
3 2014-01-08 11:00:30.468602 540.43 540.86
4 2014-01-08 11:00:30.468602 540.43 540.86
我对聚合感到困惑。在q / kdb +中,我只会这样做:
select spread:max ask-bid by exchtime.second from df
我在Pandas中提到的是
df['spread'] = df.ask - df.bid
df['exchtime_sec'] = [e.replace(microsecond=0) for e in df.exchtime]
df.groupby('exchtime_sec')['spread'].agg(np.max)
这似乎有效,但exchtime_sec
行比预期的运行时间长三个数量级!是否有更快(更简洁)的方式来表达这种聚合?
答案 0 :(得分:4)
像这样读入,没有使用转换器转换时间
In [11]: df = read_csv(StringIO(data),sep='|')
这要快得多
In [12]: df['exchtime'] = pd.to_datetime(df['exchtime'],unit='s')
In [13]: df
Out[13]:
exchtime bid ask
0 2014-01-08 11:00:14.587758 520.00 541.00
1 2014-01-08 11:00:30.462050 540.43 540.87
2 2014-01-08 11:00:30.462050 540.43 540.87
3 2014-01-08 11:00:30.468602 540.43 540.86
4 2014-01-08 11:00:30.468602 540.43 540.86
5 2014-01-08 11:00:47.675000 540.43 540.85
6 2014-01-08 11:00:47.675000 540.43 540.85
7 2014-01-08 11:00:47.735410 540.43 540.84
8 2014-01-08 11:00:47.735410 540.43 540.84
[9 rows x 3 columns]
创建展开列
In [15]: df['spread'] = df.ask-df.bid
将索引设置为exchtime,以1秒的间隔重新采样并取最大值 对于聚合器
In [16]: df.set_index('exchtime').resample('1s',how=np.max)
Out[16]:
bid ask spread
exchtime
2014-01-08 11:00:14 520.00 541.00 21.00
2014-01-08 11:00:15 NaN NaN NaN
2014-01-08 11:00:16 NaN NaN NaN
2014-01-08 11:00:17 NaN NaN NaN
2014-01-08 11:00:18 NaN NaN NaN
2014-01-08 11:00:19 NaN NaN NaN
2014-01-08 11:00:20 NaN NaN NaN
2014-01-08 11:00:21 NaN NaN NaN
2014-01-08 11:00:22 NaN NaN NaN
2014-01-08 11:00:23 NaN NaN NaN
2014-01-08 11:00:24 NaN NaN NaN
2014-01-08 11:00:25 NaN NaN NaN
2014-01-08 11:00:26 NaN NaN NaN
2014-01-08 11:00:27 NaN NaN NaN
2014-01-08 11:00:28 NaN NaN NaN
2014-01-08 11:00:29 NaN NaN NaN
2014-01-08 11:00:30 540.43 540.87 0.44
2014-01-08 11:00:31 NaN NaN NaN
2014-01-08 11:00:32 NaN NaN NaN
2014-01-08 11:00:33 NaN NaN NaN
2014-01-08 11:00:34 NaN NaN NaN
2014-01-08 11:00:35 NaN NaN NaN
2014-01-08 11:00:36 NaN NaN NaN
2014-01-08 11:00:37 NaN NaN NaN
2014-01-08 11:00:38 NaN NaN NaN
2014-01-08 11:00:39 NaN NaN NaN
2014-01-08 11:00:40 NaN NaN NaN
2014-01-08 11:00:41 NaN NaN NaN
2014-01-08 11:00:42 NaN NaN NaN
2014-01-08 11:00:43 NaN NaN NaN
2014-01-08 11:00:44 NaN NaN NaN
2014-01-08 11:00:45 NaN NaN NaN
2014-01-08 11:00:46 NaN NaN NaN
2014-01-08 11:00:47 540.43 540.85 0.42
[34 rows x 3 columns]
效果比较
In [10]: df = DataFrame(np.random.randn(100000,2),index=date_range('20130101',periods=100000,freq='50U'))
In [7]: def f1(df):
...: df = df.copy()
...: df['seconds'] = [ e.replace(microsecond=0) for e in df.index ]
...: df.groupby('seconds')[0].agg(np.max)
...:
In [11]: def f2(df):
....: df = df.copy()
....: df.resample('1s',how=np.max)
....:
In [8]: %timeit f1(df)
1 loops, best of 3: 692 ms per loop
In [12]: %timeit f2(df)
100 loops, best of 3: 2.36 ms per loop
这是另一种方法,对于较低的频率更快。 (高/低等于最大/最小,其中打开是第一个,接近是最后一个)。
In [9]: df = DataFrame(np.random.randn(100000,2),index=date_range('20130101',periods=100000,freq='50L'))
In [10]: df.groupby(pd.TimeGrouper('1s'))[0].ohlc()
Out[10]:
In [11]: %timeit df.groupby(pd.TimeGrouper('1s'))[0].ohlc()
1000 loops, best of 3: 1.2 ms per loop