大熊猫的性能问题和日期时间列

时间:2016-08-11 17:01:57

标签: python pandas numpy dataframe

我在其中一列上有一个带有datetime64对象的pandas数据帧。

    time    volume  complete    closeBid    closeAsk    openBid openAsk highBid highAsk lowBid  lowAsk  closeMid
0   2016-08-07 21:00:00+00:00   9   True    0.84734 0.84842 0.84706 0.84814 0.84734 0.84842 0.84706 0.84814 0.84788
1   2016-08-07 21:05:00+00:00   10  True    0.84735 0.84841 0.84752 0.84832 0.84752 0.84846 0.84712 0.8482  0.84788
2   2016-08-07 21:10:00+00:00   10  True    0.84742 0.84817 0.84739 0.84828 0.84757 0.84831 0.84735 0.84817 0.847795
3   2016-08-07 21:15:00+00:00   18  True    0.84732 0.84811 0.84737 0.84813 0.84737 0.84813 0.84721 0.8479  0.847715
4   2016-08-07 21:20:00+00:00   4   True    0.84755 0.84822 0.84739 0.84812 0.84755 0.84822 0.84739 0.84812 0.847885
5   2016-08-07 21:25:00+00:00   4   True    0.84769 0.84843 0.84758 0.84827 0.84769 0.84843 0.84758 0.84827 0.84806
6   2016-08-07 21:30:00+00:00   5   True    0.84764 0.84851 0.84768 0.84852 0.8478  0.84857 0.84764 0.84851 0.848075
7   2016-08-07 21:35:00+00:00   4   True    0.84755 0.84825 0.84762 0.84844 0.84765 0.84844 0.84755 0.84824 0.8479
8   2016-08-07 21:40:00+00:00   1   True    0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.847855
9   2016-08-07 21:45:00+00:00   3   True    0.84727 0.84817 0.84743 0.8482  0.84743 0.84822 0.84727 0.84817 0.84772

我的申请遵循以下(简化)结构:

class Runner():
    def execute_tick(self, clock_tick, previous_tick):
        candles = self.broker.get_new_candles(clock_tick, previous_tick)
        if candles:
            run_calculations(candles)

class Broker():
    def get_new_candles(clock_tick, previous_tick)
        start = previous_tick - timedelta(minutes=1)
        end = clock_tick - timedelta(minutes=3)
        return df[(df.time > start) & (df.time <= end)]

我在分析应用时注意到,调用df[(df.time > start) & (df.time <= end)]导致性能问题最严重,我想知道是否有办法加快这些调用?

编辑:我在这里添加了一些关于用例的更多信息(也可以在https://github.com/jmelett/pyFxTrader获取源代码)

  • 该应用程序将接受instruments的列表(例如EUR_USD,USD_JPY,GBP_CHF),然后pre-fetch滴答/ candles列表中的每一个及其时间范围(例如,5分钟, 30分钟,1小时等)。初始化的数据基本上是dict个仪器,每个仪器包含另一个dict,其中包含M5,M30,H1时间帧的蜡烛数据。
  • 每个&#34;时间范围&#34;是一个像顶部显示的pandas数据框
  • 然后使用clock simulator查询特定时间的个别蜡烛(例如,在15:30:00,给我最后的x&#34; 5分钟蜡烛&#34;)用于EUR_USD
  • 然后将这段数据用于&#34; simulate&#34;具体市场条件(例如过去1小时的平均价格上涨10%,买入市场头寸)

3 个答案:

答案 0 :(得分:2)

如果效率是你的目标,我会使用numpy来处理所有事情

我将get_new_candles重写为get_new_candles2

def get_new_candles2(clock_tick, previous_tick):
    start = previous_tick - timedelta(minutes=1)
    end = clock_tick - timedelta(minutes=3)
    ge_start = df.time.values >= start.to_datetime64()
    le_end = df.time.values <= end.to_datetime64()
    return pd.DataFrame(df.values[ge_start & le_end], df.index[mask], df.columns)

数据设置

from StringIO import StringIO
import pandas as pd

text = """time,volume,complete,closeBid,closeAsk,openBid,openAsk,highBid,highAsk,lowBid,lowAsk,closeMid
2016-08-07 21:00:00+00:00,9,True,0.84734,0.84842,0.84706,0.84814,0.84734,0.84842,0.84706,0.84814,0.84788
2016-08-07 21:05:00+00:00,10,True,0.84735,0.84841,0.84752,0.84832,0.84752,0.84846,0.84712,0.8482,0.84788
2016-08-07 21:10:00+00:00,10,True,0.84742,0.84817,0.84739,0.84828,0.84757,0.84831,0.84735,0.84817,0.847795
2016-08-07 21:15:00+00:00,18,True,0.84732,0.84811,0.84737,0.84813,0.84737,0.84813,0.84721,0.8479,0.847715
2016-08-07 21:20:00+00:00,4,True,0.84755,0.84822,0.84739,0.84812,0.84755,0.84822,0.84739,0.84812,0.847885
2016-08-07 21:25:00+00:00,4,True,0.84769,0.84843,0.84758,0.84827,0.84769,0.84843,0.84758,0.84827,0.84806
2016-08-07 21:30:00+00:00,5,True,0.84764,0.84851,0.84768,0.84852,0.8478,0.84857,0.84764,0.84851,0.848075
2016-08-07 21:35:00+00:00,4,True,0.84755,0.84825,0.84762,0.84844,0.84765,0.84844,0.84755,0.84824,0.8479
2016-08-07 21:40:00+00:00,1,True,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.847855
2016-08-07 21:45:00+00:00,3,True,0.84727,0.84817,0.84743,0.8482,0.84743,0.84822,0.84727,0.84817,0.84772
"""

df = pd.read_csv(StringIO(text), parse_dates=[0])

测试输入变量

previous_tick = pd.to_datetime('2016-08-07 21:10:00')
clock_tick = pd.to_datetime('2016-08-07 21:45:00')
get_new_candles2(clock_tick, previous_tick)

enter image description here

时序

enter image description here

答案 1 :(得分:0)

我了解到那些日期时间对象可能会变得非常耗费内存并需要更多的计算工作,特别是如果它们被设置为索引(DatetimeIndex对象?)

我认为最好的办法是将df.time,start和end转换为UNIX时间戳(作为整数,不再是日期时间dtypes),并进行简单的整数比较。

UNIX时间戳将如下所示:1471554233(此发布时间)。更多相关内容:https://en.wikipedia.org/wiki/Unix_time

执行此操作时的一些注意事项(例如,请记住时区):Convert datetime to Unix timestamp and convert it back in python

答案 2 :(得分:0)

我猜你已经在以相对有效的方式运行。

使用时间序列时,通常最佳做法是使用时间戳作为DataFrame索引使用列。使用RangeIndex作为索引并没有多大用处。但是,我对一个(2650069,2)DataFrame进行了几次测试,其中包含给定交易所给定股票的6个月交易价格数据,结果显示您的方法(创建一个布尔数组并使用它来切片DataFrame)似乎比常规DatetimeIndex切片快10倍(我认为更快)。

我测试的数据如下:

                                Price  Volume
time                                         
2016-02-10 11:16:15.951403000  6197.0   200.0
2016-02-10 11:16:16.241380000  6197.0   100.0
2016-02-10 11:16:16.521871000  6197.0   900.0
2016-02-10 11:16:16.541253000  6197.0   100.0
2016-02-10 11:16:16.592049000  6196.0   200.0

设置start / end

start = df.index[len(df)/4]
end = df.index[len(df)/4*3]

测试1:

%%time
_ = df[start:end]  # Same for df.ix[start:end]

CPU times: user 413 ms, sys: 20 ms, total: 433 ms
Wall time: 430 ms

另一方面,使用您的方法:

df = df.reset_index()
df.columns = ['time', 'Price', 'Volume']

测试2:

%%time
u = (df['time'] > start) & (df['time'] <= end)

CPU times: user 21.2 ms, sys: 368 µs, total: 21.6 ms
Wall time: 20.4 ms

测试3:

%%time
_ = df[u]

CPU times: user 10.4 ms, sys: 27.6 ms, total: 38.1 ms
Wall time: 36.8 ms

测试4:

%%time
_ = df[(df['time'] > start) & (df['time'] <= end)]

CPU times: user 21.6 ms, sys: 24.3 ms, total: 45.9 ms
Wall time: 44.5 ms

注意:每个代码块对应一个Jupyter笔记本单元及其输出。我正在使用%%time魔法,因为%%timeit通常会产生一些缓存问题,使代码看起来比实际更快。此外,每次运行后内核都已重新启动。

我不完全确定为什么会这样(我认为使用DatetimeIndex进行切片会让事情更快),但我想这可能与numpy(大多数情况下)工作方式有关可能日期时间切片操作会生成一个布尔数组,然后由numpy在内部使用它来实际进行切片 - 但不要引用我的内容。