大熊猫时间戳与日期时间的缓慢表现

时间:2015-03-21 19:17:00

标签: python performance datetime numpy pandas

我似乎在pandas.Timestamp与python常规datetime()对象上遇到意外缓慢的算术运算性能。
以下是一个基准测试:

import datetime
import pandas
import numpy

# using datetime:
def test1():
    d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
    d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
    delta = datetime.timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using pandas:
def test2():
    d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
    d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
    delta = pandas.Timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using numpy
def test3():
    d1 = numpy.datetime64('2015-03-20 10:00:00')
    d2 = numpy.datetime64('2015-03-20 10:00:15')
    delta = numpy.timedelta64(30, 'm')

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1


  time1 = datetime.datetime.now()
  test1()
  time2 = datetime.datetime.now()
  test2()
  time3 = datetime.datetime.now()
  test3()
  time4 = datetime.datetime.now()

  print('DELTA test1: ' + str(time2-time1))
  print('DELTA test2: ' + str(time3-time2))
  print('DELTA test3: ' + str(time4-time3))

我机器上的相应结果(python3.3,pandas 0.15.2):

DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389

这是预期的吗? 除了尽可能将代码切换到Python的默认日期时间实现之外,还有其他方法可以消除性能问题吗?

1 个答案:

答案 0 :(得分:1)

我的机器上有类似的结果:

$ python -mtimeit -s "from datetime import datetime, timedelta; d1, d2 = datetime(2015, 3, 20, 10, 0, 0), datetime(2015, 3, 20, 10, 0, 15); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.107 usec per loop
$ python -mtimeit -s "from numpy import datetime64, timedelta64; d1, d2 = datetime64('2015-03-20T10:00:00Z'), datetime64('2015-03-20T10:00:15Z'); delta = timedelta64(30, 'm')" "(d2 - d1) > delta"
100000 loops, best of 3: 5.35 usec per loop
$ python -mtimeit -s "from pandas import Timestamp, Timedelta; d1, d2 = Timestamp('2015-03-20T10:00:00Z'), Timestamp('2015-03-20T10:00:15Z'); delta = Timedelta(minutes=30)" "(d2 - d1) > delta"
10000 loops, best of 3: 19.9 usec per loop

datetime比相应的numpypandas类似物快几倍。

$ python -c "import numpy, pandas; print(numpy.__version__, pandas.__version__)"
('1.9.2', '0.15.2')

目前尚不清楚为何差异如此之大。确实,numpypandas代码针对矢量化操作进行了优化。但是,为什么这些特定的标量操作慢了两个数量级,这一点并不明显,例如,添加显式时区不会减慢datetime.datetime代码:

$ python3 -mtimeit -s "from datetime import datetime, timedelta, timezone; d1, d2 = datetime(2015, 3, 20, 10, 0, 0, tzinfo=timezone.utc), datetime(2015, 3, 20, 10, 0, 15, tzinfo=timezone.utc); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.0939 usec per loop

要解决此问题,您可以尝试将原生日期/时间类型一起转换为更简单(更快)的模拟(例如,POSIX时间戳表示为浮点数),如果您不能使用矢量化操作。