我似乎在pandas.Timestamp与python常规datetime()对象上遇到意外缓慢的算术运算性能。
以下是一个基准测试:
import datetime
import pandas
import numpy
# using datetime:
def test1():
d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
delta = datetime.timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using pandas:
def test2():
d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
delta = pandas.Timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using numpy
def test3():
d1 = numpy.datetime64('2015-03-20 10:00:00')
d2 = numpy.datetime64('2015-03-20 10:00:15')
delta = numpy.timedelta64(30, 'm')
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
time1 = datetime.datetime.now()
test1()
time2 = datetime.datetime.now()
test2()
time3 = datetime.datetime.now()
test3()
time4 = datetime.datetime.now()
print('DELTA test1: ' + str(time2-time1))
print('DELTA test2: ' + str(time3-time2))
print('DELTA test3: ' + str(time4-time3))
我机器上的相应结果(python3.3,pandas 0.15.2):
DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389
这是预期的吗? 除了尽可能将代码切换到Python的默认日期时间实现之外,还有其他方法可以消除性能问题吗?
答案 0 :(得分:1)
我的机器上有类似的结果:
$ python -mtimeit -s "from datetime import datetime, timedelta; d1, d2 = datetime(2015, 3, 20, 10, 0, 0), datetime(2015, 3, 20, 10, 0, 15); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.107 usec per loop
$ python -mtimeit -s "from numpy import datetime64, timedelta64; d1, d2 = datetime64('2015-03-20T10:00:00Z'), datetime64('2015-03-20T10:00:15Z'); delta = timedelta64(30, 'm')" "(d2 - d1) > delta"
100000 loops, best of 3: 5.35 usec per loop
$ python -mtimeit -s "from pandas import Timestamp, Timedelta; d1, d2 = Timestamp('2015-03-20T10:00:00Z'), Timestamp('2015-03-20T10:00:15Z'); delta = Timedelta(minutes=30)" "(d2 - d1) > delta"
10000 loops, best of 3: 19.9 usec per loop
datetime
比相应的numpy
,pandas
类似物快几倍。
$ python -c "import numpy, pandas; print(numpy.__version__, pandas.__version__)"
('1.9.2', '0.15.2')
目前尚不清楚为何差异如此之大。确实,numpy
,pandas
代码针对矢量化操作进行了优化。但是,为什么这些特定的标量操作慢了两个数量级,这一点并不明显,例如,添加显式时区不会减慢datetime.datetime
代码:
$ python3 -mtimeit -s "from datetime import datetime, timedelta, timezone; d1, d2 = datetime(2015, 3, 20, 10, 0, 0, tzinfo=timezone.utc), datetime(2015, 3, 20, 10, 0, 15, tzinfo=timezone.utc); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.0939 usec per loop
要解决此问题,您可以尝试将原生日期/时间类型一起转换为更简单(更快)的模拟(例如,POSIX时间戳表示为浮点数),如果您不能使用矢量化操作。