Question

使用不同的类方法创建相等的pd.MultiIndex的性能测试：

import pandas as pd

size_mult = 8
d1 = [1]*10**size_mult
d2 = [2]*10**size_mult

pd.__version__

'0.24.2'

分别是.from_arrays，from_tuples，from_frame：

# Cell from_arrays
%%time
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
# Cell from_tuples
%%time
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
# Cell from_frame
%%time
df = pd.DataFrame({'a':d1, 'b':d2})
index_frm = pd.MultiIndex.from_frame(df)

单元格的相应输出：

# from_arrays
CPU times: user 1min 15s, sys: 6.58 s, total: 1min 21s
Wall time: 1min 21s
# from_tuples
CPU times: user 26.4 s, sys: 4.99 s, total: 31.4 s
Wall time: 31.3 s
# from_frame
CPU times: user 47.9 s, sys: 5.65 s, total: 53.6 s
Wall time: 53.7 s

让我们检查一下该案例的所有结果是否相同

index_arr.difference(index_tup)
index_arr.difference(index_frm)

所有行都产生：

MultiIndex(levels=[[1], [2]],
           codes=[[], []],
           names=['a', 'b'])

那为什么会有如此大的差异？ from_arrays比from_tuples慢3倍。它甚至比创建DataFrame并在其之上构建索引要慢。

编辑：

我进行了另一项更通用的测试，结果却恰好相反：

np.random.seed(232)

size_mult = 7
d1 = np.random.randint(0, 10**size_mult, 10**size_mult)
d2 = np.random.randint(0, 10**size_mult, 10**size_mult)

start = pd.Timestamp.now()
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
print('ARR done in %f' % (pd.Timestamp.now()-start).total_seconds())

start = pd.Timestamp.now()
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
print('TUP done in %f' % (pd.Timestamp.now()-start).total_seconds())

ARR done in 9.559764
TUP done in 70.457208

所以现在from_tuples的速度要慢得多，尽管源数据是相同的。

Answer 1

您的第二个示例对我来说更有意义。查看熊猫from_tuples actually calls from_arrays的源代码，因此对我来说from_arrays会更快。

from_tuples还在这里执行一些额外的步骤，这些步骤花费更多时间：

您传入了zip(d1, d2)，它实际上是一个迭代器。 from_tuples converts this into a list。
将其转换为元组列表后，需要执行额外的步骤才能将其转换为list of numpy arrays
上一步iterates through the list of tuples twice使from_tuples的运行速度明显慢于from_arrays，

因此，总的来说，from_tuples并不慢，因为它必须反复遍历您的元组列表两次（并做一些额外的工作），然后才能进入{{1} }函数（顺便说是iterates a couple more times）。

熊猫多索引创建性能

1 个答案: