大熊猫查找时间的比较

时间:2016-07-07 19:46:24

标签: python performance pandas

在对Pandas(0.17.1)DataFrame上的各种类型的查找进行实验时,我只剩下几个问题。

这是设置......

import pandas as pd
import numpy as np
import itertools

letters = [chr(x) for x in range(ord('a'), ord('z'))]
letter_combinations = [''.join(x) for x in itertools.combinations(letters, 3)]

df1 = pd.DataFrame({
        'value': np.random.normal(size=(1000000)), 
        'letter': np.random.choice(letter_combinations, 1000000)
    })
df2 = df1.sort_values('letter')
df3 = df1.set_index('letter')
df4 = df3.sort_index()

所以df1看起来像这样...

print(df1.head(5))


>>>
  letter     value
0    bdh  0.253778
1    cem -1.915726
2    mru -0.434007
3    lnw -1.286693
4    fjv  0.245523

以下是测试查找性能差异的代码......

print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df1[df1.letter == 'ben']
%timeit df1[df1.letter == 'amy']
%timeit df1[df1.letter == 'abe']

print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df2[df2.letter == 'ben']
%timeit df2[df2.letter == 'amy']
%timeit df2[df2.letter == 'abe']

print('~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df3.loc['ben']
%timeit df3.loc['amy']
%timeit df3.loc['abe']

print('~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df4.loc['ben']
%timeit df4.loc['amy']
%timeit df4.loc['abe']

结果......

~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 193 ms per loop
~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached 
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 41 ms per loop
10 loops, best of 3: 40.9 ms per loop
~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 1621.00 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 242 µs per loop
1000 loops, best of 3: 243 µs per loop

...问题

  1. 很明显为什么对排序索引的查找要快得多,二进制搜索得到O(log(n))性能与O(n)进行完整阵列扫描。但是,为什么排序的非索引df2 SLOWER 上的查找比未排序的非索引列df1上的查找?

  2. The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached是怎么回事。当然,结果没有被缓存。是因为创建的索引是懒惰的,并且在需要之前不会实际重新索引?这可以解释为什么它只是第一次调用.loc[]

  3. 为什么默认情况下不对索引进行排序?这种固定成本可能太多了?

2 个答案:

答案 0 :(得分:11)

这些%timeit结果的差异

    def follow(self, user):
        # returns an object if it succeeds, None if it fails
        if not self.is_following(user):
            self.followed.append(user)
            return self

也出现在纯NumPy 等式比较中:

In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop

In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop

引擎盖下,熊猫队In [275]: %timeit df1['letter'].values == 'ben' 10 loops, best of 3: 24.1 ms per loop In [276]: %timeit df2['letter'].values == 'ben' 10 loops, best of 3: 96.5 ms per loop calls a Cython function 它遍历底层NumPy数组的值, df1['letter'] == 'ben'。它本质上是做同样的事情 df1['letter'].values但对NaN的处理方式不同。

此外,请注意,只需访问df1['letter'].values == 'ben'中的项目即可 对df1['letter']

执行相同的顺序可以比完成相同的顺序更快
df2['letter']

这三组In [11]: %timeit [item for item in df1['letter']] 10 loops, best of 3: 49.4 ms per loop In [12]: %timeit [item for item in df2['letter']] 10 loops, best of 3: 124 ms per loop 测试中每一组的时间差异都是 大致相同。我认为这是因为他们都有着相同的原因。

由于%timeit列包含字符串,因此NumPy数组letterdf1['letter'].values已{d} df2['letter'].values,因此他们持有 指向任意Python对象的内存位置的指针(在本例中为字符串)。

考虑存储在DataFrames中的字符串的内存位置,objectdf1。在CPython中,df2返回对象的内存位置:

id

memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])), 'df2': list(map(id, df2['letter'])), }) df1 df2 0 140226328244040 140226299303840 1 140226328243088 140226308389048 2 140226328243872 140226317328936 3 140226328243760 140226230086600 4 140226328243368 140226285885624 中的字符串(在前十几个之后)往往会按顺序出现 在内存中,排序导致df1中的字符串(按顺序) 分散在记忆中:

df2

In [272]: diffs = memloc.diff(); diffs.head(30) Out[272]: df1 df2 0 NaN NaN 1 -952.0 9085208.0 2 784.0 8939888.0 3 -112.0 -87242336.0 4 -392.0 55799024.0 5 -392.0 5436736.0 6 952.0 22687184.0 7 56.0 -26436984.0 8 -448.0 24264592.0 9 -56.0 -4092072.0 10 -168.0 -10421232.0 11 -363584.0 5512088.0 12 56.0 -17433416.0 13 56.0 40042552.0 14 56.0 -18859440.0 15 56.0 -76535224.0 16 56.0 94092360.0 17 56.0 -4189368.0 18 56.0 73840.0 19 56.0 -5807616.0 20 56.0 -9211680.0 21 56.0 20571736.0 22 56.0 -27142288.0 23 56.0 5615112.0 24 56.0 -5616568.0 25 56.0 5743152.0 26 56.0 -73057432.0 27 56.0 -4988200.0 28 56.0 85630584.0 29 56.0 -4706136.0 中的大多数字符串相隔56个字节:

df1

相比之下,In [14]: In [16]: diffs['df1'].value_counts() Out[16]: 56.0 986109 120.0 13671 -524168.0 215 -56.0 1 -12664712.0 1 41136.0 1 -231731080.0 1 Name: df1, dtype: int64 In [20]: len(diffs['df1'].value_counts()) Out[20]: 7 中的字符串遍布整个地方:

df2

当这些对象(字符串)按顺序位于内存中时,它们的值 可以更快地检索。这就是进行等式比较的原因 In [17]: diffs['df2'].value_counts().head() Out[17]: -56.0 46 56.0 44 168.0 39 -112.0 37 -392.0 35 Name: df2, dtype: int64 In [19]: len(diffs['df2'].value_counts()) Out[19]: 837764 可以比df1['letter'].values == 'ben'更快地完成df2['letter'].values == 'ben'查找时间较短

这个内存访问问题也解释了为什么没有差异 %timeit列的结果为value

In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop

In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop

df1['value']df2['value']是dtype float64的NumPy数组。不同于对象 数组,它们的值在内存中连续打包在一起。排序df1 使用df2 = df1.sort_values('letter')会导致df2['value']中的值 重新排序,但由于值已复制到新的NumPy数组中,因此值 顺序地位于内存中。因此,访问df2['value']中的值即可 和df1['value']中的那些一样快。

答案 1 :(得分:5)

(1)pandas目前不知道列的排序 如果您想利用排序数据,可以使用df2.letter.searchsorted请参阅@ unutbu的答案,以解释实际导致时间差异的原因。

(2)位于索引下方的哈希表是懒惰创建的,然后缓存。