Question

我在Python 3.4中使用pandas来识别两个数据帧之间的匹配。匹配基于严格相等，除了最后一列，其中匹配匹配（+/- 5）很好。

一个数据框包含许多行，在这种情况下，第二个数据帧只包含一行。如上所述，期望的结果是包含与行匹配的第一数据帧的子集的数据帧。

我首先使用了布尔索引的具体解决方案，但这花了一些时间来查看所有数据，所以我尝试了pandas merge函数。但是，我的测试数据的实现更慢。它的运行速度比布尔索引慢2到4倍。

这是一个测试运行：

import pandas as pd
import random
import time

def make_lsts(lst, num, num_choices):
    choices = list(range(0,num_choices))
    [lst.append(random.choice(choices)) for i in range(0,num)]
    return lst

def old_way(test, data):
    t1 = time.time()
    tmp = data[(data.col_1 == test.col_1[0]) &
              (data.col_2 == test.col_2[0]) &
              (data.col_3 == test.col_3[0]) &
              (data.col_4 == test.col_4[0]) &
              (data.col_5 == test.col_5[0]) &
              (data.col_6 == test.col_6[0]) &
              (data.col_7 == test.col_7[0]) &
              (data.col_8 >= (test.col_8[0]-5)) &
              (data.col_8 <= (test.col_8[0]+5))]
    t2 = time.time()
    print('old time:', t2-t1)

def new_way(test, data):
    t1 = time.time()
    tmp = pd.merge(test, data, how='inner', sort=False, copy=False,
                   on=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7'])
    tmp = tmp[(tmp.col_8_y >= (test.col_8[0] - 5)) & (tmp.col_8_y <= (test.col_8[0] + 5))]
    t2 = time.time()
    print('new time:', t2-t1)

if __name__ == '__main__':
    t1 = time.time()
    data = pd.DataFrame({'col_1':make_lsts([], 4000000, 7),
                         'col_2':make_lsts([], 4000000, 3),
                         'col_3':make_lsts([], 4000000, 3),
                         'col_4':make_lsts([], 4000000, 5),
                         'col_5':make_lsts([], 4000000, 4),
                         'col_6':make_lsts([], 4000000, 4),
                         'col_7':make_lsts([], 4000000, 2),
                         'col_8':make_lsts([], 4000000, 20)})

    test = pd.DataFrame({'col_1':[1], 'col_2':[1], 'col_3':[1], 'col_4':[4], 'col_5':[0], 'col_6':[1], 'col_7':[0], 'col_8':[12]})
    t2 = time.time()
    old_way(test, data)
    new_way(test, data)
    print('time building data:', t2-t1)

在我最近的一次比赛中，我看到以下内容：

 # old time: 0.2209608554840088
 # new time: 0.9070699214935303
 # time building data: 75.05818915367126

请注意，即使是使用merge函数的新方法也会在最后一列处理值范围时使用布尔索引，但我认为合并可能能够解决问题。显然不是这种情况，因为第一列上的合并几乎占用了新方法中使用的所有时间。

是否可以优化合并功能的实现？（来自R和data.table，我花了30分钟未能成功地搜索在pandas数据框中设置密钥的方法。）这只是合并不善于处理的问题吗？在这个例子中，为什么布尔索引比合并更快？

我并不完全理解这些方法的记忆后端，所以我们非常感谢任何见解。

Answer 1

虽然您可以在任何列上合并，但在合并索引时，合并的性能最佳。

如果您更换

tmp = pd.merge(test, data, how='inner', sort=False, copy=False,
               on=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7'])

与

cols = ['col_%i' % (i+1) for i in xrange(7)]
test.set_index(cols, inplace=True)
data.set_index(cols, inplace=True)
tmp = pd.merge(test, data, how='inner', left_index=True, right_index=True)
test.reset_index(inplace=True)
data.reset_index(inplace=True)

那跑得快吗？我没有测试过，但我认为这应该有帮助...

通过索引要合并的列，DataFrame将以一种方式组织数据，使其知道在何处查找值的速度比数据只是普通列的速度快得多。

熊猫合并与布尔索引

1 个答案: