Question

我有一个大数据框（df_b ~50 mio 行，3 列）我需要查询，以查看数据框的子集是否包含列表中的内容。我需要 1-2 秒。每个都在大型数据帧 df_b 中查找（执行 df_b.query()）。有什么建议可以加快速度/以另一种方式做吗？

在我的示例代码下方

import pandas as pd
df_b = pd.DataFrame({'M':[11,11,11,11,11,11,33,33,33,44,44],'C':['a','b','c','a','b','c','a','b','c','a','b'],'W':['AA','AA','AA','BB','BB','BB','CC','CC','CC','AA','AA']})

df_scope = pd.DataFrame({'M':[11,22,33,44,55],'W':['AA','CC','CC','CC','QQ']})

my_list = {'a','b','z'}

for row in df_scope.itertuples():
    k = df_b.query('M == '+ str(row[1]) +' and W == "'+ row[2] +'"')
    c_found = len(k[k['C'].isin(my_list)])

    if c_found > 0:
        print("PN: " + str(row[1]) + " Yes")
    else:
        print("PN: " + str(row[1]) + " No")

Answer 1

我希望我理解正确，但你可以先做 .merge 再做 .groupby：

x = df_b.merge(df_scope, on=['M', 'W'], how='right')
t = x.groupby('M')['C'].apply(lambda x: x.isin(my_list).any())
for i, v in zip(t.index, t):
    print('PN: {} {}'.format(i, 'Yes' if v else 'No'))

打印：

PN: 11 Yes
PN: 22 No
PN: 33 Yes
PN: 44 No
PN: 55 No

另一种解决方案，没有 .groupby：

df_b['tmp'] = df_b['C'].isin(my_list)
x = df_b[df_b['tmp']].drop_duplicates(subset=['M', 'W']).merge(df_scope, on=['M', 'W'], how='right')

for i, v in zip(x['M'], x['tmp']):
    print('PN: {} {}'.format(i, 'No' if pd.isna(v) else 'Yes'))

打印：

PN: 11 Yes
PN: 22 No
PN: 33 Yes
PN: 44 No
PN: 55 No

大熊猫查询大数据帧很慢

1 个答案: