大熊猫查询大数据帧很慢

时间:2021-05-13 10:17:23

标签: python pandas

我有一个大数据框(df_b ~50 mio 行,3 列)我需要查询,以查看数据框的子集是否包含列表中的内容。我需要 1-2 秒。每个都在大型数据帧 df_b 中查找(执行 df_b.query())。有什么建议可以加快速度/以另一种方式做吗?

在我的示例代码下方

import pandas as pd
df_b = pd.DataFrame({'M':[11,11,11,11,11,11,33,33,33,44,44],'C':['a','b','c','a','b','c','a','b','c','a','b'],'W':['AA','AA','AA','BB','BB','BB','CC','CC','CC','AA','AA']})

df_scope = pd.DataFrame({'M':[11,22,33,44,55],'W':['AA','CC','CC','CC','QQ']})

my_list = {'a','b','z'}

for row in df_scope.itertuples():
    k = df_b.query('M == '+ str(row[1]) +' and W == "'+ row[2] +'"')
    c_found = len(k[k['C'].isin(my_list)])

    if c_found > 0:
        print("PN: " + str(row[1]) + " Yes")
    else:
        print("PN: " + str(row[1]) + " No")

1 个答案:

答案 0 :(得分:0)

我希望我理解正确,但你可以先做 .merge 再做 .groupby

x = df_b.merge(df_scope, on=['M', 'W'], how='right')
t = x.groupby('M')['C'].apply(lambda x: x.isin(my_list).any())
for i, v in zip(t.index, t):
    print('PN: {} {}'.format(i, 'Yes' if v else 'No'))

打印:

PN: 11 Yes
PN: 22 No
PN: 33 Yes
PN: 44 No
PN: 55 No

另一种解决方案,没有 .groupby

df_b['tmp'] = df_b['C'].isin(my_list)
x = df_b[df_b['tmp']].drop_duplicates(subset=['M', 'W']).merge(df_scope, on=['M', 'W'], how='right')

for i, v in zip(x['M'], x['tmp']):
    print('PN: {} {}'.format(i, 'No' if pd.isna(v) else 'Yes'))

打印:

PN: 11 Yes
PN: 22 No
PN: 33 Yes
PN: 44 No
PN: 55 No