我有一个大数据框(df_b ~50 mio 行,3 列)我需要查询,以查看数据框的子集是否包含列表中的内容。我需要 1-2 秒。每个都在大型数据帧 df_b 中查找(执行 df_b.query())。有什么建议可以加快速度/以另一种方式做吗?
在我的示例代码下方
import pandas as pd
df_b = pd.DataFrame({'M':[11,11,11,11,11,11,33,33,33,44,44],'C':['a','b','c','a','b','c','a','b','c','a','b'],'W':['AA','AA','AA','BB','BB','BB','CC','CC','CC','AA','AA']})
df_scope = pd.DataFrame({'M':[11,22,33,44,55],'W':['AA','CC','CC','CC','QQ']})
my_list = {'a','b','z'}
for row in df_scope.itertuples():
k = df_b.query('M == '+ str(row[1]) +' and W == "'+ row[2] +'"')
c_found = len(k[k['C'].isin(my_list)])
if c_found > 0:
print("PN: " + str(row[1]) + " Yes")
else:
print("PN: " + str(row[1]) + " No")
答案 0 :(得分:0)
我希望我理解正确,但你可以先做 .merge
再做 .groupby
:
x = df_b.merge(df_scope, on=['M', 'W'], how='right')
t = x.groupby('M')['C'].apply(lambda x: x.isin(my_list).any())
for i, v in zip(t.index, t):
print('PN: {} {}'.format(i, 'Yes' if v else 'No'))
打印:
PN: 11 Yes
PN: 22 No
PN: 33 Yes
PN: 44 No
PN: 55 No
另一种解决方案,没有 .groupby
:
df_b['tmp'] = df_b['C'].isin(my_list)
x = df_b[df_b['tmp']].drop_duplicates(subset=['M', 'W']).merge(df_scope, on=['M', 'W'], how='right')
for i, v in zip(x['M'], x['tmp']):
print('PN: {} {}'.format(i, 'No' if pd.isna(v) else 'Yes'))
打印:
PN: 11 Yes
PN: 22 No
PN: 33 Yes
PN: 44 No
PN: 55 No