我有一个10M的行数据集和一个2万的行数据集,我在其中使用了两个熊猫应用函数。即使在一个子集(1000行)上,第一个apply函数也需要花费几分钟的时间才能运行。我想调试第二个apply函数,但是不必重新运行整个模块并每次都要等待几分钟。我已经阅读了文档,但是对于如何实现这一点我仍然感到困惑。
功能如下:
第一个功能
def match_name(row, pubdf):
# pdb.set_trace()
if row is not None and row !='nan':
minscore = 90
choice, score = fwp.extractOne(row, pubdf)
return choice if score > minscore else None
else:
return None
第二个功能
def match_address(privrow):
pdb.set_trace()
minscore = 90
if privrow.supplier_streetadd:
if privrow.supplier_streetadd != 'nan':
pub_addresses = [privrow.pub_streetadd1,
privrow.pub_streetadd2,
privrow.pub_streetadd3]
choice, score = fwp.extractOne(privrow.supplier_streetadd, pub_addresses)
return choice if score > minscore else None
else:
return None
运行这些代码段:
tqdm.pandas()
priv_df['pub_org_name']=priv_df['supplier_name'].progress_apply(match_name, args=(list(pub_df['org_name']),))
priv_df['pub_streetaddress'] = priv_df.progress_apply(match_address, axis = 1)